E2-VOR: An End-to-End En/Decoder Architecture for Efficient Video Object Recognition,ACM Transactions on Design Automation of Electronic Systems

当前位置： X-MOL 学术 › ACM Trans. Des. Autom. Electron. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

E2-VOR: An End-to-End En/Decoder Architecture for Efficient Video Object Recognition
ACM Transactions on Design Automation of Electronic Systems ( IF 2.2 ) Pub Date : 2022-06-17 , DOI: 10.1145/3543852
Zhuoran Song ₁ , Naifeng Jing ₁ , Xiaoyao Liang ₁

Affiliation

High-resolution video object recognition (VOR) evolves so fast but is very compute-intensive. This is because VOR leverages compute-intensive deep neural network (DNN) for better accuracy. Although many works have been proposed for speedup, they mostly focus on DNN algorithm and hardware acceleration on the edge side. We observe that most video streams need to be losslessly compressed before going online and an encoder should have all the video information. Moreover, as the cloud should have abundant computing power to handle sophisticated VOR algorithm, we propose to take one-shot effort for a modified VOR algorithm at the encoding stage in cloud and integrate the full VOR regeneration into a slightly extended decoder on the device. The scheme can enable light-weight VOR with server-class accuracy by simply leveraging the classic and economic video decoder universal to any mobile device. Meanwhile, the scheme can save massive computing power for not repetitively processing the same video on different user devices that makes it extremely sustainable for green computing across the whole network.

We propose E²-VOR, an end-to-end encoder and decoder architecture for efficient VOR. We carefully design the scheme to have minimum impact on the video bitstream transmitted. In the cloud, the VOR extended video encoder tracks on a macro-block basis and packs intelligent information into the video stream for increased VOR accuracy and fast regenerating process. On the edge device, we extend the traditional video decoder with a small piece of dedicated hardware to enable the efficient VOR regeneration. Our experiment show that E²-VOR can achieve 5.0 × performance improvement with less than \(0.4\% \) VOR accuracy loss compared to the state-of-the-art FAVOS scheme. On average, E²-VOR can run over 54 frames-per-second (FPS) for 480P videos on an edge device.

中文翻译：

E2-VOR：用于高效视频对象识别的端到端编码器架构

高分辨率视频对象识别 (VOR) 发展如此之快，但计算量很大。这是因为 VOR 利用计算密集型深度神经网络 (DNN) 来提高准确性。尽管已经提出了许多加速工作，但它们主要集中在 DNN 算法和边缘侧的硬件加速上。我们观察到大多数视频流在上线之前都需要进行无损压缩，并且编码器应该拥有所有视频信息。此外，由于云端应该有足够的计算能力来处理复杂的 VOR 算法，我们建议在云端的编码阶段一次性修改 VOR 算法，并将完整的 VOR 再生集成到设备上稍微扩展的解码器中。该方案只需利用适用于任何移动设备的经典且经济的视频解码器，即可实现具有服务器级精度的轻量级 VOR。同时，该方案可以节省海量计算能力，无需在不同用户设备上重复处理相同的视频，使得全网绿色计算具有极强的可持续性。

我们提出了 E ² -VOR，一种用于高效 VOR 的端到端编码器和解码器架构。我们精心设计了该方案，以尽量减少对传输的视频比特流的影响。在云端，VOR 扩展视频编码器以宏块为基础进行跟踪，并将智能信息打包到视频流中，以提高 VOR 精度和快速再生过程。在边缘设备上，我们用一小块专用硬件扩展了传统的视频解码器，以实现高效的 VOR 再生。我们的实验表明，与最先进的 FAVOS 方案相比，E ^{2 -VOR 可以在小于 \(0.4\% \) VOR 精度损失的情况下实现 5.0 倍的性能提升。}平均而言，E ² -VOR 可以在边缘设备上运行超过 54 帧/秒 (FPS) 的 480P 视频。

更新日期：2022-06-17

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11