当前位置: X-MOL 学术J. Real-Time Image Proc. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
An efficient parallel-pipelined intra prediction architecture to support DCT/DST engine of HEVC encoder
Journal of Real-Time Image Processing ( IF 3 ) Pub Date : 2022-02-21 , DOI: 10.1007/s11554-022-01206-2
Lakshmi Poola 1 , P. Aparna 1
Affiliation  

The complexity of intra prediction in high-efficiency video coding (HEVC) is increased due to the addition of five variable sized prediction units (PUs) and 35 directional predictions. In this work, we propose an efficient parallel-pipelined architecture that can process 8 samples in parallel for every clock cycle. The functional units needed to predict the PU samples work in a pipelined fashion. With this balanced combination of parallel-pipelined structure, we are able to achieve higher throughput with limited hardware resources than existing literature works. The samples are processed row-wise, so that they can be directly transform coded, thus eliminating the need for an intermediate memory buffer of 8 K between the two modules. A compact reconfigurable reference buffer of size 0.8 KB is incorporated to reduce the read-write latency associated with reference samples’ fetching. A dedicated module for arithmetic operations is used in the intra engine that ensures the reuse of multipliers to increase the hardware efficiency. The architecture so designed supports all the PU sizes and directional modes. The proposed design is tested and implemented on a field-programmable gate array (FPGA) platform operating at 150 MHz frequency to achieve 8 samples throughput with a hardware cost of 16.2 K Look-Up Tables (LUTs) and 5.7 K registers to support HD 4 K real-time video encoding applications.



中文翻译:

一种高效的并行流水线帧内预测架构,支持 HEVC 编码器的 DCT/DST 引擎

由于增加了五个可变大小预测单元 (PU) 和 35 个方向预测,高效视频编码 (HEVC) 中的帧内预测的复杂性增加。在这项工作中,我们提出了一种高效的并行流水线架构,可以在每个时钟周期并行处理 8 个样本。预测 PU 样本所需的功能单元以流水线方式工作。通过这种并行流水线结构的平衡组合,我们能够以有限的硬件资源实现比现有文献工作更高的吞吐量。样本是按行处理的,因此可以直接对它们进行变换编码,从而消除了两个模块之间对 8 K 中间内存缓冲区的需求。大小为 0 的紧凑型可重构参考缓冲区。包含 8 KB 以减少与参考样本获取相关的读写延迟。内部引擎中使用了用于算术运算的专用模块,以确保乘法器的重用以提高硬件效率。如此设计的架构支持所有 PU 尺寸和方向模式。所提出的设计在以 150 MHz 频率运行的现场可编程门阵列 (FPGA) 平台上进行测试和实施,以实现 8 个样本吞吐量,硬件成本为 16.2 K 查找表 (LUT) 和 5.7 K 寄存器以支持 HD 4 K 实时视频编码应用。如此设计的架构支持所有 PU 尺寸和方向模式。所提出的设计在以 150 MHz 频率运行的现场可编程门阵列 (FPGA) 平台上进行测试和实施,以实现 8 个样本吞吐量,硬件成本为 16.2 K 查找表 (LUT) 和 5.7 K 寄存器以支持 HD 4 K 实时视频编码应用。如此设计的架构支持所有 PU 尺寸和方向模式。所提出的设计在以 150 MHz 频率运行的现场可编程门阵列 (FPGA) 平台上进行测试和实施,以实现 8 个样本吞吐量,硬件成本为 16.2 K 查找表 (LUT) 和 5.7 K 寄存器以支持 HD 4 K 实时视频编码应用。

更新日期:2022-02-22
down
wechat
bug