Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards,Journal of Electronic Imaging

当前位置： X-MOL 学术 › J. Electron. Imaging › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards
Journal of Electronic Imaging ( IF 1.0 ) Pub Date : 2011-07-01 , DOI: 10.1117/1.3606588
Francesc Massanes ₁ , Marie Cadennes , Jovan G Brankov

Affiliation

In this paper we describe and evaluate a fast implementation of a classical block matching motion estimation algorithm for multiple Graphical Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) computing engine. The implemented block matching algorithm (BMA) uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and non-integer search grids.The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a non-integer search grid. The additional speedup for non-integer search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable.In addition we compared execution time of the proposed FS GPU implementation with two existing, highly optimized non-full grid search CPU based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and Simplified Unsymmetrical multi-Hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation.We also demonstrated that for an image sequence of 720×480 pixels in resolution, commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards.

中文翻译：

用于多个图形处理单元卡的块匹配算法的计算统一设备架构实现

在本文中，我们描述并评估了使用统一计算设备架构 (CUDA) 计算引擎的多个图形处理单元 (GPU) 的经典块匹配运动估计算法的快速实现。实现的块匹配算法 (BMA) 使用求和绝对差 (SAD) 误差准则和全网格搜索 (FS) 来寻找最佳块位移。在本次评估中，我们比较了 GPU 和 CPU 实现对各种尺寸图像的执行时间，使用整数和非整数搜索网格。结果表明，使用 GPU 卡可以将整数的计算时间缩短 200 倍对于非整数搜索网格，1000 次。非整数搜索网格的额外加速来自于 GPU 具有用于图像插值的内置硬件。更远，当使用多个 GPU 卡时，所呈现的评估显示了跨多个卡的数据拆分方法的重要性，但是可以实现与多个卡的几乎线性加速。此外，我们将所提出的 FS GPU 实现的执行时间与现有的两个进行比较，高度优化的非全网格搜索基于 CPU 的运动估计方法，即在 OpenCV 中实现金字塔 Lucas Kanade 光流算法和在 H.264/AVC 标准中实现简化的非对称多六边形搜索。在这些比较中，尽管 FS GPU 实现的计算复杂度显着高于非 FS CPU 实现，但 FS GPU 实现仍然表现出适度的改进。监视，

更新日期：2011-07-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11