A Fast Parallel Particle Filter for Shared Memory Systems,IEEE Signal Processing Letters

当前位置： X-MOL 学术 › IEEE Signal Process. Lett. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Fast Parallel Particle Filter for Shared Memory Systems
IEEE Signal Processing Letters ( IF 3.9 ) Pub Date : 2020-01-01 , DOI: 10.1109/lsp.2020.3014035
Alessandro Varsi , Jack Taylor , Lykourgos Kekempanos , Edward Pyzer Knapp , Simon Maskell

Particle Filters (PFs) are Sequential Monte Carlo methods which are widely used to solve filtering problems of dynamic models under Non-Linear Non-Gaussian noise. Modern PF applications have demanding accuracy and run-time constraints that can be addressed through parallel computing. However, an efficient parallelization of PFs can only be achieved by effectively parallelizing the bottleneck: resampling and its constituent redistribution step. A pre-existing implementation of redistribute on Shared Memory Architectures (SMAs) achieves

$O(\frac{N}{T}log_2N)$

time complexity over

$T$

parallel cores. This redistribute implementation is, however, highly computationally intensive and cannot be effectively parallelized due to the inherently limited number of cores of SMAs. In this paper, we propose a novel parallel redistribute on OpenMP 4.5 which takes

$O(\frac{N}{T} + log_2N)$

steps and fully exploits the computational power of SMAs. The proposed approach is up to six times faster than the

$O(\frac{N}{T}log_2N)$

one and its implementation on GPU provides a further three-time speed-up vs its equivalent on a 32-core CPU. We also show on an exemplary PF that our redistribution is no longer the bottleneck.

中文翻译：

用于共享内存系统的快速并行粒子滤波器

粒子滤波器（PFs）是一种序贯蒙特卡罗方法，广泛用于解决非线性非高斯噪声下动态模型的滤波问题。现代 PF 应用程序具有苛刻的准确性和运行时间限制，可以通过并行计算来解决。然而，PF 的有效并行化只能通过有效地并行化瓶颈来实现：重采样及其组成的重新分配步骤。在共享内存架构 (SMA) 上重新分配的预先存在的实现实现了

$O(\frac{N}{T}log_2N)$

时间复杂度超过

$T$

并行核心。然而，由于 SMA 的内核数量固有的限制，这种重新分配实现是高度计算密集型的，并且不能有效地并行化。在本文中，我们在 OpenMP 4.5 上提出了一种新颖的并行重新分发，它需要

$O(\frac{N}{T} + log_2N)$

步骤并充分利用 SMA 的计算能力。所提出的方法比所提出的方法快六倍

$O(\frac{N}{T}log_2N)$

一个和它在 GPU 上的实现比它在 32 核 CPU 上的等价物提供了进一步的三倍加速。我们还在示例性 PF 上表明我们的重新分配不再是瓶颈。

更新日期：2020-01-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南