Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures,IEEE Transactions on Parallel and Distributed Systems

当前位置： X-MOL 学术 › IEEE Trans. Parallel Distrib. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures
IEEE Transactions on Parallel and Distributed Systems ( IF 5.3 ) Pub Date : 2020-08-01 , DOI: 10.1109/tpds.2020.2978045
Peng Zhang , Jianbin Fang , Canqun Yang , Chun Huang , Tao Tang , Zheng Wang

As many-core accelerators keep integrating more processing units, it becomes increasingly more difficult for a parallel application to make effective use of all available resources. An effective way of improving hardware utilization is to exploit spatial and temporal sharing of the heterogeneous processing units by multiplexing computation and communication tasks – a strategy known as heterogeneous streaming. Achieving effective heterogeneous streaming requires carefully partitioning hardware among tasks, and matching the granularity of task parallelism to the resource partition. However, finding the right resource partitioning and task granularity is extremely challenging, because there is a large number of possible solutions and the optimal solution varies across programs and datasets. This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a performance model to estimate the resulting performance of the target application under a given resource partition and task granularity configuration. The model is used as a utility to quickly search for a good configuration at runtime. Instead of hand-crafting an analytical model that requires expert insights into low-level hardware details, we employ machine learning techniques to automatically learn it. We achieve this by first learning a predictive model offline using training programs. The learned model can then be used to predict the performance of any unseen program at runtime. We apply our approach to 39 representative parallel applications and evaluate it on two representative heterogeneous many-core platforms: a CPU-XeonPhi platform and a CPU-GPU platform. Compared to the single-stream version, our approach achieves, on average, a 1.6x and 1.1x speedup on the XeonPhi and the GPU platform, respectively. These results translate to over 93 percent of the performance delivered by a theoretically perfect predictor.

中文翻译：

优化异构多核架构上的流式并行性

随着众核加速器不断集成更多处理单元，并行应用程序越来越难以有效利用所有可用资源。提高硬件利用率的一种有效方法是通过多路复用计算和通信任务来利用异构处理单元的空间和时间共享——一种称为异构流的策略。实现有效的异构流需要在任务之间仔细划分硬件，并将任务并行的粒度与资源划分相匹配。然而，找到正确的资源划分和任务粒度极具挑战性，因为存在大量可能的解决方案，并且最佳解决方案因程序和数据集而异。本文介绍了一种自动方法，可快速为异构众核架构上基于任务的并行应用程序的硬件资源分区和任务粒度提供良好的解决方案。我们的方法采用性能模型来估计目标应用程序在给定资源分区和任务粒度配置下的最终性能。该模型用作在运行时快速搜索良好配置的实用程序。我们不是手工制作一个需要专家洞察底层硬件细节的分析模型，而是采用机器学习技术来自动学习它。我们通过首先使用培训程序离线学习预测模型来实现这一目标。然后可以使用学习模型来预测任何未见过的程序在运行时的性能。我们将我们的方法应用于 39 个有代表性的并行应用程序，并在两个有代表性的异构众核平台上对其进行评估：CPU-XeonPhi 平台和 CPU-GPU 平台。与单流版本相比，我们的方法在 XeonPhi 和 GPU 平台上平均分别实现了 1.6 倍和 1.1 倍的加速。这些结果转化为理论上完美预测器提供的性能的 93% 以上。

更新日期：2020-08-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>