Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures: A Machine Learning Based Approach,arXiv - CS - Programming Languages

当前位置： X-MOL 学术 › arXiv.cs.PL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures: A Machine Learning Based Approach
arXiv - CS - Programming Languages Pub Date : 2020-03-05 , DOI: arxiv-2003.04294
Peng Zhang, Jianbin Fang, Canqun Yang, Chun Huang, Tao Tang, Zheng Wang

This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a performance model to estimate the resulting performance of the target application under a given resource partition and task granularity configuration. The model is used as a utility to quickly search for a good configuration at runtime. Instead of hand-crafting an analytical model that requires expert insights into low-level hardware details, we employ machine learning techniques to automatically learn it. We achieve this by first learning a predictive model offline using training programs. The learnt model can then be used to predict the performance of any unseen program at runtime. We apply our approach to 39 representative parallel applications and evaluate it on two representative heterogeneous many-core platforms: a CPU-XeonPhi platform and a CPU-GPU platform. Compared to the single-stream version, our approach achieves, on average, a 1.6x and 1.1x speedup on the XeonPhi and the GPU platform, respectively. These results translate to over 93% of the performance delivered by a theoretically perfect predictor.

中文翻译：

优化异构多核架构上的流式并行性：一种基于机器学习的方法

本文介绍了一种自动方法，可快速为异构众核架构上基于任务的并行应用程序的硬件资源分区和任务粒度提供良好的解决方案。我们的方法采用性能模型来估计目标应用程序在给定资源分区和任务粒度配置下的最终性能。该模型用作在运行时快速搜索良好配置的实用程序。我们不是手工制作一个需要专家洞察底层硬件细节的分析模型，而是采用机器学习技术来自动学习它。我们通过首先使用训练程序离线学习预测模型来实现这一点。然后可以使用学习模型来预测任何未见过的程序在运行时的性能。我们将我们的方法应用于 39 个有代表性的并行应用程序，并在两个有代表性的异构众核平台上对其进行评估：CPU-XeonPhi 平台和 CPU-GPU 平台。与单流版本相比，我们的方法在 XeonPhi 和 GPU 平台上平均分别实现了 1.6 倍和 1.1 倍的加速。这些结果转化为理论上完美预测器提供的性能的 93% 以上。

更新日期：2020-03-10

点击分享查看原文

点击收藏

阅读更多本刊最新论文