当前位置: X-MOL 学术J. Syst. Archit. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Efficient algorithms for task mapping on heterogeneous CPU/GPU platforms for fast completion time
Journal of Systems Architecture ( IF 3.7 ) Pub Date : 2020-12-19 , DOI: 10.1016/j.sysarc.2020.101936
Zexin Li , Yuqun Zhang , Ao Ding , Husheng Zhou , Cong Liu

In GPU-based embedded systems, the problem of computation and data mapping for multiple applications while minimizing the completion time is quite challenging due to large size of the policy space. To achieve fast competition time, a fine-grain mapping framework that explores a set of critical factors is needed for heterogeneous embedded systems. In this paper, we present a theoretical framework that yields a sub-optimal solution via three practical mapping algorithms with low time complexity. We evaluate such algorithms upon StarPU with a large set of popular benchmarks. Experimental results demonstrate that algorithms proposed by the original EMSOFT paper can achieve up to 30% faster completion time compared to state-of-the-art mapping techniques, and can perform consistently well across different workloads. We further extend such algorithms to minimize the completion time and enhance the runtime performance of complex heterogeneous applications under resource-limited infrastructure. We also extend the evaluation by deploying StarPU under multiple setups with an additional benchmark testing suite for simulating real-world runtime neural networks. Experimental results demonstrate that our extended algorithm can achieve much faster completion time (averagely 30% to 37% under multiple resource-constraint scenarios) compared to the state-of-the-art mapping techniques.



中文翻译:

用于异构CPU / GPU平台上任务映射的高效算法,可快速完成任务

在基于GPU的嵌入式系统中,由于策略空间较大,因此在使完成时间最小化的同时为多个应用程序进行计算和数据映射的问题非常具有挑战性。为了获得快速的比赛时间,异构嵌入式系统需要一个细粒度的映射框架来探索一组关键因素。在本文中,我们提出了一种理论框架,该框架通过三种具有低时间复杂度的实用映射算法产生了次优解决方案。我们在StarPU上使用大量流行的基准对此类算法进行评估。实验结果表明,与最先进的映射技术相比,原始EMSOFT论文提出的算法可最多完成30%的完成时间,并且可以在不同的工作负载下保持一致的性能。我们进一步扩展了此类算法,以在资源受限的基础架构下最大程度地减少完成时间并增强复杂异构应用程序的运行时性能。我们还通过在多种设置下部署StarPU以及用于模拟现实世界运行时神经网络的附加基准测试套件来扩展评估。实验结果表明,与最新的映射技术相比,我们的扩展算法可以实现更快的完成时间(在多种资源受限的情况下,平均完成时间为30%到37%)。我们还通过在多种设置下部署StarPU以及用于模拟现实世界运行时神经网络的附加基准测试套件来扩展评估。实验结果表明,与最新的映射技术相比,我们的扩展算法可以实现更快的完成时间(在多种资源受限的情况下,平均完成时间为30%到37%)。我们还通过在多种设置下部署StarPU以及用于模拟现实世界运行时神经网络的附加基准测试套件来扩展评估。实验结果表明,与最新的映射技术相比,我们的扩展算法可以实现更快的完成时间(在多种资源受限的情况下,平均完成时间为30%到37%)。

更新日期:2020-12-20
down
wechat
bug