当前位置: X-MOL 学术IEEE Trans. Very Larg. Scale Integr. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Global Analysis of C Concurrency in High-Level Synthesis
IEEE Transactions on Very Large Scale Integration (VLSI) Systems ( IF 2.8 ) Pub Date : 2020-10-21 , DOI: 10.1109/tvlsi.2020.3026112
Nadesh Ramanathan , George A. Constantinides , John Wickerson

When mapping C programs to hardware, highlevel synthesis (HLS) tools reorder independent instructions, aiming to obtain a schedule that requires as few clock cycles as possible. However, when synthesizing multithreaded C programs, reordering opportunities are limited by the presence of atomic operations (“atomics”), the fundamental concurrency primitives in C. Existing HLS tools analyze and schedule each thread in isolation. In this article, we argue that thread-local analysis is conservative, especially since HLS compilers have access to the entire program. Hence, we propose a global analysis that exploits information about memory accesses by all threads when scheduling each thread. Implemented in the LegUp HLS tool, our analysis is sensitive to sequentially consistent (SC) and weak atomics and supports loop pipelining. Since the semantics of C atomics is complicated, we formally verify that our analysis correctly implements the C memory model using the Alloy model checker. Compared with thread-local analysis, our global analysis achieves a 2.3× average speedup on a set of lock-free data structures and data-flow patterns. We also apply our analysis to a larger application: a lock-free, streamed, and load-balanced implementation of Google's PageRank, where we see a 1.3× average speedup compared with the thread-local analysis.

中文翻译:


高阶综合中C并发的全局分析



将 C 程序映射到硬件时,高级综合 (HLS) 工具会重新排序独立指令,旨在获得需要尽可能少时钟周期的调度。然而,在综合多线程 C 程序时,重新排序机会受到原子操作(“原子”)(C 中基本并发原语)的存在的限制。现有的 HLS 工具独立地分析和调度每个线程。在本文中,我们认为线程局部分析是保守的,特别是因为 HLS 编译器可以访问整个程序。因此,我们提出了一种全局分析,在调度每个线程时利用所有线程的内存访问信息。我们的分析在 LegUp HLS 工具中实现,对顺序一致 (SC) 和弱原子敏感,并支持循环流水线。由于 C 原子的语义很复杂,我们使用 Alloy 模型检查器正式验证我们的分析是否正确实现了 C 内存模型。与线程本地分析相比,我们的全局分析在一组无锁数据结构和数据流模式上实现了 2.3 倍的平均加速。我们还将我们的分析应用于更大的应用程序:Google PageRank 的无锁、流式传输和负载平衡实现,与线程本地分析相比,我们发现平均加速提高了 1.3 倍。
更新日期:2020-10-21
down
wechat
bug