SLAP: A Split Latency Adaptive VLIW pipeline architecture which enables on-the-fly variable SIMD vector-length,arXiv - CS - Hardware Architecture

当前位置： X-MOL 学术 › arXiv.cs.AR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

SLAP: A Split Latency Adaptive VLIW pipeline architecture which enables on-the-fly variable SIMD vector-length
arXiv - CS - Hardware Architecture Pub Date : 2021-02-26 , DOI: arxiv-2102.13301
Ashish Shrivastava, Alan Gatherer, Tong Sun, Sushma Wokhlu, Alex Chandra

Over the last decade the relative latency of access to shared memory by multicore increased as wire resistance dominated latency and low wire density layout pushed multiport memories farther away from their ports. Various techniques were deployed to improve average memory access latencies, such as speculative pre-fetching and branch-prediction, often leading to high variance in execution time which is unacceptable in real time systems. Smart DMAs can be used to directly copy data into a layer1 SRAM, but with overhead. The VLIW architecture, the de facto signal processing engine, suffers badly from a breakdown in lockstep execution of scalar and vector instructions. We describe the Split Latency Adaptive Pipeline (SLAP) VLIW architecture, a cache performance improvement technology that requires zero change to object code, while removing smart DMAs and their overhead. SLAP builds on the Decoupled Access and Execute concept by 1) breaking lockstep execution of functional units, 2) enabling variable vector length for variable data level parallelism, and 3) adding a novel triangular load mechanism. We discuss the SLAP architecture and demonstrate the performance benefits on real traces from a wireless baseband system (where even the most compute intensive functions suffer from an Amdahls law limitation due to a mixture of scalar and vector processing).

中文翻译：

SLAP：分离延迟自适应VLIW管道体系结构，可实现动态可变SIMD向量长度

在过去的十年中，多线访问共享内存的相对等待时间增加了，这是因为线电阻主导了等待时间，而低线密度布局将多端口存储器推离了其端口更远。部署了各种技术来改善平均内存访问延迟，例如推测性的预取和分支预测，这通常会导致执行时间的高差异，这在实时系统中是不可接受的。智能DMA可用于直接将数据复制到第1层SRAM中，但会产生开销。VLIW体系结构，事实上的信号处理引擎，在标量和矢量指令的锁步执行中遭受严重破坏。我们描述了分割延迟自适应流水线（SLAP）VLIW体系结构，这是一项缓存性能改进技术，需要零更改目标代码，同时删除智能DMA及其开销。SLAP通过以下方式构建：1）打破功能单元的锁步执行； 2）启用可变矢量长度以实现可变数据级并行性； 3）添加新颖的三角加载机制。我们讨论了SLAP体系结构，并演示了无线基带系统在真实轨迹上的性能优势（由于标量和矢量处理的混合，即使是计算量最大的功能也受到Amdahls定律的限制）。

更新日期：2021-03-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>