Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators,arXiv - CS - Hardware Architecture

当前位置： X-MOL 学术 › arXiv.cs.AR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators
arXiv - CS - Hardware Architecture Pub Date : 2020-11-30 , DOI: arxiv-2012.00158
Benjamin Y. Cho, Jeageun Jung, Mattan Erez

DL inference queries play an important role in diverse internet services and a large fraction of datacenter cycles are spent on processing DL inference queries. Specifically, the matrix-matrix multiplication (GEMM) operations of fully-connected MLP layers dominate many inference tasks. We find that the GEMM operations for datacenter DL inference tasks are memory bandwidth bound, contrary to common assumptions: (1) strict query latency constraints force small-batch operation, which limits reuse and increases bandwidth demands; and (2) large and colocated models require reading the large weight matrices from main memory, again requiring high bandwidth without offering reuse opportunities. We demonstrate the large potential of accelerating these small-batch GEMMs with processing in the main CPU memory. We develop a novel GEMM execution flow and corresponding memory-side address-generation logic that exploits GEMM locality and enables long-running PIM kernels despite the complex address-mapping functions employed by the CPU that would otherwise destroy locality. Our evaluation of StepStone variants at the channel, device, and within-device PIM levels, along with optimizations that balance parallelism benefits with data-distribution overheads demonstrate $12\times$ better minimum latency than a CPU and $2.8\times$ greater throughput for strict query latency constraints. End-to-end performance analysis of recent recommendation and language models shows that StepStone PIM outperforms a fast CPU (by up to $16\times$) and prior main-memory acceleration approaches (by up to $2.4\times$ compared to the best prior approach).

中文翻译：

使用主内存加速器加速带宽有限的深度学习推理

DL推理查询在各种Internet服务中扮演重要角色，并且数据中心周期的很大一部分都花在处理DL推理查询上。具体来说，完全连接的MLP层的矩阵-矩阵乘法（GEMM）操作主导着许多推理任务。我们发现，针对数据中心DL推理任务的GEMM操作受内存带宽限制，这与通常的假设相反：（1）严格的查询等待时间约束会强制执行小批量操作，这会限制重用并增加带宽需求；（2）大型且位于同一地点的模型需要从主存储器读取较大的权重矩阵，这又需要高带宽而又不提供重用机会。我们展示了通过主CPU内存中的处理来加速这些小批量GEMM的巨大潜力。我们开发了一种新颖的GEMM执行流程和相应的存储器侧地址生成逻辑，该逻辑利用GEMM局部性，并允许长时间运行的PIM内核，尽管CPU使用的复杂的地址映射功能会破坏局部性。我们在通道，设备和设备内PIM级别对StepStone变体的评估，以及在并行性优势与数据分配开销之间取得平衡的优化，证明最低延迟比CPU高12倍$倍，对于严格的吞吐量更高2.8倍倍查询延迟限制。对最新推荐和语言模型进行的端到端性能分析表明，StepStone PIM的性能优于快速CPU（最高$ 16 \ times $）和先前的主内存加速方法（最高$ 2.4 \ times $，相比之前的最佳处理器）方法）。

更新日期：2020-12-02

点击分享查看原文

点击收藏

阅读更多本刊最新论文