ConvAix: An Application-Specific Instruction-Set Processor for the Efficient Acceleration of CNNs,IEEE Open Journal of Circuits and Systems

当前位置： X-MOL 学术 › IEEE Open J. Circuits Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

ConvAix: An Application-Specific Instruction-Set Processor for the Efficient Acceleration of CNNs
IEEE Open Journal of Circuits and Systems Pub Date : 2020-11-16 , DOI: 10.1109/ojcas.2020.3037758
Andreas Bytyn , Rainer Leupers , Gerd Ascheid

ConvAix is an application-specific instruction-set processor (ASIP) that enables the energy-efficient processing of convolutional neural networks (CNNs) while retaining substantial flexibility through its instruction-set architecture (ISA) based design. By utilizing a combination of data-level parallelism (DLP), instruction-level parallelism (ILP), and subword parallelism, the proposed design offers sufficient processing power for the execution of state-of-the-art CNNs in real-time. ConvAix’s arithmetic logic units (ALUs) are C-programmable, thereby offering the degree of flexibility required to implement many different convolution layer types, e.g., depthwise-separable convolutions and residual blocks, as well as fully-connected and pooling layers. It comprises a total of 256 ALUs and leverages low-precision computations down to 4 bits. Furthermore, it exploits sparsity in feature maps and weights via zero-guarding of redundant computations to maximize its energy efficiency. The processor was implemented in a modern 28 nm CMOS technology operating at 1V supply voltage with a resulting clock frequency of 513 MHz. The final design offers a precision-dependent peak throughput between 263 GOP/s (int16) and 1.1 TOP/s (int4), while consuming between 972mW and 340mW of power, resulting in effective energy-efficiencies ranging from 176 GOP/s/W to 2 TOP/s/W. Well-known CNNs, such as AlexNet, MobileNet, and ResNet-18, are simulated based on the placed and routed netlist, achieving between 233 (AlexNet) and 69 (ResNet-18) frames-per-second for a batch-size of 1, including times for off-chip transfers.

中文翻译：

ConvAix：一种用于CNN高效加速的专用指令集处理器

ConvAix是一种专用指令集处理器（ASIP），通过基于指令集体系结构（ISA）的设计，可以对卷积神经网络（CNN）进行节能处理，同时保留了很大的灵活性。通过利用数据级并行（DLP），指令级并行（ILP）和子字并行的组合，所提出的设计为实时执行最新的CNN提供了足够的处理能力。ConvAix的算术逻辑单元（ALU）是C可编程的，从而提供了实现许多不同卷积层类型（例如，深度可分离的卷积和残差块以及完全连接和合并的层）所需的灵活性。它总共包含256个ALU，并利用低至4位的低精度计算。此外，它通过对冗余计算进行零保护来利用特征图和权重中的稀疏性，以最大化其能效。该处理器采用现代的28 nm CMOS技术实现，在1V的电源电压下工作，产生的时钟频率为513 MHz。最终设计在263 GOP / s（int16）和1.1 TOP / s（int4）之间提供了取决于精度的峰值吞吐量，同时消耗972mW至340mW的功率，从而产生了176 GOP / s / W的有效能效。至2 TOP / s / W。众所周知的CNN，例如AlexNet，MobileNet和ResNet-18，是根据放置和路由的网表进行模拟的，批量大小为每秒可实现233（AlexNet）和69（ResNet-18）帧/秒之间1，包括片外传输时间。它通过零冗余保护冗余计算来利用特征图和权重中的稀疏性，以最大化其能效。该处理器采用现代的28 nm CMOS技术实现，在1V的电源电压下工作，产生的时钟频率为513 MHz。最终设计在263 GOP / s（int16）和1.1 TOP / s（int4）之间提供了取决于精度的峰值吞吐量，同时消耗972mW至340mW的功率，从而产生了176 GOP / s / W的有效能效。至2 TOP / s / W。众所周知的CNN，例如AlexNet，MobileNet和ResNet-18，是根据放置和路由的网表进行模拟的，批量大小为每秒可实现233（AlexNet）和69（ResNet-18）帧/秒之间1，包括片外传输时间。它通过零冗余保护冗余计算来利用特征图和权重中的稀疏性，以最大化其能效。该处理器采用现代的28 nm CMOS技术实现，在1V的电源电压下工作，产生的时钟频率为513 MHz。最终设计在263 GOP / s（int16）和1.1 TOP / s（int4）之间提供了取决于精度的峰值吞吐量，同时消耗972mW至340mW的功率，从而产生了176 GOP / s / W的有效能效。至2 TOP / s / W。众所周知的CNN，例如AlexNet，MobileNet和ResNet-18，是根据放置和路由的网表进行模拟的，批量大小为每秒可实现233（AlexNet）和69（ResNet-18）帧/秒之间1，包括片外传输时间。该处理器采用现代的28 nm CMOS技术实现，在1V的电源电压下工作，产生的时钟频率为513 MHz。最终设计在263 GOP / s（int16）和1.1 TOP / s（int4）之间提供了取决于精度的峰值吞吐量，同时消耗972mW至340mW的功率，从而产生了176 GOP / s / W的有效能效。至2 TOP / s / W。众所周知的CNN，例如AlexNet，MobileNet和ResNet-18，是根据放置和路由的网表进行模拟的，批量大小为每秒可实现233（AlexNet）和69（ResNet-18）帧/秒之间1，包括片外传输时间。该处理器采用现代的28 nm CMOS技术实现，在1V的电源电压下工作，产生的时钟频率为513 MHz。最终设计在263 GOP / s（int16）和1.1 TOP / s（int4）之间提供了取决于精度的峰值吞吐量，同时消耗972mW至340mW的功率，从而产生了176 GOP / s / W的有效能效。至2 TOP / s / W。众所周知的CNN，例如AlexNet，MobileNet和ResNet-18，是根据放置和路由的网表进行模拟的，批量大小为每秒可实现233（AlexNet）和69（ResNet-18）帧/秒之间1，包括片外传输时间。同时消耗972mW至340mW的功率，从而实现了176 GOP / s / W至2 TOP / s / W的有效能效。众所周知的CNN，例如AlexNet，MobileNet和ResNet-18，是根据放置和路由的网表进行模拟的，批量大小为每秒可实现233（AlexNet）和69（ResNet-18）帧/秒之间1，包括片外传输时间。同时消耗972mW至340mW的功率，从而实现了176 GOP / s / W至2 TOP / s / W的有效能效。众所周知的CNN，例如AlexNet，MobileNet和ResNet-18，是根据放置和路由的网表进行模拟的，批量大小为每秒可实现233（AlexNet）和69（ResNet-18）帧/秒之间1，包括片外传输时间。

更新日期：2021-01-12

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>