Special issue on SoC and AI processors,ETRI Journal

当前位置： X-MOL 学术 › ETRI J. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Special issue on SoC and AI processors
ETRI Journal ( IF 1.3 ) Pub Date : 2020-08-10 , DOI: 10.4218/etr2.12316
Ji‐Hoon Kim ₁ , Minjae Lee ₂ , Jongsun Park ₃ , Ho‐Young Cha ₄

Affiliation

Artificial Intelligence (AI) has evolved into a general technology for a wide range of purposes and has been applied in all aspects of economy and society. It has already been extensively used in various fields, including medical services, finance, security, education, transportation, and logistics, and had led to the emergence of new commercial activities, business models, and game‐changing product applications. AI is a driving force to economic and social development at the forefront of the technological revolution and industrial transformation. Additionally, System‐on‐a‐Chip (SoC) plays a vital role in post‐PC era products like smartphones, tablets, and various wearable devices where form‐factor, cost, and energy‐efficiency, are critical drivers. It contains multiple processing parts such as the central processing unit (CPU), graphics processing unit (GPU), image processing unit (IPU), digital signal processor (DSP), video encoder/decoder, modems, and neural processing unit (NPU). Specifically, AI processors, another name for NPU, are specially optimized for mathematics and algorithms commonly used by neural networks. They can run neural networks and machine learning tasks faster and more efficiently than CPUs.

In this special issue, we have selected papers that represent the current state‐of‐the‐art in AI processors as well as in essential SoC blocks covering radar, RF/analog, hardware security, and design methodology.

The first paper “40TFLOPS Artificial Intelligence Processor with Function‐safe Programmable Many‐Cores for ISO26262 ASIL‐D” by Jinho Han et al. presents AI processor architecture that has high throughput for accelerating the neural network and reducing the required external memory bandwidth for processing the neural network. For high throughput, the proposed super thread core (STC) includes 128 × 128 nano cores operating at 1.2 GHz clock frequency and the general‐purpose processor (GPP) core is integrated for the control of the STC and processing AI algorithm. For the functional safety that becomes very important in automotive systems, various microarchitectural techniques are adopted, including the self‐recovering cache and dynamic lockstep (DLS) function, to achieve ASIL‐D of ISO26262 standard fault tolerance levels. The entire AI processor fabricated in the 28‐nm CMOS processyields peak performance up to 40TFLOPS at 1.2 GHz operating frequency and 1.1 V supply voltage, with a measured energy efficiency of 1.3TOPS/W and ISO26262 ASIL‐D compliant, single‐point fault tolerance rate equal to 99.64%.

The next paper titled “An impulse radio (IR) radar SoC for through‐the‐wall human‐detection applications” by Piljae Park et al. proposes through‐the‐wall radar (TTWR) SoC and its architecture with the test standards and methods, which can be used at disaster scenes in limited visibility conditions owing to smoke, walls, and collapse debris. Additive reception based on the coherent clocks and reconfigurability can fulfill the demands for the TTWR and a clock‐based single‐chip IR radar transceiver is implemented in 130‐nm CMOS technology. By utilizing the repetitive coherent clock schemes, the proposed SoC can achieve signal‐to‐noise‐ratio (SNR) enhancements. Furthermore, this paper shows the test results in various pseudo‐disaster conditions of the hand‐held prototype radar with the proposed TTWR SoC operating in real‐time.

The third paper “AB9: a Neural Processor for Inference Acceleration” by Yong Cheol Peter Cho et al. presents a neural processor for interference acceleration with the systolic tensor core (STC) by exploiting data‐reuse and parallelism characteristics inherent in neural networks, while also providing fast access to large on‐chip memory. AB9 shows a superior performance and power efficiency to those of a general‐purpose GPU (GPGPU) for YOLOv2, and has been fabricated with a 28‐nm CMOS process along with a 40 TFLOP STC that includes 32 k arithmetic units and over 36 MB of on‐chip SRAM.

To alleviate the high‐computational and memory‐intensive burdens in deep neural networks, the following paper “Automated Optimization for Memory‐efficient High Performance Deep Neural Network Accelerators” by Hyun Mi Kim et al. investigates the efficient memory structure and operating scheme, which can provide an intuitive solution for high‐performance accelerators along with dataflow control. The authors propose an efficient architecture with flexibility, while operating at high frequency despite the large memory size and PE array. They demonstrate an improvement in the efficiency and usability of our architecture by presenting an automation algorithm for optimization. The experiments show that the proposed architecture with the increased data reuse, such as a diagonal write path, improves the performance by 1.44× on average across a wide range of neural networks. The automated optimizations dramatically improve the performance from 3.8× to 14.79× that enhances usability even further.

Recently, the importance of security has emerged in various computing fields, such as mobile, biomedical, and automotive systems. The paper “An Analysis and Efficient Hardware Implementation of True Random Number Generator Based‐on Beta Source” by Seongmo Park et al. proposes an efficient hardware random number generator based on a beta source. The proposed generator generates the values of “0” and “1” and provides a method to distinguish between pseudo‐random and true random numbers by comparing them with simple cumulative operations. The random‐number generator produces labeled data, thus indicating whether the count value is a true random number based on the bit values of the binary count value and on the comparison of the generated labeling data that are used as reference data. The generated random numbers pass the test procedures outlined in the standards SP800‐22 and SP800‐90B issued by the National Institute of Standards (NIST).

To improve design productivity in SoC, high‐level synthesis (HLS) has become popular, and has been used to automatically synthesize a register‐transfer level (RTL) circuit from a behavioral description written in a high‐level programing language such as C/C++. However, HLS tools often generate the design with the larger area overhead owing to unnecessary redundant instances. In the paper entitled “Function‐Level Module Sharing Techniques in High‐Level Synthesis” by Hiroki Nishigawa et al., the authors present two HLS techniques for module sharing at function level and show the effectiveness with the experimental results.

The paper “Field Programmable Analog Arrays for Implementation of Generalized Balanced OTA‐C Odd/Even‐nth‐Order Elliptic Filters” by Maha Diab and Soliman Mahmoud presents an architecture for a field‐programmable analog array based on operational transconductance amplifier (OTA) as the building block that can be used in analog signal processing units operating at low frequencies such as biopotential signals. The architecture eliminates the need for switches in the signal path and has a flexible structure. Moreover, this work presents a simplified direct circuit realization method for the synthesis of OTA‐C even/odd‐nth‐order elliptic filters. The proposed method results in an OTA‐C symmetric balanced structure with a minimum number of components and grounded capacitors for a balanced design.

The final paper “W‐band MMIC Chipset in 0.1 um mHEMT Technology” by Jong Min Lee et al. developed 0.1‐μm metamorphic high electron mobility transistor (mHEMT) and fabricated W‐band monolithic microwave integrated circuit chipset with in‐house technology to verify the performance and usability of the developed technology. The direct current (DC) characteristics of mHEMT include a drain current density equal to 747 mA/mm and a maximum transconductance of 1.354 S/mm. In addition, the RF characteristics include a cut‐off frequency of 210 GHz and maximum oscillation frequency of 252 GHz. To increase the frequency of an input signal, a frequency multiplier is developed that consists of three common source doublers connected in cascade. The authors also present in this paper a low‐noise amplifier with a four‐stage, single‐ended architecture with a common source stage and a W‐band IR module with an external off‐chip coupler.

The Guest Editors thank all the authors, reviewers, and the editorial staff members of the ETRI Journal for making this special issue a success. We are most pleased to have been part of this effort, and for the timely publication of these high‐quality technical articles.

中文翻译：

SoC和AI处理器特刊

人工智能（AI）已发展成为具有广泛用途的通用技术，并已应用于经济和社会的各个方面。它已被广泛用于医疗服务，金融，安全，教育，运输和物流等各个领域，并导致了新的商业活动，商业模式和改变游戏规则的产品应用的出现。在技术革命和产业转型的最前沿，人工智能是推动经济和社会发展的动力。此外，片上系统（SoC）在后PC时代的产品（如智能手机，平板电脑和各种可穿戴设备）中起着至关重要的作用，其中外形，成本和能效是关键驱动因素。它包含多个处理部分，例如中央处理单元（CPU），图形处理单元（GPU），图像处理单元（IPU），数字信号处理器（DSP），视频编码器/解码器，调制解调器和神经处理单元（NPU）。具体来说，AI处理器（NPU的另一个名称）是针对神经网络常用的数学和算法进行了专门优化的。他们可以比CPU更快，更高效地运行神经网络和机器学习任务。

在本期特刊中，我们选择了代表AI处理器以及包括雷达，RF /模拟，硬件安全性和设计方法论的基本SoC块中最新技术的论文。

Jinho Han等人的第一篇论文“具有ISO26262 ASIL-D功能安全可编程多核功能的40TFLOPS人工智能处理器”。提出了具有高吞吐量的AI处理器体系结构，用于加速神经网络并减少处理神经网络所需的外部内存带宽。为了获得高吞吐量，建议的超线程内核（STC）包括以1.2 GHz时钟频率运行的128×128纳米内核，并且集成了通用处理器（GPP）内核以控制STC和处理AI算法。对于在汽车系统中变得非常重要的功能安全性，采用了各种微体系结构技术，包括自动恢复缓存和动态锁步（DLS）功能，以达到ISO26262标准容错级别的ASIL-D。

Piljae Park等人的下一篇论文标题为“用于全壁人体检测应用的脉冲无线电（IR）雷达SoC”。提出了带有测试标准和方法的穿墙雷达（TTWR）SoC及其体系结构，可将其用于由于烟雾，墙壁和倒塌碎片而导致可见性有限的灾难现场。基于相干时钟和可重构性的加性接收可以满足TTWR的需求，并且采用130 nm CMOS技术实现了基于时钟的单芯片IR雷达收发器。通过利用重复的相干时钟方案，拟议的SoC可以实现信噪比（SNR）增强。此外，本文还展示了使用拟议的TTWR SoC实时运行的手持式原型雷达在各种伪灾难条件下的测试结果。

Yong Cheol Peter Cho等人的第三篇论文“ AB9：推理加速的神经处理器”。提出了一种神经处理器，用于通过利用神经网络固有的数据重用和并行性特性来加快收缩张量核心（STC）的干扰，同时还提供对大型片上存储器的快速访问。AB9具有比YOLOv2通用GPU（GPGPU）更高的性能和能效，并采用28 nm CMOS工艺以及40 TFLOP STC制成，其中包括32 k算术单元和超过36 MB的运算能力。片上SRAM。

为了减轻深度神经网络中的高计算量和内存密集型负担，以下Hyun Mi Kim等人撰写的论文“针对内存效率高的高性能深度神经网络加速器的自动优化”。研究有效的内存结构和操作方案，它们可以为高性能加速器以及数据流控制提供直观的解决方案。作者提出了一种高效，灵活的架构，尽管内存和PE阵列很大，但仍可以在高频下运行。他们通过提出一种用于优化的自动化算法，证明了我们架构的效率和可用性方面的改进。实验表明，所提出的具有增加的数据复用性的体系结构（例如对角写路径）将性能提高了1。在广泛的神经网络中平均为44倍。自动化优化将性能从3.8倍提高到14.79倍，从而进一步提高了可用性。

近来，安全性的重要性已经出现在各种计算领域中，例如移动，生物医学和汽车系统。Seongmo Park等人的论文“基于Beta源的真随机数生成器的分析和有效的硬件实现”。提出了一种基于beta源的高效硬件随机数生成器。拟议的生成器生成值“ 0”和“ 1”，并提供了一种通过将伪随机数和真实随机数与简单的累积运算进行比较来区分伪随机数和真实随机数的方法。随机数生成器会生成标记数据，从而根据二进制计数值的位值以及所生成的用作参考数据的标记数据的比较来指示计数值是否为真正的随机数。

为了提高SoC的设计生产率，高级综合（HLS）变得很流行，并且已被用于根据以高级编程语言（如C / C）编写的行为描述自动合成寄存器传输级（RTL）电路。 C ++。但是，由于不必要的冗余实例，HLS工具通常会产生较大面积开销的设计。西川裕树（Hiroki Nishigawa）等人在题为“高级综合中的功能级模块共享技术”的论文中，作者介绍了两种用于功能级模块共享的HLS技术，并通过实验结果证明了其有效性。

Maha Diab和Soliman Mahmoud的论文“用于实现通用平衡OTA-C奇/偶数阶椭圆滤波器的现场可编程模拟阵列”提出了一种基于运算跨导放大器（OTA）的现场可编程模拟阵列的架构，可在低频下运行的模拟信号处理单元（如生物电势信号）中使用的构件。该架构消除了对信号路径中的开关的需求，并具有灵活的结构。此外，这项工作为OTA-C偶/奇n阶椭圆滤波器的合成提供了一种简化的直接电路实现方法。所提出的方法可实现OTA-C对称平衡结构，其中具有最少数量的组件和接地电容器，以实现平衡设计。

Jong Min Lee等人撰写的最后论文“采用0.1 um mHEMT技术的W波段MMIC芯片组”。开发了0.1μm变质高电子迁移率晶体管（mHEMT），并采用内部技术制造了W带单片微波集成电路芯片组，以验证所开发技术的性能和可用性。mHEMT的直流（DC）特性包括等于747 mA / mm的漏极电流密度和1.354 S / mm的最大跨导。此外，RF特性还包括210 GHz的截止频率和252 GHz的最大振荡频率。为了增加输入信号的频率，开发了一种倍频器，该倍频器由三个级联的公共源倍频器组成。作者还介绍了一种具有四级，

客座编辑感谢ETRI杂志的所有作者，审稿人和编辑人员，使本期特刊取得了成功。我们很高兴参与这项工作，并及时出版了这些高质量的技术文章。

更新日期：2020-08-18

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11