当前期刊: IEEE Micro Go to current issue    加入关注   
显示样式:        排序: 导出
我的关注
我的收藏
您暂时未登录!
登录
  • Path2SL: Leveraging InfiniBand Resources to Reduce Head-of-Line Blocking in Fat Trees
    IEEE Micro (IF 2.570) Pub Date : 2019-10-25
    German Maglione-Mathey; Jesus Escudero-Sahuquillo; Pedro Javier Garcia; Francisco J. Quiles; José Duato

    The number of endnodes in high-performance computing and datacenter systems is constantly increasing. Hence, it is crucial to minimize the impact of network congestion to guarantee a suitable network performance. InfiniBand is a prominent interconnect technology that allows implementing efficient topologies and routing algorithms, as well as queuing schemes that reduce the head-of-line (HoL) blocking effect derived from congestion situations. Here, we explain and evaluate thoroughly a queuing scheme called Path2SL that optimizes the use of the InfiniBand Virtual Lanes to reduce the HoL blocking in fat-tree network topologies.

    更新日期:2020-01-17
  • A Bunch-of-Wires (BoW) Interface for Interchiplet Communication
    IEEE Micro (IF 2.570) Pub Date : 2019-10-30
    Ramin Farjadrad; Mark Kuemerle; Bapi Vinnakota

    Multichiplet system-in-package designs have recently received a lot of attention as a mechanism to combat high SoC design costs and to economically manufacture large ASICs. These designs require low-power area-efficient off-die on-package die-to-die communication. Current technologies either extend on-die high-wire count buses using silicon interposers or off-package serial buses. The former approach leads to expensive packaging. The latter leads to complex and high-power designs. We propose a simple bunch-of-wires interface that combines ease of development with low-cost packaging techniques. We develop the interface and show how it can be used in multichiplet systems.

    更新日期:2020-01-17
  • Toward FPGA-Based HPC: Advancing Interconnect Technologies
    IEEE Micro (IF 2.570) Pub Date : 2019-10-31
    Joshua Lant; Javier Navaridas; Mikel Luján; John Goodacre

    HPC architects are currently facing myriad challenges from ever tighter power constraints and changing workload characteristics. In this article, we discuss the current state of FPGAs within HPC systems. Recent technological advances show that they are well placed for penetration into the HPC market. However, there are still a number of research problems to overcome; we address the requirements for system architectures and interconnects to enable their proper exploitation, highlighting the necessity of allowing FPGAs to act as full-fledged peers within a distributed system rather than attached to the CPU. We argue that this model requires a reliable, connectionless, hardware-offloaded transport supporting a global memory space. Our results show how our fully fledged hardware implementation gives latency improvements of up to 25% versus a software-based transport, and demonstrates that our solution can outperform the state of the art in HPC workloads such as matrix–matrix multiplication achieving a 10% higher computing throughput.

    更新日期:2020-01-17
  • Communication Profiling and Characterization of Deep-Learning Workloads on Clusters With High-Performance Interconnects
    IEEE Micro (IF 2.570) Pub Date : 2019-10-30
    Ammar Ahmad Awan; Arpan Jain; Ching-Hsiang Chu; Hari Subramoni; Dhableswar K. Panda

    Heterogeneous high-performance computing systems with GPUs are equipped with high-performance interconnects like InfiniBand, Omni-Path, PCIe, and NVLink. However, little exists in the literature that captures the performance impact of these interconnects on distributed deep learning (DL). In this article, we choose Horovod, a distributed training middleware, to analyze and profile various DNN training workloads using TensorFlow and PyTorch in addition to standard MPI microbenchmarks. We use a wide variety of systems with CPUs like Intel Xeon and IBM POWER9, GPUs like Volta V100, and various interconnects to analyze the following metrics: 1) message-size with Horovod's tensor-fusion; 2) message-size without tensor-fusion; 3) number of MPI/NCCL calls; and 4) time taken by each MPI/NCCL call. We observed extreme performance variations for non-power-of-two message sizes on different platforms. To address this, we design a message-padding scheme for Horovod, illustrate significantly smoother allreduce latency profiles, and report cases where we observed improvement for end-to-end training.

    更新日期:2020-01-17
  • High-Quality Fault Resiliency in Fat Trees
    IEEE Micro (IF 2.570) Pub Date : 2019-10-30
    John Gliksberg; Antoine Capra; Alexandre Louvet; Pedro Javier García; Devan Sohier

    Coupling regular topologies with optimized routing algorithms is key in pushing the performance of interconnection networks of supercomputers. In this article, we present Dmodc, a fast deterministic routing algorithm for parallel generalized fat trees (PGFTs), which minimizes congestion risk even under massive network degradation caused by equipment failure. Dmodc computes forwarding tables with a closed-form arithmetic formula by relying on a fast preprocessing phase. This allows complete rerouting of networks with tens of thousands of nodes in less than a second. In turn, this greatly helps centralized fabric management react to faults with high-quality routing tables and has no impact on running applications in current and future very large scale high-performance computing clusters.

    更新日期:2020-01-17
  • A High-Throughput Network Processor Architecture for Latency-Critical Applications
    IEEE Micro (IF 2.570) Pub Date : 2019-12-10
    Sourav Roy; Arvind Kaushik; Rajkumar Agrawal; Joseph Gergen; Wim Rouwet; John Arends

    This article presents the recent advancements on the Advanced IO Processor (AIOP), a network processor architecture designed by NXP Semiconductors. The AIOP is a multicore accelerated computing architecture where each core is equipped with dedicated hardware for rapid task switching on every hardware accelerator call. A hardware preemption controller snoops on the accelerator completions and sends task preemption requests to the cores, thus reducing the latency of real-time tasks. A technique of priority thresholding is used to avoid latency uncertainty on lower priority tasks and head-of-line blocking. In this way, the AIOP handles the conflicting requirements of high throughput and low latency for next-generation wireless applications such as WiFi and 5G. In presence of frequent preemptions, the throughput reduces by only 3% on AIOP, compared to 25% on a similar network processor. Moreover, the absolute throughput and latency numbers are 2X better. The area and power overhead of adding hardware task-scheduling and preemption is only about 3%.

    更新日期:2020-01-17
  • Warp: A Hardware Platform for Efficient Multimodal Sensing With Adaptive Approximation
    IEEE Micro (IF 2.570) Pub Date : 2020-01-14
    Phillip Stanley-Marbell; Martin Rinard

    In this article, we present Warp, the first open hardware platform designed explicitly to support research in approximate computing. Warp incorporates 21 sensors, computation, and circuit-level facilities designed explicitly to enable approximate computing research, in a 3.6 cm × 3.3 cm × 0.5 cm device. Warp supports a wide range of precision and accuracy versus power and performance tradeoffs.

    更新日期:2020-01-17
  • $\Delta$ΔNN: Power-Efficient Neural Network Acceleration Using Differential Weights
    IEEE Micro (IF 2.570) Pub Date : 2019-10-21
    Hoda Mahdiani; Alireza Khadem; Azam Ghanbari; Mehdi Modarressi; Farima Fattahi-Bayat; Masoud Daneshtalab

    The enormous and ever-increasing complexity of state-of-the-art neural networks has impeded the deployment of deep learning on resource-limited embedded and mobile devices. To reduce the complexity of neural networks, this article presents $\Delta$ΔNN, a power-efficient architecture that leverages a combination of the approximate value locality of neuron weights and algorithmic structure of neural networks. $\Delta$ΔNN keeps each weight as its difference ($\Delta$Δ) to the nearest smaller weight: each weight reuses the calculations of the smaller weight, followed by a calculation on the $\Delta$Δ value to make up the difference. We also round up/down the $\Delta$Δ to the closest power of two numbers to further reduce complexity. The experimental results show that $\Delta$ΔNN boosts the average performance by 14%–37% and reduces the average power consumption by 17%–49% over some state-of-the-art neural network designs.

    更新日期:2020-01-17
  • AutoML for Architecting Efficient and Specialized Neural Networks
    IEEE Micro (IF 2.570) Pub Date : 2019-11-12
    Han Cai; Ji Lin; Yujun Lin; Zhijian Liu; Kuan Wang; Tianzhe Wang; Ligeng Zhu; Song Han

    Efficient deep learning inference requires algorithm and hardware codesign to enable specialization: we usually need to change the algorithm to reduce memory footprint and improve energy efficiency. However, the extra degree of freedom from the neural architecture design makes the design space much larger: it is not only about designing the hardware architecture but also codesigning the neural architecture to fit the hardware architecture. It is difficult for human engineers to exhaust the design space by heuristics. We propose design automation techniques for architecting efficient neural networks given a target hardware platform. We investigate automatically designing specialized and fast models, auto channel pruning, and auto mixed-precision quantization. We demonstrate that such learning-based, automated design achieves superior performance and efficiency than the rule-based human design. Moreover, we shorten the design cycle by 200× than previous work, so that we can afford to design specialized neural network models for different hardware platforms.

    更新日期:2020-01-17
  • In-Hardware Moving Compute to Data Model to Accelerate Thread Synchronization on Large Multicores
    IEEE Micro (IF 2.570) Pub Date : 2019-11-22
    Masab Ahmad; Halit Dogan; José A. Joao; Omer Khan

    In this article, the moving computation to data model (MC2D) is proposed to accelerate thread synchronization by pinning shared data to dedicated cores, and utilize in-hardware core-to-core messaging to communicate critical code execution. The MC2D model optimizes shared data locality by eliminating unnecessary data movement, and alleviates contended synchronization using nonblocking communication between threads. This article evaluates task-parallel algorithms under their synchronization-centric classification to demonstrate that the effectiveness of the MC2D model to exploit performance correlates with the number and frequency of synchronizations. The evaluation on Tilera TILE-Gx72 multicore shows that the MC2D model delivers highest performance scaling gains for ordered and unordered algorithms that expose significant synchronizations due to task and data level dependencies. The MC2D model is also shown to deliver at par performance with the traditional atomic operations based model for highly data parallel algorithms from the unordered category.

    更新日期:2020-01-17
  • [Front cover]
    IEEE Micro (IF 2.570) Pub Date : 2019-11-07

    Presents the front cover for this issue of the publication.

    更新日期:2020-01-04
  • Keep Your Career Options Open
    IEEE Micro (IF 2.570) Pub Date : 2019-11-07

    Advertisement, IEEE.

    更新日期:2020-01-04
  • Masthead
    IEEE Micro (IF 2.570) Pub Date : 2019-11-07

    Provides a listing of current staff, committee members and society officers.

    更新日期:2020-01-04
  • Table of contents
    IEEE Micro (IF 2.570) Pub Date : 2019-11-07

    Presents the table of contents for this issue of this publication.

    更新日期:2020-01-04
  • 3-D Chips! Chips are Getting Denser and Taller Than Ever!!
    IEEE Micro (IF 2.570) Pub Date : 2019-11-07
    Lizy Kurian John

    Presents the introductory editorial for this issue of the publication.

    更新日期:2020-01-04
  • Security & Privacy
    IEEE Micro (IF 2.570) Pub Date : 2019-11-07

    Advertisement, IEEE.

    更新日期:2020-01-04
  • Going Vertical: The Future of Electronics
    IEEE Micro (IF 2.570) Pub Date : 2019-11-07
    Vijaykrishnan Narayanan

    This special issue provides an overview of the foundational advances enabling 3-D monolithic systems, reports on exciting new advances, and identifies open opportunities and challenges.

    更新日期:2020-01-04
  • Back-End-of-Line Compatible Transistors for Monolithic 3-D Integration
    IEEE Micro (IF 2.570) Pub Date : 2019-11-07
    Suman Datta; Sourav Dutta; Benjamin Grisafe; Jeff Smith; Srivatsa Srinivasa; Huacheng Ye

    The manufacturers of high-performance logic have been ardent champions of Moore's Law, which has resulted in exponential increase in aerial transistor density to 100 million transistors per square millimeter of silicon real estate. However, it is the memory chip makers who have taken the first step toward escaping the confines of scaling within the horizontal plane and have embraced the vertical or the third dimension. The dynamic random access memory manufacturers have adopted stacked capacitors that tower above the silicon plane that hosts the access and peripheral transistors, whereas the nand flash memory technologists can stack 128 layers of charge trap flash cells on top of each other in a monolithic fashion. To enable monolithic three-dimensional (M3D) integration of high-performance logic, one needs to solve the fundamental challenge of low temperature (<; 400 °C) in situ synthesis of high mobility n-type and p-type semiconductor thin films that can be utilized for fabrication of back-end-of-line (BEOL) compatible complementary MOS transistors under the constraint of limited thermal budget. This article discusses recent progress in the selection and optimization of semiconductor materials for BEOL compatible transistors to enable sequential M3D integration for a range of applications.

    更新日期:2020-01-04
  • Monolithic 3-D Integration
    IEEE Micro (IF 2.570) Pub Date : 2019-11-07
    Mindy D. Bishop; H.-S. Philip Wong; Subhasish Mitra; Max M. Shulaker

    The demands of future applications in computing (from self-driving cars to bioinformatics) overwhelm the projected capabilities of current electronic systems. The need to process unprecedented amounts of loosely structured data is driving the push for ultradense and fine-grained integration of traditionally off-chip components (e.g., sensors, memories) with energy-efficient computation units-all within a single chip. Monolithic 3-D integration is a leading approach for building such future systems, as it naturally enables ultradense connectivity between various heterogeneous technologies inside a single chip. This article discusses exciting recent progress toward realizing monolithic 3-D systems and elucidates key benefits that these new systems offer. Monolithic 3-D integration promises to enable the next wave of gains in performance (both energy and speed) for coming generations of applications as well as provides the means for developing rich additional functionalities such as sensing-immersed-in-computation that lie beyond the scope of traditional computing today.

    更新日期:2020-01-04
  • Recent Advances in Compute-in-Memory Support for SRAM Using Monolithic 3-D Integration
    IEEE Micro (IF 2.570) Pub Date : 2019-11-07
    Zhixiao Zhang; Xin Si; Srivatsa Srinivasa; Akshay Krishna Ramanathan; Meng-Fan Chang

    Computing-in-memory (CiM) is a popular design alternative to overcome the von Neumann bottleneck and improve the performance of artificial intelligence computing applications. Monolithic three-dimensional (M3D) technology is a promising solution to extend Moore's law through the development of CiM for data-intensive applications. In this article, we first discuss the motivation and challenges associated with two-dimensional CiM designs, and then examine the possibilities presented by emerging M3D technologies. Finally, we review recent advances and trends in the implementation of CiM using M3D technology.

    更新日期:2020-01-04
  • IEEE Computer Society Has You Covered!
    IEEE Micro (IF 2.570) Pub Date : 2019-11-07

    Advertisement, IEEE.

    更新日期:2020-01-04
  • A Logic-on-Memory Processor-System Design With Monolithic 3-D Technology
    IEEE Micro (IF 2.570) Pub Date : 2019-09-27
    Sai Pentapati; Lingjun Zhu; Lennart Bamberg; Da Eun Shim; Alberto García-Ortiz; Sung Kyu Lim

    In recent years, the size of transistors has been scaled down to a few nanometers and further shrinking will eventually reach the atomic scale. Monolithic three-dimensional (M3D) ICs use the third dimension for placement and routing, which helps reduce footprint and improve power and performance of circuits without relying on technology shrinking. This article explores the benefits of M3D ICs using OpenPiton, a scalable open-source Reduced Instruction Set Computer (RISC)-V-based multicore SoC. With a logic-on-memory 3-D integration scheme, we analyze the power and performance benefits of two OpenPiton single-tile systems with smaller and larger memory architectures. The logic-on-memory M3D design shows 36.8% performance improvement compared to the corresponding tile design in 2-D. In addition, at isoperformance, M3D shows 13.5% total power saving.

    更新日期:2020-01-04
  • Network-on-Chip Design Guidelines for Monolithic 3-D Integration
    IEEE Micro (IF 2.570) Pub Date : 2019-08-27
    Itir Akgun; Dylan Stow; Yuan Xie

    Monolithic three-dimensional (M3D) integration is viewed as a promising improvement over through-silicon-via-based 3-D integration due to its greater inter-tier connectivity, higher circuit density, and lower parasitic capacitance. With M3D integration, network-on-chip (NoC) communication fabric can benefit from reduced link distances and improved intra-router efficiency. However, the sequential fabrication methods utilized for M3D integration impose unique interconnect requirements for each of the possible partitioning schemes at transistor, gate, and block granularities. Further, increased cell density introduces contention of available routing resources. Prior work on M3D NoCs has focused on the benefits of reduced distances, but has not considered these process-imposed circuit complications. In this article, NoC topology decisions are analyzed in conjunction with these M3D interconnect requirements to provide an equivalent architectural comparison between M3D partitioning schemes.

    更新日期:2020-01-04
  • Monolithically Integrated RRAM- and CMOS-Based In-Memory Computing Optimizations for Efficient Deep Learning
    IEEE Micro (IF 2.570) Pub Date : 2019-11-08
    Shihui Yin; Yulhwa Kim; Xu Han; Hugh Barnaby; Shimeng Yu; Yandong Luo; Wangxin He; Xiaoyu Sun; Jae-Joon Kim; Jae-sun Seo

    Resistive RAM (RRAM) has been presented as a promising memory technology toward deep neural network (DNN) hardware design, with nonvolatility, high density, high ON/OFF ratio, and compatibility with logic process. However, prior RRAM works for DNNs have shown limitations on parallelism for in-memory computing, array efficiency with large peripheral circuits, multilevel analog operation, and demonstration of monolithic integration. In this article, we propose circuit-/device-level optimizations to improve the energy and density of RRAM-based in-memory computing architectures. We report experimental results based on prototype chip design of 128 × 64 RRAM arrays and CMOS peripheral circuits, where RRAM devices are monolithically integrated in a commercial 90-nm CMOS technology. We demonstrate the CMOS peripheral circuit optimization using input-splitting scheme and investigate the implication of higher low resistance state on energy efficiency and robustness. Employing the proposed techniques, we demonstrate RRAM-based in-memory computing with up to 116.0 TOPS/W energy efficiency and 84.2% CIFAR-10 accuracy. Furthermore, we investigate four-level programming with single RRAM device, and report the system-level performance and DNN accuracy results using circuit-level benchmark simulator NeuroSim.

    更新日期:2020-01-04
  • Analyzing the Monolithic Integration of a ReRAM-Based Main Memory Into a CPU's Die
    IEEE Micro (IF 2.570) Pub Date : 2019-09-27
    Meenatchi Jagasivamani; Candace Walden; Devesh Singh; Luyi Kang; Shang Li; Mehdi Asnaashari; Sylvain Dubois; Bruce Jacob; Donald Yeung

    Nonvolatile memory, such as resistive RAM (ReRAM), is compatible with standard CMOS logic processes, allowing a sizable main memory system to be integrated into a CPU's die. ReRAM bitcells are fabricated within crosspoint subarrays that leave the bulk of transistors underneath the subarrays vacant. This permits placing the memory system over the CPU, improving area, parallelism, and power. Our work quantifies the impact of integrating ReRAM into a CPU's die. When integrating ReRAM over CPU logic, the best area efficiency occurs when 48% of the die is covered with ReRAM. The CPU's area increases by 18.8%, but we can recoup 35.5% of the die area by utilizing the free transistors underneath the crosspoint subarrays. When integrating ReRAM over CPU cache, up to 85.3% of the cache can be covered with ReRAM. Our work also shows that on-die ReRAM can support very high bandwidth through massively parallel memory access. At 28 nm, 4-16k independent ReRAM banks could be integrated onto the CPU die, providing 512-1024-GB/s peak bandwidth. At more advanced technology nodes, 5-10 TB/s may be possible.

    更新日期:2020-01-04
  • MEMTI: Optimizing On-Chip Nonvolatile Storage for Visual Multitask Inference at the Edge
    IEEE Micro (IF 2.570) Pub Date : 2019-10-04
    Marco Donato; Lillian Pentecost; David Brooks; Gu-Yeon Wei

    The combination of specialized hardware and embedded nonvolatile memories (eNVM) holds promise for energy-efficient deep neural network (DNN) inference at the edge. However, integrating DNN hardware accelerators with eNVMs still presents several challenges. Multilevel programming is desirable for achieving maximal storage density on chip, but the stochastic nature of eNVM writes makes them prone to errors and further increases the write energy and latency. In this article, we present MEMTI, a memory architecture that leverages a multitask learning technique for maximal reuse of DNN parameters across multiple visual tasks. We show that by retraining and updating only 10% of all DNN parameters, we can achieve efficient model adaptation across a variety of visual inference tasks. The system performance is evaluated by integrating the memory with the open-source NVIDIA deep learning architecture.

    更新日期:2020-01-04
  • Call for Papers: IEEE Transactions on Computers
    IEEE Micro (IF 2.570) Pub Date : 2019-11-07

    Prospective authors are requested to submit new, unpublished manuscripts for inclusion in the upcoming event described in this call for papers.

    更新日期:2020-01-04
  • Antitrust in Three Acts
    IEEE Micro (IF 2.570) Pub Date : 2019-11-07
    Shane Greenstein

    Economic analysis frames the debates in U.S. antitrust, and as such, it resembles an Italian Opera. While the best economists in the world sing in front of judges, most of the audience loses something in the translation. Without a guide, it is easy for the spectacle to distract. Observers miss crucial details, and lose the plot. Which motivates the topic today: How does antitrust economics inform a case against the big four tech firms in the United States -Google/Alphabet, Facebook, Amazon, and Apple? This question is under serious consideration today, so the answer is more than an academic exercise.

    更新日期:2020-01-04
  • IEEE COMPUTER SOCIETY
    IEEE Micro (IF 2.570) Pub Date : 2019-11-07

    Advertisement, IEEE.

    更新日期:2020-01-04
  • Computing Edge
    IEEE Micro (IF 2.570) Pub Date : 2019-11-07

    Advertisement.

    更新日期:2020-01-04
Contents have been reproduced by permission of the publishers.
导出
全部期刊列表>>
2020新春特辑
限时免费阅读临床医学内容
ACS材料视界
科学报告最新纳米科学与技术研究
清华大学化学系段昊泓
自然科研论文编辑服务
中国科学院大学楚甲祥
上海纽约大学William Glover
中国科学院化学研究所
课题组网站
X-MOL
北京大学分子工程苏南研究院
华东师范大学分子机器及功能材料
中山大学化学工程与技术学院
试剂库存
天合科研
down
wechat
bug