A 161.6 TOPS/W Mixed-mode Computing-in-Memory Processor for Energy-Efficient Mixed-Precision Deep Neural Networks (유회준교수 연구실)

A Mixed-mode Computing-in memory (CIM) processor for the mixed-precision Deep Neural Network (DNN) processing is proposed. Due to the bit-serial processing for the multi-bit data, the previous CIM processors could not exploit the energy-efficient computation of mixed-precision DNNs. This paper proposes an energy-efficient mixed-mode CIM processor with two key features: 1) Mixed-Mode Mixed precision CIM (M3-CIM) which achieves 55.46% energy efficiency improvement. 2) Digital-CIM for In-memory MAC for the increased throughput of M3-CIM. The proposed CIM processor was simulated in 28nm CMOS technology and occupies 1.96 mm2. It achieves a state-of-the-art energy efficiency of 161.6 TOPS/W with 72.8% accuracy at ImageNet (ResNet50).

Related papers:

Wooyoung Jo, Sangjin Kim, Juhyoung Lee, Soyeon Um, Zhiyong Li, and Hoi-jun Yoo, “A 161.6 TOPS/W Mixed-mode Computing-in-Memory Processor for Energy-Efficient Mixed-Precision Deep Neural Networks”, Int’l Symp. on Circuits and Systems (ISCAS), May 2022.

10

A 36.2 dB High SNR and PVT/Leakage-robust eDRAM Computing-In-Memory Macro with Segmented BL and Reference Cell Array (유회준교수 연구실)

Computing-in-memory (CIM) shows high energy-efficiency through the analog DNN computation inside the memory macros. However, as the DNN size increases, the energy-efficiency of CIM is reduced by external memory access (EMA). One of the promising solutions is eDRAM based CIM to increase memory capacity with a high density cell. Although the eDRAM-CIM has a higher density than the SRAM-CIM, it suffers from both poor robustness and a low signal-to-noise ratio (SNR). In this work, the energy-efficient eDRAM-CIM macro is proposed while improving computational robustness and SNR with three key features: 1) High SNR voltage-based accumulation with segmented BL architecture (SBLA), resulting in 17.1 dB higher SNR, 2) canceling PVT/leakage-induced error with common-mode error canceling (CMEC) circuit, resulting in 51.4% PVT variation reduction and 51.4% refresh power reduction, 3) a ReLU-based zero-gating ADC (ZG-ADC), resulting in ADC power reduction up to 58.1%. According to these new features, the proposed eDRAM-CIM macro achieves 81.5-to-115.0 TOPS/W energy-efficiency with 209-to-295 μW power consumption when 4b×4b MAC operation is performed with 250 MHz core frequency. The proposed macro also achieves 91.52% accuracy at the CIFAR-10 object classification dataset (ResNet-20) without accuracy drop even with PVT variation.

Related papers:

Ha, Sangwoo, et al. “A 36.2 dB High SNR and PVT/Leakage-robust eDRAM Computing-In-Memory Macro with Segmented BL and Reference Cell Array.” IEEE Transactions on Circuits and Systems II: Express Briefs (2022).

Ha, Sangwoo, et al. “A 36.2 dB High SNR and PVT/Leakage-robust eDRAM Computing-In-Memory Macro with Segmented BL and Reference Cell Array”, IEEE International Symposium on Circuits and Systems (ISCAS), May. 2022

9

An Energy-efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient Cache Subsystem (유회준교수 연구실)

With the rise of contactless communication and streaming services, Super-resolution (SR) in mobile devices has become one of the most important image processing technologies. Also, the popularity of high-end Application Processor (AP) and high-resolution display in mobile drives the development of the lightweight mobile SR-CNNs, which show the high reconstruction quality. However, the large size and wide dynamic range of both images and intermediate feature maps in CNN hidden layers pose challenges for mobile platforms. Constraints from the limited power and shared bandwidth on mobile platform, a low power and energy-efficient architecture is required.

This paper presents an image processing SoC exploiting non-sparse SR task. It contributes 2 following key features: 1) Heterogeneous architecture with only 8bit FP-FXP hybrid-precision for SR task, and 2) data lifetime-aware two-way optimized cache subsystem for energy-efficient depth-first image processing. With highly optimized heterogeneous cores and cache subsystem, our SoC presents 2.6x higher energy-efficiency than previous SRNPU and 107 frame-per-second (fps) framerate running 4x SR image generation to Full-HD scale with 0.92 mJ/frame energy consumption.

Related papers:

Z. Li, S. Kim, D. Im, D. Han and H. -J. Yoo, “An 0.92 mJ/frame High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient Cache,” 2022 IEEE Custom Integrated Circuits Conference (CICC), 2022.

 

8

 

A Low-Power Graph Convolutional Network Processor with Sparse Grouping for 3D Point Cloud Semantic Segmentation in Mobile Devices (유회준교수 연구실)

A low-power graph convolutional network (GCN) processor is proposed for accelerating 3D point cloud semantic segmentation (PCSS) in real-time on mobile devices. Three key features enable the low-power GCN-based 3D PCSS. First, the new hardware-friendly GCN algorithm, sparse grouping-based dilated graph convolution (SG-DGC) is proposed. SG-DGC reduces 71.7% of the overall computation and 76.9% of EMA through the sparse grouping of the point cloud. Second, the two-level pipeline (TLP) consisting of the point-level pipeline (PLP) and group-level pipelining (GLP) was proposed to improve low utilization by the imbalanced workload of GCN. The PLP enables point-level module-wise fusion (PMF) which reduces 47.4% of EMA for low power consumption. Also, center point feature reuse (CPFR) reuses computation results of the redundant operation and reduces 11.4% of computation. Finally, the GLP increased the core utilization by 21.1% by balancing the workload of graph generation and graph convolution and enable 1.1× higher throughput. The processor is implemented with 65nm CMOS technology, and the 4.0mm 2 3D PCSS processor show 95mW power consumption while operating in real-time of 30.8 fps in the 3D PCSS of the indoor scene with 4k points.

Related papers:

Kim, Sangjin, et al. “A Low-Power Graph Convolutional Network Processor With Sparse Grouping for 3D Point Cloud Semantic Segmentation in Mobile Devices.” IEEE Transactions on Circuits and Systems I: Regular Papers (2022).

Kim, Sangjin, et al. “A 54.7 fps 3D Point Cloud Semantic Segmentation Processor with Sparse Grouping based Dilated Graph Convolutional Network for Mobile Devices.” 2020 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2020.

 

7

 

TSUNAMI: Triple Sparsity-aware Ultra Energy-efficient Neural Network Training Accelerator with Multi-modal Iterative Pruning (유회준교수 연구실)

이 논문에서는 에너지 효율적인 심층 신경망 훈련을 지원하는 TSUNAMI를 제안한다. TSUNAMI는 활성화 및 가중치에서 0을 생성하기 위해 다중 모드 반복 가지치기를 지원한다. 타일 기반 동적 활성화 프루닝 유닛과 가중치 메모리 공유 프루닝 유닛은 프루닝으로 인해 요구되는 추가 메모리 액세스를 제거한다. Coarse-zero skipping 컨트롤러는 여러 개의 불필요한 MAC(multiply-and-accumulation) 연산을 한 번에 생략하고, fine-zero skipping 컨트롤러는 임의로 찾은 불필요한 MAC 연산을 생략한다. 가중치 희소성 balancer는 가중치 희소성 불균형으로 인한 활용도 저하를 해결하고, 각 convolution core의 작업량은 랜덤 채널 allocator에 의해 할당된다. TSUNAMI는 부동 소수점 8비트 활성화 및 가중치로 0.78V 및 50MHz에서 3.42TFLOPS/W의 에너지 효율을 달성하였고, 90% 희소성 조건에서 405.96 TFLOPS/W의 세계 최고 수준의 에너지 효율을 달성했다.

Related papers:

Kim, Sangyeob, et al. “TSUNAMI: Triple Sparsity-Aware Ultra Energy-Efficient Neural Network Training Accelerator With Multi-Modal Iterative Pruning.” IEEE Transactions on Circuits and Systems I: Regular Papers (2022).

6

 

 

Neuro-CIM: A 310.4 TOPS/W Neuromorphic Computing-in-Memory Processor with Low WL/BL activity and Digital-Analog Mixed-mode Neuron Firing (유회준교수 연구실)

이전 메모리 컴퓨팅 프로세서들은 병렬 처리를 위해 다중 WL 구동과 높은 정확도를 위해 많은 고정밀 ADC가 필요하기 때문에 100TOPS/W 이상의 높은 에너지 효율을 달성하지 못했다. 일부 프로세서들은 높은 에너지 효율성을 얻기 위해 가중치 데이터 희소성을 많이 이용했지만, 실제 사례(예: ResNet-18을 사용한 ImageNet 작업)에서는 희소성이 30% 미만으로 제한되므로 성능에 제한이 존재한다.

본 논문에서는 에너지 효율적인 뉴로모픽 메모리 컴퓨팅 프로세서가 4가지 주요 특징들과 함께 제안된다. BL 활성도를 줄이기 위해 최상위 비트(MSB) 워드 건너뛰기가 제안되었으며, 더 낮은 BL 활성도를 달성하기 위한 early stopping이 제안되었다. 그리고, 다중 macro 간의 aggregation을 위한 mixed mode firing이 제안되었고, dynamic range를 확장하기 위한 voltage folding이 제안되었다. 그 결과, 제안된 메모리 컴퓨팅 프로세서는 62.1 TOPS/W (I=4b, W=8b) 및 310.4 TOPS/W(I=4b, W=1b)의 세계 최고 수준의 에너지 효율을 달성하였다.

Related papers:

  1. Kim, et al. “Neuro-CIM: A 310.4TOPS/W Neuromorphic Computing-in-Memory Processor with Low WL/BL activity and Digital-Analog Mixed-mode Neuron Firing”, Symposium on VLSI Circuits (S. VLSI), Jun. 2022

 

5

DSPU: A 281.6mW Real-Time Depth Signal Processing Unit for Deep Learning-Based Dense RGB-D Data Acquisition with Depth Fusion and 3D Bounding Box Extraction in Mobile Platforms (유회준교수 연구실)

A low-latency and low-power dense RGB-D acquisition and 3D bounding-box extraction system-on-chip, DSPU, is proposed. The DSPU produces accurate dense RGB-D data through CNN-based monocular depth estimation and sensor fusion with a low-power ToF sensor. Furthermore, it performs a 3D point cloud-based neural network for 3D bounding-box extraction. The architecture of the DSPU accelerates the system by alleviating the data-intensive and computation-intensive operations. Finally, the DSPU achieves real-time implementation with 281.6 mW of end-to-end RGB-D and 3D bounding-box extraction.

Related papers:

Im, Dongseok, et al. “DSPU: A 281.6 mW Real-Time Depth Signal Processing Unit for Deep Learning-Based Dense RGB-D Data Acquisition with Depth Fusion and 3D Bounding Box Extraction in Mobile Platforms.” 2022 IEEE International Solid-State Circuits Conference (ISSCC). Vol. 65. IEEE, 2022.

4

HNPU: An Adaptive DNN Training Processor Utilizing Stochastic Dynamic Fixed-point and Active Bit-precision Searching (유회준교수 연구실)

This paper presents HNPU, which is an energy-efficient DNN training processor by adopting algorithm-hardware co-design. The HNPU supports stochastic dynamic fixed-point representation and layer-wise adaptive precision searching unit for low-bit-precision training. It additionally utilizes slice-level reconfigurability and sparsity to maximize its efficiency both in DNN inference and training. Adaptive-bandwidth reconfigurable accumulation network enables reconfigurable DNN allocation and maintains its high core utilization even in various bit-precision conditions. Fabricated in a 28nm process, the HNPU accomplished at least 5.9 × higher energy-efficiency and 2.5 × higher area efficiency in actual DNN training compared with the previous state-of-the-art on-chip learning processors.

Related papers:

Han, Donghyeon, et al. “HNPU: An adaptive DNN training processor utilizing stochastic dynamic fixed-point and active bit-precision searching.” IEEE Journal of Solid-State Circuits 56.9 (2021): 2858-2869.

 

3

ECIM: Exponent Computing in Memory for an Energy-Efficient Heterogeneous Floating-Point Processor (유회준교수 연구실)

The authors propose a heterogeneous floating-point (FP) computing architecture to maximize energy efficiency by separately optimize exponent processing and mantissa processing. The proposed exponent-computing-in-memory (ECIM) architecture and mantissa-free-exponent-computing (MFEC) algorithm reduce the power consumption of both memory and FP MAC while resolving previous FP computing-in-memory processors’ limitations. Also, a bfloat16 DNN training processor with proposed features and sparsity exploitation support is implemented and fabricated in 28 nm CMOS technology. It achieves 13.7 TFLOPS/W energy efficiency while supporting FP operations with CIM architecture.

Related papers:

  1. Lee et al., “ECIM: Exponent Computing in Memory for an Energy-Efficient Heterogeneous Floating-Point DNN Training Processor,” in IEEE Micro, Jan 2022
  2. Lee et al., “An Energy-efficient Floating-Point DNN Processor using Heterogeneous Computing Architecture with Exponent-Computing-in-Memory”, 2021 IEEE Hot Chips 33 Symposium (HCS), 2021
  3. Lee et al., “A 13.7 TFLOPS/W Floating-point DNN Processor using Heterogeneous Computing Architecture with Exponent-Computing-in-Memory,” 2021 Symposium on VLSI Circuits, 2021

 

2

OmniDRL: An Energy-efficient Deep Reinforcement Learning Processor (유회준교수 연구실)

We present an energy-efficient deep reinforcement learning (DRL) processor, OmniDRL, for DRL training on edge devices. Recently, the need for DRL training is growing due to the DRL’s distinct characteristics that can be adapted to each user. However, a massive amount of external and internal memory access limits the implementation of DRL training on resource-constrained platforms. OmniDRL proposes 4 key features that can reduce external memory access by compressing as much data as possible, and can reduce internal memory access by directly processing compressed data. A group-sparse training enables a high weight compression ratio for every DRL iteration. A group-sparse training core is proposed to fully take advantage of compressed weight from GST. An exponent mean delta encoding additionally compresses exponent of both weight and feature map. A world-first on-chip sparse-weight-transposer enables the DRL training process of compressed weight without off-chip transposer. As a result, OmniDRL is fabricated in 28nm CMOS technology and occupies a 3.6×3.6 mm2 die area. It achieved 7.42 TFLOPS/W energy efficiency for training robot agent (Mujoco Halfcheetah, TD3), which is 2.4× higher than the previous state-of-the-art.

Related papers:

  1. Lee et al., “OmniDRL: An Energy-Efficient Deep Reinforcement Learning Processor With Dual-Mode Weight Compression and Sparse Weight Transposer,” in IEEE Journal of Solid-State Circuits, April 2022
  2. Lee et al., “OmniDRL: An Energy-Efficient Mobile Deep Reinforcement Learning Accelerators with Dual-mode Weight Compression and Direct Processing of Compressed Data”, 2021 IEEE Hot Chips 33 Symposium (HCS), 2021
  3. Lee et al., “OmniDRL: A 29.3 TFLOPS/W Deep Reinforcement Learning Processor with Dual-mode Weight Compression and On-chip Sparse Weight Transposer,” 2021 Symposium on VLSI Circuits, 2021
  4. Lee et al., “Low-power Autonomous Adaptation System with Deep Reinforcement Learning,” 2022 AICAS, 2022
  5. Lee et al., “Energy-Efficient Deep Reinforcement Learning Accelerator Designs for Mobile Autonomous Systems,” 2021 AICAS, 2021

 

  1. 1