KAIST 김주영 교수 센터(인공지능반도체시스템 연구센터), 과학기술정보통신부 장관상 수상

dataURItoBlob 1

[김주영 교수(센터장)]
 
KAIST 김주영 교수가 이끄는 인공지능반도체시스템 연구센터(이하 AISS)가 지난 11월 10일, 탁월한 인재 양성 성과를 인정 받아 과학기술정보통신부 장관상을 수상하였다.
 
김주영 교수가 센터장으로 있는 AISS는 지난 2020년부터 과학기술정보통신부의 대학ICT연구센터육성지원사업을 수행중으로, 미래 인공지능 사회를 위한 인공지능 반도체 핵심원천기술 및 응용 기술을 개발하고 석박사급 전문 인재를 양성하는 데 있어 다각도로 매진해왔다. 
 
특히 2021년도에는 인턴쉽, 기술이전, 기업가정신교육, 창의자율과제 등 다채로운 주제와 프로그램을 통해 96명의 학생연구원을 꾸준히 양성하였으며, 이중 42%에 달하는 인재가 석박사 학위 취득, 인공지능 반도체 관련 글로벌 기업으로의 취업 등 괄목할 만한 성과를 기록함으로써 타 센터의 모범이 되었다. 
 
AISS는 현재 김주영 연구책임자를 비롯해 KAIST 유회준, 김이섭, 박인철, 류승탁, 김현식, 연세대학교 김한준, 송진호, 이화여자대학교 김지훈, 박성민, UNIST 이규호 교수가 프로젝트 책임으로 활발한 연구 활동을 펼친다. 또한 2021년보다 10% 증가된 110여 명의 석박사급 인재가 참여함으로써 명실상부한 인공지능 반도체 분야의 대한민국 허브를 향해 힘찬 발걸음을 내딛고 있다. 
 
수상의 영광을 안은 연구책임자 김주영 교수는 “앞으로도 대학의 ICT 및 지능형 반도체 기술 역량을 토대로 국내 유수의 대학, 기업 등과 연계를 강화해 대한민국이 진정한 반도체 기술 강국으로 거듭나는 데 있어 필수적인 시스템 반도체 인력 양성에 중추적인 역할을 하겠다”고 수상 소감을 밝혔다. 
 
사진 dataURItoBlob
[수상식 사진]
 

 

Sunwha Koh, Younggwang Jung, Daijoon Hyun, and Youngsoo Shin, “Routability optimization for extreme aspect ratio design using convolutional neural network,” Proc. Int’l Symp. on Circuits and Systems (ISCAS), May 2021

  • Title: Routability optimization for extreme aspect ratio design using convolutional neural network,
  • Authors: Sunwha Koh, Younggwang Jung, Daijoon Hyun, Youngsoo Shin
  • Abstract :

     Circuits that are placed with very low (or high) aspect ratio are susceptible to routing overflows. Such designs are difficult to close and usually end up with larger area with low area utilization. We observe that non-uniform setting of utilization target greatly helps in these designs, specifically low utilization in the center and gradually higher utilization toward the ends. We introduce a convolutional neural network (CNN) model to predict the setting of utilization target values. Experiments indicate that routing congestion overflows are reduced by 29% on average of test designs with 40% reduction in wirelength.

  • 1 2

ARCHON: A 332.7TOPS/W 5b Variation-Tolerant Analog CNN Processor Featuring Analog Neuronal Computation Unit and Analog Memory

Title: ARCHON: A 332.7TOPS/W 5b Variation-Tolerant Analog CNN Processor Featuring Analog Neuronal Computation Unit and Analog Memory

Abstract: In this paper, we present a fully analog CNN processor featuring convolution, pooling, and non-linearity (RELU) datapath fully (end-to-end) in the analog domain, with no analog-to-digital conversion between layers. The processor adopts a variation-tolerant analog design approach, including analog memory with a write-with-feedback scheme that allows the fully analog processor to be robust to PVT variations. The 28nm chip achieves a peak efficiency of 332.7TOPS/W for 5b equivalent precision.

1

NAND-Net: A 133.6TOPS/W Compute-In-Memory SRAM Macro with Fully Parallel One-Step Multi-Bit Computation

Title: NAND-Net: A 133.6TOPS/W Compute-In-Memory SRAM Macro with Fully Parallel One-Step Multi-Bit Computation 

Authors: Edward Choi, Injun Choi, Chanhee Jeon, Gichan Yun, Donghyeon Yi, Sohmyung Ha, Ik-Joon Chang, Minkyu Je1

SRAM-based compute-in-memory (CIM) structures have shown ways to perform deep neural network (DNN) computations in the mixed-signal domain with high energy efficiency but suffer from the tradeoff and limitations in their accuracy arising from analog nonidealities. Recently, circuit techniques were developed to support multi-bit analog computations in SRAM-based CIM macro [1], [2], which computes multiplication and accumulation by using transistor currents. However, the transistor current has nonlinear characteristics with respect to the gate voltage, significantly degrading the accuracies of DNNs. Some works address this problem by using charge-based computation [3], [4], where the multiplication results between 1b weight and multi-bit inputs are firstly stored in capacitors. Multi-bit-weight computations are then achieved by shifting and adding the multiplication result outputs either in the digital domain [3] or in the analog domain using a charge-sharing method [1]. The digital method typically requires a higher ADC precision and one ADC for every accumulation, becoming power heavy. The analog charge-sharing method requires switches to control, being exposed to charge injection noise and dissipating considerable power to turn on and off the switches. To address these issues, this work proposes an 8T1C SRAM-based CIM macro structure, which supports (1) multi-bit-weight chargebased computation without additional switches used for charge sharing; (2) a simple and fast computation where multi-bit-weight multiply-accumulate-averaging (MAV) voltage is immediately formed when the input is given, namely “one-step” computation; (3) compact 8T1C bit cell using metal-oxide-metal (MOM) capacitor which incurs only 1.5× cell area of the conventional 6T SRAM under logic rules; and (4) no additional power consumption in bit-shift for energy-efficient computing. We fabricated the proposed 4kb SRAM CIM macro in a 65nm process, whose structure is shown in Fig. 1, supporting a fully parallel computation of 1024 MAV operations with 64 4b inputs and 16 4b weights.

 

1

A 49.5 mW Multi-scale Linear Quantized Online Learning Processor for Real-Time Adaptive Object Detection (유회준교수 연구실)

Online training is essential to maintain a high object detection (OD) in various environments. However, additional computation workload, EMA, and high bit precision is the problem of conventional online learning scheme on mobile devices. Therefore, a low power real-time online learning OD processor is proposed with three key features. In this paper, we present low power online learning processor for mobile devices with 3 key features: 1) Multiscale linear quantization and architecture to support it for low-bit fxp-based arithmetic at all stages of online learning. 2) Low-gradient channel skipping for computation reduction and EMA reduction. 3) Gradient Norm Estimation to support gradient norm clipping with less than 0.1% additional computations for fast adaptation. As a result, the proposed processor achieves 34 frame-per-second real-time OD with accurate online learning while only consuming 49.5mW.

Related papers:

Song, Seokchan, et al. “A 49.5 mW Multi-scale Linear Quantized Online Learning Processor for Real-Time Adaptive Object Detection.” IEEE Transactions on Circuits and Systems II: Express Briefs (2022).

Song, Seokchan, et al. “A 49.5 mW Multi-Scale Linear Quantized Online Learning Processor for Real-Time Adaptive Object Detection”, IEEE International Symposium on Circuits and Systems (ISCAS), May. 2022

11

A 161.6 TOPS/W Mixed-mode Computing-in-Memory Processor for Energy-Efficient Mixed-Precision Deep Neural Networks (유회준교수 연구실)

A Mixed-mode Computing-in memory (CIM) processor for the mixed-precision Deep Neural Network (DNN) processing is proposed. Due to the bit-serial processing for the multi-bit data, the previous CIM processors could not exploit the energy-efficient computation of mixed-precision DNNs. This paper proposes an energy-efficient mixed-mode CIM processor with two key features: 1) Mixed-Mode Mixed precision CIM (M3-CIM) which achieves 55.46% energy efficiency improvement. 2) Digital-CIM for In-memory MAC for the increased throughput of M3-CIM. The proposed CIM processor was simulated in 28nm CMOS technology and occupies 1.96 mm2. It achieves a state-of-the-art energy efficiency of 161.6 TOPS/W with 72.8% accuracy at ImageNet (ResNet50).

Related papers:

Wooyoung Jo, Sangjin Kim, Juhyoung Lee, Soyeon Um, Zhiyong Li, and Hoi-jun Yoo, “A 161.6 TOPS/W Mixed-mode Computing-in-Memory Processor for Energy-Efficient Mixed-Precision Deep Neural Networks”, Int’l Symp. on Circuits and Systems (ISCAS), May 2022.

10

A 36.2 dB High SNR and PVT/Leakage-robust eDRAM Computing-In-Memory Macro with Segmented BL and Reference Cell Array (유회준교수 연구실)

Computing-in-memory (CIM) shows high energy-efficiency through the analog DNN computation inside the memory macros. However, as the DNN size increases, the energy-efficiency of CIM is reduced by external memory access (EMA). One of the promising solutions is eDRAM based CIM to increase memory capacity with a high density cell. Although the eDRAM-CIM has a higher density than the SRAM-CIM, it suffers from both poor robustness and a low signal-to-noise ratio (SNR). In this work, the energy-efficient eDRAM-CIM macro is proposed while improving computational robustness and SNR with three key features: 1) High SNR voltage-based accumulation with segmented BL architecture (SBLA), resulting in 17.1 dB higher SNR, 2) canceling PVT/leakage-induced error with common-mode error canceling (CMEC) circuit, resulting in 51.4% PVT variation reduction and 51.4% refresh power reduction, 3) a ReLU-based zero-gating ADC (ZG-ADC), resulting in ADC power reduction up to 58.1%. According to these new features, the proposed eDRAM-CIM macro achieves 81.5-to-115.0 TOPS/W energy-efficiency with 209-to-295 μW power consumption when 4b×4b MAC operation is performed with 250 MHz core frequency. The proposed macro also achieves 91.52% accuracy at the CIFAR-10 object classification dataset (ResNet-20) without accuracy drop even with PVT variation.

Related papers:

Ha, Sangwoo, et al. “A 36.2 dB High SNR and PVT/Leakage-robust eDRAM Computing-In-Memory Macro with Segmented BL and Reference Cell Array.” IEEE Transactions on Circuits and Systems II: Express Briefs (2022).

Ha, Sangwoo, et al. “A 36.2 dB High SNR and PVT/Leakage-robust eDRAM Computing-In-Memory Macro with Segmented BL and Reference Cell Array”, IEEE International Symposium on Circuits and Systems (ISCAS), May. 2022

9

An Energy-efficient High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient Cache Subsystem (유회준교수 연구실)

With the rise of contactless communication and streaming services, Super-resolution (SR) in mobile devices has become one of the most important image processing technologies. Also, the popularity of high-end Application Processor (AP) and high-resolution display in mobile drives the development of the lightweight mobile SR-CNNs, which show the high reconstruction quality. However, the large size and wide dynamic range of both images and intermediate feature maps in CNN hidden layers pose challenges for mobile platforms. Constraints from the limited power and shared bandwidth on mobile platform, a low power and energy-efficient architecture is required.

This paper presents an image processing SoC exploiting non-sparse SR task. It contributes 2 following key features: 1) Heterogeneous architecture with only 8bit FP-FXP hybrid-precision for SR task, and 2) data lifetime-aware two-way optimized cache subsystem for energy-efficient depth-first image processing. With highly optimized heterogeneous cores and cache subsystem, our SoC presents 2.6x higher energy-efficiency than previous SRNPU and 107 frame-per-second (fps) framerate running 4x SR image generation to Full-HD scale with 0.92 mJ/frame energy consumption.

Related papers:

Z. Li, S. Kim, D. Im, D. Han and H. -J. Yoo, “An 0.92 mJ/frame High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient Cache,” 2022 IEEE Custom Integrated Circuits Conference (CICC), 2022.

 

8

 

A Low-Power Graph Convolutional Network Processor with Sparse Grouping for 3D Point Cloud Semantic Segmentation in Mobile Devices (유회준교수 연구실)

A low-power graph convolutional network (GCN) processor is proposed for accelerating 3D point cloud semantic segmentation (PCSS) in real-time on mobile devices. Three key features enable the low-power GCN-based 3D PCSS. First, the new hardware-friendly GCN algorithm, sparse grouping-based dilated graph convolution (SG-DGC) is proposed. SG-DGC reduces 71.7% of the overall computation and 76.9% of EMA through the sparse grouping of the point cloud. Second, the two-level pipeline (TLP) consisting of the point-level pipeline (PLP) and group-level pipelining (GLP) was proposed to improve low utilization by the imbalanced workload of GCN. The PLP enables point-level module-wise fusion (PMF) which reduces 47.4% of EMA for low power consumption. Also, center point feature reuse (CPFR) reuses computation results of the redundant operation and reduces 11.4% of computation. Finally, the GLP increased the core utilization by 21.1% by balancing the workload of graph generation and graph convolution and enable 1.1× higher throughput. The processor is implemented with 65nm CMOS technology, and the 4.0mm 2 3D PCSS processor show 95mW power consumption while operating in real-time of 30.8 fps in the 3D PCSS of the indoor scene with 4k points.

Related papers:

Kim, Sangjin, et al. “A Low-Power Graph Convolutional Network Processor With Sparse Grouping for 3D Point Cloud Semantic Segmentation in Mobile Devices.” IEEE Transactions on Circuits and Systems I: Regular Papers (2022).

Kim, Sangjin, et al. “A 54.7 fps 3D Point Cloud Semantic Segmentation Processor with Sparse Grouping based Dilated Graph Convolutional Network for Mobile Devices.” 2020 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2020.

 

7

 

TSUNAMI: Triple Sparsity-aware Ultra Energy-efficient Neural Network Training Accelerator with Multi-modal Iterative Pruning (유회준교수 연구실)

이 논문에서는 에너지 효율적인 심층 신경망 훈련을 지원하는 TSUNAMI를 제안한다. TSUNAMI는 활성화 및 가중치에서 0을 생성하기 위해 다중 모드 반복 가지치기를 지원한다. 타일 기반 동적 활성화 프루닝 유닛과 가중치 메모리 공유 프루닝 유닛은 프루닝으로 인해 요구되는 추가 메모리 액세스를 제거한다. Coarse-zero skipping 컨트롤러는 여러 개의 불필요한 MAC(multiply-and-accumulation) 연산을 한 번에 생략하고, fine-zero skipping 컨트롤러는 임의로 찾은 불필요한 MAC 연산을 생략한다. 가중치 희소성 balancer는 가중치 희소성 불균형으로 인한 활용도 저하를 해결하고, 각 convolution core의 작업량은 랜덤 채널 allocator에 의해 할당된다. TSUNAMI는 부동 소수점 8비트 활성화 및 가중치로 0.78V 및 50MHz에서 3.42TFLOPS/W의 에너지 효율을 달성하였고, 90% 희소성 조건에서 405.96 TFLOPS/W의 세계 최고 수준의 에너지 효율을 달성했다.

Related papers:

Kim, Sangyeob, et al. “TSUNAMI: Triple Sparsity-Aware Ultra Energy-Efficient Neural Network Training Accelerator With Multi-Modal Iterative Pruning.” IEEE Transactions on Circuits and Systems I: Regular Papers (2022).

6