MS course Wonhoon Park (Prof. Hoi-Jun Yoo), won the Distinguished Design Award at ’22 IEEE A-SSCC

[Prof. Hoi-Jun Yoo,  Wonhoon Park ]
 
EE MS student  Wonhoon Park (Advised by Hoi-Jun Yoo), won the Distinguished Design Award at the 2022 IEEE Asian Solid-State Circuits Conference (A-SSCC) Student Design Contest.
The conference was held in Taipei, Taiwan from November 6th to 9th.
 
A-SSCC is an international conference held annually by IEEE. M.S. student Wonhoon Park has published a paper titled “An Efficient Unsupervised Learning-based Monocular Depth Estimation Processor with Partial-Switchable Systolic Array Architecture in Edge Devices” and was selected as a winner for its excellence.
 
Details are as follows. 
 
-Conference: 2022 IEEE Asian Solid-State Circuits Conference (A-SSCC)
-Location: Taipei, Taiwan
-Date: November 6-9, 2022
-Award: Student Distinguished Design Award
-Authors: Wonhoon Park, Dongseok Im, Hankyul Kwon, and Hoi-Jun Yoo (Advisory Professor)
-Paper Title: An Efficient Unsupervised Learning-based Monocular Depth Estimation Processor with Partial-Switchable Systolic Array Architecture in Edge Devices

Professor Joo-Young Kim’s Center (Artificial Intelligence Semiconductor System Research Center) won the Minister of Science and Technology Information and Communication Award

[Prof. Joo-Young Kim ]
 
On November 10, the Artificial Intelligence Semiconductor System Research Center (AISS), led by Professor Joo-Young Kim of KAIST, was awarded the Minister of Science, Technology and Information Technology Award in recognition of its outstanding talent cultivation performance.
 
AISS, headed by Professor Joo-Young Kim, has been carrying out the university ICT research center fostering support project of the Ministry of Science and ICT since 2020, dedicating efforts to nurturing talented people from various angles.
 
Particularly in 2021, 96 student researchers were continuously educated through various themes and programs such as internships, technology transfer, entrepreneurship education, and creative initiatives. It has become a model for other centers by recording remarkable achievements such as employment.
 
AISS is currently carrying out active research activities under the project responsibility of Professor Joo-Young Kim, KAIST Research Director, Hoe-Jun Yoo, Yi-Seop Kim, In-Cheol Park, Seung-Tak Ryu, Hyun-Sik Kim, Yonsei University Han-Jun Kim, Jin-Ho Song, Ji-Hoon Kim, Seong-Min Park of Ewha Womans University, and Kyu-Ho Lee of UNIST. In addition, there has been a 10% increase from 2021 with 110 masters and doctoral-level students participating in taking a strong step towards becoming a Korean hub in the field of artificial intelligence semiconductors.
 
Professor Joo-Young Kim, the research director who won the award, said, “We will continue to strengthen the link with leading universities and companies in Korea based on the university’s ICT and intelligent semiconductor technology capabilities to foster system semiconductor manpower essential for Korea to become a true semiconductor technology powerhouse.”
 

 

KAIST EE Professor Hyun-Sik Kim’s team, Prime Minister’s Award at the 23rd Korea Semiconductor Design Challenge

[Prof. Hyun-Sik Kim,  PhD candidate Gyuwan Lim, PhD candidate  Gyeong-Gu Kang, from left]

 

EE professor Hyun-Sik Kim’s team of Ph.D. students received the Prime Minister’s Award at the 23rd Korea Semiconductor Design Challenge.

 

The 23rd Korea Semiconductor Design Challenge is held to cultivate design skills and discover creative ideas of students within the field of semiconductor design, jointly organized by the Korean Ministry of Trade, Industry and Energy, and the Korea Semiconductor Industry Association (KSIA).

 

The winners, Gyuwan Lim and Gyeong-Gu Kang, have been selected for the achievement of high resolution and high uniformity with their mobile device Display Driver IC (DDI) design while maintaining an ultra-small chip area.

 

The DDI chip is a key component of a display system, that converts digital display data into analog signals (digital-to-analog conversion, DAC) and writes them to the display panel. The KAIST team solved the problem of uniformity and increasing chip surface that comes with higher resolution DDI chips.

 

The award-winning DDI chip design consists of a low-voltage MOSFET with a voltage amplifier instead of the conventional high-voltage MOSFET. This technology dramatically reduces the channel area, further reduced through a novel LSU technology that generates a 10-bit output voltage from an 8-bit input voltage.

 

The team was able to achieve high uniformity through designing a robust amplifier and chip operation against variations of the CMOS fabrication process. The novel DDI chip design is expected to significantly reduce cost while increasing the quality of mobile device displays through the reduced chip area, while achieving high resolution and high uniformity at the same time.

 

The results of this study were also presented at ISSCC 2022, a highly reputable international conference in the field of integrated circuits.

 

 

Design of Processing-in-Memory with Triple Computational Path and Sparsity Handling for Energy-Efficient DNN Training

Title : Design of Processing-in-Memory with Triple Computational Path and Sparsity Handling for Energy-Efficient DNN Training

Authors : Wontak Han, Jaehoon Heo, Junsoo Kim, Sukbin Lim, Joo-Young Kim

Publications : IEEE Journal on Emerging and Selected Topics in Circuits and Systems (JETCAS), 2022

As machine learning (ML) and artificial intelligence (AI) have become mainstream technologies, many accelerators have been proposed to cope with their computation kernels. However, they access the external memory frequently due to the large size of deep neural network model, suffering from the von Neumann bottleneck. Moreover, as privacy issue is becoming more critical, on-device training is emerging as its solution. However, on-device training is challenging because it should perform the training under a limited power budget, which requires a lot more computations and memory accesses than the inference. In this paper, we present an energy-efficient processing-inmemory (PIM) architecture supporting end-to-end on-device training named T-PIM. Its macro design includes an 8T-SRAM cell-based PIM block to compute in-memory AND operation and three computational datapaths for end-to-end training. Each of three computational paths integrates arithmetic units for forward propagation, backward propagation, and gradient calculation and weight update, respectively, allowing the weight data stored in the memory stationary. T-PIM also supports variable bit precision to cover various ML scenarios. It can use fully variable input bit precision and 2-bit, 4-bit, 8-bit, and 16-bit weight bit precision for the forward propagation and the same input bit precision and 16-bit weight bit precision for the backward propagation. In addition, T-PIM implements sparsity handling schemes that skip the computation for input data and turn off the arithmetic units for weight data to reduce both unnecessary computations and leakage power. Finally, we fabricate the T-PIM chip on a 5.04mm2 die in a 28-nm CMOS logic process. It operates at 50–280MHz with the supply voltage of 0.75–1.05V, dissipating 5.25–51.23mW power in inference and 6.10-37.75mW in training. As a result, it achieves 17.90–161.08TOPS/W energy efficiency for the inference of 1-bit activation and 2-bit weight data, and 0.84–7.59TOPS/W for the training of 8-bit activation/error and 16-bit weight data. In conclusion, T-PIM is the first PIM chip that supports end-to-end training, demonstrating 2.02 times performance improvement over the latest PIM that partially supports training.

T-PIM: A 2.21-to-161.08TOPS/W Processing-In-Memory Accelerator for End-to-End On-Device Training

Title : T-PIM: A 2.21-to-161.08TOPS/W Processing-In-Memory Accelerator for End-to-End On-Device Training

Authors : Jaehoon Heo, Junsoo Kim, Wontak Han, Sukbin Lim, Joo-Young Kim

Publications : IEEE Custom Integrated Circuits Conference (CICC) 2022

As the number of edge devices grows to tens of billions, the importance of intelligent computing has been shifted from cloud datacenters to edge devices. On-device training, which enables the personalization of a machine learning (ML) model for each user, is crucial in the success of edge intelligence. However, battery-powered edge devices cannot afford huge computations and memory accesses involved in the training. Processing-in-Memory (PIM) is a promising technology to overcome the memory bandwidth and energy problem by combining processing logic into the memory. Many PIM chips [1-5] have accelerated ML inference using analog or digital-based logic with sparsity handling. Two-way transpose PIM [6] supports backpropagation, but it lacks gradient calculation and weight update, required for end-to-end ML training.

This paper presents T-PIM, the first PIM accelerator that can perform end-to-end on-device training with sparsity handling and support low-latency ML inference. T-PIM makes the four key contributions: 1) T-PIM can run the complete four computational stages of ML training on a chip (Fig. 1). 2) T-PIM allows various data mapping strategies for two major computational layers, i.e., fully-connected (FC) and convolutional (CONV), as well as two computational directions, i.e., forward and backward. 3) T-PIM supports fully variable bit-width for input data and power-of-two bit-width for weight data using serial and configurable arithmetic units. 4) T-PIM accelerates and saves energy consumption in ML training by exploiting fine-grained sparsity in all data types (act., error, and weight).

 

Optimizing ADC Utilization through Value-Aware Bypass in ReRAM-based DNN Accelerator

Title: Optimizing ADC Utilization through Value-Aware Bypass in ReRAM-based DNN Accelerator

 

Author: Hancheon Yun, Hyein Shin, Myeonggu Kang, Lee-Sup Kim

 

Conference : IEEE/ACM Design Automation Conference (DAC) 2021

 

Abstract: ReRAM-based Processing-In-Memory (PIM) has been widely studied as a promising approach for Deep Neural Networks (DNN) accelerator with its energy-efficient analog operations. However, the domain conversion process for the analog operation requires frequent accesses to power-hungry Analog-to-Digital Converter (ADC), hindering the overall energy efficiency. Although previous research has been suggested to address this problem, the ADC cost has not been sufficiently reduced because of its unsuitable approach for ReRAM. In this paper, we propose mixed-signal-based value-aware bypass techniques to optimize the ADC utilization of the ReRAM-based PIM. By utilizing the property of bit-line (BL) level value distribution, the proposed work bypasses the redundant ADC operations depending on the magnitude of value. Evaluation results show that our techniques successfully reduce ADC access and improve overall energy efficiency by 2.48 × -3.07 × compared to ISAAC.

Fault-free: A Fault-resilient Deep Neural Network Accelerator based on Realistic ReRAM Devices

Title : Fault-free: A Fault-resilient Deep Neural Network Accelerator based on Realistic ReRAM Devices

 

Author: Hyein Shin, Myeonggu Kang, Lee-Sup Kim

 

Conference: IEEE/ACM Design Automation Conference (DAC) 2021

 

Abstract: Energy-efficient Resistive RAM (ReRAM) based deep neural network (DNN) accelerator suffers from severe Stuck-At-Fault (SAF) problem that drastically degrades the inference accuracy. The SAF problem gets even worse in realistic ReRAM devices with low cell resolution. To address the issue, we propose a fault-resilient DNN accelerator based on realistic ReRAM devices. We first analyze the SAF problem in a realistic ReRAM device and propose a 3-stage offline fault-resilient compilation and lightweight online compensation. The proposed work enables the reliable execution of DNN with only 5% area and 0.8% energy overhead from the ideal ReRAM-based DNN accelerator.

A Convergence Monitoring Method for DNN Training of On-Device Task Adaptation

Title : A Convergence Monitoring Method for DNN Training of On-Device Task Adaptation

 

Author : Seungkyu Choi, Jaekang Shin, Lee-Sup Kim

 

Conference : IEEE/ACM International Conference On Computer Aided Design 2021

 

Abstract: DNN training has become a major workload in on-device situations to execute various vision tasks with high performance. Accordingly, training architectures accompanying approximate computing have been steadily studied for efficient acceleration. However, most of the works examine their scheme on from-the-scratch training where inaccurate computing is not tolerable. Moreover, previous solutions are mostly provided as an extended version of the inference works, e.g., sparsity/pruning, quantization, dataflow, etc. Therefore, unresolved issues in practical workloads that hinder the total speed of the DNN training process remain still. In this work, with targeting the transfer learning-based task adaptation of the practical on-device training workload, we propose a convergence monitoring method to resolve the redundancy in massive training iterations. By utilizing the network’s output value, we detect the training intensity of incoming tasks and monitor the prediction convergence with the given intensity to provide early-exits in the scheduled training iteration. As a result, an accurate approximation over various tasks is performed with minimal overhead. Unlike the sparsity-driven approximation, our method enables runtime optimization and can be easily applicable to off-the-shelf accelerators achieving significant speedup. Evaluation results on various datasets show a geomean of 2.2× speedup over baseline and 1.8× speedup over the latest convergence-related training method.

Deferred Dropout: An Algorithm-Hardware Co-Design DNN Training Method Provisioning Consistent High Activation Sparsity

Title : Deferred Dropout: An Algorithm-Hardware Co-Design DNN Training Method Provisioning Consistent High Activation Sparsity

 

Author: Kangkyu Park, Yunki Han, Lee-Sup Kim

 

Conference : IEEE/ACM International Conference On Computer Aided Design 2021

 

Abstract: This paper proposes a deep neural network training method that provisions consistent high activation sparsity and the ability to adjust the sparsity. To improve training performance, prior work reduces the memory footprint for training by exploiting input activation sparsity which is observed due to the ReLU function. However, the previous approach relies solely on the inherent sparsity caused by the function, and thus the footprint reduction is not guaranteed. In particular, models for natural language processing tasks like BERT do not use the function, so the models have almost zero activation sparsity and the previous approach loses its efficiency. In this paper, a new training method, Deferred Dropout, and its hardware architecture are proposed. With the proposed method, input activations are dropped out after the conventional forward-pass computation. In contrast to the conventional dropout where activations are zeroed before forward-pass computation, the dropping timing is deferred until the completion of the computation. Then, the sparsified activations are compressed and stashed in memory. This approach is based on our observation that networks preserve training quality even if only a few high magnitude activations are used in the backward pass. The hardware architecture enables designers to exploit the tradeoff between training quality and activation sparsity. Evaluation results demonstrate that the proposed method achieves 1.21-3.60 × memory footprint reduction and 1.06-1.43 x speedup on the TPUv3 architecture, compared to the prior work

A Framework for Area-efficient Multi-task BERT Execution on ReRAM-based Accelerators

Title : A Framework for Area-efficient Multi-task BERT Execution on ReRAM-based Accelerators

 

Author : Myeonggu Kang, Hyein Shin, Jaekang Shin, Lee-Sup Kim

 

Conference : IEEE/ACM International Conference On Computer Aided Design 2021

 

Abstract : With the superior algorithmic performances, BERT has become the de-facto standard model for various NLP tasks. Accordingly, multiple BERT models have been adopted on a single system, which is also called multi-task BERT. Although the ReRAM-based accelerator shows the sufficient potential to execute a single BERT model by adopting in-memory computation, processing multi-task BERT on the ReRAM-based accelerator extremely increases the overall area due to multiple fine-tuned models. In this paper, we propose a framework for area-efficient multi-task BERT execution on the ReRAM-based accelerator. Firstly, we decompose the fine-tuned model of each task by utilizing the base-model. After that, we propose a two-stage weight compressor, which shrinks the decomposed models by analyzing the properties of the ReRAM-based accelerator. We also present a profiler to generate hyper-parameters for the proposed compressor. By sharing the base-model and compressing the decomposed models, the proposed framework successfully reduces the total area of the ReRAM-based accelerator without an additional training procedure. It achieves a 0.26 x area than baseline while maintaining the algorithmic performances.