국문 abstract
회로의 실제 동작 중에 최대로 발생하는 IR drop인 dynamic IR drop 분석은 매우 오래 걸리는 과정이다. 따라서, 본 연구에서는 이미지–이미지 변환 인공신경망의 일종인 U-net을 이용하여 빠르게 dynamic IR drop 분석을 수행하는 방법을 제안하였다. U-net의 input으로는 각 gate까지의 effective 저항, 각 gate의 시간별 전류 소모량, 가장 가까운 power pad까지의 거리를 각각 map으로 나타낸 이미지 clip이 들어가게 된다. 보다 빠른 IR drop 예측을 위하여 모든 clip을 예측하지 않고 높은 IR drop 발생 가능성이 있는 time window에 대해서만 예측을 수행하며, PDN 저항을 빠르게 근사값으로 구하는 방법을 적용하였다. 실험결과, 제안한 IR drop 예측 방법은 실제 dynamic IR drop 분석방법에 대비하여 약 20배 빠르면서 약 15%의 오차를 보였다.
국문 abstract
RNN을 이용하여 빠른 OPC를 제안하였다. RNN은 여러 개의 neural network instance가 연속적으로 연결되어 있는데, 이러한 점을 이용하여 RNN을 이용한 OPC에서는 근거리에 있는 여러 개의 segment들의 mask bias를 한번에 예측할 수 있으므로 각각을 개별적으로 예측하는 것에 대비하여 보다 높은 정확도로 OPC를 수행할 수 있다. GRU cell을 양방향으로 연결한 bidirectional GRU 구조를 가지는 RNN을 사용하였으며, RNN의 input, 학습 데이터 sampling, 각 segment와 neural network instance를 효율적으로 mapping하는 방법을 같이 제안하였다. 실험결과, ANN을 사용한 ML-OPC 방법에 대비하여 36% 낮은 EPE를 보여주었다.
우리 학부 김주영 교수님께서 센터장으로 부임하시는 KAIST ITRC 인공지능반도체시스템(AISS)연구센터가 출범합니다. 2020 대학 ICT연구센터 사업으로 새로이 선정되었으며, 해당 사업은 과기정통부 산하 정보통신기획평가원(IITP)에서 주관하고 있습니다. 김주영 교수님께서는 ‘비대면·인공지능 사회를 위한 반도체 시스템 융합혁신기술 개발’을 목표로 총 50여억원의 연구비로 2025년까지 연구과제를 이끌 계획입니다.
이번 연구센터는 서울, 대전, 울산을 잇는 거점연구센터로 대전에 위치하여 연세대, 이화여대, 울산과기대와 공동 연구를 진행할 예정입니다.
해당 연구센터 개소식이 이번 금요일 오전 10시 30분에 진행될 예정입니다. 이번 개소식은 코로나 바이러스 확산 방지 차원에서 온라인으로 진행될 예정입니다. (https://us02web.zoom.us/j/84476190909 (ZOOM))
우리 학부 박현욱 연구부총장님께서도 연사로 참여를 하시니, 구성원 여러분의 관심 부탁드립니다.
[link]
http://www.aitimes.kr/news/articleView.html?idxno=18666 (전자신문 보도)
우리 학부 유회준 교수님 연구팀이 생성적 적대 신경망(GAN: Generative Adversarial Network)을 저전력, 효율적으로 처리하는 인공지능(AI: Artificial Intelligent) 반도체를 개발하였습니다.
연구팀이 개발한 인공지능 반도체는 다중-심층 신경망을 처리할 수 있고 이를 저전력의 모바일 기기에서도 학습할 수 있습니다. 연구팀은 이번 반도체 칩 개발을 통해 이미지 합성, 스타일 변환, 손상 이미지 복원 등의 생성형 인공지능 기술을 모바일 기기에서 구현하는 데 성공했습니다.
강상훈 박사과정이 1 저자로 참여한 이번 연구결과는 지난 2월 17일 3천여 명 반도체 연구자들이 미국 샌프란시스코에 모여 개최한 국제고체회로설계학회(ISSCC)에서 발표되었습니다. (논문명 : GANPU: A 135TFLOPS/W Multi-DNN Training Processor for GANs with Speculative Dual-Sparsity Exploitation)
최근 모바일 기기에서 인공지능을 구현하기 위해 다양한 가속기 개발이 이뤄지고 있지만, 기존 연구들은 추론 단계만 지원하거나 단일-심층 신경망 학습에 한정되어 있습니다. 연구팀은 단일-심층 신경망뿐만 아니라 생성적 적대 신경망과 같은 다중-심층 신경망을 처리할 수 있으면서 모바일에서 학습도 가능한 인공지능 반도체 GANPU(Generative Adversarial Networks Processing Unit)를 개발해 모바일 장치의 인공지능 활용범위를 넓혔습니다.
연구팀이 개발한 인공지능 반도체는 서버로 데이터를 보내지 않고 모바일 장치 내에서 생성적 적대 신경망(GAN)을 스스로 학습할 수 있어 사생활을 보호를 가능케 하는 프로세서라는 점에서 그 활용도가 기대됩니다. 자체 개발한 기술을 사용함으로써 연구팀의 GANPU는 기존 최고 성능을 보이던 심층 신경망 학습 반도체 대비 4.8배 증가한 에너지효율을 달성했습니다.
연구팀은 GANPU의 활용 예시로 태블릿 카메라로 찍은 사진을 사용자가 직접 수정할 수 있는 응용 기술을 시연했습니다. 사진상의 얼굴에서 머리·안경·눈썹 등 17가지 특징에 대해 추가·삭제 및 수정사항을 입력하면 GANPU가 실시간으로 이를 자동으로 완성해 보여 주는 얼굴 수정 시스템을 개발했습니다.
[Link]
https://www.ytn.co.kr/_ln/0115_202004070308303992
https://news.kaist.ac.kr/news/html/news/?mode=V&mng_no=6831
Title: A Full HD 60 fps CNN Super Resolution Processor with Selective Caching based Layer Fusion for Mobile Devices
Authors: Ju-Hyoung Lee, Dong-Joo Shin, Jin-Su Lee, Jin-Mook Lee, Sang-Hoon Kang, and Hoi-Jun Yoo
Recently, super resolution algorithms based on convolution neural network (SR-CNN) has been broadly utilized to enable mobile devices to support better user experience (UX) from video quality enhancement or far object recognition. However, SRCNN’s distinct architecture makes it harder to meet the high throughput requirement in conventional hardware targeting classification CNNs. It is because the intermediate feature maps of SR do not decrease when they pass through the layers, while classification CNN’s feature maps shrink due to pooling or strided convolutions. Because of the huge amount of feature maps in SR-CNN, it requires larger external memory access (EMA), on-chip memory footprint and computation workload than the classification CNN.
In this work, we propose a high throughput SR-CNN processor which minimizes the amount of EMA and on-chip memory footprint with three key features; 1) Selective caching based layer fusion (SCLF) algorithm to reduce the overall memory cost (product of on-chip memory size and EMA), 2) memory compaction scheme to reduce the on-chip memory footprint further and 3) cyclic ring core architecture to increase the PE utilization for SCLF. As a result, the implemented processor achieves 60 frames-per-second throughput in generating full HD images.
Figure 1. An illustration of the proposed SR computing algorithm & proposed ring core architecture
Title: A Full HD 60 fps CNN Super Resolution Processor with Selective Caching based Layer Fusion for Mobile Devices
Authors: Ju-Hyoung Lee, Dong-Joo Shin, Jin-Su Lee, Jin-Mook Lee, Sang-Hoon Kang, and Hoi-Jun Yoo
Recently, super resolution algorithms based on convolution neural network (SR-CNN) has been broadly utilized to enable mobile devices to support better user experience (UX) from video quality enhancement or far object recognition. However, SRCNN’s distinct architecture makes it harder to meet the high throughput requirement in conventional hardware targeting classification CNNs. It is because the intermediate feature maps of SR do not decrease when they pass through the layers, while classification CNN’s feature maps shrink due to pooling or strided convolutions. Because of the huge amount of feature maps in SR-CNN, it requires larger external memory access (EMA), on-chip memory footprint and computation workload than the classification CNN.
In this work, we propose a high throughput SR-CNN processor which minimizes the amount of EMA and on-chip memory footprint with three key features; 1) Selective caching based layer fusion (SCLF) algorithm to reduce the overall memory cost (product of on-chip memory size and EMA), 2) memory compaction scheme to reduce the on-chip memory footprint further and 3) cyclic ring core architecture to increase the PE utilization for SCLF. As a result, the implemented processor achieves 60 frames-per-second throughput in generating full HD images.
Figure 1. An illustration of the proposed SR computing algorithm & proposed ring core architecture
Title: 1.32 TOPS/W Energy Efficient Deep Neural Network Learning Processor with Direct Feedback Alignment based Heterogeneous Core Architecture
Authors: Dong-Hyeon Han, Jin-Su Lee, Jin-Mook Lee and Hoi-Jun Yoo
An energy efficient deep neural network (DNN) learning processor is proposed using direct feedback alignment (DFA).
The proposed processor achieves 2.2 × faster learning speed compared with the previous learning processors by the pipelined DFA (PDFA). Since the computation direction of the back-propagation (BP) is reversed from the inference, the gradient of the 1st layer cannot be generated until the errors are propagated from the last layer to the 1st layer. On the other hand, the proposed processor applies DFA which can propagate the errors directly from the last layer. This means that the PDFA can propagate errors during the next inference computation and that weight update of the 1st layer doesn’t need to wait for error propagation of all the layers. In order to enhance the energy efficiency by 38.7%, the heterogeneous learning core (LC) architecture is optimized with the 11-stage pipeline data-path. It show 2 × longer data reusing compared with the conventional BP. Furthermore, direct error propagation core (DEPC) utilizes random number generators (RNG) to remove external memory access (EMA) caused by error propagation (EP) and improve the energy efficiency by 19.9%.
The proposed PDFA based learning processor is evaluated on the object tracking (OT) application, and as a result, it shows 34.4 frames-per-second (FPS) throughput with 1.32 TOPS/W energy efficiency.
Figure 1. Back-propagation vs Pipelined DFA
Figure 2. Layer Level vs Neuron-level vs Partial-sum Level Pipeline
Figure 3. Overall Architecture of Proposed Processor
Title: LNPU: A 25.3TFLOPS/W Sparse Deep-Neural-Network Learning Processor with Fine-Grained Mixed Precision of FP8-FP16
Authors: Jin-Su Lee, Ju-Hyoung Lee, Dong-Hyeon Han, Jin-Mook Lee, Gwang-Tae Park, and Hoi-Jun Yoo
Recently, deep neural network (DNN) hardware accelerators have been reported for energy-efficient deep learning (DL) acceleration. Most of the previous DNN inference accelerators trained their DNN parameters at the cloud server using public datasets and downloaded the parameters to them to implement AI. However, the local DNN learning with domain-specific and private data is required to adapt to various user’s preferences on the edge or mobile devices. Since the edge and mobile devices contain only limited computation capability with battery power, energy-efficient DNN learning processor is necessary. In this paper, we present an energy-efficient on-chip learning accelerator. Its data precision is optimized while maintaining the training accuracy with fine-grained mixed precision (FGMP) of FP8-FP16 to reduce external memory access (EMA) and to enhance throughput with high accuracy. In addition, sparsity is exploited with intra-channel accumulation as well as inter-channel accumulation to support 3 DNN learning steps with higher throughput to enhance energy-efficiency. Also, the input load balancer (ILB) is integrated to improve PE utilization under the unbalanced amount of input data caused by irregular sparsity. The external memory access is reduced by 38.9% and energy-efficiency is improved 2.08 times for ResNet-18 training. The fabricated chip occupies 16mm2 in 65nm CMOS and the energy efficiency is 3.48TFLOPS/W (FP8) for 0.0% sparsity and 25.3TFLOPS/W (FP8) for 90% sparsity.
Title: A 2.1TFLOPS/W Mobile Deep RL Accelerator with Transposable PE Array and Experience Compression
Authors: Chang-Hyeon Kim, Sang-Hoon Kang, Don-Joo Shin, Sung-Pill Choi, Young-Woo Kim and Hoi-Jun Yoo
Recently, deep neural networks (DNNs) are actively used for action control so that an autonomous system, such as the robot, can perform human-like behaviors and operations. Unlike recognition tasks, the real-time operation is essential in action control, and it is too slow to use remote learning on a server communicating through a network. New learning techniques, such as reinforcement learning (RL), are needed to determine and select the correct robot behavior locally. In this paper, We propose DRL accelerator with transposable PE array and experience compressor to realize real-time DRL operation of autonomous agents in dynamic environments. It supports on-chip data compression and decompression that ~10,000 of DRL experiences can be compressed by 65%. And it enables adaptive data reuse for inferencing and training, which results in power and peak memory bandwidth reduction by 31% and 41%, respectively. The proposed DRL accelerator is fabricated with 65nm CMOS technology and occupies 4×4 mm2 die area. This is the first fully trainable DRL processor, and it achieves 2.16 TFLOPS/W energy-efficiency at 0.73V with 16b weights@50MHz.
Fig. 1. DRL Accelerator with tPE Array, Implementation & Measurement Results
Title: CNNP-v2: An Energy Efficient Memory-Centric Convolutional Neural Network Processor Architecture
Authors: Sung-Pill Choi, Kyeong-Ryeol Bong, Dong-Hyeon Han, and Hoi-Jun Yoo
An energy efficient memory-centric convolution-al neural network (CNN) processor architecture is proposed for smart devices such as wearable devices or internet of things (IoT) devices. To achieve energy-efficient processing, it has 2 key features: First, 1-D shift convolution PEs with fully distributed memory architecture achieve 3.1TOPS/W energy efficiency. Compared with conventional architecture, even though it has massively parallel 1024 MAC units, it achieve high energy efficiency by scaling down voltage to 0.46V due to its fully local routed design. Next, fully configurable 2-D mesh core-to-core interconnection support various size of input features to maximize utilization. The proposed architecture is evaluated 16mm2 chip which is fabricated with 65nm CMOS process and it performs real-time face recognition with only 9.4mW at 10MHz and 0.48V.