Title: LNPU: A 25.3TFLOPS/W Sparse Deep-Neural-Network Learning Processor with Fine-Grained Mixed Precision of FP8-FP16
Authors: Jin-Su Lee, Ju-Hyoung Lee, Dong-Hyeon Han, Jin-Mook Lee, Gwang-Tae Park, and Hoi-Jun Yoo
Recently, deep neural network (DNN) hardware accelerators have been reported for energy-efficient deep learning (DL) acceleration. Most of the previous DNN inference accelerators trained their DNN parameters at the cloud server using public datasets and downloaded the parameters to them to implement AI. However, the local DNN learning with domain-specific and private data is required to adapt to various user’s preferences on the edge or mobile devices. Since the edge and mobile devices contain only limited computation capability with battery power, energy-efficient DNN learning processor is necessary. In this paper, we present an energy-efficient on-chip learning accelerator. Its data precision is optimized while maintaining the training accuracy with fine-grained mixed precision (FGMP) of FP8-FP16 to reduce external memory access (EMA) and to enhance throughput with high accuracy. In addition, sparsity is exploited with intra-channel accumulation as well as inter-channel accumulation to support 3 DNN learning steps with higher throughput to enhance energy-efficiency. Also, the input load balancer (ILB) is integrated to improve PE utilization under the unbalanced amount of input data caused by irregular sparsity. The external memory access is reduced by 38.9% and energy-efficiency is improved 2.08 times for ResNet-18 training. The fabricated chip occupies 16mm2 in 65nm CMOS and the energy efficiency is 3.48TFLOPS/W (FP8) for 0.0% sparsity and 25.3TFLOPS/W (FP8) for 90% sparsity.