With the rise of contactless communication and streaming services, Super-resolution (SR) in mobile devices has become one of the most important image processing technologies. Also, the popularity of high-end Application Processor (AP) and high-resolution display in mobile drives the development of the lightweight mobile SR-CNNs, which show the high reconstruction quality. However, the large size and wide dynamic range of both images and intermediate feature maps in CNN hidden layers pose challenges for mobile platforms. Constraints from the limited power and shared bandwidth on mobile platform, a low power and energy-efficient architecture is required.
This paper presents an image processing SoC exploiting non-sparse SR task. It contributes 2 following key features: 1) Heterogeneous architecture with only 8bit FP-FXP hybrid-precision for SR task, and 2) data lifetime-aware two-way optimized cache subsystem for energy-efficient depth-first image processing. With highly optimized heterogeneous cores and cache subsystem, our SoC presents 2.6x higher energy-efficiency than previous SRNPU and 107 frame-per-second (fps) framerate running 4x SR image generation to Full-HD scale with 0.92 mJ/frame energy consumption.
Z. Li, S. Kim, D. Im, D. Han and H. -J. Yoo, “An 0.92 mJ/frame High-quality FHD Super-resolution Mobile Accelerator SoC with Hybrid-precision and Energy-efficient Cache,” 2022 IEEE Custom Integrated Circuits Conference (CICC), 2022.
A low-power graph convolutional network (GCN) processor is proposed for accelerating 3D point cloud semantic segmentation (PCSS) in real-time on mobile devices. Three key features enable the low-power GCN-based 3D PCSS. First, the new hardware-friendly GCN algorithm, sparse grouping-based dilated graph convolution (SG-DGC) is proposed. SG-DGC reduces 71.7% of the overall computation and 76.9% of EMA through the sparse grouping of the point cloud. Second, the two-level pipeline (TLP) consisting of the point-level pipeline (PLP) and group-level pipelining (GLP) was proposed to improve low utilization by the imbalanced workload of GCN. The PLP enables point-level module-wise fusion (PMF) which reduces 47.4% of EMA for low power consumption. Also, center point feature reuse (CPFR) reuses computation results of the redundant operation and reduces 11.4% of computation. Finally, the GLP increased the core utilization by 21.1% by balancing the workload of graph generation and graph convolution and enable 1.1× higher throughput. The processor is implemented with 65nm CMOS technology, and the 4.0mm 2 3D PCSS processor show 95mW power consumption while operating in real-time of 30.8 fps in the 3D PCSS of the indoor scene with 4k points.
Kim, Sangjin, et al. “A Low-Power Graph Convolutional Network Processor With Sparse Grouping for 3D Point Cloud Semantic Segmentation in Mobile Devices.” IEEE Transactions on Circuits and Systems I: Regular Papers (2022).
Kim, Sangjin, et al. “A 54.7 fps 3D Point Cloud Semantic Segmentation Processor with Sparse Grouping based Dilated Graph Convolutional Network for Mobile Devices.” 2020 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2020.
이 논문에서는 에너지 효율적인 심층 신경망 훈련을 지원하는 TSUNAMI를 제안한다. TSUNAMI는 활성화 및 가중치에서 0을 생성하기 위해 다중 모드 반복 가지치기를 지원한다. 타일 기반 동적 활성화 프루닝 유닛과 가중치 메모리 공유 프루닝 유닛은 프루닝으로 인해 요구되는 추가 메모리 액세스를 제거한다. Coarse-zero skipping 컨트롤러는 여러 개의 불필요한 MAC(multiply-and-accumulation) 연산을 한 번에 생략하고, fine-zero skipping 컨트롤러는 임의로 찾은 불필요한 MAC 연산을 생략한다. 가중치 희소성 balancer는 가중치 희소성 불균형으로 인한 활용도 저하를 해결하고, 각 convolution core의 작업량은 랜덤 채널 allocator에 의해 할당된다. TSUNAMI는 부동 소수점 8비트 활성화 및 가중치로 0.78V 및 50MHz에서 3.42TFLOPS/W의 에너지 효율을 달성하였고, 90% 희소성 조건에서 405.96 TFLOPS/W의 세계 최고 수준의 에너지 효율을 달성했다.
Kim, Sangyeob, et al. “TSUNAMI: Triple Sparsity-Aware Ultra Energy-Efficient Neural Network Training Accelerator With Multi-Modal Iterative Pruning.” IEEE Transactions on Circuits and Systems I: Regular Papers (2022).
이전 메모리 컴퓨팅 프로세서들은 병렬 처리를 위해 다중 WL 구동과 높은 정확도를 위해 많은 고정밀 ADC가 필요하기 때문에 100TOPS/W 이상의 높은 에너지 효율을 달성하지 못했다. 일부 프로세서들은 높은 에너지 효율성을 얻기 위해 가중치 데이터 희소성을 많이 이용했지만, 실제 사례(예: ResNet-18을 사용한 ImageNet 작업)에서는 희소성이 30% 미만으로 제한되므로 성능에 제한이 존재한다.
본 논문에서는 에너지 효율적인 뉴로모픽 메모리 컴퓨팅 프로세서가 4가지 주요 특징들과 함께 제안된다. BL 활성도를 줄이기 위해 최상위 비트(MSB) 워드 건너뛰기가 제안되었으며, 더 낮은 BL 활성도를 달성하기 위한 early stopping이 제안되었다. 그리고, 다중 macro 간의 aggregation을 위한 mixed mode firing이 제안되었고, dynamic range를 확장하기 위한 voltage folding이 제안되었다. 그 결과, 제안된 메모리 컴퓨팅 프로세서는 62.1 TOPS/W (I=4b, W=8b) 및 310.4 TOPS/W(I=4b, W=1b)의 세계 최고 수준의 에너지 효율을 달성하였다.
- Kim, et al. “Neuro-CIM: A 310.4TOPS/W Neuromorphic Computing-in-Memory Processor with Low WL/BL activity and Digital-Analog Mixed-mode Neuron Firing”, Symposium on VLSI Circuits (S. VLSI), Jun. 2022
A low-latency and low-power dense RGB-D acquisition and 3D bounding-box extraction system-on-chip, DSPU, is proposed. The DSPU produces accurate dense RGB-D data through CNN-based monocular depth estimation and sensor fusion with a low-power ToF sensor. Furthermore, it performs a 3D point cloud-based neural network for 3D bounding-box extraction. The architecture of the DSPU accelerates the system by alleviating the data-intensive and computation-intensive operations. Finally, the DSPU achieves real-time implementation with 281.6 mW of end-to-end RGB-D and 3D bounding-box extraction.
Im, Dongseok, et al. “DSPU: A 281.6 mW Real-Time Depth Signal Processing Unit for Deep Learning-Based Dense RGB-D Data Acquisition with Depth Fusion and 3D Bounding Box Extraction in Mobile Platforms.” 2022 IEEE International Solid-State Circuits Conference (ISSCC). Vol. 65. IEEE, 2022.
This paper presents HNPU, which is an energy-efficient DNN training processor by adopting algorithm-hardware co-design. The HNPU supports stochastic dynamic fixed-point representation and layer-wise adaptive precision searching unit for low-bit-precision training. It additionally utilizes slice-level reconfigurability and sparsity to maximize its efficiency both in DNN inference and training. Adaptive-bandwidth reconfigurable accumulation network enables reconfigurable DNN allocation and maintains its high core utilization even in various bit-precision conditions. Fabricated in a 28nm process, the HNPU accomplished at least 5.9 × higher energy-efficiency and 2.5 × higher area efficiency in actual DNN training compared with the previous state-of-the-art on-chip learning processors.
Han, Donghyeon, et al. “HNPU: An adaptive DNN training processor utilizing stochastic dynamic fixed-point and active bit-precision searching.” IEEE Journal of Solid-State Circuits 56.9 (2021): 2858-2869.
The authors propose a heterogeneous floating-point (FP) computing architecture to maximize energy efficiency by separately optimize exponent processing and mantissa processing. The proposed exponent-computing-in-memory (ECIM) architecture and mantissa-free-exponent-computing (MFEC) algorithm reduce the power consumption of both memory and FP MAC while resolving previous FP computing-in-memory processors’ limitations. Also, a bfloat16 DNN training processor with proposed features and sparsity exploitation support is implemented and fabricated in 28 nm CMOS technology. It achieves 13.7 TFLOPS/W energy efficiency while supporting FP operations with CIM architecture.
- Lee et al., “ECIM: Exponent Computing in Memory for an Energy-Efficient Heterogeneous Floating-Point DNN Training Processor,” in IEEE Micro, Jan 2022
- Lee et al., “An Energy-efficient Floating-Point DNN Processor using Heterogeneous Computing Architecture with Exponent-Computing-in-Memory”, 2021 IEEE Hot Chips 33 Symposium (HCS), 2021
- Lee et al., “A 13.7 TFLOPS/W Floating-point DNN Processor using Heterogeneous Computing Architecture with Exponent-Computing-in-Memory,” 2021 Symposium on VLSI Circuits, 2021
We present an energy-efficient deep reinforcement learning (DRL) processor, OmniDRL, for DRL training on edge devices. Recently, the need for DRL training is growing due to the DRL’s distinct characteristics that can be adapted to each user. However, a massive amount of external and internal memory access limits the implementation of DRL training on resource-constrained platforms. OmniDRL proposes 4 key features that can reduce external memory access by compressing as much data as possible, and can reduce internal memory access by directly processing compressed data. A group-sparse training enables a high weight compression ratio for every DRL iteration. A group-sparse training core is proposed to fully take advantage of compressed weight from GST. An exponent mean delta encoding additionally compresses exponent of both weight and feature map. A world-first on-chip sparse-weight-transposer enables the DRL training process of compressed weight without off-chip transposer. As a result, OmniDRL is fabricated in 28nm CMOS technology and occupies a 3.6×3.6 mm2 die area. It achieved 7.42 TFLOPS/W energy efficiency for training robot agent (Mujoco Halfcheetah, TD3), which is 2.4× higher than the previous state-of-the-art.
- Lee et al., “OmniDRL: An Energy-Efficient Deep Reinforcement Learning Processor With Dual-Mode Weight Compression and Sparse Weight Transposer,” in IEEE Journal of Solid-State Circuits, April 2022
- Lee et al., “OmniDRL: An Energy-Efficient Mobile Deep Reinforcement Learning Accelerators with Dual-mode Weight Compression and Direct Processing of Compressed Data”, 2021 IEEE Hot Chips 33 Symposium (HCS), 2021
- Lee et al., “OmniDRL: A 29.3 TFLOPS/W Deep Reinforcement Learning Processor with Dual-mode Weight Compression and On-chip Sparse Weight Transposer,” 2021 Symposium on VLSI Circuits, 2021
- Lee et al., “Low-power Autonomous Adaptation System with Deep Reinforcement Learning,” 2022 AICAS, 2022
- Lee et al., “Energy-Efficient Deep Reinforcement Learning Accelerator Designs for Mobile Autonomous Systems,” 2021 AICAS, 2021
Abstract: The conventional delay-and-sum algorithm is based on the assumption that a target object is composed of substances with identical speed-of-sound (SoS)(i.e. 1540 m/s) and proper delay is applied to received RF signals to synthesize output images. However, such an assumption compromises the resolution of images due to the inhomogeneity of body tissues. In this paper, we propose an SoS adaptive Rx beamforming method that generates high-resolution ultrasonic images. A neural network (NN) approach has been adopted to reconstruct SoS distribution and determine the accurate time-of-flight (ToF) of each channel from the generated SoS map
Abstract: In this paper, we present a scalable lesion-quantifying neural network based on b-mode-to-quantitative neural style transfer. Quantitative tissue characteristics have great potential in diagnostic ultrasound since pathological changes cause variations in biomechanical properties. The proposed system provides four clinically critical quantitative tissue images such as sound speed, attenuation coefficient, effective scatterer diameter, and effective scatterer concentration simultaneously by applying quantitative style information to structurally accurate b-mode images. The proposed system was evaluated through numerical simulation, and phantom and ex-vivo measurements. The numerical simulation shows that the proposed framework outperforms the baseline model as well as existing state-of-the-art methods while achieving significant parameter reduction per quantitative variables. In phantom and ex-vivo studies, the BQI-Net demonstrates that the proposed system achieves sufficient sensitivity and specificity in identifying and classifying cancerous lesions.