Multiple-Resolution Decoding Architecture for QC-LDPC Codes (박인철 교수 연구실)

Abstract

This paper proposes a hardware-efficient multiple-resolution decoding architecture for quasi-cyclic low-density parity-check (QC-LDPC) codes, specifically designed to meet the stringent requirements of the 5G New Radio (NR) standard. The proposed architecture adopts a single-instruction multiple-data (SIMD) approach to dynamically adjust the bitwidth of log-likelihood ratio (LLR) values based on the Eb/N0 condition, significantly reducing hardware complexity and improving throughput area ratio. Unlike conventional single-resolution decoders, the architecture processes 2-bit LLR values in high Eb/N0 regions and scales up to 4-bit or 8-bit LLR values for moderate and low Eb/N0 conditions, maintaining robust error-correcting performance. Key innovations include SIMD-based design for variable-node units (VNUs), check-node units (CNUs), and quasi-cyclic shifting networks (QSNs), as well as optimized memory access scheduling to support all 51 lifting sizes defined in the 5G NR standard. Designed in a 65-nm CMOS process, the decoder achieves a peak throughput of 27.24 Gbps under error-free conditions with a throughput area ratio improvement of 2.07× compared to the state-of-the-art designs. Furthermore, the proposed architecture demonstrates superior throughput-area ratio and flexibility, supporting all code rates and lifting sizes specified in the 5G NR standard. Simulation results confirm that the proposed decoder meets the peak throughput requirement under error-free conditions, while maintaining robust performance in challenging channel environments.

 

Main Figure

3

A Reconfigurable Spiking Neural Network Computing-in-memory Processor using 1T1C eDRAM for Enhanced System-level Efficiency (유회준 교수 연구실)

Abstract

Spiking Neural Network (SNN) Computing-In-Memory (CIM) achieves high macro-level energy efficiency but struggles with system-level efficiency due to excessive external memory access (EMA) caused by intermediate activation memory demands. To address this, a high-capacity SNN-CIM capable of managing large weight loads is essential. This paper introduces a high-density 1T1C eDRAM-based SNN-CIM processor that significantly enhances system-level energy efficiency through two key features: a high-density, low-power Reconfigurable Neuro-Cell Array (ReNCA) that reuses the 1T1C cell array and employs a charge pump, achieving a 41% area and 90% power reduction and a reconfigurable CIM architecture with dual-mode ReNCA and Dynamic Adjustable Neuron Link (DAN Link) to optimize EMA for activations and weights. These innovations collectively improve system-level energy efficiency by 10×, setting a new benchmark for performance.

 

Main Figure

690abf4303b1d

HLX: A Unified Pipelined Architecture for Optimized Performance of Hybrid Transformer-Mamba Language Models(김주영 교수 연구실)

Abstract

The rapid increase in demand for long-context language models has revealed fundamental performance limitations in conventional Transformer architectures, particularly their quadratic computational complexity. Hybrid Transformer-Mamba models, which interleave attention layers with efficient state-space model layers such as Mamba-2, have emerged as promising solutions combining the strengths of both Transformer and Mamba. However, maintaining a high  compute utilization and performance across workloads (e.g.,varying sequence length and batch size) in the Hybrid models is challenging due to their heterogeneous compute patterns and shifting performance bottlenecks between the two key computational kernels: FlashAttention-2 (FA-2) and State-Space Duality (SSD).

In this paper, we introduce HLX, a unified pipelined architecture designed to ensure optimized performance across workloads for Hybrid models. Through detailed kernel-level analysis, we identify two key blockers that limit compute utilization: inter-operation dependencies in FA-2 and excessive memory traffic in SSD. To overcome these hurdles, we propose two novel fine-grained pipelined dataflows named PipeFlash and PipeSSD. PipeFlash effectively hides
operational dependencies in attention computations, while PipeSSD firstly introduces the fused pipelined execution for SSD computations, substantially enhancing data reuse and reducing memory traffic. In addition, we propose a unified hardware architecture that can process both PipeFlash and PipeSSD in an efficient pipelining scheme to maximize the compute utilization. Finally, across sequence lengths from 1K to 128K, the proposed HLX architecture
achieves up to 97.5% and 78.4% compute utilization for FA-2 and SSD, respectively, resulting in an average speedup of 1.75× and 2.91× over A100, and an average 2.78× (FA-2), 1.84× (FA-3), and 4.95× speedups over H100. For end-to-end latency and batching, HLX achieves a 1.56× and 1.38× speedup over A100 and a 2.08× and 1.76× (1.84× and 1.72×) speedup when running FA-2 (FA-3) on H100. It also significantly reduces area and power consumption by 
up to 89.8% and 63.8% compared to GPU baselines.

 

Main Figure

캡처 2025 11 14 170757
〈The comparison between Transformer, Mamba, and Hybrid models. The latency breakdown of the Hybrid-2.7B model on an A100 GPU according to the sequence length.〉

Distinguishing Pathologic Gait in Older Adults Using Instrumented Insoles and Deep Neural Networks (이영주 교수 연구실)

Abstract

Gait abnormalities are common in the older population owing to aging- and disease-related changesin physical and neurological functions. Differentiating the causes of gait abnormalities is challenging because various abnormal gaits share a similar pattern in older patients.

 

Main Figure

1

A 44.2-TOPS/W CNN Processor With Variation-Tolerant Analog Datapath and Variation Compensating Circuit (조성환 교수 연구실)

Abstract

Convolutional neural network (CNN) processors that exploit analog computing for high energy efficiency suffer from two major issues. First, frequent data conversions between the layers limit energy efficiency. Second, computing errors occur from analog circuits since they are vulnerable to process, voltage, and temperature (PVT) variations. In this article, a CNN processor featuring a variation-tolerant analog datapath with analog memory (AMEM) is proposed so that data conversion is not needed. To minimize the computing error, both AMEM and ANU are designed in such a way that their performance is not affected by PVT variations. In addition, a variation compensating circuit is also proposed. Prototype implemented in 28-nm complementary metal-oxide-semiconductor (CMOS) achieves energy-efficiency of 437.9 TOPS/W in the analog datapath, 44.2 TOPS/W in the total system and maintains its classification accuracy to within 0.5%p across variations of ±10% in supply voltage and −20 °C to 85 °C in temperature.
 
Main Figure
690a4dc4e7d98

“SP-PIM: A Super-Pipelined Processing-In-Memory Accelerator With Local Error Prediction for Area/Energy-Efficient On-Device Learning.” IEEE Journal of Solid-State Circuits (2024). Accept (김주영 교수 연구실)

Heo, Jaehoon, et al. “SP-PIM: A Super-Pipelined Processing-In-Memory Accelerator With Local Error Prediction for Area/Energy-Efficient On-Device Learning.” IEEE Journal of Solid-State Circuits (2024).

Abstract: On-device learning (ODL) is crucial for edge devices as it restores machine learning (ML) model accuracy in changing environments. However, implementing ODL on battery-limited devices faces challenges due to large intermediate data generation and frequent processor-memory data movement, causing significant power consumption. To address this, some edge ML accelerators use processing-in-memory (PIM), but they still suffer from high latency, power overheads, and incomplete handling of data sparsity during training. This paper presents SP-PIM, a high-throughput super-pipelined PIM accelerator that overcomes these limitations. SP-PIM implements multi-level pipelining based on local error prediction (EP), increasing training speed by 7.31× and reducing external memory access by 59.09%. It exploits activation and error sparsity with an optimized PIM macro. Fabricated using 28-nm CMOS technology, SP-PIM achieves a training speed of 8.81 epochs/s, showing state-of-the-art area (560.6 GFLOPS/mm²) and power efficiency (22.4 TFLOPS/W). A cycle-level simulator further demonstrates SP-PIM’s scalability and efficiency.

Main Figure

13

 

“PRIMO: A Full-Stack Processing-in-DRAM Emulation Framework for Machine Learning Workloads.” 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD). IEEE, 2023. Accept (김주영 교수 연구실)

Heo, Jaehoon, et al. “PRIMO: A Full-Stack Processing-in-DRAM Emulation Framework for Machine Learning Workloads.” 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD). IEEE, 2023.

Abstract: The increasing size of deep learning models has made excessive memory access between AI processors and DRAM a major system bottleneck. Processing-in-DRAM (DRAM-PIM) offers a solution by integrating compute logic within memory, reducing external memory access. Existing simulators are often too slow for full applications, and FPGA-based emulators have been introduced, but none include the full software stack. This paper introduces PRIMO, the first full-stack DRAM-PIM emulation framework for end-to-end ML inference. PRIMO allows software developers to test ML workloads without real DRAM-PIM chips and helps designers explore design space. Our real-time FPGA emulator delivers results significantly faster than CPU-based simulations. We also provide a PIM compiler and driver to support various ML workloads with high bandwidth utilization. PRIMO achieves 106.64-6093.56× faster emulation for ML tasks compared to CPU simulations.

Main Figure

12

A 28nm 4.96 TOPS/W End-to-End Diffusion Accelerator with Reconfigurable Hyper-Precision Unified Non-Matrix Processing Engine (김주영 교수 연구실)

Title: A 28nm 4.96 TOPS/W End-to-End Diffusion Accelerator with Reconfigurable Hyper-Precision Unified Non-Matrix Processing Engine

Venue: ESSERC 2024

 

Abstract: This paper presents Picasso, an end-to-end diffusion accelerator. Picasso proposes a novel hyper-precision data type and reconfigurable architecture that can maximize hardware efficiency with extended dynamic range, with no compromise in accuracy. Picasso also proposes a unified engine operating all non-matrix operations in a streamlined processing flow and minimizes the end-to-end latency by sub-block pipeline scheduling. The accelerator is fabricated in 28nm CMOS technology and achieves an energy efficiency of 4.96 TOPS/W and a peak performance of 9.83 TOPS. Compared with prior works, Picasso achieves speedups of 8.4×-26.8× while improving energy and area efficiency by 1.1×-2.8× and 3.6×-30.5×, respectively.

 

Main Figure:10

“Morphling: A Throughput-Maximized TFHE-based Accelerator using Transform-domain Reuse ,” IEEE HPCA 2024 (김주영 교수 연구실)

Prasetiyo, Adiwena Putra, and Joo-Young Kim, “Morphling: A Throughput-Maximized TFHE-based Accelerator using Transform-domain Reuse ,” IEEE HPCA 2024

Abstract:

Fully Homomorphic Encryption (FHE) has become an increasingly important aspect in modern computing, particularly in preserving privacy in cloud computing by enabling computation directly on encrypted data. Despite its potential, FHE generally poses major computational challenges, including huge computational and memory requirements. The bootstrapping operation, which is essential particularly in Torus-FHE (TFHE) scheme, involves intensive computations characterized by an enormous number of polynomial multiplications. For instance, performing a single bootstrapping at the 128-bit security level requires more than 10,000 polynomial multiplications. Our in-depth analysis reveals that domain-transform operations, i.e., Fast Fourier Transform (FFT), contribute up to 88% of these operations, which is the bottleneck of the TFHE system. To address these challenges, we propose Morphling, an accelerator architecture that combines the 2D systolic array and strategic use of transform-domain reuse in order to reduce the overhead of domain-transform in TFHE. This novel approach effectively reduces the number of required domain-transform operations by up to 83.3%, allowing more computational cores in a given die area. In addition, we optimize its microarchitecture design for end-to-end TFHE operation, such as merge-split pipelined-FFT for efficient domain-transform operation, double-pointer method for high-throughput polynomial rotation, and specialized buffer design. Furthermore, we introduce custom instructions for tiling, batching, and scheduling of multiple ciphertext operations. This facilitates software-hardware co-optimization, effectively mapping high-level applications such as XG-Boost classifier, Neural-Network, and VGG-9. As a result, Morphling, with four 2D systolic arrays and four vector units with domain transform reuse, takes 74.79 mm2 die area and 53.00 W power consumption in 28nm process. It achieves a throughput of up to 147,615 bootstrappings per second, demonstrating improvements of 3440× over the CPU, 143× over the GPU, and 14.7× over the state-of-the-art TFHE accelerator. It can run various deep learning models with sub-second latency.

9

“LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference,” IEEE Micro, Sep. 2024 (김주영 교수 연구실)

Seungjae Moon, Jung-Hoon Kim, Junsoo Kim, Seongmin Hong, Junseo Cha, Minsu Kim, Sukbin Lim, Gyubin Choi, Dongjin Seo, Jongho Kim, Hunjong Lee, Hyunjun Park, Ryeowook Ko, Soongyu Choi, Jongse Park, Jinwon Lee, and Joo-Young Kim, “LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference,” IEEE Micro, Sep. 2024

Abstract: The explosive arrival of OpenAI’s ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs.  HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09x and 1.37x faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm2 and power consumption of 284.31 mW. LPU-based servers achieve 1.33x and 1.32x energy efficiency over NVIDIA H100 and L4 servers, respectively.

Main figure8