Abstract
Main Figure

Main Figure

Spiking Neural Network (SNN) Computing-In-Memory (CIM) achieves high macro-level energy efficiency but struggles with system-level efficiency due to excessive external memory access (EMA) caused by intermediate activation memory demands. To address this, a high-capacity SNN-CIM capable of managing large weight loads is essential. This paper introduces a high-density 1T1C eDRAM-based SNN-CIM processor that significantly enhances system-level energy efficiency through two key features: a high-density, low-power Reconfigurable Neuro-Cell Array (ReNCA) that reuses the 1T1C cell array and employs a charge pump, achieving a 41% area and 90% power reduction and a reconfigurable CIM architecture with dual-mode ReNCA and Dynamic Adjustable Neuron Link (DAN Link) to optimize EMA for activations and weights. These innovations collectively improve system-level energy efficiency by 10×, setting a new benchmark for performance.
Main Figure

The rapid increase in demand for long-context language models has revealed fundamental performance limitations in conventional Transformer architectures, particularly their quadratic computational complexity. Hybrid Transformer-Mamba models, which interleave attention layers with efficient state-space model layers such as Mamba-2, have emerged as promising solutions combining the strengths of both Transformer and Mamba. However, maintaining a high compute utilization and performance across workloads (e.g.,varying sequence length and batch size) in the Hybrid models is challenging due to their heterogeneous compute patterns and shifting performance bottlenecks between the two key computational kernels: FlashAttention-2 (FA-2) and State-Space Duality (SSD).
In this paper, we introduce HLX, a unified pipelined architecture designed to ensure optimized performance across workloads for Hybrid models. Through detailed kernel-level analysis, we identify two key blockers that limit compute utilization: inter-operation dependencies in FA-2 and excessive memory traffic in SSD. To overcome these hurdles, we propose two novel fine-grained pipelined dataflows named PipeFlash and PipeSSD. PipeFlash effectively hides
operational dependencies in attention computations, while PipeSSD firstly introduces the fused pipelined execution for SSD computations, substantially enhancing data reuse and reducing memory traffic. In addition, we propose a unified hardware architecture that can process both PipeFlash and PipeSSD in an efficient pipelining scheme to maximize the compute utilization. Finally, across sequence lengths from 1K to 128K, the proposed HLX architecture
achieves up to 97.5% and 78.4% compute utilization for FA-2 and SSD, respectively, resulting in an average speedup of 1.75× and 2.91× over A100, and an average 2.78× (FA-2), 1.84× (FA-3), and 4.95× speedups over H100. For end-to-end latency and batching, HLX achieves a 1.56× and 1.38× speedup over A100 and a 2.08× and 1.76× (1.84× and 1.72×) speedup when running FA-2 (FA-3) on H100. It also significantly reduces area and power consumption by
up to 89.8% and 63.8% compared to GPU baselines.

Abstract
Gait abnormalities are common in the older population owing to aging- and disease-related changesin physical and neurological functions. Differentiating the causes of gait abnormalities is challenging because various abnormal gaits share a similar pattern in older patients.
Main Figure


Heo, Jaehoon, et al. “SP-PIM: A Super-Pipelined Processing-In-Memory Accelerator With Local Error Prediction for Area/Energy-Efficient On-Device Learning.” IEEE Journal of Solid-State Circuits (2024).
Abstract: On-device learning (ODL) is crucial for edge devices as it restores machine learning (ML) model accuracy in changing environments. However, implementing ODL on battery-limited devices faces challenges due to large intermediate data generation and frequent processor-memory data movement, causing significant power consumption. To address this, some edge ML accelerators use processing-in-memory (PIM), but they still suffer from high latency, power overheads, and incomplete handling of data sparsity during training. This paper presents SP-PIM, a high-throughput super-pipelined PIM accelerator that overcomes these limitations. SP-PIM implements multi-level pipelining based on local error prediction (EP), increasing training speed by 7.31× and reducing external memory access by 59.09%. It exploits activation and error sparsity with an optimized PIM macro. Fabricated using 28-nm CMOS technology, SP-PIM achieves a training speed of 8.81 epochs/s, showing state-of-the-art area (560.6 GFLOPS/mm²) and power efficiency (22.4 TFLOPS/W). A cycle-level simulator further demonstrates SP-PIM’s scalability and efficiency.
Main Figure

Heo, Jaehoon, et al. “PRIMO: A Full-Stack Processing-in-DRAM Emulation Framework for Machine Learning Workloads.” 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD). IEEE, 2023.
Abstract: The increasing size of deep learning models has made excessive memory access between AI processors and DRAM a major system bottleneck. Processing-in-DRAM (DRAM-PIM) offers a solution by integrating compute logic within memory, reducing external memory access. Existing simulators are often too slow for full applications, and FPGA-based emulators have been introduced, but none include the full software stack. This paper introduces PRIMO, the first full-stack DRAM-PIM emulation framework for end-to-end ML inference. PRIMO allows software developers to test ML workloads without real DRAM-PIM chips and helps designers explore design space. Our real-time FPGA emulator delivers results significantly faster than CPU-based simulations. We also provide a PIM compiler and driver to support various ML workloads with high bandwidth utilization. PRIMO achieves 106.64-6093.56× faster emulation for ML tasks compared to CPU simulations.
Main Figure

Title: A 28nm 4.96 TOPS/W End-to-End Diffusion Accelerator with Reconfigurable Hyper-Precision Unified Non-Matrix Processing Engine
Venue: ESSERC 2024
Abstract: This paper presents Picasso, an end-to-end diffusion accelerator. Picasso proposes a novel hyper-precision data type and reconfigurable architecture that can maximize hardware efficiency with extended dynamic range, with no compromise in accuracy. Picasso also proposes a unified engine operating all non-matrix operations in a streamlined processing flow and minimizes the end-to-end latency by sub-block pipeline scheduling. The accelerator is fabricated in 28nm CMOS technology and achieves an energy efficiency of 4.96 TOPS/W and a peak performance of 9.83 TOPS. Compared with prior works, Picasso achieves speedups of 8.4×-26.8× while improving energy and area efficiency by 1.1×-2.8× and 3.6×-30.5×, respectively.
Main Figure:
Prasetiyo, Adiwena Putra, and Joo-Young Kim, “Morphling: A Throughput-Maximized TFHE-based Accelerator using Transform-domain Reuse ,” IEEE HPCA 2024
Abstract:
Fully Homomorphic Encryption (FHE) has become an increasingly important aspect in modern computing, particularly in preserving privacy in cloud computing by enabling computation directly on encrypted data. Despite its potential, FHE generally poses major computational challenges, including huge computational and memory requirements. The bootstrapping operation, which is essential particularly in Torus-FHE (TFHE) scheme, involves intensive computations characterized by an enormous number of polynomial multiplications. For instance, performing a single bootstrapping at the 128-bit security level requires more than 10,000 polynomial multiplications. Our in-depth analysis reveals that domain-transform operations, i.e., Fast Fourier Transform (FFT), contribute up to 88% of these operations, which is the bottleneck of the TFHE system. To address these challenges, we propose Morphling, an accelerator architecture that combines the 2D systolic array and strategic use of transform-domain reuse in order to reduce the overhead of domain-transform in TFHE. This novel approach effectively reduces the number of required domain-transform operations by up to 83.3%, allowing more computational cores in a given die area. In addition, we optimize its microarchitecture design for end-to-end TFHE operation, such as merge-split pipelined-FFT for efficient domain-transform operation, double-pointer method for high-throughput polynomial rotation, and specialized buffer design. Furthermore, we introduce custom instructions for tiling, batching, and scheduling of multiple ciphertext operations. This facilitates software-hardware co-optimization, effectively mapping high-level applications such as XG-Boost classifier, Neural-Network, and VGG-9. As a result, Morphling, with four 2D systolic arrays and four vector units with domain transform reuse, takes 74.79 mm2 die area and 53.00 W power consumption in 28nm process. It achieves a throughput of up to 147,615 bootstrappings per second, demonstrating improvements of 3440× over the CPU, 143× over the GPU, and 14.7× over the state-of-the-art TFHE accelerator. It can run various deep learning models with sub-second latency.

Seungjae Moon, Jung-Hoon Kim, Junsoo Kim, Seongmin Hong, Junseo Cha, Minsu Kim, Sukbin Lim, Gyubin Choi, Dongjin Seo, Jongho Kim, Hunjong Lee, Hyunjun Park, Ryeowook Ko, Soongyu Choi, Jongse Park, Jinwon Lee, and Joo-Young Kim, “LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference,” IEEE Micro, Sep. 2024
Abstract: The explosive arrival of OpenAI’s ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09x and 1.37x faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm2 and power consumption of 284.31 mW. LPU-based servers achieve 1.33x and 1.32x energy efficiency over NVIDIA H100 and L4 servers, respectively.
Main figure