AI in EE

AI IN DIVISIONS

AI in Circuit Division

HLX: A Unified Pipelined Architecture for Optimized Performance of Hybrid Transformer-Mamba Language Models(Prof. Kim, Joo-Young’s Lab)

Abstract

The rapid increase in demand for long-context language models has revealed fundamental performance limitations in conventional Transformer architectures, particularly their quadratic computational complexity. Hybrid Transformer-Mamba models, which interleave attention layers with efficient state-space model layers such as Mamba-2, have emerged as promising solutions combining the strengths of both Transformer and Mamba. However, maintaining a high  compute utilization and performance across workloads (e.g.,varying sequence length and batch size) in the Hybrid models is challenging due to their heterogeneous compute patterns and shifting performance bottlenecks between the two key computational kernels: FlashAttention-2 (FA-2) and State-Space Duality (SSD).

In this paper, we introduce HLX, a unified pipelined architecture designed to ensure optimized performance across workloads for Hybrid models. Through detailed kernel-level analysis, we identify two key blockers that limit compute utilization: inter-operation dependencies in FA-2 and excessive memory traffic in SSD. To overcome these hurdles, we propose two novel fine-grained pipelined dataflows named PipeFlash and PipeSSD. PipeFlash effectively hides
operational dependencies in attention computations, while PipeSSD firstly introduces the fused pipelined execution for SSD computations, substantially enhancing data reuse and reducing memory traffic. In addition, we propose a unified hardware architecture that can process both PipeFlash and PipeSSD in an efficient pipelining scheme to maximize the compute utilization. Finally, across sequence lengths from 1K to 128K, the proposed HLX architecture
achieves up to 97.5% and 78.4% compute utilization for FA-2 and SSD, respectively, resulting in an average speedup of 1.75× and 2.91× over A100, and an average 2.78× (FA-2), 1.84× (FA-3), and 4.95× speedups over H100. For end-to-end latency and batching, HLX achieves a 1.56× and 1.38× speedup over A100 and a 2.08× and 1.76× (1.84× and 1.72×) speedup when running FA-2 (FA-3) on H100. It also significantly reduces area and power consumption by 
up to 89.8% and 63.8% compared to GPU baselines.

 

Main Figure

캡처 2025 11 14 170757
〈The comparison between Transformer, Mamba, and Hybrid models. The latency breakdown of the Hybrid-2.7B model on an A100 GPU according to the sequence length.〉