Research

Research Highlights

[Professor Myoungsoo Jung’s Research Team Pioneers the ‘CXL-GPU’ Market. KAIST Develops High Capacity and Performance GPU]
 
3842864258736465856.3843537081640038163@dooray
<Professor Myoungsoo Jung’s Research Team>
 

Recently, big tech companies at the forefront of large-scale AI service provision are competitively increasing the size of their models and data to deliver better performance to users. The latest large-scale language models require tens of terabytes (TB, 10^12 bytes) of memory for training. A domestic research team has developed a next-generation interface technology-enabled high-capacity, high-performance AI accelerator that can compete with NVIDIA, which currently dominates the AI accelerator market.

 

Professor Jung Myoungsoo’s research team, announced on the 8th that they have developed a technology to optimize the memory read/write performance of high-capacity GPU devices with the next-generation interface technology, Compute Express Link (CXL).

 

The internal memory capacity of the latest GPUs is only a few tens of gigabytes (GB, 10^9 bytes), making it impossible to train and infer models with a single GPU. To provide the memory capacity required by large-scale AI models, the industry generally adopts the method of connecting multiple GPUs. However, this method significantly increases the total cost of ownership (TCO) due to the high prices of the latest GPUs.

 

3842864258736465856.3843535396340293895@dooray

< Representative Image of the CXL-GPU >

 

Therefore, the ‘CXL-GPU’ structure technology, which directly connects large-capacity memory to GPU devices using the next-generation connection technology, CXL, is being actively reviewed in various industries. However, the high-capacity feature of CXL-GPU alone is not sufficient for practical AI service use. Since large-scale AI services require fast inference and training performance, the memory read/write performance to the memory expansion device directly connected to the GPU must be comparable to that of the local memory of the existing GPU for actual service utilization.

 

*CXL-GPU: It supports high capacity by integrating the memory space of memory expansion devices connected via CXL into the GPU memory space. The CXL controller automatically handles operations needed for managing the integrated memory space, allowing the GPU to access the expanded memory space in the same manner as accessing its local memory. Unlike the traditional method of purchasing additional expensive GPUs to increase memory capacity, CXL-GPU can selectively add memory resources to the GPU, significantly reducing system construction costs.

 

Our research team has developed technology to improve the causes of decreased memory read/write performance of CXL-GPU devices. By developing technology that allows the memory expansion device to determine its memory write timing independently, the GPU device can perform memory writes to the memory expansion device and the GPU’s local memory simultaneously. This means that the GPU does not have to wait for the completion of the memory write task, thereby solving the write performance degradation issue.

 

제안하는 CXL-GPU의 구조

< Proposed CXL-GPU Architecture > 

 

Furthermore, the research team developed a technology that provides hints from the GPU device side to enable the memory expansion device to perform memory reads in advance.

Utilizing this technology allows the memory expansion device to start memory reads faster, achieving faster memory read performance by reading data from the cache (a small but fast temporary data storage space) when the GPU device actually needs the data.

3842864258736465856.3843535396499373584@dooray

< CXL-GPU Hardware Prototype >

 

This research was conducted using the ultra-fast CXL controller and CXL-GPU prototype from Panmnesia*, a semiconductor fabless startup. Through the technology efficacy verification using Panmnesia’s CXL-GPU prototype, the research team confirmed that it could execute AI services 2.36 times faster than existing GPU memory expansion technology. The results of this research will be presented at the USENIX Association Conference and HotStorage research presentation in Santa Clara this July.

 

*Panmnesia possesses a proprietary CXL controller with pure domestic technology that has reduced the round-trip latency for CXL memory management operations to less than double-digit nanoseconds (nanosecond, 10^9 of a second) for the first time in the industry. This is more than three times faster than the latest CXL controllers worldwide. Panmnesia has utilized its high-speed CXL controller to directly connect multiple memory expansion devices to the GPU, enabling a single GPU to form a large-scale memory space in the terabyte range.

Professor Jung stated, “Accelerating the market adoption of CXL-GPU can significantly reduce the memory expansion costs for big tech companies operating large-scale AI services.”

 

3842864258736465856.3843535396705573424@dooray

< Evaluation Results of CXL-GPU Execution Time >