Research

Research Highlights

Prof. Dongsu Han’s Team Receives NeurIPS Spotlight for Edge-Assisted LLM Inference Technology

교수님 메인 최종
<(From left) Professor Dongsu Han, Dr. Jinwoo Park and Master’s student Seunggeun Cho>
A research team led by Professor Dongsu Han from KAIST’s School of Electrical Engineering has developed an edge-assisted inference framework that dramatically reduces large language model (LLM) service costs by utilizing affordable consumer-grade GPUs.
 

Currently, LLM inference services rely entirely on dedicated accelerators and GPUs in data centers, requiring substantial financial and infrastructure investments for large-scale language model services. While high-performance consumer-grade GPUs—more affordable than data center GPUs—have become widely available at the edge outside data centers, structural limitations of existing LLM inference architectures prevent their efficient utilization in internet environments with limited communication infrastructure.

 

The research team developed SpecEdge, an edge-assisted inference framework to address these challenges. SpecEdge reduces LLM inference costs by effectively distributing computation between consumer-grade edge GPUs and data center GPUs. The framework also adopts speculative decoding techniques to enable smooth communication between edge GPUs and data center GPUs over the internet. Speculative decoding is a technique where a relatively small language model quickly generates multiple high-probability tokens, which are then verified by a large language model. SpecEdge deploys a small model on edge GPUs to generate high-probability token sequences at once, then sends them to data center GPUs for batch verification.

 

picture 1 diagramkr
<SpecEdge Framework Diagram>

 

SpecEdge employs a strategy where edge GPUs continue generating tokens while waiting for verification results from the server. After initial token generation, the edge pre-generates additional tokens along the highest-probability path, allowing immediate utilization of pre-generated tokens when all verification results match. Additionally, server-side pipeline optimization intelligently batches verification requests from multiple edges to maximize server GPU utilization. While one edge GPU drafts tokens, the server verifies other requests, eliminating idle time and enabling processing of more requests.

 

picture 2 proactive draft
<Edge GPU Proactive Draft>
picture 3 pipeline optimization
<Pipeline Batch Optimization for Server GPU>

 

This research demonstrates the potential to reduce dependence on data center GPUs by leveraging widely deployed edge GPUs. The SpecEdge framework, which can be extended to NPUs at the edge, addresses cost concerns and limited GPU availability in data centers, providing opportunities to deploy high-quality LLM services. This could lower barriers to entry in the AI service market and stimulate competition, laying the foundation for the development of Korea’s AI industry ecosystem.

 

Professor Dongsu Han stated, “We will continue research to enable the use of user edge devices as LLM infrastructure, beyond edge cloud GPUs,” adding that “utilizing user edge resources will reduce the cost burden on service providers, lower barriers to accessing high-quality LLMs, and serve as the foundation for AI for everyone.”

 

This research was conducted with Dr. Jinwoo Park and Master’s student Seunggeun Cho from KAIST. The findings will be presented as a Spotlight paper (top 3.2% of submissions) at the Annual Conference on Neural Information Processing Systems (NeurIPS), a top-tier international conference in artificial intelligence, held in San Diego, USA, from December 2–7 (Paper title: SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs).