정명수 교수 연구팀, 세계 최초 CXL 3.0 기반 검색 엔진을 위한 AI 반도체 개발

전기및전자공학부 정명수 교수 연구팀이 세계 최초로 CXL 3.0 기반 검색 엔진을 위한 AI 반도체를 개발하였다.


최근 각광 받고 있는 이미지 검색, 데이터베이스, 추천 시스템, 광고 등의 서비스들은 근사 근접 이웃 탐색(Approximate Nearest Neighbor Search, ANNS) 알고리즘을 사용한다.

근사 근접 이웃 탐색 알고리즘을 실제 서비스에서 사용할 때 필요한 데이터 셋이 매우 커 많은 양의 메모리를 요구하는 어려움이 있다.

이를 해결하기 위해 기존에는 압축 방식과 스토리지 방식을 사용하였지만, 각각 낮은 정확도와 성능을 가지는 문제가 있다.


연구팀은 메모리 확장의 제한이라는 근본적인 문제를 해결하기 위해 CXL이라는 기술을 사용하였다. CXL은 PCI 익스프레스(PCIe) 인터페이스 기반의 CPU-장치 간 연결을 위한 프로토콜로, 가속기 및 메모리 확장기의 고속 연결을 제공한다.

또한 CXL 스위치를 통하여 여러 대의 메모리 확장기를 하나의 포트에 연결할 수 있는 확장성을 제공한다. 하지만 CXL을 통한 메모리 확장은 로컬 메모리와 비교하여 메모리 접근 시간이 증가하는 단점을 가지고 있다.


연구팀이 개발한 AI 반도체(CXL-ANNS)는 CXL 스위치와 CXL 메모리 확장기를 사용해 근사 근접 이웃 탐색에서 필요한 모든 데이터를 메모리에 적재할 수 있어 정확도와 성능 감소를 없앴다.

또한 근사 근접 이웃 탐색의 특징을 활용해 데이터 근처 처리 기법과 지역성을 활용한 데이터 배치 기법으로 CXL-ANNS의 성능을 한 단계 향상시켰다.

연구팀은CXL-ANNS의 프로토타입을 자체 제작하여 기존 연구들과 성능을 비교하였다.

CXL-ANNS는 기존 연구들 대비 평균 111 성능 향상이 있었다. 특히, 마이크로소프트에서 실제 서비스에서 사용되는 방식과 비교하였을 때 92의 성능 향상을 보여줬다.


이번 연구는 미국 보스턴에서 오는 7월에 열릴 시스템 분야 최우수 학술대회인 유즈닉스 연례 회의 `USENIX Annual Technical Conference, 2023’에 ‘CXL-ANNS’이라는 논문명(CXL-ANNS: Software-Hardware Collaborative Memory Disaggregation and Computation for Billion-Scale Approximate Nearest Neighbor Search)으로 발표될 예정이다.


해당 연구에 대한 자세한 내용은 연구실 웹사이트(http://camelab.org)에서 확인할 수 있다.


[그림1. 하드웨어 프로토타입]


[그림2. CXL-ANNS의 로고]


[그림3. CXL-ANNS 연구진 (왼쪽부터) 전기및전자공학부 박사과정 장준혁, 이승준, 석사과정 최한진, 박사과정 권미령, 배한여름, 교수 정명수]


자세한 정보는 아래 링크 참조


한국경제: https://www.hankyung.com/it/article/202305259204i

헤럴드경제: http://news.heraldcorp.com/view.php?ud=20230525000225

조선비즈: https://biz.chosun.com/science-chosun/technology/2023/05/25/4UW5LPX3WVARVIS3QBBICPINFM/

전자신문: https://www.etnews.com/20230525000092

NeuroScaler: Neural Video Enhancement at Scale


 High-definition live streaming has experienced tremendous growth. However, the video quality of live video is often limited by the streamer’s uplink bandwidth. Recently, neural-enhanced live streaming has shown great promise in enhancing the video quality by running neural super-resolution at the ingest server. Despite its benefit, it is too expensive to be deployed at scale. To overcome the limitation, we present NeuroScaler, a framework that delivers efficient and scalable neural enhancement for live streams. First, to accelerate end-to-end neural enhancement, we propose novel algorithms that significantly reduce the overhead of video super-resolution, encoding, and GPU context switching. Second, to maximize the overall quality gain, we devise a resource scheduler that considers the unique characteristics of the neural-enhancing workload. Our evaluation on a public cloud shows NeuroScaler reduces the overall cost by 22.3× and 3.0-11.1× compared to the latest per-frame and selective neural-enhancing systems, respectively.

TSPipe: Learn from Teacher Faster with Pipelines


 The teacher-student (TS) framework, training a (student) network by utilizing an auxiliary superior (teacher) network, has been adopted as a popular training paradigm in many machine learning schemes, since the seminal work—Knowledge distillation (KD) for model compression and transfer learning. Many recent self-supervised learning (SSL) schemes also adopt the TS framework, where teacher networks are maintained as the moving average of student networks, called the momentum networks. This paper presents TSPipe, a pipelined approach to accelerate the training process of any TS frameworks including KD and SSL. Under the observation that the teacher network does not need a backward pass, our main idea is to schedule the computation of the teacher and student network separately, and fully utilize the GPU during training by interleaving the computations of the two networks and relaxing their dependencies. In case the teacher network requires a momentum update, we use delayed parameter updates only on the teacher network to attain high model accuracy. Compared to existing pipeline parallelism schemes, which sacrifice either training throughput or model accuracy, TSPipe provides better performance trade-offs, achieving up to 12.15x higher throughput.

Hierarchical User Status Classification for Imbalanced Biometric Data Class (https://ieeexplore.ieee.org/document/9722653)

With the proliferation of Internet of Things technologies, health care services that target a household equipped with IoT devices are widely emerging. In the meantime, the number of global single households is expected to rapidly grow. Contactless radar-based sensors are recently investigated as a convenient and practical means to collect biometric data of subjects in single households. In this paper, biometric data collected by contactless radar-based sensors installed in single households of the elderly under uncontrolled environments are analyzed, and a deep learning-based classification model is proposed that estimates a user’s status in one of the predefined classes. In particular, the issue of the imbalance class sizes in the generated dataset is managed by reorganizing the classes into a hierarchical structure and designing the architecture for a deep learning-based status classification model. The experimental results verify that the proposed classification model has a noticeable impact in mitigating the issue of imbalanced class sizes as it enhances the classification accuracy of the individual class by up to 65% while improving the overall status classification accuracy by 6%.

Multi-head CNN and LSTM with Attention for User Status Estimation from Biometric Information (https://ieeexplore.ieee.org/document/9722697)

With Internet of Things technologies, healthcare services for smart homes are emerging. In the meantime, the number of households of single-living elderly who are distant from using smart devices is increasing, and contactless radar-based sensors are recently introduced to monitor the users in single households. In this paper, contactless radar-based sensors were installed in over 100 households of single-living elderly to collect their biometric data under uncontrolled environments. In addition, a deep learning-based classification model is proposed that estimates the user status in predefined classes. In particular, the classification model is designed with a multi-head convolutional neural network with long-short-term memory and an attention mechanism. The proposed model aims to extract features in diverse resolutions from the biometric data while capturing the temporal causalities and relative importance of the features. The experimental results verify that the proposed classification model improves the status classification accuracy by 2.8% to 31.7% in terms of F 1 score for the real-world dataset.


PreGNN: Hardware Acceleration to Take Preprocessing Off the Critical Path in Graph Neural Networks

In this paper, we observe that the main performance bottleneck of emerging graph neural networks (GNNs) is not the inference algorithms themselves, but their graph data preprocessing. To take such preprocessing off the critical path in GNNs, we propose PreGNN, a novel hardware automation architecture that accelerates all the tasks of GNN preprocessing from the beginning to the end. Specifically, PreGNN accelerates graph generation in parallel, samples neighbor nodes of a given graph, and prepares graph datasets through all hardware. To reduce the long latency of GNN preprocessing over hardware, we also propose simple, efficient combinational logic that can perform radix sort and arrange the data in a self-governing manner. We implement PreGNN in a customized coprocessor prototype that contains a 16nm FPGA with 64GB DRAM. The results show that PreGNN can shorten the end-to-end latency of GNN inferences by 10.7 x while consuming less energy by 3.3 x, compared to a GPU-only system.



Disentangling Sources of Risk for Distributional Multi-Agent Reinforcement Learning

Abstract: In cooperative multi-agent reinforcement learning, the outcomes of agent-wise policies are highly stochastic due to the two sources of risk: (a) random actions taken by teammates and (b) random transition and rewards. Although the two sources have very distinct characteristics, existing frameworks are insufficient to control the risk-sensitivity of agent-wise policies in a disentangled manner. To this end, we propose Disentangled RIsk-sensitive Multi-Agent reinforcement learning (DRIMA) to separately access the risk sources. For example, our framework allows an agent to be optimistic with respect to teammates (who can prosocially adapt) but more risk-neutral with respect to the environment (which does not adapt). Our experiments demonstrate that DRIMA significantly outperforms prior state-of-the-art methods across various scenarios in the StarCraft Multi-agent Challenge environment. Notably, DRIMA shows robust performance where prior methods learn only a highly suboptimal policy, regardless of reward shaping, exploration scheduling, and noisy (random or adversarial) agents.

Robust Continual Test-time Adaptation: Instance-aware BN and Prediction-balanced Memory


Conference on Neural Information Processing Systems (NeurIPS), 2022.


Test-time adaptation (TTA) is an emerging paradigm that addresses distributional shifts between training and testing phases without additional data acquisition or labeling cost; only unlabeled test data streams are used for continual model adaptation. Previous TTA schemes assume that the test samples are independent and identically distributed (i.i.d.), even though they are often temporally correlated (non-i.i.d.) in application scenarios, e.g., autonomous driving. We discover that most existing TTA methods fail dramatically under such scenarios. Motivated by this, we present a new test-time adaptation scheme that is robust against non- i.i.d. test data streams. Our novelty is mainly two-fold: (a) Instance-Aware Batch Normalization (IABN) that corrects normalization for out-of-distribution samples, and (b) Prediction-balanced Reservoir Sampling (PBRS) that simulates i.i.d. data stream from non-i.i.d. stream in a class-balanced manner. Our evaluation with various datasets, including real-world non-i.i.d. streams, demonstrates that the proposed robust TTA not only outperforms state-of-the-art TTA algorithms in the non-i.i.d. setting, but also achieves comparable performance to those algorithms under the i.i.d. assumption.

FedBalancer: Data and Pace Control for Efficient Federated Learning on Heterogeneous Clients


ACM International Conference on Mobile Systems, Applications, and Services (MobiSys) 2022.


Federated Learning (FL) trains a machine learning model on dis- tributed clients without exposing individual data. Unlike centralized training that is usually based on carefully-organized data, FL deals with on-device data that are often unfiltered and imbalanced. As a result, conventional FL training protocol that treats all data equally leads to a waste of local computational resources and slows down the global learning process. To this end, we propose FedBalancer, a systematic FL framework that actively selects clients’ training samples. Our sample selection strategy prioritizes more “informa- tive” data while respecting privacy and computational capabilities of clients. To better utilize the sample selection to speed up global training, we further introduce an adaptive deadline control scheme that predicts the optimal deadline for each round with varying client training data. Compared with existing FL algorithms with deadline configuration methods, our evaluation on five datasets from three different domains shows that FedBalancer improves the time-to-accuracy performance by 1.20∼4.48× while improving the model accuracy by 1.1∼5.0%. We also show that FedBalancer is readily applicable to other FL approaches by demonstrating that FedBalancer improves the convergence speed and accuracy when operating jointly with three different FL algorithms.


Predicting Mind-Wandering with Facial Videos in Online Lectures


International Workshop on Computer Vision for Physiological Measurement (CVPM) 2022.


The importance of online education has been brought to the forefront due to COVID. Understanding students’ attentional states are crucial for lecturers, but this could be more difficult in online settings than in physical class- rooms. Existing methods that gauge online students’ at- tention status typically require specialized sensors such as eye-trackers and thus are not easily deployable to every stu- dent in real-world settings. To tackle this problem, we uti- lize facial video from student webcams for attention state prediction in online lectures. We conduct an experiment in the wild with 37 participants, resulting in a dataset consist- ing of 15 hours of lecture-taking students’ facial recordings with corresponding 1,100 attentional state probings. We present PAFE (Predicting Attention with Facial Expression), a facial-video-based framework for attentional state pre- diction that focuses on the vision-based representation of traditional physiological mind-wandering features related to partial drowsiness, emotion, and gaze. Our model only requires a single camera and outperforms gaze-only baselines.