In this paper, we observe that the main performance bottleneck of emerging graph neural networks (GNNs) is not the inference algorithms themselves, but their graph data preprocessing. To take such preprocessing off the critical path in GNNs, we propose PreGNN, a novel hardware automation architecture that accelerates all the tasks of GNN preprocessing from the beginning to the end. Specifically, PreGNN accelerates graph generation in parallel, samples neighbor nodes of a given graph, and prepares graph datasets through all hardware. To reduce the long latency of GNN preprocessing over hardware, we also propose simple, efficient combinational logic that can perform radix sort and arrange the data in a self-governing manner. We implement PreGNN in a customized coprocessor prototype that contains a 16nm FPGA with 64GB DRAM. The results show that PreGNN can shorten the end-to-end latency of GNN inferences by 10.7 x while consuming less energy by 3.3 x, compared to a GPU-only system.
https://ieeexplore.ieee.org/document/9837798
Abstract: In cooperative multi-agent reinforcement learning, the outcomes of agent-wise policies are highly stochastic due to the two sources of risk: (a) random actions taken by teammates and (b) random transition and rewards. Although the two sources have very distinct characteristics, existing frameworks are insufficient to control the risk-sensitivity of agent-wise policies in a disentangled manner. To this end, we propose Disentangled RIsk-sensitive Multi-Agent reinforcement learning (DRIMA) to separately access the risk sources. For example, our framework allows an agent to be optimistic with respect to teammates (who can prosocially adapt) but more risk-neutral with respect to the environment (which does not adapt). Our experiments demonstrate that DRIMA significantly outperforms prior state-of-the-art methods across various scenarios in the StarCraft Multi-agent Challenge environment. Notably, DRIMA shows robust performance where prior methods learn only a highly suboptimal policy, regardless of reward shaping, exploration scheduling, and noisy (random or adversarial) agents.

Conference
Conference on Neural Information Processing Systems (NeurIPS), 2022.
Abstract
Test-time adaptation (TTA) is an emerging paradigm that addresses distributional shifts between training and testing phases without additional data acquisition or labeling cost; only unlabeled test data streams are used for continual model adaptation. Previous TTA schemes assume that the test samples are independent and identically distributed (i.i.d.), even though they are often temporally correlated (non-i.i.d.) in application scenarios, e.g., autonomous driving. We discover that most existing TTA methods fail dramatically under such scenarios. Motivated by this, we present a new test-time adaptation scheme that is robust against non- i.i.d. test data streams. Our novelty is mainly two-fold: (a) Instance-Aware Batch Normalization (IABN) that corrects normalization for out-of-distribution samples, and (b) Prediction-balanced Reservoir Sampling (PBRS) that simulates i.i.d. data stream from non-i.i.d. stream in a class-balanced manner. Our evaluation with various datasets, including real-world non-i.i.d. streams, demonstrates that the proposed robust TTA not only outperforms state-of-the-art TTA algorithms in the non-i.i.d. setting, but also achieves comparable performance to those algorithms under the i.i.d. assumption.

Conference
ACM International Conference on Mobile Systems, Applications, and Services (MobiSys) 2022.
Abstract
Federated Learning (FL) trains a machine learning model on dis- tributed clients without exposing individual data. Unlike centralized training that is usually based on carefully-organized data, FL deals with on-device data that are often unfiltered and imbalanced. As a result, conventional FL training protocol that treats all data equally leads to a waste of local computational resources and slows down the global learning process. To this end, we propose FedBalancer, a systematic FL framework that actively selects clients’ training samples. Our sample selection strategy prioritizes more “informa- tive” data while respecting privacy and computational capabilities of clients. To better utilize the sample selection to speed up global training, we further introduce an adaptive deadline control scheme that predicts the optimal deadline for each round with varying client training data. Compared with existing FL algorithms with deadline configuration methods, our evaluation on five datasets from three different domains shows that FedBalancer improves the time-to-accuracy performance by 1.20∼4.48× while improving the model accuracy by 1.1∼5.0%. We also show that FedBalancer is readily applicable to other FL approaches by demonstrating that FedBalancer improves the convergence speed and accuracy when operating jointly with three different FL algorithms.

Conference
International Workshop on Computer Vision for Physiological Measurement (CVPM) 2022.
Abstract
The importance of online education has been brought to the forefront due to COVID. Understanding students’ attentional states are crucial for lecturers, but this could be more difficult in online settings than in physical class- rooms. Existing methods that gauge online students’ at- tention status typically require specialized sensors such as eye-trackers and thus are not easily deployable to every stu- dent in real-world settings. To tackle this problem, we uti- lize facial video from student webcams for attention state prediction in online lectures. We conduct an experiment in the wild with 37 participants, resulting in a dataset consist- ing of 15 hours of lecture-taking students’ facial recordings with corresponding 1,100 attentional state probings. We present PAFE (Predicting Attention with Facial Expression), a facial-video-based framework for attentional state pre- diction that focuses on the vision-based representation of traditional physiological mind-wandering features related to partial drowsiness, emotion, and gaze. Our model only requires a single camera and outperforms gaze-only baselines.

Abstract:
We propose a new low-diameter interconnection network called FleX, which offers high f lexibility when installing interconnections in a HPC system. FleX consists of multiple layers with only connections between neighboring layers and not within each layer. These structural properties make it easy to achieve a low diameter with regardless of the scale. The cross-like connections between the adjacent layers in FleX impart various alternative minimal paths, allowing FleX to have high resiliency and a wide bisection width. We also discuss the minimal routing scheme and a stochastic load balancing scheme (LBR) for the proposed interconnection network. Through cycle-based simulations, the performance of FleX is evaluated, and the cost and power consumption analyses in comparison with other interconnection networks are also conducted. We verify that FleX has high configuration flexibility with regard to cost and performance, and also provides low latency and high saturation throughput with the same cost over the legacy interconnection networks for the HPC system. Moreover, being synergied with the proposing LBR, we also verify that FleX can expand its saturation throughput further while only sacrificing the latency slightly.

Abstract:
Personalized recommendation models (RecSys) are one of the most popular machine learning workload serviced by hyperscalers. A critical challenge of training RecSys is its high memory capacity requirements, reaching hundreds of GBs to TBs of model size. In RecSys, the so-called embedding layers account for the majority of memory usage so current systems employ a hybrid CPU-GPU design to have the large CPU memory store the memory hungry embedding layers. Unfortunately, training embeddings involve several memory bandwidth intensive operations which is at odds with the slow CPU memory, causing performance overheads. Prior work proposed to cache frequently accessed embeddings inside GPU memory as means to filter down the embedding layer traffic to CPU memory, but this paper observes several limitations with such cache design. In this work, we present a fundamentally different approach in designing embedding caches for RecSys. Our proposed ScratchPipe architecture utilizes unique properties of RecSys training to develop an embedding cache that not only sees the past but also the “future” cache accesses. ScratchPipe exploits such property to guarantee that the active working set of embedding layers can “always” be captured inside our proposed cache design, enabling embedding layer training to be conducted at GPU memory speed.

Abstract
Graph neural networks (GNNs) can extract features by learning both the representation of each objects (i.e., graph nodes) and the relationship across different objects (i.e., the edges that connect nodes), achieving state-of-the-art performance in various graph-based tasks. Despite its strengths, utilizing these algorithms in a production environment faces several challenges as the number of graph nodes and edges amount to several billions to hundreds of billions scale, requiring substantial storage space for training. Unfortunately, state-of-the-art ML frameworks employ an in-memory processing model which significantly hampers the productivity of ML practitioners as it mandates the overall working set to fit within DRAM capacity. In this work, we first conduct a detailed characterization on a state-of-the-art, large-scale GNN training algorithm, GraphSAGE. Based on the characterization, we then explore the feasibility of utilizing capacity-optimized NVMe SSDs for storing memory-hungry GNN data, which enables large-scale GNN training beyond the limits of main memory size. Given the large performance gap between DRAM and SSD, however, blindly utilizing SSDs as a direct substitute for DRAM leads to significant performance loss. We therefore develop SmartSAGE, our software/hardware co-design based on an in-storage processing (ISP) architecture. Our work demonstrates that an ISP based large-scale GNN training system can achieve both high capacity storage and high performance, opening up opportunities for ML practitioners to train large GNN datasets without being hampered by the physical limitations of main memory size.

Geon Lee and Kijung Shin
ICDM 2021: IEEE International Conference on Data Mining 2021
Abstract: Group interactions arise in our daily lives (email communications, on-demand ride sharing, comment interactions on online communities, to name a few), and they together form hypergraphs that evolve over time. Given such temporal hypergraphs, how can we describe their underlying design principles? If their sizes and time spans are considerably different, how can we compare their structural and temporal characteristics?
In this work, we define 96 temporal hypergraph motifs (TH-motifs), and propose the relative occurrences of their instances as an answer to the above questions. TH-motifs categorize the relational and temporal dynamics among three connected hyperedges that appear within a short time. For scalable analysis, we develop THyMe+, a fast and exact algorithm for counting the instances of TH-motifs in massive hypergraphs, and show that THyMe+ is at most 2,163X faster while requiring less space than baseline. Using it, we investigate 11 real-world temporal hypergraphs from various domains. We demonstrate that TH-motifs provide important information useful for downstream tasks and reveal interesting patterns, including the striking similarity between temporal hypergraphs from the same domain.

Hyeonjeong Shin, Taehyung Kwon, Neil Shah, and Kijung Shin
WSDM 2022: International Conference on Web Search and Data Mining 2022
Abstract: A variety of tasks on dynamic graphs, including anomaly detection, community detection, compression, and graph understanding, have been formulated as problems of identifying constituent (near) bi-cliques (i.e., complete bipartite graphs). Even when we restrict our attention to maximal ones, there can be exponentially many near bi-cliques, and thus finding all of them is practically impossible for large graphs. Then, two questions naturally arise: (Q1) What is a “good” set of near bi-cliques? That is, given a set of near bi-cliques in the input dynamic graph, how should we evaluate its quality? (Q2) Given a large dynamic graph, how can we rapidly identify a high-quality set of near bi-cliques in it? Regarding Q1, we measure how concisely, precisely, and exhaustively a given set of near bi-cliques describes the input dynamic graph. We combine these three perspectives systematically on the Minimum Description Length principle. Regarding Q2, we propose CutNPeel, a fast search algorithm for a high-quality set of near bi-cliques. By adaptively re-partitioning the input graph, CutNPeel reduces the search space and at the same time improves the search quality. Our experiments using six real-world dynamic graphs demonstrate that CutNPeel is (a) High-quality: providing near bi-cliques of up to 51.2% better quality than its state-of-the-art competitors, (b) Fast: up to 68.8x faster than the next-best competitor, and (c) Scalable: scaling to graphs with 134 million edges. We also show successful applications of CutNPeel to graph compression and pattern discovery.
