Hierarchical User Status Classification for Imbalanced Biometric Data Class (https://ieeexplore.ieee.org/document/9722653)

With the proliferation of Internet of Things technologies, health care services that target a household equipped with IoT devices are widely emerging. In the meantime, the number of global single households is expected to rapidly grow. Contactless radar-based sensors are recently investigated as a convenient and practical means to collect biometric data of subjects in single households. In this paper, biometric data collected by contactless radar-based sensors installed in single households of the elderly under uncontrolled environments are analyzed, and a deep learning-based classification model is proposed that estimates a user’s status in one of the predefined classes. In particular, the issue of the imbalance class sizes in the generated dataset is managed by reorganizing the classes into a hierarchical structure and designing the architecture for a deep learning-based status classification model. The experimental results verify that the proposed classification model has a noticeable impact in mitigating the issue of imbalanced class sizes as it enhances the classification accuracy of the individual class by up to 65% while improving the overall status classification accuracy by 6%.2

Multi-head CNN and LSTM with Attention for User Status Estimation from Biometric Information (https://ieeexplore.ieee.org/document/9722697)

With Internet of Things technologies, healthcare services for smart homes are emerging. In the meantime, the number of households of single-living elderly who are distant from using smart devices is increasing, and contactless radar-based sensors are recently introduced to monitor the users in single households. In this paper, contactless radar-based sensors were installed in over 100 households of single-living elderly to collect their biometric data under uncontrolled environments. In addition, a deep learning-based classification model is proposed that estimates the user status in predefined classes. In particular, the classification model is designed with a multi-head convolutional neural network with long-short-term memory and an attention mechanism. The proposed model aims to extract features in diverse resolutions from the biometric data while capturing the temporal causalities and relative importance of the features. The experimental results verify that the proposed classification model improves the status classification accuracy by 2.8% to 31.7% in terms of F 1 score for the real-world dataset.

1

 

PreGNN: Hardware Acceleration to Take Preprocessing Off the Critical Path in Graph Neural Networks

In this paper, we observe that the main performance bottleneck of emerging graph neural networks (GNNs) is not the inference algorithms themselves, but their graph data preprocessing. To take such preprocessing off the critical path in GNNs, we propose PreGNN, a novel hardware automation architecture that accelerates all the tasks of GNN preprocessing from the beginning to the end. Specifically, PreGNN accelerates graph generation in parallel, samples neighbor nodes of a given graph, and prepares graph datasets through all hardware. To reduce the long latency of GNN preprocessing over hardware, we also propose simple, efficient combinational logic that can perform radix sort and arrange the data in a self-governing manner. We implement PreGNN in a customized coprocessor prototype that contains a 16nm FPGA with 64GB DRAM. The results show that PreGNN can shorten the end-to-end latency of GNN inferences by 10.7 x while consuming less energy by 3.3 x, compared to a GPU-only system.1

https://ieeexplore.ieee.org/document/9837798

 

Disentangling Sources of Risk for Distributional Multi-Agent Reinforcement Learning

Abstract: In cooperative multi-agent reinforcement learning, the outcomes of agent-wise policies are highly stochastic due to the two sources of risk: (a) random actions taken by teammates and (b) random transition and rewards. Although the two sources have very distinct characteristics, existing frameworks are insufficient to control the risk-sensitivity of agent-wise policies in a disentangled manner. To this end, we propose Disentangled RIsk-sensitive Multi-Agent reinforcement learning (DRIMA) to separately access the risk sources. For example, our framework allows an agent to be optimistic with respect to teammates (who can prosocially adapt) but more risk-neutral with respect to the environment (which does not adapt). Our experiments demonstrate that DRIMA significantly outperforms prior state-of-the-art methods across various scenarios in the StarCraft Multi-agent Challenge environment. Notably, DRIMA shows robust performance where prior methods learn only a highly suboptimal policy, regardless of reward shaping, exploration scheduling, and noisy (random or adversarial) agents.

1

Robust Continual Test-time Adaptation: Instance-aware BN and Prediction-balanced Memory

Conference

Conference on Neural Information Processing Systems (NeurIPS), 2022.

Abstract

Test-time adaptation (TTA) is an emerging paradigm that addresses distributional shifts between training and testing phases without additional data acquisition or labeling cost; only unlabeled test data streams are used for continual model adaptation. Previous TTA schemes assume that the test samples are independent and identically distributed (i.i.d.), even though they are often temporally correlated (non-i.i.d.) in application scenarios, e.g., autonomous driving. We discover that most existing TTA methods fail dramatically under such scenarios. Motivated by this, we present a new test-time adaptation scheme that is robust against non- i.i.d. test data streams. Our novelty is mainly two-fold: (a) Instance-Aware Batch Normalization (IABN) that corrects normalization for out-of-distribution samples, and (b) Prediction-balanced Reservoir Sampling (PBRS) that simulates i.i.d. data stream from non-i.i.d. stream in a class-balanced manner. Our evaluation with various datasets, including real-world non-i.i.d. streams, demonstrates that the proposed robust TTA not only outperforms state-of-the-art TTA algorithms in the non-i.i.d. setting, but also achieves comparable performance to those algorithms under the i.i.d. assumption.

3

FedBalancer: Data and Pace Control for Efficient Federated Learning on Heterogeneous Clients

Conference

ACM International Conference on Mobile Systems, Applications, and Services (MobiSys) 2022.

Abstract

Federated Learning (FL) trains a machine learning model on dis- tributed clients without exposing individual data. Unlike centralized training that is usually based on carefully-organized data, FL deals with on-device data that are often unfiltered and imbalanced. As a result, conventional FL training protocol that treats all data equally leads to a waste of local computational resources and slows down the global learning process. To this end, we propose FedBalancer, a systematic FL framework that actively selects clients’ training samples. Our sample selection strategy prioritizes more “informa- tive” data while respecting privacy and computational capabilities of clients. To better utilize the sample selection to speed up global training, we further introduce an adaptive deadline control scheme that predicts the optimal deadline for each round with varying client training data. Compared with existing FL algorithms with deadline configuration methods, our evaluation on five datasets from three different domains shows that FedBalancer improves the time-to-accuracy performance by 1.20∼4.48× while improving the model accuracy by 1.1∼5.0%. We also show that FedBalancer is readily applicable to other FL approaches by demonstrating that FedBalancer improves the convergence speed and accuracy when operating jointly with three different FL algorithms.

 

2

Predicting Mind-Wandering with Facial Videos in Online Lectures

Conference

International Workshop on Computer Vision for Physiological Measurement (CVPM) 2022.

Abstract

The importance of online education has been brought to the forefront due to COVID. Understanding students’ attentional states are crucial for lecturers, but this could be more difficult in online settings than in physical class- rooms. Existing methods that gauge online students’ at- tention status typically require specialized sensors such as eye-trackers and thus are not easily deployable to every stu- dent in real-world settings. To tackle this problem, we uti- lize facial video from student webcams for attention state prediction in online lectures. We conduct an experiment in the wild with 37 participants, resulting in a dataset consist- ing of 15 hours of lecture-taking students’ facial recordings with corresponding 1,100 attentional state probings. We present PAFE (Predicting Attention with Facial Expression), a facial-video-based framework for attentional state pre- diction that focuses on the vision-based representation of traditional physiological mind-wandering features related to partial drowsiness, emotion, and gaze. Our model only requires a single camera and outperforms gaze-only baselines.

 

FleX: A Flex Interconnected HPC System with Stochastic Load Balancing Scheme

Abstract:

We propose a new low-diameter interconnection network called FleX, which offers high f lexibility when installing interconnections in a HPC system. FleX consists of multiple layers with only connections between neighboring layers and not within each layer. These structural properties make it easy to achieve a low diameter with regardless of the scale. The cross-like connections between the adjacent layers in FleX impart various alternative minimal paths, allowing FleX to have high resiliency and a wide bisection width. We also discuss the minimal routing scheme and a stochastic load balancing scheme (LBR) for the proposed interconnection network. Through cycle-based simulations, the performance of FleX is evaluated, and the cost and power consumption analyses in comparison with other interconnection networks are also conducted. We verify that FleX has high configuration flexibility with regard to cost and performance, and also provides low latency and high saturation throughput with the same cost over the legacy interconnection networks for the HPC system. Moreover, being synergied with the proposing LBR, we also verify that FleX can expand its saturation throughput further while only sacrificing the latency slightly.

1

Youngeun Kwon and Minsoo Rhu, ” Training Personalized Recommendation Systems from (GPU) Scratch: Look Forward not Backwards,” The 49th IEEE/ACM International Symposium on Computer Architecture (ISCA-49), New York, NY, June 2022

Abstract:

 Personalized recommendation models (RecSys) are one of the most popular machine learning workload serviced by hyperscalers. A critical challenge of training RecSys is its high memory capacity requirements, reaching hundreds of GBs to TBs of model size. In RecSys, the so-called embedding layers account for the majority of memory usage so current systems employ a hybrid CPU-GPU design to have the large CPU memory store the memory hungry embedding layers. Unfortunately, training embeddings involve several memory bandwidth intensive operations which is at odds with the slow CPU memory, causing performance overheads. Prior work proposed to cache frequently accessed embeddings inside GPU memory as means to filter down the embedding layer traffic to CPU memory, but this paper observes several limitations with such cache design. In this work, we present a fundamentally different approach in designing embedding caches for RecSys. Our proposed ScratchPipe architecture utilizes unique properties of RecSys training to develop an embedding cache that not only sees the past but also the “future” cache accesses. ScratchPipe exploits such property to guarantee that the active working set of embedding layers can “always” be captured inside our proposed cache design, enabling embedding layer training to be conducted at GPU memory speed. 

 

2

Yunjae Lee, Jinha Chung, and Minsoo Rhu, “SmartSAGE: Training Large-scale Graph Neural Networks using In-Storage Processing Architectures,” The 49th IEEE/ACM International Symposium on Computer Architecture (ISCA-49), New York, NY, June 2022

Abstract

 Graph neural networks (GNNs) can extract features by learning both the representation of each objects (i.e., graph nodes) and the relationship across different objects (i.e., the edges that connect nodes), achieving state-of-the-art performance in various graph-based tasks. Despite its strengths, utilizing these algorithms in a production environment faces several challenges as the number of graph nodes and edges amount to several billions to hundreds of billions scale, requiring substantial storage space for training. Unfortunately, state-of-the-art ML frameworks employ an in-memory processing model which significantly hampers the productivity of ML practitioners as it mandates the overall working set to fit within DRAM capacity. In this work, we first conduct a detailed characterization on a state-of-the-art, large-scale GNN training algorithm, GraphSAGE. Based on the characterization, we then explore the feasibility of utilizing capacity-optimized NVMe SSDs for storing memory-hungry GNN data, which enables large-scale GNN training beyond the limits of main memory size. Given the large performance gap between DRAM and SSD, however, blindly utilizing SSDs as a direct substitute for DRAM leads to significant performance loss. We therefore develop SmartSAGE, our software/hardware co-design based on an in-storage processing (ISP) architecture. Our work demonstrates that an ISP based large-scale GNN training system can achieve both high capacity storage and high performance, opening up opportunities for ML practitioners to train large GNN datasets without being hampered by the physical limitations of main memory size.

1