Sangwon Lee, Gyuyoung Park, and Myoungsoo Jung
12th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage), 2020, Poster
https://www.usenix.org/conference/hotstorage20/presentation/lee
We present TensorPRAM, a scalable heterogeneous deep learning accelerator that realizes FPGA-based domain specific architecture, and it can be used for forming a computational array for deep neural networks (DNNs). The current design of TensorPRAM includes a systolic-array hardware, which accelerates general matrix multiplication (GEMM) and convolution of DNNs. To reduce data movement overhead between a host and the accelerator, we further replace TensorPRAM’s on-board memory with a dense, but byte-addressable storage class memory (PRAM). We prototype TensorPRAM by placing all the logic of a general processor, front-end host interface module, systolic-array and PRAM controllers into a single FPGA chip, such that one or more TensorPRAMs can be attached to the host over PCIe fabric as a scalable option. Our real system evaluations show that TensorPRAM can reduce the execution time of various DNN workloads, compared to a processor only accelerator and a systolic-array only accelerator by 99% and 48%, on average, respectively.

Junhyeok Jang, Donghyun Gouk, Jinwoo Shin, and Myoungsoo Jung
12th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage), 2020, Poster
https://www.usenix.org/conference/hotstorage20/presentation/jang
Flash block reclaiming, called garbage collection (GC), is the major performance bottleneck and sits on the critical path in modern SSDs. Thus, both industry and academia have paid significant attentions to address the overhead imposed by GC. To eliminate GC overhead from users’ viewpoint, there exist several studies to perform GCs at user idle times. While these scheduling methods, called background GC, are a very practical approach, the main challenge behind the background GC is to predict the exact arrival time of a next I/O request.
We propose GC-Tutor, which is a garbage collection (GC) scheduler that makes GC overhead invisible to users by precisely predicting future I/O arrival times with a deep learning algorithm. For the prediction of future arrivals, applying conventional deep neural networks (DNNs) to SSD is unfortunately an infeasible option as typical model training takes tens of hours or days. Instead, GC-Tutor leverages a light-weight online-learning method that learns the dynamic request arrival behavior with a small amount of runtime information within the target SSD. Our evaluation results show that GC-Tutor reduces the request suspending time than a conventional rule-based and DNN-only GC schedulers by 82.4% and 67.9%, respectively, while increasing the prediction accuracy by 16.9%, on average, under diverse real workloads.
We study the problem of unsupervised domain adaptation that aims at obtaining a prediction model for the target domain using labeled data from the source domain and unlabeled data from the target domain. There exists an array of recent research based on the idea of extracting features that are not only invariant for both domains but also provide high discriminative power for the target domain.
In this paper, we propose an idea of improving the discriminativeness: Adding an extra artificial class and training the model on the given data together with the GAN-generated samples of the new class.
The trained model based on the new class samples is capable of extracting the features that are more discriminative by repositioning data of current classes in the target domain and therefore
increasing the distances among the target clusters in the feature space. Our idea is highly generic so that it is compatible with many existing methods such as DANN, VADA, and DIRT-T.
We conduct various experiments for the standard data commonly used for the evaluation of unsupervised domain adaptations and demonstrate that our algorithm achieves the SOTA performance for many scenarios.
The carrier sense multiple access (CSMA) algorithm has been used in the wireless medium access control (MAC) under standard 802.11 implementation due to its simplicity and generality. An extensive body of research on CSMA has long been made not only in the context of practical protocols, but also in a distributed way of optimal MAC scheduling. However, the current state-of-the-art CSMA (or its extensions) still suffers from poor performance, especially in multi-hop scenarios, and often requires patch-based solutions rather than a universal solution. In this paper, we propose an algorithm which adopts an experience-driven approach and train CSMA-based wireless MAC by using deep reinforcement learning. We name our protocol, Neuro-DCF. Two key challenges are: (i) a stable training method for distributed execution and (ii) a unified training method for embracing various interference patterns and configurations. For (i), we adopt a multi-agent reinforcement learning framework, and for (ii) we introduce a novel graph neural network (GNN) based training structure. We provide extensive simulation results which demonstrate that our protocol, Neuro-DCF, significantly outperforms 802.11 DCF and O-DCF, a recent theory-based MAC protocol, especially in terms of improving delay performance while preserving optimal utility. We believe our multi-agent reinforcement learning based approach would get broad interest from other learning-based network controllers in different layers that require distributed operation.


For smart grid services, accurate individual load forecasting is an essential element. When training individual forecasting models for multi-customers, discrepancies in data distribution among customers should be considered; there are two simple ways to build the models considering multi-customers: constructing each model independently or training as one model encompassing multi-customers. The independent approach shows higher accuracy than the latter. However, it deploys copious models, causing resource/management inefficiency; the latter is the opposite. A compromise between these two could be clustering-based forecasting. However, the previous studies are limited in applying to individual forecasting in that they focus on aggregated load and do not consider concept drift, which degrades accuracy over time. Therefore, we propose a distribution-aware temporal pooling framework that is enhanced clustering-based forecasting. For the clustering, we propose Variational Recurrent Deep Embedding (VaRDE) working in a distribution-aware manner, so it is suitable to process individual load. It allocates clusters to customers every time, so the clusters, where customers are assigned, are dynamically changed to resolve distribution change. We conducted experiments with real data for evaluation, and the result showed better performance than previous studies, especially with a few models even for unseen data, leading to high scalability.

Recently, satellite image analytics based on convolutional neural networks have been vigorously investigated; however, in order for the artificial intelligence systems to be applied in practice, there still exists several challenges: (a) model explanability to improve the reliability of the artificial intelligence system by providing the evidence for the prediction results; (b) dealing with domain shift among images captured by multiple satellites of which the specification of the image sensors is various. To resolve the two issues in the development of a deep model for satellite image analytics, in this paper we propose a multi-domain learning method based on attention-based adapters. As plug-ins to the backbone network, the adapter modules are designed to extract domain-specific features as well as improve visual attention for input images. In addition, we also discuss an alternating training strategy of the backbone network and the adapters in order to effectively separate domain-invariant features and -specific features, respectively. Finally, we utilize Grad-CAM/LIME to provide visual explanation on the proposed network architecture. The experimental results demonstrate that the proposed method can be used to improve test accuracy, and its enhancement in visual explanability is also validated.

The scale of model parameters and datasets is rapidly growing for high accuracy in various areas. To train a large-scale deep neural network (DNN) model, a huge amount of computation and memory is required; therefore, a parallelization technique for training large-scale DNN models has attracted attention. A number of approaches have been proposed to parallelize large-scale DNN models, but these schemes lack scalability because of their long communication time and limited worker memory. They often sacrifice accuracy to reduce communication time.
In this work, we proposed an efficient parallelism strategy named group hybrid parallelism (GHP) to minimize the training time without any accuracy loss. Two key ideas inspired our approach. First, grouping workers and training them by groups reduces unnecessary communication overhead among workers. It saves a huge amount of network resources in the course of training large-scale networks. Second, mixing data and model parallelism can reduce communication time and mitigate the worker memory issue. Data and model paralleism are complementary to each other so the training time can be enhanced when they are combined. We analyzed the training time model of the data and model parallelism, and based on the training time model, we demonstrated the heuristics that determine the parallelization strategy for minimizing training time.
We evaluated group hybrid parallelism in comparison with existing parallelism schemes, and our experimental results show that group hybrid parallelism outperforms them.
abstract
Rotary unmanned aerial vehicles (UAVs), also known as drones, have various advantages, yet their actual applications are limited owing to their flight range. However, increasing the flight range by enhancing the hardware is a challenging task. In this study, we introduce the first step of systematic drone low-power optimization based on the framework of electronic design automation (EDA). We attempt drone power management without in-depth knowledge of aerodynamics and control theory. Instead, we introduce a novel power model of drones using physical parameters that can affect power consumption, such as the three-axis velocity and acceleration, drone height, wind velocity, and the weight and volume of payloads. We detail the experimental setup, power modeling, accuracy verification, and optimization for minimum energy paths. We achieved over 90% accuracy in power modeling without depending on aerodynamics. The proposed approach shows the feasibility of energy-aware rotary UAV flight trajectory optimization considering the external forces affecting drones such as wind. The proposed method presents up to 24.01% energy saving through path changes considering external forces.

Abstract
Personalized recommendations are one of the most widely deployed machine learning (ML) workload serviced from cloud datacenters. As such, architectural solutions for high-performance recommendation inference have recently been the target of several prior literatures. Unfortunately, little have been explored and understood regarding the training side of this emerging ML workload. In this paper, we first perform a detailed workload characterization study on training recommendations, root-causing sparse embedding layer training as one of the most significant performance bottlenecks. We then propose our algorithm-architecture co-design called Tensor Casting, which enables the development of a generic accelerator architecture for tensor gather-reduce that encompasses all the key primitives of training embedding layers. When prototyped on a real CPU-GPU system, Tensor Casting provides 1.9-15x improvements in training throughput compared to state-of-the-art approaches.
Abstract
In cloud ML inference systems, batching is an essential technique to increase throughput which helps optimize total-cost-of-ownership. Prior graph batching combines the individual DNN graphs into a single one, allowing multiple inputs to be concurrently executed in parallel. We observe that the coarse-grained graph batching becomes suboptimal in effectively handling the dynamic inference request traffic, leaving significant performance left on the table. This paper proposes LazyBatching, an SLA-aware batching system that considers both scheduling and batching in the granularity of individual graph nodes, rather than the entire graph for flexible batching. We show that LazyBatching can intelligently determine the set of nodes that can be efficiently batched together, achieving an average 15x,1.5x, and 5.5x improvement than graph batching in terms of average response time, throughput, and SLA satisfaction, respectively.