The scale of model parameters and datasets is rapidly growing for high accuracy in various areas. To train a large-scale deep neural network (DNN) model, a huge amount of computation and memory is required; therefore, a parallelization technique for training large-scale DNN models has attracted attention. A number of approaches have been proposed to parallelize large-scale DNN models, but these schemes lack scalability because of their long communication time and limited worker memory. They often sacrifice accuracy to reduce communication time.
In this work, we proposed an efficient parallelism strategy named group hybrid parallelism (GHP) to minimize the training time without any accuracy loss. Two key ideas inspired our approach. First, grouping workers and training them by groups reduces unnecessary communication overhead among workers. It saves a huge amount of network resources in the course of training large-scale networks. Second, mixing data and model parallelism can reduce communication time and mitigate the worker memory issue. Data and model paralleism are complementary to each other so the training time can be enhanced when they are combined. We analyzed the training time model of the data and model parallelism, and based on the training time model, we demonstrated the heuristics that determine the parallelization strategy for minimizing training time.
We evaluated group hybrid parallelism in comparison with existing parallelism schemes, and our experimental results show that group hybrid parallelism outperforms them.