As artificial intelligence technology matures, the competition for developing DNNs with higher accuracy has become increasingly intense. To achieve higher performance, the size of DNNs is increasing, leading to a significant increase in the time and c...
As artificial intelligence technology matures, the competition for developing DNNs with higher accuracy has become increasingly intense. To achieve higher performance, the size of DNNs is increasing, leading to a significant increase in the time and cost of development. In this paper, we propose EDDIS, a distributed DNN training platform that integrates heterogeneous GPU resources to provide high-speed distributed training to support faster DNN training. EDDIS provides a methodology for modifying existing Tensorflow/PyTorch codes to enable distributed training and also offers an asynchronous parameter update method to solve the straggler problem associated with synchronous parameter update methods, thus providing superior distributed training performance.