Progress Indication for Deep Learning Model Training: A Feasibility Demonstration.
Qifei DongGang LuoPublished in: IEEE access : practical innovations, open solutions (2020)
Deep learning is the state-of-the-art learning algorithm for many machine learning tasks. Yet, training a deep learning model on a large data set is often time-consuming, taking several days or even months. During model training, it is desirable to offer a non-trivial progress indicator that can continuously project the remaining model training time and the fraction of model training work completed. This makes the training process more user-friendly. In addition, we can use the information given by the progress indicator to assist in workload management. In this paper, we present the first set of techniques to support non-trivial progress indicators for deep learning model training when early stopping is allowed. We report an implementation of these techniques in TensorFlow and our evaluation results for both convolutional and recurrent neural networks. Our experiments show that our progress indicator can offer useful information even if the run-time system load varies over time. In addition, the progress indicator can self-correct its initial estimation errors, if any, over time.