Title: HW/SW Methods for Scalable Training of Deep Learning Models
Committee:
Dr. Tushar Krishna, ECE, Chair, Advisor
Dr. Alexandros Daglis, CoC
Dr. Alexey Tumanov, CoC
Dr. Srinivas Sridharan, Meta
Dr. Zhihao Jia, CMU
Abstract: The objective of the proposed thesis is to present novel HW/SW techniques for designing platforms for distributed training of Deep Learning (DL) models. DL applications are becoming an integral part of our society due to their vast application in different domains such as vision, language processing, recommendation systems, speech processing, etc. Before being deployed, DL models need to be trained using training samples over many iterations to reach the desired accuracy. To improve the accuracy, DL models are constantly growing in size and training samples, making the tasks of training extremely challenging, taking months or even years for a given model to be trained. Distributed training aims to improve the training speed by distributing the training task across many accelerators. However, distributed training introduces new overheads, such as communication overhead that can limit scalability if left unaddressed.