TitleHardware-Friendly Model Compression for Deep Learning Accelerators

Committee:

Dr. Arijit Raychowdhury, ECE, Chair, Advisor

Dr. Justin Romberg, ECE

Dr. Shimeng Yu, ECE

Dr. Asif Khan, ECE

Dr. Yingyan Lin, CS

Abstract: The objective of the proposed research is to introduce solutions to make energy-efficient Deep Neural Network (DNN) algorithms to be deployable on edge devices through developing hardware-aware DNN compression methods. The rising popularity of intelligent mobile devices and the computational cost of deep learning-based models call for efficient and accurate on-device inference schemes. In particular we proposed four compression techniques. In the first method, LGPS, we present a hardware-aware pruning method where the locations of non-zero weights are derived in real-time from a LFSR. Using the proposed method, we demonstrate a total saving of energy and area up to 63.96% and 64.23% for VGG-16 network on down-sampled ImageNet, respectively for iso-compression rate and iso-accuracy. Secondly, we propose a novel model compression scheme that allows inference to be carried out using bit-level sparsity, which can be efficiently implemented using in-memory computing macros.We introduce a method called BitS-Net to leverage the benefits of bit-sparsity (where the number of zeros is more than number of ones in binary representation of weight/activation values) when applied to Compute-In-Memory (CIM) with Resistive Random Access Memory (RRAM) to develop energy efficient DNN accelerators operating in the inference mode. We demonstrate that BitS-Net improves the energy efficiency by up to 5x for ResNet models on the ImageNet dataset. We also explored the deep learning quantization by developing knowledge distillation and gradual quantization for pruned network. Finally, to achieve highly energy-efficient DNN, we introduce a novel twofold sparsity method to sparsify the DNN models in bit- and network-level, simultaneously. We use two separate regularizations to be added to the loss function in order to achieve bit- and network-level sparsity at the same time. We have shown that by using our proposed method we are able to sparsify the network and design a highly energy-efficient deep learning accelerator to eventually bring artificial intelligence (AI) to our daily lives.