Friday, November 8, 2019

Exploiting Dynamic Sparsity in Neural Network Accelerators

Yuan Zhou Presenting on Wednesday, 11/13/19 at 1:00PM ET.

Convolutional neural networks (CNNs) have demonstrated human-level performance in many vision-related tasks, including image classification, segmentation, and real-time tasks such as autonomous driving and robotic manipulation. While modern CNNs continue to achieve higher accuracy, they also have larger model sizes and require more computation. As a result, it is challenging to deploy these compute-intensive CNN models to a wider range of applications, especially for embedded and mobile platforms which are area- and power-constrained. 

In this talk, I will introduce our work on reducing the computational costs of CNNs by exploiting dynamic sparsity at run time. I will first present channel gating, a fine-grained dynamic pruning technique for CNN inference. Channel gating identifies the regions in the feature map of each CNN layer that contribute less to the classification result, and turns off a subset of channels for computing the activations in these less important regions. Since channel gating preserves the memory locality in the channel dimension, a CNN with channel gating can be effectively mapped to a slightly modified weight-stationary CNN accelerator. Running a channel gating ResNet-18 model with 2.8x theoretical FLOP reduction, our accelerator achieves 2.3x speedup over the baseline on ImageNet. We further demonstrate that channel gating is suitable for near-memory CNN acceleration by simulating a compute-constrained, high-bandwidth platform. With sufficient memory bandwidth, the actual speedup of the channel gating accelerator scales almost linearly with the theoretical FLOP reduction. 

I will then briefly talk about our ongoing work, precision gating. Compared with channel gating, precision gating reduces computation by exploiting dynamic sparsity in another direction. Rather than pruning away entire channels at non-important pixel locations, precision gating uses low arithmetic precision at these locations while keeping the original precision at important locations. We believe precision gating is also suitable for near-memory CNN acceleration since it can effectively reduce the computational costs of CNN inference.