Monday, November 18, 2019

Intermediate Languages for Automated Spatial Computing

Yi-Hsiang Lai presenting on Wed. 11/20/19-- 

With the pursuit of improving compute performance under strict power constraints, there is an increasing need for deploying applications to heterogeneous hardware architectures with accelerators, such as PIMs and FPGAs. However, although these heterogeneous computing platforms are becoming widely available, they are very difficult to program. As a result, the use of such platforms has been limited to a small subset of programmers with specialized hardware knowledge.

To tackle this challenge, we introduce HeteroCL, a programming infrastructure composed of a Python-based domain-specific language (DSL) and a compilation flow. The HeteroCL DSL provides a clean programming abstraction that decouples algorithm specification from three important types of hardware customization in compute, data types, and memory architectures. HeteroCL further captures the interdependence among these different customization techniques, allowing programmers to explore various performance/area/accuracy trade-offs in a systematic and productive manner. In addition, our framework produces highly efficient hardware implementations for a variety of popular workloads by targeting spatial architecture templates such as systolic arrays and stencil with dataflow architectures. HeteroCL further incorporates the T2S framework developed by Intel Lab. T2S is an intermediate programming model extended from Halide for high-performance systolic architectures. Similar to HetoerCL, T2S cleanly decouples the temporal definition from spatial mapping, which enables productive programming and efficient design space exploration. 

Experimental results show that HeteroCL allows programmers to explore the design space efficiently in both performance and accuracy by combining different types of hardware customization and targeting spatial architectures, while keeping the algorithm code intact.

Friday, November 8, 2019

Exploiting Dynamic Sparsity in Neural Network Accelerators

Yuan Zhou Presenting on Wednesday, 11/13/19 at 1:00PM ET.

Convolutional neural networks (CNNs) have demonstrated human-level performance in many vision-related tasks, including image classification, segmentation, and real-time tasks such as autonomous driving and robotic manipulation. While modern CNNs continue to achieve higher accuracy, they also have larger model sizes and require more computation. As a result, it is challenging to deploy these compute-intensive CNN models to a wider range of applications, especially for embedded and mobile platforms which are area- and power-constrained. 

In this talk, I will introduce our work on reducing the computational costs of CNNs by exploiting dynamic sparsity at run time. I will first present channel gating, a fine-grained dynamic pruning technique for CNN inference. Channel gating identifies the regions in the feature map of each CNN layer that contribute less to the classification result, and turns off a subset of channels for computing the activations in these less important regions. Since channel gating preserves the memory locality in the channel dimension, a CNN with channel gating can be effectively mapped to a slightly modified weight-stationary CNN accelerator. Running a channel gating ResNet-18 model with 2.8x theoretical FLOP reduction, our accelerator achieves 2.3x speedup over the baseline on ImageNet. We further demonstrate that channel gating is suitable for near-memory CNN acceleration by simulating a compute-constrained, high-bandwidth platform. With sufficient memory bandwidth, the actual speedup of the channel gating accelerator scales almost linearly with the theoretical FLOP reduction. 

I will then briefly talk about our ongoing work, precision gating. Compared with channel gating, precision gating reduces computation by exploiting dynamic sparsity in another direction. Rather than pruning away entire channels at non-important pixel locations, precision gating uses low arithmetic precision at these locations while keeping the original precision at important locations. We believe precision gating is also suitable for near-memory CNN acceleration since it can effectively reduce the computational costs of CNN inference.