Authors: Hengyu Zhao, Jiawen Liu, Matheus Almeida Ogleari,
Dong Li, Jishen Zhao
Abstract: Neural networks (NNs) have been adopted in a
widerange of application domains, such as image classification,
speechrecognition, object detection,
and computer vision. However,training
NNs – especially deep neural networks (DNNs) – can beenergy and time consuming,
because of frequent data movementbetween processor and
memory. Furthermore, training involvesmassive
fine-grained operations with various computation
andmemory access characteristics. Exploiting high
parallelism withsuch diverse operations is
challenging. To address these chal-lenges, we
propose a software/hardware co-design of
heteroge-neous processing-in-memory (PIM) system. Our hardware
designincorporates hundreds of fix-function arithmetic units and ARM-based
programmable cores on the logic layer of a 3D die-stackedmemory to
form a heterogeneous PIM architecture attached
toCPU. Our software design offers a
programming model and aruntime system that program, offload,
and schedule various NNtraining operations across
compute resources provided by CPUand heterogeneous PIM.
By extending the OpenCL programmingmodel and employing
a hardware heterogeneity-aware runtimesystem, we
enable high program portability and easy
programmaintenance across various heterogeneous
hardware, optimizesystem energy efficiency, and
improve hardware utilization.