Authors: Hengyu Zhao, Jiawen Liu, Matheus Almeida Ogleari,
Dong Li, Jishen Zhao
Abstract: Neural networks (NNs) have been adopted in a
widerange of application domains, such as image classification,
speechrecognition,   object   detection, 
 and   computer   vision.   However,training
NNs – especially deep neural networks (DNNs) – can beenergy and time consuming,
because of frequent data movementbetween  processor  and 
memory.  Furthermore,  training  involvesmassive 
fine-grained  operations  with  various  computation 
andmemory  access  characteristics.  Exploiting  high 
parallelism  withsuch  diverse  operations  is 
challenging.  To  address  these  chal-lenges,  we 
propose  a  software/hardware  co-design  of 
heteroge-neous processing-in-memory (PIM) system. Our hardware
designincorporates hundreds of fix-function arithmetic units and ARM-based
programmable cores on the logic layer of a 3D die-stackedmemory  to 
form  a  heterogeneous  PIM  architecture  attached 
toCPU.  Our  software  design  offers  a 
programming  model  and  aruntime system that program, offload,
and schedule various NNtraining  operations  across 
compute  resources  provided  by  CPUand heterogeneous PIM.
By extending the OpenCL programmingmodel  and  employing 
a  hardware  heterogeneity-aware  runtimesystem,  we 
enable  high  program  portability  and  easy 
programmaintenance  across  various  heterogeneous 
hardware,  optimizesystem  energy  efficiency,  and 
improve  hardware  utilization.
