(Presenting 12/5)
Increasingly,
accelerators such as GPUs, FPGAs, TPUs, etc. are being co-located with the main
CPU on server systems to deal with the variable needs across and within workloads.
While a lot of prior works have optimized for data movement amongst homogeneous
components, the possible inefficiencies of similar data transfers across
heterogeneous components has not been explored as much. In this talk, we will
present costs due to excessive data transfers that happen in many current
CPU-GPU systems, and their poor scalability as we evolve into multi-GPU
systems. Since data transfers happen at coarse (page) granularities, and on
demand, it results in poor efficiencies, especially with non-useful data being
moved as well. We propose (i) compiler based approaches to relayout the data
before the movement and (ii) novel address translation mechanisms to handle the
consequence of the new data layout. We will present experimental results
showing the benefits of such an approach.