Wednesday, December 5, 2018

Integrated Data Transfer and Address Translation for CPU-GPU Environments

(Presenting 12/5)
Increasingly, accelerators such as GPUs, FPGAs, TPUs, etc. are being co-located with the main CPU on server systems to deal with the variable needs across and within workloads. While a lot of prior works have optimized for data movement amongst homogeneous components, the possible inefficiencies of similar data transfers across heterogeneous components has not been explored as much. In this talk, we will present costs due to excessive data transfers that happen in many current CPU-GPU systems, and their poor scalability as we evolve into multi-GPU systems. Since data transfers happen at coarse (page) granularities, and on demand, it results in poor efficiencies, especially with non-useful data being moved as well. We propose (i) compiler based approaches to relayout the data before the movement and (ii) novel address translation mechanisms to handle the consequence of the new data layout. We will present experimental results showing the benefits of such an approach.