(Zixuan Wang, UCSD, presenting on Wed. 3/16/22 at 1:00 & 7:00 PM ET at our CRISP task-level meeting.)
Modern deep learning (DL) training
is memory consuming, constrained by the memory capacity of each computation
component and cross-device communication bandwidth. In response to such
constraints, current approaches include increasing parallelism in distributed
training and optimizing inter-device communication. However, model parameter
communication is becoming a key performance
bottleneck in distributed DL training. To
improve parameter communication performance, we propose COARSE, a disaggregated
memory extension for distributed DL training. COARSE is built on modern
cache-coherent interconnect (CCI) protocols and MPI-like collective
communication for synchronization, to allow
low-latency and parallel access to training data
and model parameters shared among worker
GPUs. To enable high bandwidth transfers between GPUs and the disaggregated
memory system, we propose a decentralized parameter communication scheme to
decouple and localize parameter synchronization traffic. Furthermore, we
propose dynamic tensor routing and
partitioning to fully utilize the non-uniform
serial bus bandwidth varied across different cloud computing systems. Finally,
we design a deadlock avoidance and dual synchronization to ensure
high-performance parameter synchronization. Our evaluation shows that COARSE achieves
up to 48.3% faster DL training compared to the state-of-the-art MPI AllReduce
communication.