Monday, November 8, 2021

GraphLily: Near-Memory Acceleration for Graph Linear Algebra

(Yixiao Du, Cornell, presenting on Wednesday, November 10, 2021 at 1:00 p.m. & 7:00 p.m. ET. )

Graph processing is typically memory-bound due to low compute to memory access ratio and irregular data access patterns. The emerging high-bandwidth memory (HBM) delivers exceptional bandwidth and is adopted on FPGAs with near-memory integration and customizable datapaths, which brings the potential to significantly boost the performance of graph processing and relieve the programming burden.
We present GraphLily, a graph linear algebra overlay, to accelerate graph processing on HBM-equipped FPGAs. GraphLily supports a rich set of graph algorithms by adopting the GraphBLAS programming interface, which formulates graph algorithms as sparse linear algebra operations. Operators of different GraphBLAS semirings share the same datapath, which is configurable at run time. GraphLily further builds a middleware to provide runtime support. With this middleware, we can port existing GraphBLAS programs to FPGAs with slight modifications to the original code intended for CPU/GPU execution. GraphLily provides efficient, memory-optimized accelerators for the two widely-used kernels in graph processing, namely, sparse-matrix dense-vector multiplication (SpMV) and sparse-matrix sparse-vector multiplication (SpMSpV). Leveraging the near-memory connectivity, the SpMV accelerator uses a sparse matrix storage format tailored to HBM that enables streaming, vectorized accesses to each channel and concurrent accesses to multiple channels. Besides, the SpMV accelerator exploits data reuse in accesses of the dense vector by introducing a scalable on-chip buffer design. The SpMSpV accelerator complements the SpMV accelerator to handle cases where the input vector has a high sparsity. The evaluation shows that compared with state-of-the-art graph processing frameworks on CPUs and GPUs, GraphLily achieves up to 2.5x and 1.1x higher throughput, while reducing the energy consumption by 8.1x and 2.4x; compared with prior single-purpose graph accelerators on FPGAs, GraphLily achieves 1.2x–1.9x higher throughput. GraphLily is a prototype of near-memory graph processing. We are working to further improve the performance.