We present GraphLily, a graph linear algebra overlay, to accelerate graph processing on HBM-equipped FPGAs. GraphLily supports a rich set of graph algorithms by adopting the GraphBLAS programming interface, which formulates graph algorithms as sparse linear algebra operations. Operators of different GraphBLAS semirings share the same datapath, which is configurable at run time. GraphLily further builds a middleware to provide runtime support. With this middleware, we can port existing GraphBLAS programs to FPGAs with slight modifications to the original code intended for CPU/GPU execution. GraphLily provides efficient, memory-optimized accelerators for the two widely-used kernels in graph processing, namely, sparse-matrix dense-vector multiplication (SpMV) and sparse-matrix sparse-vector multiplication (SpMSpV). Leveraging the near-memory connectivity, the SpMV accelerator uses a sparse matrix storage format tailored to HBM that enables streaming, vectorized accesses to each channel and concurrent accesses to multiple channels. Besides, the SpMV accelerator exploits data reuse in accesses of the dense vector by introducing a scalable on-chip buffer design. The SpMSpV accelerator complements the SpMV accelerator to handle cases where the input vector has a high sparsity. The evaluation shows that compared with state-of-the-art graph processing frameworks on CPUs and GPUs, GraphLily achieves up to 2.5x and 1.1x higher throughput, while reducing the energy consumption by 8.1x and 2.4x; compared with prior single-purpose graph accelerators on FPGAs, GraphLily achieves 1.2x–1.9x higher throughput. GraphLily is a prototype of near-memory graph processing. We are working to further improve the performance.
Monday, November 8, 2021
GraphLily: Near-Memory Acceleration for Graph Linear Algebra
We present GraphLily, a graph linear algebra overlay, to accelerate graph processing on HBM-equipped FPGAs. GraphLily supports a rich set of graph algorithms by adopting the GraphBLAS programming interface, which formulates graph algorithms as sparse linear algebra operations. Operators of different GraphBLAS semirings share the same datapath, which is configurable at run time. GraphLily further builds a middleware to provide runtime support. With this middleware, we can port existing GraphBLAS programs to FPGAs with slight modifications to the original code intended for CPU/GPU execution. GraphLily provides efficient, memory-optimized accelerators for the two widely-used kernels in graph processing, namely, sparse-matrix dense-vector multiplication (SpMV) and sparse-matrix sparse-vector multiplication (SpMSpV). Leveraging the near-memory connectivity, the SpMV accelerator uses a sparse matrix storage format tailored to HBM that enables streaming, vectorized accesses to each channel and concurrent accesses to multiple channels. Besides, the SpMV accelerator exploits data reuse in accesses of the dense vector by introducing a scalable on-chip buffer design. The SpMSpV accelerator complements the SpMV accelerator to handle cases where the input vector has a high sparsity. The evaluation shows that compared with state-of-the-art graph processing frameworks on CPUs and GPUs, GraphLily achieves up to 2.5x and 1.1x higher throughput, while reducing the energy consumption by 8.1x and 2.4x; compared with prior single-purpose graph accelerators on FPGAs, GraphLily achieves 1.2x–1.9x higher throughput. GraphLily is a prototype of near-memory graph processing. We are working to further improve the performance.
Monday, October 25, 2021
DRAM-CAM: Extending Sieve for General-Purpose Exact Pattern Matching
Exact pattern matching is a widely used kernel in many application domains. A prominent example is k-mer matching in bioinformatics where input DNA sequences of size K are exactly matched against a library of reference DNA patterns for genome classification. Previously we propose three DRAM-based in-situ accelerator designs, dubbed Sieve, to alleviate the bottleneck stage of K-mer matching in genomics.
We notice k-mer matching shares many similarities with other exact pattern matching intensive workloads (e.g., text processing and data mining), thus we extend Sieve with several cost-effective modifications to make it capable of accommodating applications beyond bioinformatics. This enhanced architecture is named Sieve ++.
We show that Sieve ++ as an exact pattern matching accelerator offers significant advantages over traditional architectures. Sieve ++ outperforms CPU by two orders of magnitude for five out of six selected benchmarks that represent a broad set of real-world exact pattern matching use cases.
Monday, October 18, 2021
Ayudante: A Deep Reinforcement Learning Approach to Assist Persistent Memory Programming
(Hanxian Huang, UCSD, presenting on Wednesday, October 20, 2021 at 1:00pm and 7:00pm ET)
Programming PM imposes non-trivial labor effort on writing code to adopt new PM-aware libraries and APIs. In addition, non-expert PM code can be error-prone. In order to ease the burden of PM programmers, we propose Ayudante, a deep reinforcement learning (RL)- based PM programming assistant framework consisting of two key components: a deep RL-based PM code generator and a code refining pipeline. Given a piece of C, C++, or Java source code developed for conventional volatile memory systems, our code generator automatically generates the corresponding PM code and checks its data persistence. The code refining pipeline parses the generated code to provide a report for further program testing and performance optimization. Our evaluation on an Intel server equipped with Optane DC PM demonstrates that both microbenchmark programs and a key-value store application generated by Ayudante pass PMDK checkers. Performance evaluation on the microbenchmarks shows that the generated code achieves comparable speedup and memory access performance as PMDK code examples.
Friday, October 8, 2021
Reticle: A Virtual Machine for Programming Modern FPGAs
(Luis Vega presenting on Wednesday, October 13, 2021 at 1:00 p.m. and 7:00 p.m. ET)
Modern field-programmable gate arrays (FPGAs) have recently powered high-profile efficiency gains in systems from datacenters to embedded devices by offering ensembles of heterogeneous, reconfigurable hardware units. Programming stacks for FPGAs, however, are stuck in the past—they are based on traditional hardware languages, which were appropriate when FPGAs were simple, homogeneous fabrics of basic programmable primitives. We describe Reticle, a new low-level abstraction for FPGA programming that, unlike existing languages, explicitly represents the special-purpose units available on a particular FPGA device. Reticle has two levels: a portable intermediate language and a target-specific assembly language. We show how to use a standard instruction selection approach to lower intermediate programs to assembly programs, which can be both faster and more effective than the complex metaheuristics that existing FPGA toolchains use. We use Reticle to implement linear algebra operators and coroutines and find that Reticle compilation runs up to 100times faster than current approaches while producing comparable or better run-time and utilization.
Monday, June 28, 2021
Dual-stream Multiple Instance Learning Network for Whole Slide Image Classification with Self-supervised Contrastive Learning
(Bin Li presenting on June 30 at 1:00 PM and 7:00 PM Eastern Time)
Whole slide imaging is one of the essential tools in modern histopathology and presents a great opportunity for machine learning applications. Whole slide images (WSIs) have very high resolutions and usually lack localized patch-level annotations. Consequently, the classification of WSIs is often cast as a multiple instance learning (MIL) problem, where a patch-based classifier is trained using slide-level labels. We propose a MIL-based method for WSI classification and tumor detection that does not require localized annotations. First, we introduce a novel MIL aggregator that models the relations of the instances in a dual-stream architecture with trainable distance measurement. Second, since WSIs can produce large or unbalanced bags that hinder the training of MIL models, we propose to use self-supervised contrastive learning to extract good representations for MIL and alleviate the issue of prohibitive memory cost for large bags. We demonstrate that self-supervised contrastive learning can produce robust representations for MIL without the need for patch-level labels, alleviating the issue of prohibitive memory cost for training deep models. Our model offers significant improvement in classification accuracy for both unbalanced and balanced WSI datasets and under both binary and multiple class MIL settings. Additional benchmark results on standard MIL datasets further demonstrate the superior performance of our MIL aggregator on general MIL problems. Collaborating with hardware and computer vision groups, we demonstrate the use of WSI classification as a driving use case for developing deep learning accelerators and efficient training paradigms for related problems such as video and language analysis.
Tuesday, June 1, 2021
Efficient and Secure Learning across Memory Hierarchy
(Saransh Gupta, UC San Diego, presenting on Wednesday, June 2, 2021 at 1:00 & 7:00 PM ET)
Tuesday, May 25, 2021
Corundum: statically-enforced persistent memory safety
Monday, May 17, 2021
FANS: FPGA-Accelerated Near-Storage Sorting
Large-scale sorting is always an important yet demanding task for data center applications. In addition to powerful processing capability, high-performance sorting system requires efficient utilization of the available bandwidth of various levels in the memory hierarchy. Nowadays, with the explosive data size, the frequent data transfers between the host and the storage device are becoming increasingly a performance bottleneck. Fortunately, the emergence of near-storage computing devices gives us the opportunity to accelerate large-scale sorting by avoiding the back and forth data transfer. Near-storage sorting is promising for extra performance improvement and power reduction. However, it is still an open question of how to achieve the optimal sorting performance on the existing near-storage computing device.
In this work, we first perform an in-depth analysis of the sorting performance on the newly released Samsung SmartSSD platform. Contrary to the previous belief, our analysis shows that the end-to-end sorting performance is bound by not only the bandwidth of the flash, but also the main memory bandwidth, the configuration of the sorting kernel and the intermediate sorting status. Based on our modeling, we propose FANS, an FPGA accelerated near-storage sorting system which selects the optimized design configuration and achieves the theoretically maximum end-to-end performance when using a single Samsung SmartSSD device. The experiments demonstrate more than 3× performance speedup over the state-of-art FPGA-accelerated flash storage.
Monday, May 10, 2021
Compiler-Driven Simulation of Reconfigurable Hardware Accelerators
(Zhijing Li of Cornell presenting on Wed. 5/12/21 at 1PM & 7PM ET)
This work proposes a compiler-driven simulation workflow that can model the performance of arbitrary hardware accelerators, while introducing little programming overhead. The key idea is to separate structure representation from simulation by developing an intermediate language that can flexibly represent a wide variety of hardware constructs. We design the Event Queue dialect of MLIR, a dialect that can model arbitrary hardware accelerators with explicit data movement and distributed event-based control. We also implement a simulator to model EQueue-structured programs. We show the effectiveness of our simulator by comparing it with a state-of-the-art application-specialized simulator, while offering higher extensibility. We demonstrate our simulator can save designers' time with reusable lowering pipeline and visualizable outputs.
Monday, March 22, 2021
Defense against the Dark Arts: New Attacks and Defenses for Shared Last-Level Caches
Caches in modern processors improve performance by enabling fast access to data. However, the sharing of the processor caches (particularly the last-level cache) between different processor-cores, VMs and applications, results in vulnerability to side-channel and covert-channel attacks via caches. Cache side-channel attacks allow malware to infer data-access patterns of sensitive applications based on their cache-accesses, to leak encryption keys, confidential IP like DNN models, etc. Similarly, cache covert-channels allow malicious processes to use timing variation due to caches as a means of covert communication (e.g., to exfiltrate sensitive information from a sandbox).
In this talk, I will first present our Streamline Attack (to appear in ASPLOS’21), the current fastest cache covert-channel attack and the first to achieve >1MB/s bandwidth (3-6x higher than prior attacks) by leveraging asynchronous communication protocols. With this attack, we demonstrate that cache covert-channels can have higher information capacity than previously understood, while having fewer ISA or micro-architecture specific limitations. Next, I will present MIRAGE, a Defense Against Cache Side-Channel Attacks (to appear in USENIX Security’21) that provides a practical way to enable a fully-associative last-level cache, which is immune to set-conflict based cache attacks. This defense promises to meaningfully end the arms-race between attackers and defenders in the area of conflict-based cache attacks, where successive defenses have been broken by newer attacks in the past few years.