Tuesday, May 26, 2020

Sieve: Scalable In-situ DRAM-based Accelerator Designs for Massively Parallel k-mer Matching

(Lingxi Wu presenting on May 27, 2020 at 11:00 a.m. and 7:00 p.m.)

The rapid influx of biosequence data, coupled with the stagnation of the processing power of conventional computing systems, highlight the critical need for exploring high-performance accelerator designs that can meet the ever-increasing throughput demands of modern bioinformatics pipelines.  This work argues that processing in memory (PIM) is a viable and effective solution to alleviate the bottleneck of k-mer matching, a widely used genome sequence comparison and classification algorithm, characterized by highly random access patterns and low computational intensity.

This work proposes and evaluates three DRAM-based in-situ k-mer matching accelerator designs (one optimized for area, one optimized for throughput, and one that strikes a balance between hardware cost and performance), dubbed Sieve, that leverage a novel data mapping scheme to allow for simultaneous comparisons of millions of DNA base pairs, lightweight matching circuitry for fast pattern matching, and an early termination mechanism that prunes unnecessary DRAM row activation to reduce latency and save energy.  Evaluation of Sieve using state-of-the-art workloads with real-world datasets shows that the most aggressive design provides an average of 408x/41x speedup and 93X/61x energy savings over multi-core-CPU/GPU baselines for k-mer matching. Sieve's performance scales linearly with the reference sequence data, substantially boosting the efficiency of modern genome sequence pipelines.

Friday, May 15, 2020

aCortex: a Multi-Purpose Mixed-Signal Neural Inference Accelerator Based on Non-Volatile Memory Devices

(Mohammad Bavandpour, UCSB, presenting on May 20, 2020 at 11:00AM & 7:00PM ET)

We introduce “aCortex”, an extremely energy efficient, fast, compact, and versatile neuromorphic processor architecture suitable for acceleration of a wide range of neural network inference models. The most important feature of our processor is a configurable mixed-signal computing array of vector-by-matrix multiplier (VMM) blocks utilizing embedded nonvolatile memory (NVM) arrays for storing weight matrices. In this architecture, power-hungry analog peripheral circuitry for data integration and conversion is shared among a very large array of VMM blocks enabling efficient instant analog-domain VMM operation for different neural layer types with a wide range of layer specifications. This approach also maximizes the processor’s area efficiency through sharing the area-hungry high-voltage programming switching circuitry as well as the analog peripheries among a large 2D array of NVM blocks. Such compact implementation further boosts the energy efficiency via lowering the digital data transfer cost. Other unique features of aCortex include configurable chain of buffers and data buses, a simple and efficient Instruction Set Architecture (ISA) and its corresponding multi-agent controller, and a customized refresh-free embedded DRAM memory. In this work, we specifically focus on 55-nm 2D-NOR and 3D-NAND flash memory technologies, and present detailed system-level area/energy/speed estimations targeting several common benchmarks, namely Inception-v1 and ResNet-152, two state-of-the-art deep feedforward networks for image classification, and GNTM, a Google’s deep recurrent network for language translation.

Monday, May 11, 2020

HeteroRefactor: Refactoring for Heterogeneous Computing with FPGA

(Jason Lau, UCLA, presenting on Wednesday, May 13, 2020)

Heterogeneous computing with field-programmable gate-arrays (FPGAs) has demonstrated orders of magnitude improvement in computing efficiency for many applications. However, the use of such platforms so far is limited to a small subset of programmers with specialized hardware knowledge. High-level synthesis (HLS) tools made significant progress in raising the level of programming abstraction from hardware programming languages to C/C++, but they usually cannot compile and generate accelerators for kernel programs with pointers, memory management, and recursion, and require manual refactoring to make them HLS-compatible. Besides, experts also need to provide heavily handcrafted optimizations to improve resource efficiency, which affects the maximum operating frequency, parallelization, and power efficiency.

We propose a new dynamic invariant analysis and automated refactoring technique, called HeteroRefactor. First, HeteroRefactor monitors FPGA-specific dynamic invariants—the required bitwidth of integer and floating-point variables, and the size of recursive data structures and stacks. Second, using this knowledge of dynamic invariants, it refactors the kernel to make traditionally HLS-incompatible programs synthesizable and to optimize the accelerator’s resource usage and frequency further. Third, to guarantee correctness, it selectively offloads the computation from CPU to FPGA, only if an input falls within the dynamic invariant. On average, for a recursive program of size 175 LOC, an expert FPGA programmer would need to write 185 more LOC to implement an HLS compatible version, while HeteroRefactor automates such transformation. Our results on Xilinx FPGA show that HeteroRefactor minimizes BRAM by 83% and increases frequency by 42% for recursive programs; reduces BRAM by 41% through integer bitwidth reduction; and reduces DSP by 50% through floating-point precision tuning.