CRISP Blog: 2020

Monday, December 7, 2020

Multiple Instance Learning Network for Whole Slide Image Classification with Hardware Acceleration

(Bin Li, U. Wisconsin-Madison, presenting on 12/9/20 at 11:00 a.m. and 7:00 p.m. ET)

We propose a novel multiple instance learning (MIL) model for disease detection in whole slide images (WSI) as a driving application for developing hardware acceleration architectures. Our model applies self-attention mechanism on latent representations to model the dependencies between the instances. Our model demonstrates the state-of-the-art performance on both classic MIL benchmark datasets and real-word clinical WSI datasets. Collaborating with hardware research groups, we propose a processing-in-memory (PIM) design for our application as well as general long-sequence attention-based models, which outperforms other PIM-based and GPU-based baselines by a large margin. Meanwhile, we are continuously working with PIM, GPU, and FPGA groups on software-hardware co-design and acceleration performance benchmarking.

Monday, November 30, 2020

ReTail: Request-Level Latency Prediction in Multicore-Enabled Cloud Servers

(Shuang Chen of Cornell presenting December 2, 2020 at 11:00 AM & 8:00 PM ET)

Latency-critical cloud services, such as websearch, have strict quality-of-service (QoS) constraints in terms of tail latency. Improving energy efficiency usually takes second place in an effort to meet these latency constraints. Per-core Dynamic Voltage and Frequency Scaling (DVFS) can offer significant efficiency benefits, however, it is challenging to determine which requests can afford to employ DVFS without hurting the end-to-end tail latency of the entire service.

We introduce ReTail, a framework for QoS-aware power management for LC services using request-level latency prediction. ReTail is composed of (1) a general and systematic process to collect and select the features of an application that best correlate with the processing latency of its requests, (2) a simple yet accurate request-level latency predictor using linear regression, and (3) a runtime power management system that meets the QoS constraints of LC applications, while maximizing the server's energy savings. Experimental results show that compared to the best state-of-the-art per-core power manager, ReTail achieves an average of 11% (up to 48%) energy savings, while at the same time meeting QoS.

Monday, October 12, 2020

A case for in-network persistence

(Korakit Seemakhupt, UVA, presenting Wed. October 14 at 11:00 a.m. and 7:00 p.m. ET)

To guarantee data persistence, storage workloads (such as databases, key-value stores, and file systems) typically use a synchronous protocol that puts network and server stack latency on the critical path of request processing. The use of the fast and byte-addressable persistent memory (PM) has helped mitigate the storage overhead of the server stack; yet, networking is still a dominant factor in the end-to-end latency of request processing. Emerging programmable network devices can reduce network latency by moving parts of the applications’ compute into the network (e.g., caching results for read requests); however, for update requests, the client still has to stall on the server to commit the updates, persistently.

In this work, we introduce in-network data persistence that extends the data-persistence domain from servers to the network, and present PMNet, a network device (e.g., switch or NIC) with PM for persisting data in-network. PMNet logs incoming update requests and acknowledges clients directly without having them to wait on the server to commit the request. In case of a failure, the logged requests act as redo logs for the server to recover. We implement PMNet using an FPGA and evaluate its performance against PM-optimized key-value stores and a database. Our evaluation shows that PMNet can improve the throughput of update requests by 4.27x on average, and the 99th-percentile tail latency by 3.23x.

Tuesday, August 11, 2020

Acceleration of Bioinformatics Workloads

(Cameron Martino, UCSD, presenting on 8/12/20)

Humans are host to a unique set of trillions of microbes that encode 99% of the genetic function found in the body. The sequencing of microbial genetic material has led to a revolution in the ability to profile the microbial communities living and on us all. These microbial profiles have been recognized as effective biomarkers for many fields ranging from cancer to forensics. Despite these revelations, the ability to employ sequencing of microbial profiles at the scale and speed necessary for many applications has lagged behind sequencing technology. This is often due to the expense in both time and compute power needed to process these large datasets. Here, we describe a 10 fold acceleration of processing pipelines while also improving processing accuracy. We then describe a GPU implementation of UniFrac, a widely used metric for evaluating microbial community profiles, which reduces run time from hours to minutes. Finally, we discuss the impact of the immediate application of these improvements to the current COVID-19 pandemic, highlighting the importance of acceleration in bioinformatic workloads.

Monday, July 27, 2020

Look-Up Table based Energy Efficient Processing in SRAM for Neural Network Acceleration

(Akshay Krishna Ramanathan presenting on Wed. 7/29/2020 at 11:00 a.m. and 7:00 p.m. ET)

This work presents a Look-Up Table (LUT) based Processing-In-Memory (PIM) technique with the potential for running Neural Network inference tasks. We implement a bitline computing free technique to avoid frequent bitline accesses to the cache sub-arrays and thereby considerably reducing the memory access energy overhead. LUT in conjunction with the compute engines enables sub-array level parallelism while executing complex operations through data lookup which otherwise requires multiple cycles. Sub-array level parallelism and systolic input data flow ensure data movement to be confined to the SRAM slice.

Our proposed LUT based PIM methodology exploits substantial parallelism using look-up tables, which does not alter the memory structure/organization, that is, preserving the bit-cell and peripherals of the existing SRAM monolithic arrays. Our solution achieves 1.72x higher performance and 3.14x lower energy as compared to a state of the art processing-in-cache solution. Sub-array level design modifications to in-corporate LUT along with the compute engines will increase the overall cache area by 5.6%. We achieve 3.97x speedup w.r.t neural network systolic accelerator with a similar area. The re-configurable nature of the compute engines enables various neural network operations and thereby supporting sequential networks (RNNs) and transformer models. Our quantitative analysis demonstrates 101x, 3x faster execution and 91x, 11x energy efficiency than CPU and GPU respectively while running the transformer model, BERT-Base.

Monday, July 20, 2020

Video Analytic Platform and Deep Graph Matching

Feng Shi and Ziheng Xu, UCLA, presenting on July 22, 2020 at 11:00 a.m. and 7:00 p.m. ET)

As a fundamental problem in pattern recognition, graph matching has applications in a variety of fields, especially for computer vision. The utilization of graph matching in video analytics include (but not limited) scene understanding, object keypoints matching, checking the availability of scene generation, etc.. The graph similarity (or matching) problem attempts to find node correspondences between graphs. Traditionally, obtaining an exact solution with heuristic algorithms is NP-hard and spends long latency; recently, the research employing deep graph architectures has been proposed in finding an approximate solution, which leverages the speed and accuracy. Previous works in such research mainly focused on using localized node embeddings to obtain an approximate node alignment. However, investigating only local features cannot reflect the whole structure; the overall graph topology plays a vital role in determining the edge alignment between graphs. Diffusion wavelets, which depict the probability distributions of graph signals over a graph, are powerful tools to assist in the graph topology exploration. In this work, we present WaveSim, a light-weight deep graph matching framework incorporating graph diffusion wavelets to calculate the diffusion distance. We also mathematically prove that it is possible to transform a Quadratic Assignment Programming (QAP) problem with high-order combinatorial nature into a lower dimension problem. Experiments show that WaveSim achieves remarkable and robust performances, and can be extended to matching problems with large graphs.

Tuesday, July 14, 2020

GCN meets GPU: Decoupling “When to Sample” from “How to Sample"

(Morteza Ramezani, Penn State, presenting on Wed. 7/15/20)

Graphs are powerful and versatile data structures to model many real world problems and learning on graph-based data is attracting growing interest in both academia and industry. Recently, many models have been introduced for learning on graphs. Among these, Graph Convolutional Networks (GCNs) and their variants have experienced significant attention and have become the de facto methods for learning on graphs. However, due to node dependency in graphs, which leads to "neighborhood explosion" , training GCN for large graphs on GPU is not practical. To address this problem, sampling-based methods have been introduced to couple with stochastic gradient descent for training GCNs.

While effective in alleviating the neighborhood explosion, due to bandwidth and memory bottlenecks, these methods lead to computational overheads in preprocessing and loading new samples in heterogeneous systems (CPU-GPU), which significantly deteriorate the sampling performance. By decoupling the frequency of sampling from the sampling strategy, we propose LazyGCN, a general yet effective framework that can be integrated with any sampling strategy to substantially improve the training time. The idea behind LazyGCN is to perform sampling periodically and effectively recycle the sampled nodes to mitigate data preparation overhead. We theoretically analyze the proposed algorithm and give corroborating empirical evidence on large real-world graphs, demonstrating that the proposed schema can significantly reduce the number of sampling steps and yield superior speedup without compromising the accuracy.

Thursday, July 2, 2020

Mirage: A Highly Parallel And Flexible RRAM Accelerator

(Xiao Liu, UCSD, presenting on July 8, 2020 at 11:00 AM & 7:00 PM ET)

Emerging resistive memory (RRAM) based crossbar is a promising technology to accelerate neural network applications.
Such a structure can support massively parallel multiply-accumulate operations, which are intensively used in convolutional neural networks (CNN).
The novel structure is demonstrated to offer higher performance and power efficiency than the CMOS-based accelerators.
However, previously proposed RRAM-based neural network designs lack several desirable features for neural network accelerators.
First, the pipeline of existing architecture is inefficient.
as data dependency between different layers of the network significantly can stall the execution.
Second, existing RRAM-based accelerators suffer from limited flexibility.
The oversized networks are not able to be executed on the accelerator, while the undersized networks are not able to utilize all the RRAM crossbar arrays.

To address these issues, we propose Mirage, a novel architectural design to enable high parallelism and flexibility for RRAM-based CNN accelerators.
Mirage consists of a Fine-grained Parallel RRAM Architecture (FPRA) and Auto Assignment (AA).
Motivated by thread block design in the GPU, FPRA addresses the data dependency issue in the pipeline.
When inter layer parallelism is involved, FPRA unifies the data dependency of each layer and handles them with shared input and output memory.
AA provides the ability to execute any-sized network on the accelerator.
When the network is oversized, AA utilizes dynamic reconfiguration to fold the network to fit with available hardware.
When the network is undersized, AA utilizes the FPRA to maximize the use of the extra hardware for higher performance.

We evaluate Mirage on seven popular image recognition neural network models with various network sizes.

We find that Mirage manages to achieve 2.0x average speedup compares to the state-of-the-art RRAM-based accelerator.

Additionally, Mirage can adopt network into RRAM-based accelerators of various sizes and we show that Mirage can deliver better performance scalability over prior works.

Monday, June 22, 2020

Efficient and Reliable Digital Processing in-Memory Acceleration

(Presented by Minxuan Zhou, UCSD, on June 24th.)

Digital processing in-memory (PIM) is a promising technology that minimizes data movement overhead while enabling an extremely high degree of parallelism. These characteristics provide a great opportunity for accelerating emerging big-data applications. Such optimization requires that both software and hardware be considered in order to get an efficient and reliable solution. In this talk we first illustrate our software-hardware co-design methodology for digital PIM by using an example of graph processing workloads. We next introduce several new PIM-specific methods that improve the reliability of the system by efficiently managing its thermal issues. Our results show that it is possible to get orders of magnitude improvement in speed over the state of the art graph processing algorithms, while ensuring thermal and reliability constraints are met.

Tuesday, June 9, 2020

AI-PiM – PiM accelerators for AI applications for RISC-V

(Vaibhav Varma, UVA, presenting on Wed. 6/17/2020)

Artificial intelligence (AI) and machine learning (ML) have emerged as the fastest growing workloads in the last few years ranging from applications like object detection and face recognition to self-driving smart cars. This rise in AI applications combined with IoT infrastructure is leading to a new paradigm of Artificial Intelligence of Things (AIoT) where IoT edge devices are augmented with AI/ML capabilities to enable smart sensing applications. This has fuelled an increased interest towards integrating AI accelerators in edge devices and Processing-in-Memory (PiM) accelerators are prime candidates for this integration. PiM accelerators promise improved performance and power characteristics by breaking the memory wall but they are notoriously difficult to program which resists their integration in the traditional computing stack. In this talk, we present AI-PiM as a solution to this problem. AI-PiM is a hardware/software codesign methodology which helps with efficient integration of PiM accelerators in the RISC-V processor pipeline as functional units. Along with hardware integration, AI-PiM also focusses on RISC-V ISA extensions which target the PiM functional units directly, resulting in a tight integration of PiM accelerators with the processor.

Tuesday, May 26, 2020

Sieve: Scalable In-situ DRAM-based Accelerator Designs for Massively Parallel k-mer Matching

(Lingxi Wu presenting on May 27, 2020 at 11:00 a.m. and 7:00 p.m.)

The rapid influx of biosequence data, coupled with the stagnation of the processing power of conventional computing systems, highlight the critical need for exploring high-performance accelerator designs that can meet the ever-increasing throughput demands of modern bioinformatics pipelines. This work argues that processing in memory (PIM) is a viable and effective solution to alleviate the bottleneck of k-mer matching, a widely used genome sequence comparison and classification algorithm, characterized by highly random access patterns and low computational intensity.

This work proposes and evaluates three DRAM-based in-situ k-mer matching accelerator designs (one optimized for area, one optimized for throughput, and one that strikes a balance between hardware cost and performance), dubbed Sieve, that leverage a novel data mapping scheme to allow for simultaneous comparisons of millions of DNA base pairs, lightweight matching circuitry for fast pattern matching, and an early termination mechanism that prunes unnecessary DRAM row activation to reduce latency and save energy. Evaluation of Sieve using state-of-the-art workloads with real-world datasets shows that the most aggressive design provides an average of 408x/41x speedup and 93X/61x energy savings over multi-core-CPU/GPU baselines for k-mer matching. Sieve's performance scales linearly with the reference sequence data, substantially boosting the efficiency of modern genome sequence pipelines.

Friday, May 15, 2020

aCortex: a Multi-Purpose Mixed-Signal Neural Inference Accelerator Based on Non-Volatile Memory Devices

(Mohammad Bavandpour, UCSB, presenting on May 20, 2020 at 11:00AM & 7:00PM ET)

We introduce “aCortex”, an extremely energy efficient, fast, compact, and versatile neuromorphic processor architecture suitable for acceleration of a wide range of neural network inference models. The most important feature of our processor is a configurable mixed-signal computing array of vector-by-matrix multiplier (VMM) blocks utilizing embedded nonvolatile memory (NVM) arrays for storing weight matrices. In this architecture, power-hungry analog peripheral circuitry for data integration and conversion is shared among a very large array of VMM blocks enabling efficient instant analog-domain VMM operation for different neural layer types with a wide range of layer specifications. This approach also maximizes the processor’s area efficiency through sharing the area-hungry high-voltage programming switching circuitry as well as the analog peripheries among a large 2D array of NVM blocks. Such compact implementation further boosts the energy efficiency via lowering the digital data transfer cost. Other unique features of aCortex include configurable chain of buffers and data buses, a simple and efficient Instruction Set Architecture (ISA) and its corresponding multi-agent controller, and a customized refresh-free embedded DRAM memory. In this work, we specifically focus on 55-nm 2D-NOR and 3D-NAND flash memory technologies, and present detailed system-level area/energy/speed estimations targeting several common benchmarks, namely Inception-v1 and ResNet-152, two state-of-the-art deep feedforward networks for image classification, and GNTM, a Google’s deep recurrent network for language translation.

Monday, May 11, 2020

HeteroRefactor: Refactoring for Heterogeneous Computing with FPGA

(Jason Lau, UCLA, presenting on Wednesday, May 13, 2020)

Heterogeneous computing with field-programmable gate-arrays (FPGAs) has demonstrated orders of magnitude improvement in computing efficiency for many applications. However, the use of such platforms so far is limited to a small subset of programmers with specialized hardware knowledge. High-level synthesis (HLS) tools made significant progress in raising the level of programming abstraction from hardware programming languages to C/C++, but they usually cannot compile and generate accelerators for kernel programs with pointers, memory management, and recursion, and require manual refactoring to make them HLS-compatible. Besides, experts also need to provide heavily handcrafted optimizations to improve resource efficiency, which affects the maximum operating frequency, parallelization, and power efficiency.

We propose a new dynamic invariant analysis and automated refactoring technique, called HeteroRefactor. First, HeteroRefactor monitors FPGA-specific dynamic invariants—the required bitwidth of integer and floating-point variables, and the size of recursive data structures and stacks. Second, using this knowledge of dynamic invariants, it refactors the kernel to make traditionally HLS-incompatible programs synthesizable and to optimize the accelerator’s resource usage and frequency further. Third, to guarantee correctness, it selectively offloads the computation from CPU to FPGA, only if an input falls within the dynamic invariant. On average, for a recursive program of size 175 LOC, an expert FPGA programmer would need to write 185 more LOC to implement an HLS compatible version, while HeteroRefactor automates such transformation. Our results on Xilinx FPGA show that HeteroRefactor minimizes BRAM by 83% and increases frequency by 42% for recursive programs; reduces BRAM by 41% through integer bitwidth reduction; and reduces DSP by 50% through floating-point precision tuning.

Tuesday, April 28, 2020

MEG1.1: A RISCV-based System Simulation Infrastructure for Exploring Memory Optimization using FPGAs and High Bandwidth Memory

(Nicholas Beckwith, U Penn., presenting on Wednesday, April 29, 2020)

In this presentation, we propose MEG1.1, a configurable, cycle-exact, and RISC-V based full system emulation infrastructure using FPGA and HBM. MEG1.1 extends MEG1.0 by providing out-of-order RISC-V cores as well as OS and architectural support to help integrate the user’s customized accelerators. Furthermore, MEG1.1 provides an HBM memory interface to fully expose the HBM’s bandwidth to the user. Leveraging MEG1.1, we present a cross-layer system optimization as an illustrative case to demonstrate the usability of MEG1.1. In this case study, we present a reconfigurable memory controller to improve the address mapping of a standard memory controller. This reconfigurable memory controller, along with its OS support, allows the user to improve the memory bandwidth accessible to the out-of-order RISC-V cores as well as to the custom near-memory accelerators. We also present the challenges and research directions for MEG 2.0 that can significantly reduce the cost and improve the portability, flexibility, usability of MEG 1.0 and 1.1 without sacrificing the performance and fidelity.

Tuesday, April 21, 2020

BaM: Enabling Accelerator Memory Accesses into the SSD

(Zaid Qureshi, David Min, and Vikram Sharma Mailthody presenting 4/22/20.)

Storage class memories (SCM) have been considered as a prime candidate to address the growing need for applications’ memory footprint. An ideal SCM for tomorrow’s data center has TBs of memory capacity, has a few hundred nanoseconds to a couple of microsecond latency, is energy efficient, offers high memory parallelism, is scalable and is very cheap. Among the several types of SCM, 3DX-point and Flash have shown promising results. Compared to 3DX-point, Flash offers higher throughput, thanks to several levels of parallelism, has higher density, consumes very low power per memory access, and is proven to be scalable and cost-efficient.

However, studying Flash as part of the main memory system is challenging as the existing simulators and emulators do not provide the needed flexibility and cannot address practical system-level challenges. In this talk, we will discuss our attempt in modeling an SSD using an FPGA. We show that CPUs offer a very low amount of memory parallelism over PCIe and are inefficient in exploiting the massive parallelism offered by these emerging NVM devices. To increase the memory level parallelism, we connect a GPU with an FPGA. We shall then discuss our learnings and several applications and system-level challenges we encountered.