Monday, December 7, 2020
Multiple Instance Learning Network for Whole Slide Image Classification with Hardware Acceleration
Monday, November 30, 2020
ReTail: Request-Level Latency Prediction in Multicore-Enabled Cloud Servers
(Shuang Chen of Cornell presenting December 2, 2020 at 11:00 AM & 8:00 PM ET)
Latency-critical cloud services, such as websearch, have strict quality-of-service (QoS) constraints in terms of tail latency. Improving energy efficiency usually takes second place in an effort to meet these latency constraints. Per-core Dynamic Voltage and Frequency Scaling (DVFS) can offer significant efficiency benefits, however, it is challenging to determine which requests can afford to employ DVFS without hurting the end-to-end tail latency of the entire service.
We introduce ReTail, a framework for QoS-aware power management for LC services using request-level latency prediction. ReTail is composed of (1) a general and systematic process to collect and select the features of an application that best correlate with the processing latency of its requests, (2) a simple yet accurate request-level latency predictor using linear regression, and (3) a runtime power management system that meets the QoS constraints of LC applications, while maximizing the server's energy savings. Experimental results show that compared to the best state-of-the-art per-core power manager, ReTail achieves an average of 11% (up to 48%) energy savings, while at the same time meeting QoS.
Monday, October 12, 2020
A case for in-network persistence
(Korakit Seemakhupt, UVA, presenting Wed. October 14 at 11:00 a.m. and 7:00 p.m. ET)
To guarantee data persistence, storage workloads (such as databases, key-value stores, and file systems) typically use a synchronous protocol that puts network and server stack latency on the critical path of request processing. The use of the fast and byte-addressable persistent memory (PM) has helped mitigate the storage overhead of the server stack; yet, networking is still a dominant factor in the end-to-end latency of request processing. Emerging programmable network devices can reduce network latency by moving parts of the applications’ compute into the network (e.g., caching results for read requests); however, for update requests, the client still has to stall on the server to commit the updates, persistently.
In this work, we introduce in-network data persistence that extends the data-persistence domain from servers to the network, and present PMNet, a network device (e.g., switch or NIC) with PM for persisting data in-network. PMNet logs incoming update requests and acknowledges clients directly without having them to wait on the server to commit the request. In case of a failure, the logged requests act as redo logs for the server to recover. We implement PMNet using an FPGA and evaluate its performance against PM-optimized key-value stores and a database. Our evaluation shows that PMNet can improve the throughput of update requests by 4.27x on average, and the 99th-percentile tail latency by 3.23x.
Tuesday, August 11, 2020
Acceleration of Bioinformatics Workloads
(Cameron Martino, UCSD, presenting on 8/12/20)
Humans are host to a unique set of trillions of microbes that encode 99% of the genetic function found in the body. The sequencing of microbial genetic material has led to a revolution in the ability to profile the microbial communities living and on us all. These microbial profiles have been recognized as effective biomarkers for many fields ranging from cancer to forensics. Despite these revelations, the ability to employ sequencing of microbial profiles at the scale and speed necessary for many applications has lagged behind sequencing technology. This is often due to the expense in both time and compute power needed to process these large datasets. Here, we describe a 10 fold acceleration of processing pipelines while also improving processing accuracy. We then describe a GPU implementation of UniFrac, a widely used metric for evaluating microbial community profiles, which reduces run time from hours to minutes. Finally, we discuss the impact of the immediate application of these improvements to the current COVID-19 pandemic, highlighting the importance of acceleration in bioinformatic workloads.
Monday, July 27, 2020
Look-Up Table based Energy Efficient Processing in SRAM for Neural Network Acceleration
Monday, July 20, 2020
Video Analytic Platform and Deep Graph Matching
As a fundamental problem in pattern recognition, graph matching has applications in a variety of fields, especially for computer vision. The utilization of graph matching in video analytics include (but not limited) scene understanding, object keypoints matching, checking the availability of scene generation, etc.. The graph similarity (or matching) problem attempts to find node correspondences between graphs. Traditionally, obtaining an exact solution with heuristic algorithms is NP-hard and spends long latency; recently, the research employing deep graph architectures has been proposed in finding an approximate solution, which leverages the speed and accuracy. Previous works in such research mainly focused on using localized node embeddings to obtain an approximate node alignment. However, investigating only local features cannot reflect the whole structure; the overall graph topology plays a vital role in determining the edge alignment between graphs. Diffusion wavelets, which depict the probability distributions of graph signals over a graph, are powerful tools to assist in the graph topology exploration. In this work, we present WaveSim, a light-weight deep graph matching framework incorporating graph diffusion wavelets to calculate the diffusion distance. We also mathematically prove that it is possible to transform a Quadratic Assignment Programming (QAP) problem with high-order combinatorial nature into a lower dimension problem. Experiments show that WaveSim achieves remarkable and robust performances, and can be extended to matching problems with large graphs.
Tuesday, July 14, 2020
GCN meets GPU: Decoupling “When to Sample” from “How to Sample"
Thursday, July 2, 2020
Mirage: A Highly Parallel And Flexible RRAM Accelerator
(Xiao Liu, UCSD, presenting on July 8, 2020 at 11:00 AM & 7:00 PM ET)
Such a structure can support massively parallel multiply-accumulate operations, which are intensively used in convolutional neural networks (CNN).
The novel structure is demonstrated to offer higher performance and power efficiency than the CMOS-based accelerators.
However, previously proposed RRAM-based neural network designs lack several desirable features for neural network accelerators.
First, the pipeline of existing architecture is inefficient.
as data dependency between different layers of the network significantly can stall the execution.
Second, existing RRAM-based accelerators suffer from limited flexibility.
The oversized networks are not able to be executed on the accelerator, while the undersized networks are not able to utilize all the RRAM crossbar arrays.
To address these issues, we propose Mirage, a novel architectural design to enable high parallelism and flexibility for RRAM-based CNN accelerators.
Mirage consists of a Fine-grained Parallel RRAM Architecture (FPRA) and Auto Assignment (AA).
Motivated by thread block design in the GPU, FPRA addresses the data dependency issue in the pipeline.
When inter layer parallelism is involved, FPRA unifies the data dependency of each layer and handles them with shared input and output memory.
AA provides the ability to execute any-sized network on the accelerator.
When the network is oversized, AA utilizes dynamic reconfiguration to fold the network to fit with available hardware.
When the network is undersized, AA utilizes the FPRA to maximize the use of the extra hardware for higher performance.
We evaluate Mirage on seven popular image recognition neural network models with various network sizes.
Monday, June 22, 2020
Efficient and Reliable Digital Processing in-Memory Acceleration
Tuesday, June 9, 2020
AI-PiM – PiM accelerators for AI applications for RISC-V
Tuesday, May 26, 2020
Sieve: Scalable In-situ DRAM-based Accelerator Designs for Massively Parallel k-mer Matching
Friday, May 15, 2020
aCortex: a Multi-Purpose Mixed-Signal Neural Inference Accelerator Based on Non-Volatile Memory Devices
We introduce “aCortex”, an extremely energy efficient, fast, compact, and versatile neuromorphic processor architecture suitable for acceleration of a wide range of neural network inference models. The most important feature of our processor is a configurable mixed-signal computing array of vector-by-matrix multiplier (VMM) blocks utilizing embedded nonvolatile memory (NVM) arrays for storing weight matrices. In this architecture, power-hungry analog peripheral circuitry for data integration and conversion is shared among a very large array of VMM blocks enabling efficient instant analog-domain VMM operation for different neural layer types with a wide range of layer specifications. This approach also maximizes the processor’s area efficiency through sharing the area-hungry high-voltage programming switching circuitry as well as the analog peripheries among a large 2D array of NVM blocks. Such compact implementation further boosts the energy efficiency via lowering the digital data transfer cost. Other unique features of aCortex include configurable chain of buffers and data buses, a simple and efficient Instruction Set Architecture (ISA) and its corresponding multi-agent controller, and a customized refresh-free embedded DRAM memory. In this work, we specifically focus on 55-nm 2D-NOR and 3D-NAND flash memory technologies, and present detailed system-level area/energy/speed estimations targeting several common benchmarks, namely Inception-v1 and ResNet-152, two state-of-the-art deep feedforward networks for image classification, and GNTM, a Google’s deep recurrent network for language translation.
Monday, May 11, 2020
HeteroRefactor: Refactoring for Heterogeneous Computing with FPGA
Heterogeneous computing with field-programmable gate-arrays (FPGAs) has demonstrated orders of magnitude improvement in computing efficiency for many applications. However, the use of such platforms so far is limited to a small subset of programmers with specialized hardware knowledge. High-level synthesis (HLS) tools made significant progress in raising the level of programming abstraction from hardware programming languages to C/C++, but they usually cannot compile and generate accelerators for kernel programs with pointers, memory management, and recursion, and require manual refactoring to make them HLS-compatible. Besides, experts also need to provide heavily handcrafted optimizations to improve resource efficiency, which affects the maximum operating frequency, parallelization, and power efficiency.
We propose a new dynamic invariant analysis and automated refactoring technique, called HeteroRefactor. First, HeteroRefactor monitors FPGA-specific dynamic invariants—the required bitwidth of integer and floating-point variables, and the size of recursive data structures and stacks. Second, using this knowledge of dynamic invariants, it refactors the kernel to make traditionally HLS-incompatible programs synthesizable and to optimize the accelerator’s resource usage and frequency further. Third, to guarantee correctness, it selectively offloads the computation from CPU to FPGA, only if an input falls within the dynamic invariant. On average, for a recursive program of size 175 LOC, an expert FPGA programmer would need to write 185 more LOC to implement an HLS compatible version, while HeteroRefactor automates such transformation. Our results on Xilinx FPGA show that HeteroRefactor minimizes BRAM by 83% and increases frequency by 42% for recursive programs; reduces BRAM by 41% through integer bitwidth reduction; and reduces DSP by 50% through floating-point precision tuning.
Tuesday, April 28, 2020
MEG1.1: A RISCV-based System Simulation Infrastructure for Exploring Memory Optimization using FPGAs and High Bandwidth Memory
In this presentation, we propose MEG1.1, a configurable, cycle-exact, and RISC-V based full system emulation infrastructure using FPGA and HBM. MEG1.1 extends MEG1.0 by providing out-of-order RISC-V cores as well as OS and architectural support to help integrate the user’s customized accelerators. Furthermore, MEG1.1 provides an HBM memory interface to fully expose the HBM’s bandwidth to the user. Leveraging MEG1.1, we present a cross-layer system optimization as an illustrative case to demonstrate the usability of MEG1.1. In this case study, we present a reconfigurable memory controller to improve the address mapping of a standard memory controller. This reconfigurable memory controller, along with its OS support, allows the user to improve the memory bandwidth accessible to the out-of-order RISC-V cores as well as to the custom near-memory accelerators. We also present the challenges and research directions for MEG 2.0 that can significantly reduce the cost and improve the portability, flexibility, usability of MEG 1.0 and 1.1 without sacrificing the performance and fidelity.