Monday, July 27, 2020

Look-Up Table based Energy Efficient Processing in SRAM for Neural Network Acceleration

(Akshay Krishna Ramanathan presenting on Wed. 7/29/2020 at 11:00 a.m. and 7:00 p.m. ET)

This work presents a Look-Up Table (LUT) based Processing-In-Memory (PIM) technique with the potential for running Neural Network inference tasks. We implement a bitline computing free technique to avoid frequent bitline accesses to the cache sub-arrays and thereby considerably reducing the memory access energy overhead. LUT in conjunction with the compute engines enables sub-array level parallelism while executing complex operations through data lookup which otherwise requires multiple cycles. Sub-array level parallelism and systolic input data flow ensure data movement to be confined to the SRAM slice.

Our proposed LUT based PIM methodology exploits substantial parallelism using look-up tables, which does not alter the memory structure/organization, that is, preserving the bit-cell and peripherals of the existing SRAM monolithic arrays. Our solution achieves 1.72x higher performance and 3.14x lower energy as compared to a state of the art processing-in-cache solution. Sub-array level design modifications to in-corporate LUT along with the compute engines will increase the overall cache area by 5.6%. We achieve 3.97x speedup w.r.t neural network systolic accelerator with a similar area. The re-configurable nature of the compute engines enables various neural network operations and thereby supporting sequential networks (RNNs) and transformer models. Our quantitative analysis demonstrates 101x, 3x faster execution and 91x, 11x energy efficiency than CPU and GPU respectively while running the transformer model, BERT-Base.

Monday, July 20, 2020

Video Analytic Platform and Deep Graph Matching

Feng Shi and Ziheng Xu, UCLA, presenting on July 22, 2020 at 11:00 a.m. and 7:00 p.m. ET)

As a fundamental problem in pattern recognition, graph matching has applications in a variety of fields, especially for computer vision. The utilization of graph matching in video analytics include (but not limited) scene understanding, object keypoints matching, checking the availability of scene generation, etc.. The graph similarity (or matching) problem attempts to find node correspondences between graphs. Traditionally, obtaining an exact solution with heuristic algorithms is NP-hard and spends long latency; recently, the research employing deep graph architectures has been proposed in finding an approximate solution, which leverages the speed and accuracy. Previous works in such research mainly focused on using localized node embeddings to obtain an approximate node alignment. However, investigating only local features cannot reflect the whole structure; the overall graph topology plays a vital role in determining the edge alignment between graphs. Diffusion wavelets, which depict the probability distributions of graph signals over a graph, are powerful tools to assist in the graph topology exploration. In this work, we present WaveSim, a light-weight deep graph matching framework incorporating graph diffusion wavelets to calculate the diffusion distance. We also mathematically prove that it is possible to transform a Quadratic Assignment Programming (QAP) problem with high-order combinatorial nature into a lower dimension problem. Experiments show that WaveSim achieves remarkable and robust performances, and can be extended to matching problems with large graphs.

Tuesday, July 14, 2020

GCN meets GPU: Decoupling “When to Sample” from “How to Sample"

(Morteza Ramezani, Penn State, presenting on Wed. 7/15/20)

Graphs are powerful and versatile data structures to model many real world problems and learning on graph-based data is attracting growing interest in both academia and industry. Recently, many models have been introduced for learning on graphs. Among these, Graph Convolutional Networks (GCNs) and their variants have experienced significant attention and have become the de facto methods for learning on graphs. However, due to node dependency in graphs, which leads to "neighborhood explosion" , training GCN for large graphs on GPU is not practical. To address this problem, sampling-based methods have been introduced to couple with stochastic gradient descent for training GCNs. 

While effective in alleviating the neighborhood explosion, due to bandwidth and memory bottlenecks, these methods lead to computational overheads in preprocessing and loading new samples in heterogeneous systems (CPU-GPU), which significantly deteriorate the sampling performance. By decoupling the frequency of sampling from the sampling strategy, we propose LazyGCN, a general yet effective framework that can be integrated with any sampling strategy to substantially improve the training time. The idea behind LazyGCN is to perform sampling periodically and effectively recycle the sampled nodes to mitigate data preparation overhead. We theoretically analyze the proposed algorithm and give corroborating empirical evidence on large real-world graphs, demonstrating that the proposed schema can significantly reduce the number of sampling steps and yield superior speedup without compromising the accuracy.

Thursday, July 2, 2020

Mirage: A Highly Parallel And Flexible RRAM Accelerator


(Xiao Liu, UCSD, presenting on July 8, 2020 at 11:00 AM & 7:00 PM ET)


Emerging resistive memory (RRAM) based crossbar is a promising technology to accelerate neural network applications.
Such a structure can support massively parallel multiply-accumulate operations, which are intensively used in convolutional neural networks (CNN).
The novel structure is demonstrated to offer higher performance and power efficiency than the CMOS-based accelerators.
However, previously proposed RRAM-based neural network designs lack several desirable features for neural network accelerators.
First, the pipeline of existing architecture is inefficient. 
as data dependency between different layers of the network significantly can stall the execution.
Second, existing RRAM-based accelerators suffer from limited flexibility.
The oversized networks are not able to be executed on the accelerator, while the undersized networks are not able to utilize all the RRAM crossbar arrays.

To address these issues, we propose Mirage, a novel architectural design to enable high parallelism and flexibility for RRAM-based CNN accelerators.
Mirage consists of a Fine-grained Parallel RRAM Architecture (FPRA) and Auto Assignment (AA).
Motivated by thread block design in the GPU, FPRA addresses the data dependency issue in the pipeline.
When inter layer parallelism is involved, FPRA unifies the data dependency of each layer and handles them with shared input and output memory.
AA provides the ability to execute any-sized network on the accelerator.
When the network is oversized, AA utilizes dynamic reconfiguration to fold the network to fit with available hardware.
When the network is undersized, AA utilizes the FPRA to maximize the use of the extra hardware for higher performance.

We evaluate Mirage on seven popular image recognition neural network models with various network sizes. 
We find that Mirage manages to achieve 2.0x average speedup compares to the state-of-the-art RRAM-based accelerator. 
Additionally, Mirage can adopt network into RRAM-based accelerators of various sizes and we show that Mirage can deliver better performance scalability over prior works.