CRISP Blog: March 2019

Tuesday, March 19, 2019

Processing-in-Memory for Energy-efficient NeuralNetwork Training: A Heterogeneous Approach

(Hengyu Zhao Presenting on Wed. 3/20 at 2:00PM ET)

Authors: Hengyu Zhao, Jiawen Liu, Matheus Almeida Ogleari, Dong Li, Jishen Zhao

Abstract: Neural networks (NNs) have been adopted in a widerange of application domains, such as image classification, speechrecognition, object detection, and computer vision. However,training NNs – especially deep neural networks (DNNs) – can beenergy and time consuming, because of frequent data movementbetween processor and memory. Furthermore, training involvesmassive fine-grained operations with various computation andmemory access characteristics. Exploiting high parallelism withsuch diverse operations is challenging. To address these chal-lenges, we propose a software/hardware co-design of heteroge-neous processing-in-memory (PIM) system. Our hardware designincorporates hundreds of fix-function arithmetic units and ARM-based programmable cores on the logic layer of a 3D die-stackedmemory to form a heterogeneous PIM architecture attached toCPU. Our software design offers a programming model and aruntime system that program, offload, and schedule various NNtraining operations across compute resources provided by CPUand heterogeneous PIM. By extending the OpenCL programmingmodel and employing a hardware heterogeneity-aware runtimesystem, we enable high program portability and easy programmaintenance across various heterogeneous hardware, optimizesystem energy efficiency, and improve hardware utilization.

Thursday, March 14, 2019

A Binary Learning Framework for Hyperdimensional Computing

(Presenting Fri. 3/15/19 at 2:00PM ET)

Authors: Mohsen Imani, John Messerly, Fan Wu, Wang Pi, and Tajana S. Rosing

Brain-inspired Hyperdimensional (HD) computing is a computing paradigm emulating a neuron’s activity in highdimensional space. In practice, HD first encodes all data points to high-dimensional vectors, called hypervectors, and then performs the classification task in an efficient way using a well-defined set of operations. In order to provide acceptable classification accuracy, the current HD computing algorithms need to map data points to hypervectors with non-binary elements. However, working with non-binary vectors significantly increases the HD computation cost and the amount of memory requirement for both training and inference. In this paper, we propose BinHD, a novel learning framework which enables HD computing to be trained and tested using binary hypervectors. BinHD encodes data points to binary hypervectors and provides a framework which enables HD to perform the training task with significantly low resources and memory footprint. In inference, BinHD binarizes the model and simplifies the costly Cosine similarity used in existing HD computing algorithms to a hardware-friendly Hamming distance metric. In addition, for the first time, BinHD introduces the concept of learning rate in HD computing which gives an extra knob to the HD in order to control the training efficiency and accuracy. We accordingly design a digital hardware to accelerate BinHD computation. Our evaluations on four practical classification applications show that BinHD in training (inference) can achieve 12.4× and 6.3× (13.8× and 9.9×) energy efficiency and speedup as compared to the state-of-the-art HD computing algorithm while providing the similar classification accuracy.

Tuesday, March 12, 2019

Persistence Parallelism Optimization: a Holistic Approach from Memory Bus to RDMA Network

Xing Hu from the University of California, Santa Barbara
Presenting Wednesday, March 13th at 2:00PM CST
Emerging non-volatile memories (NVM), such as phase change memory (PCM) and Resistive RAM (ReRAM), incorporate the features of fast byte-addressability and data persistence, which are beneficial for data services such as file systems and databases. To support data persistence, a persistent memory system requires ordering for write requests. The data path of a persistent request consists of three segments: through the cache hierarchy to the memory controller, through the bus from the memory controller to memory devices, and through the network from a remote node to a local node. Previous work contributes significantly to improve the persistence parallelism in the first segment of the data path. However, we observe that the memory bus and the Remote Direct Memory Access (RDMA) network remain severely under-utilized because the persistence parallelism in these two segments is not fully leveraged during ordering.

We propose an architecture to further improve the persistence parallelism in the memory bus and the RDMA network. First, we utilize inter-thread persistence parallelism for barrier epoch management with better bank-level parallelism (BLP). Second, we enable intra-thread persistence parallelism for remote requests through RDMA network with buffered strict persistence. With these features, the architecture efficiently supports persistence through all three segments of the write datapath. Experimental results show that for local applications, the proposed mechanism can achieve 1.3× performance improvement, compared to the original buffered persistence work. In addition, it can achieve 1.93× performance improvement for remote applications serviced through the RDMA network.

Friday, March 8, 2019

Image Analysis at Run-Time

Image Analysis at Run-time

Bin Li from UW-Madison

Presenting Friday, March 8th at 1:00PM CST

Whether experiments are focused and small-scale or automated and high-throughput, effectively quantifying image is now a critical, widespread need that continues to grow in many ways:

Scale. Automated microscopes can acquire millions of images faster than can be analyzed by eye. Large-scale computing resources for analysis are often inaccessible to biologists who need them.

Size. Many experiments, from basic research to patient studies, involve huge images, often 10,000+ pixels in each dimension, using light sheet microscopy and/or large field-of-view detectors.

Dimensionality. Researchers are performing quantitative, higher-throughput experiments using these multi-dimensional image types via time-lapse, three-dimensional, & multi-spectral imaging

Scope. Researchers need to create complex workflows involving software for microscope control, high-throughput image processing, cloud computing, deep learning, & data mining

Modality. Researchers are struggling to identify and apply appropriate analytical methods for the explosion of novel types of microscopy, including super-resolution, single-molecule, and others

Complexity. Microscopy is now being used for profiling, to extract a “signature” of samples based on morphological measurements of imaged cells and microenvironment, often more subtle & complex than humans perceive.

The overall goal of quantitative and systematic biomedical imaging is to have a general platform for run-time computer vision applications with hardware acceleration and integration knowledge from biology, instrumentation, engineering and computer science, so that new instruments with run-time processing ability can address many current challenges and allow us to thoroughly study nature’s variability.

Wednesday, March 6, 2019

RISC-V Support for Efficient Hardware Undo+Redo Logging in Persistent Memory Systems

Persistent memory is a new tier of memory that combines the benefits of both storage systems and main memory. It has the data persistence of storage with the fast load/storeinterface of memory. Most previous persistent memory designs place careful control over theorder of writes arriving at persistent memory. This can prevent caches and memory controllers from optimizing system performance through write coalescing and reordering. This write-order control can be relaxed by employing undo+redo logging for data in persistent memory systems. However, traditional software logging mechanisms are expensive to adopt in persistent memory due to performance and energy overheads. Previously proposed hardware logging schemes are inefficient and do not fully address the issues in software.

To address these challenges, we propose a hardware undo+redo logging scheme which maintains data persistence by leveraging the write-back, write-allocate policies used in commodity caches. Furthermore, we develop a cache force-write-back mechanism in hardware to significantly reduce the performance and energy overheads from forcing data into persistent memory. The evaluation across persistent memory microbenchmarks and real workloads demonstrates that this design significantly improves system throughput and reduces both dynamic energy and memory traffic. It also provides strong consistency guarantees compared to software approaches.

Additionally, most persistent memory research is done using x86 due to extensive support in its instruction set. In particular, RISC-V is widely used in academia but also more recently in industry-based research. In this work, we also propose changes for RISC-V to support persistent memory. We fully integrate persistent memory and logging into a RISC-V system running on an FPGA as proof of concept. This implementation enables us to identify key challenges and optimizations for persistent memory not found on other ISAs. It also introduces new avenues of research into persistent memory using different architecture. Additionally, we make RISC-V compatible with existing persistent memory work including benchmarks, file systems, and logging mechanisms.