CRISP Blog: 2019

Tuesday, December 17, 2019

Graph Analytics Accelerator Supporting Sparse Data Representation using Crossbar Architectures

(Nagadastagiri Challapalle presenting on Wednesday, December 18, 2019 at 1:00PM ET)

Graph analytics applications are ubiquitous in this era of a connected world. These applications have very low compute to byte-transferred ratios and exhibit poor locality, which limits their computational efficiency on general purpose computing systems. Conventional hardware accelerators employ custom dataflow and memory hierarchy organization to over- come these challenges. Processing-in-memory (PIM) accelerators leverage massively parallel compute capable memory arrays to perform the in-situ operations on graph data or employ custom compute elements near the memory to leverage larger internal bandwidths. In this work, we present GaaS-X, a graph analytics accelerator that inherently supports the sparse graph data representations using an in-situ compute-enabled crossbar memory architectures. We alleviate the overheads of redundant writes, sparse to dense conversions, and redundant computations on the invalid edges that are present in the state of the art crossbar-based PIM accelerators. GaaS-X achieves 7.7× and 2.4× performance and 22× and 5.7×, energy savings, respectively, over two state-of-the-art crossbar accelerators and offers orders of magnitude improvements over GPU and CPU solutions.

Monday, December 2, 2019

MEG: A RISCV-Based System Simulation Infrastructure for Exploring Memory Optimization Using FPGAs and Hybrid Memory Cube

Emerging 3D memory technologies, such as the Hybrid Memory Cube (HMC) and High Bandwidth Memory (HBM), provide increased bandwidth and massive memory-level parallelism. Efficiently integrating emerging memories into existing systems poses new challenges and require detailed evaluation in a real computing environment. In this paper, we propose MEG, an open-source, configurable, cycle-exact, and RISC-V based full system simulation infrastructure using FPGA and HMC. MEG has three highly configurable design components: (i) an HMC adaptation module that not only enables communication between the HMC device and the processor cores but also can be extended to fit other memories (e.g., HBM, nonvolatile memory) with minimal effort, (ii) a reconfigurable memory controller along with its OS support that can be effectively leveraged by system designers to perform software-hardware co-optimization, and (iii) a performance monitor module that effectively improves the observability and debuggability of the system to guide performance optimization. We provide a prototype implementation of MEG on Xilinx VCU110 board and demonstrate its capability, fidelity, and flexibility on real-world benchmark applications. We hope that our open-source release of MEG fills a gap in the space of publicly-available FPGA-based full system simulation infrastructures specifically targeting memory system and inspires further collaborative software/hardware innovations.

Monday, November 18, 2019

Intermediate Languages for Automated Spatial Computing

Yi-Hsiang Lai presenting on Wed. 11/20/19--

With the pursuit of improving compute performance under strict power constraints, there is an increasing need for deploying applications to heterogeneous hardware architectures with accelerators, such as PIMs and FPGAs. However, although these heterogeneous computing platforms are becoming widely available, they are very difficult to program. As a result, the use of such platforms has been limited to a small subset of programmers with specialized hardware knowledge.

To tackle this challenge, we introduce HeteroCL, a programming infrastructure composed of a Python-based domain-specific language (DSL) and a compilation flow. The HeteroCL DSL provides a clean programming abstraction that decouples algorithm specification from three important types of hardware customization in compute, data types, and memory architectures. HeteroCL further captures the interdependence among these different customization techniques, allowing programmers to explore various performance/area/accuracy trade-offs in a systematic and productive manner. In addition, our framework produces highly efficient hardware implementations for a variety of popular workloads by targeting spatial architecture templates such as systolic arrays and stencil with dataflow architectures. HeteroCL further incorporates the T2S framework developed by Intel Lab. T2S is an intermediate programming model extended from Halide for high-performance systolic architectures. Similar to HetoerCL, T2S cleanly decouples the temporal definition from spatial mapping, which enables productive programming and efficient design space exploration.

Experimental results show that HeteroCL allows programmers to explore the design space efficiently in both performance and accuracy by combining different types of hardware customization and targeting spatial architectures, while keeping the algorithm code intact.

Friday, November 8, 2019

Exploiting Dynamic Sparsity in Neural Network Accelerators

Yuan Zhou Presenting on Wednesday, 11/13/19 at 1:00PM ET.

Convolutional neural networks (CNNs) have demonstrated human-level performance in many vision-related tasks, including image classification, segmentation, and real-time tasks such as autonomous driving and robotic manipulation. While modern CNNs continue to achieve higher accuracy, they also have larger model sizes and require more computation. As a result, it is challenging to deploy these compute-intensive CNN models to a wider range of applications, especially for embedded and mobile platforms which are area- and power-constrained.

In this talk, I will introduce our work on reducing the computational costs of CNNs by exploiting dynamic sparsity at run time. I will first present channel gating, a fine-grained dynamic pruning technique for CNN inference. Channel gating identifies the regions in the feature map of each CNN layer that contribute less to the classification result, and turns off a subset of channels for computing the activations in these less important regions. Since channel gating preserves the memory locality in the channel dimension, a CNN with channel gating can be effectively mapped to a slightly modified weight-stationary CNN accelerator. Running a channel gating ResNet-18 model with 2.8x theoretical FLOP reduction, our accelerator achieves 2.3x speedup over the baseline on ImageNet. We further demonstrate that channel gating is suitable for near-memory CNN acceleration by simulating a compute-constrained, high-bandwidth platform. With sufficient memory bandwidth, the actual speedup of the channel gating accelerator scales almost linearly with the theoretical FLOP reduction.

I will then briefly talk about our ongoing work, precision gating. Compared with channel gating, precision gating reduces computation by exploiting dynamic sparsity in another direction. Rather than pruning away entire channels at non-important pixel locations, precision gating uses low arithmetic precision at these locations while keeping the original precision at important locations. We believe precision gating is also suitable for near-memory CNN acceleration since it can effectively reduce the computational costs of CNN inference.

Tuesday, October 29, 2019

MEDAL: Scalable DIMM based Near Data Processing Accelerator for DNA Seeding Algorithm

(Presenting on Wed. 10/30/2019)
Abstract:
Computational genomics has proven its great potential to support precise and customized health care. However, with the wide adoption of the Next Generation Sequencing (NGS) technology, `DNA Alignment', as the crucial step in computational genomics, is becoming more and more challenging due to the booming bio-data. Consequently, various hardware approaches have been explored to accelerate DNA seeding - the core and most time consuming step in DNA alignment.

Most previous hardware approaches leverage multi-core, GPU, and FPGA to accelerate DNA seeding. However, DNA seeding is bounded by memory and above hardware approaches focus on computation. For this reason, Near Data Processing (NDP) is a better solution for DNA seeding. Unfortunately, existing NDP accelerators for DNA seeding face two grand challenges, i.e., fine-grained random memory access and scalability demand for booming bio-data. To address those challenges, we propose a practical, energy efficient, Dual-Inline Memory Module (DIMM) based, NDP Accelerator for DNA Seeding Algorithm (MEDAL), which is based on off-the-shelf DRAM components. For small databases that can be fitted within a single DRAM rank, we propose the intra-rank design, together with an algorithm-specific address mapping, bandwidth-aware data mapping, and Individual Chip Select (ICS) to address the challenge of fine-grained random memory access, improving parallelism and bandwidth utilization. Furthermore, to tackle the challenge of scalability for large databases, we propose three inter-rank designs (polling-based communication, interrupt-based communication, and Non-Volatile DIMM (NVDIMM)-based solution). In addition, we propose an algorithm-specific data compression technique to reduce memory footprint, introduce more space for the data mapping, and reduce the communication overhead. Experimental results show that for three proposed designs, on average, MEDAL can achieve 30.50x/8.37x/3.43x speedup and 289.91x/6.47x/2.89x energy reduction when compared with a 16-thread CPU baseline and two state-of-the-art NDP accelerators, respectively.

Bio of Wenqin Huangfu:

Wenqin Huangfu is a fourth year Ph.D. student in University of California at Santa Barbara. His advisor is Professor. Yuan Xie.

Wenqin’s research interests include domain-specific accelerator, Processing-In-Memory (PIM), Near-Data-Processing (NDP), and emerging memory technology. Currently, Wenqin is focusing on hardware acceleration of bioinformatics with PIM and NDP technology.

Monday, October 14, 2019

SubZero: Zero-copy IO for Persistent Main Memory File Systems

Juno Kim, UC San Diego, presenting on Oct 14th

POSIX-style read() and write() have long been the standard interface for accessing file data. However, the data copy their semantics require imposes an unnecessary overhead on accessing files stored in persistent memories attached to the processor memory bus (PMEM). PMEM-aware file systems provide direct-access (DAX)-based mmap() to avoid the copy, but it forces the programmer to manage concurrency control and atomicity guarantees.

We propose a new IO interface, called SubZero IO, that avoids data movement overheads when accessing persistent memory-backed files while still interacting cleanly with legacy read() and write(). SubZero IO provides two new system calls – peek() and patch() – that allow for data access and modification without a copy. They can improve performance significantly, but they require substantial changes to how an application performs IO. To avoid this, we describe a PMEM-aware implementation of read() and write() that can transparently provide most of the benefits of peek() and patch() for some applications.

Measurements of simple benchmarks show that SubZero outperforms copy-based read() and write() by up to 2× and 6×, respectively. At the application level, peek() improves GET performance of the Apache Web Server by 3.8×, and patch() boosts the SET performance of Kyoto Cabinet up to 2.7×.

Wednesday, October 9, 2019

FloatPIM: Acceleration of DNN Training in PIM

Presented by Saransh Gupta of UC San Diego on Wednesday, October 9th at 1:00PM ET.

We present PIM methods that perform in-memory computations on digital data and support high precision floating-point operations. First, we introduce an operation-dependent variable voltage application scheme which improves the performance and energy efficiency of existing PIM operations by 2x. Then, we propose an in-memory deep neural networks (DNN) architecture, which not only supports DNN inference but also training completely in-memory. To achieve this, we natively enable, for the first time, high-precision floating-point operations in memory. Our design also enables fast communication between neighboring memory blocks to reduce the internal data movement of the PIM architecture.

Tuesday, May 21, 2019

Bring Your Own Datatypes to TVM

Gus Smith, University of Washington, presenting on May 22nd.

Deep learning is hungry for computational power, and it seems it will only be satiated through extreme hardware specialization. Google’s TPU and Intel’s Nervana both employ custom hardware to accelerate deep learning. The exploration of new numerical datatypes, which specify how mathematical values are expressed and operated on in hardware, has been key to extracting the best performance from hardware accelerators. Previously, numerical computations used IEEE 754 floating point, a standard which is designed to be general-purpose. However, the general-purpose nature of IEEE floats often leaves a lot of potential performance on the table. As a result, a number of new datatypes have sprung up as competitors to IEEE floating point, including Google's bfloat16, Intel's Flexpoint, Facebook's Deepfloat, and the Posit format from John Gustafson.

By supporting these new custom datatypes in TVM, an extensible deep learning compiler stack developed at the University of Washington, we can enable future workloads which utilize a variety of custom datatypes. In this talk, I discuss how we are taking the first steps towards supporting custom datatypes in TVM, by allowing users to "bring their own datatypes." Many datatype researchers first develop a software-emulated version of their datatype, before developing it in hardware; our framework allows users to plug these software-emulated versions of their datatypes directly into TVM, to compile and test real models.

Thursday, May 2, 2019

Data-Free Quantization for Deep Neural Networks

(Ritchie Zhao presenting on Friday, 5/3/19.)

Quantization is key to improving the execution time and energy efﬁciency of neural networks on both commodity GPUs and specialized accelerators. The majority of existing literature focuses on training quantized DNNs. However, industry shows great demand for data-free quantization - techniques that quantize a floating-point model without (re)training. Our talk focuses on this latter topic.

DNN weights and activations follow a bell-shaped distribution post-training, while practical hardware uses a linear quantization grid. This leads to challenges in dealing with outliers in the distribution. Prior work has addressed this by clipping the outliers or using specialized hardware. In this work, we propose outlier channel splitting (OCS),which duplicates channels containing outliers, then halves the duplicated values. The network remains functionally identical, but affected outliers are moved toward the center of the distribution. OCS requires no additional training and works on commodity hardware. Experimental evaluation on ImageNet classiﬁcation and language modeling shows that OCS can outperform state-of-the-art clipping techniques with only minor overhead.

Tuesday, April 23, 2019

PMTest: A Fast and Flexible Testing Framework for Persistent Memory Programs

(Sihang Liu presenting on Wed. 4/24 Task 2.4 liaison meeting)

Recent non-volatile memory technologies such as 3D XPoint and NVDIMMs have enabled persistent memory (PM) systems that can manipulate persistent data directly in memory. This advancement of memory technology has spurred the development of a new set of crash-consistent software (CCS) for PM - applications that can recover persistent data from memory in a consistent state in the event of a crash (e.g., power failure). CCS developed for persistent memory ranges from kernel modules to user-space libraries and custom applications. However, ensuring crash consistency in CCS is difficult and error-prone. Programmers typically employ low-level hardware primitives or transactional libraries to enforce ordering and durability guarantees that are required for ensuring crash consistency. Unfortunately, hardware can reorder instructions at runtime, making it difficult for the programmers to test whether the implementation enforces the correct ordering and durability guarantees.

We believe that there is an urgent need for developing a testing framework that helps programmers identify crash consistency bugs in their CCS. We find that prior testing tools lack generality, i.e., they work only for one specific CCS or memory persistency model and/or introduce significant performance overhead. To overcome these drawbacks, we propose PMTest1 , a crash consistency testing framework that is both flexible and fast. PMTest provides flexibility by providing two basic assertion-like software checkers to test two fundamental characteristics of all CCS: the ordering and durability guarantee. These checkers can also serve as the building blocks of other application-specific, high-level checkers. PMTest enables fast testing by deducing the persist order without exhausting all possible orders. In the evaluation with eight programs, PMTest not only identified 45 synthetic crash consistency bugs, but also detected 3 new bugs in a file system (PMFS) and in applications developed using a transactional library (PMDK), while on average being 7.1x faster than the state-of-the-art tool.

Tuesday, April 16, 2019

Tuning Applications for Efficient GPU Offloading to In-memory Processing

(Presenting on Wed. 04/17/2019 at 2:00PM ET)

Authors: Yudong Wu, Mingyao Shen, Yi Hui Chen, Yuanyuan Zhou

Data movement between processors and main memory is a critical bottleneck for data-intensive applications. This problem is more severe with Graphics Processing Units (GPUs) due to their massive parallel data processing capability. Recent research has shown that Processing-in-Memory (PIM) can greatly alleviate data movement bottleneck by reducing traffic between GPUs and memory devices. It offloads relative small execution context, instead of transferring massive data to be processed between memory devices and processors. However, conventional application code that is highly optimized for locality to execute efficiently in GPU is not a natural match to be offloaded into PIM. To address this challenge, our project investigates how application code can be restructured to improve the benefit of PIM offloading from GPUs. In addition, we also study approaches to dynamically determine how much to offload as well as how to leverage all resources including GPUS in case of offloading to achieve the best possible overall performance. From our experimental evaluations over 14 applications, our approach can averagely improve application offloading performance by 21%.

Tuesday, April 9, 2019

PIMCloud: Exploring Near-data Processing for Interactive Cloud Services

Shuang Chen from Cornell University
Presenting Wednesday, Apr 10 at 2:00PM EDT

Data centers host latency-critical (LC) as well as best-effort jobs. The former rely critically on an adequate provisioning of hardware resources to meet their QoS target. Many recent industry implementations and research proposals assume a single LC application per node; this is in part to make it easy to carve out resources for LC that one LC application, while allowing best-effort jobs to compete for the rest.

Two big changes in data centers, however, are about to shake up the status quo. First, the micro services model is making the number of LC applications hosted in data centers explode, making it impractical (and inefficient) to assume one LC application per node. That means multiple LC applications competing for resources on a single node, each with their own QoS needs. Second, the arrival of processing-in-memory (PIM) capabilities introduces a complex scheduling challenge (and opportunity).

In this talk, I will present our PIMCloud project, show some initial results, and discuss our ongoing work. First, I will present PARTIES, a novel hardware resource manager that enables successful colocation of multiple LC applications on a single node of a traditional data center. (This work will be presented next week at ASPLOS 2019.) Second, I will discuss how we envision augmenting this framework to accommodate PIM capabilities. Specifically, I will discuss some challenges and opportunities in future nodes where memory channels are themselves compute-capable.

Tuesday, March 19, 2019

Processing-in-Memory for Energy-efficient NeuralNetwork Training: A Heterogeneous Approach

(Hengyu Zhao Presenting on Wed. 3/20 at 2:00PM ET)

Authors: Hengyu Zhao, Jiawen Liu, Matheus Almeida Ogleari, Dong Li, Jishen Zhao

Abstract: Neural networks (NNs) have been adopted in a widerange of application domains, such as image classification, speechrecognition, object detection, and computer vision. However,training NNs – especially deep neural networks (DNNs) – can beenergy and time consuming, because of frequent data movementbetween processor and memory. Furthermore, training involvesmassive fine-grained operations with various computation andmemory access characteristics. Exploiting high parallelism withsuch diverse operations is challenging. To address these chal-lenges, we propose a software/hardware co-design of heteroge-neous processing-in-memory (PIM) system. Our hardware designincorporates hundreds of fix-function arithmetic units and ARM-based programmable cores on the logic layer of a 3D die-stackedmemory to form a heterogeneous PIM architecture attached toCPU. Our software design offers a programming model and aruntime system that program, offload, and schedule various NNtraining operations across compute resources provided by CPUand heterogeneous PIM. By extending the OpenCL programmingmodel and employing a hardware heterogeneity-aware runtimesystem, we enable high program portability and easy programmaintenance across various heterogeneous hardware, optimizesystem energy efficiency, and improve hardware utilization.

Thursday, March 14, 2019

A Binary Learning Framework for Hyperdimensional Computing

(Presenting Fri. 3/15/19 at 2:00PM ET)

Authors: Mohsen Imani, John Messerly, Fan Wu, Wang Pi, and Tajana S. Rosing

Brain-inspired Hyperdimensional (HD) computing is a computing paradigm emulating a neuron’s activity in highdimensional space. In practice, HD first encodes all data points to high-dimensional vectors, called hypervectors, and then performs the classification task in an efficient way using a well-defined set of operations. In order to provide acceptable classification accuracy, the current HD computing algorithms need to map data points to hypervectors with non-binary elements. However, working with non-binary vectors significantly increases the HD computation cost and the amount of memory requirement for both training and inference. In this paper, we propose BinHD, a novel learning framework which enables HD computing to be trained and tested using binary hypervectors. BinHD encodes data points to binary hypervectors and provides a framework which enables HD to perform the training task with significantly low resources and memory footprint. In inference, BinHD binarizes the model and simplifies the costly Cosine similarity used in existing HD computing algorithms to a hardware-friendly Hamming distance metric. In addition, for the first time, BinHD introduces the concept of learning rate in HD computing which gives an extra knob to the HD in order to control the training efficiency and accuracy. We accordingly design a digital hardware to accelerate BinHD computation. Our evaluations on four practical classification applications show that BinHD in training (inference) can achieve 12.4× and 6.3× (13.8× and 9.9×) energy efficiency and speedup as compared to the state-of-the-art HD computing algorithm while providing the similar classification accuracy.

Tuesday, March 12, 2019

Persistence Parallelism Optimization: a Holistic Approach from Memory Bus to RDMA Network

Xing Hu from the University of California, Santa Barbara
Presenting Wednesday, March 13th at 2:00PM CST
Emerging non-volatile memories (NVM), such as phase change memory (PCM) and Resistive RAM (ReRAM), incorporate the features of fast byte-addressability and data persistence, which are beneficial for data services such as file systems and databases. To support data persistence, a persistent memory system requires ordering for write requests. The data path of a persistent request consists of three segments: through the cache hierarchy to the memory controller, through the bus from the memory controller to memory devices, and through the network from a remote node to a local node. Previous work contributes significantly to improve the persistence parallelism in the first segment of the data path. However, we observe that the memory bus and the Remote Direct Memory Access (RDMA) network remain severely under-utilized because the persistence parallelism in these two segments is not fully leveraged during ordering.

We propose an architecture to further improve the persistence parallelism in the memory bus and the RDMA network. First, we utilize inter-thread persistence parallelism for barrier epoch management with better bank-level parallelism (BLP). Second, we enable intra-thread persistence parallelism for remote requests through RDMA network with buffered strict persistence. With these features, the architecture efficiently supports persistence through all three segments of the write datapath. Experimental results show that for local applications, the proposed mechanism can achieve 1.3× performance improvement, compared to the original buffered persistence work. In addition, it can achieve 1.93× performance improvement for remote applications serviced through the RDMA network.

Friday, March 8, 2019

Image Analysis at Run-Time

Image Analysis at Run-time

Bin Li from UW-Madison

Presenting Friday, March 8th at 1:00PM CST

Whether experiments are focused and small-scale or automated and high-throughput, effectively quantifying image is now a critical, widespread need that continues to grow in many ways:

Scale. Automated microscopes can acquire millions of images faster than can be analyzed by eye. Large-scale computing resources for analysis are often inaccessible to biologists who need them.

Size. Many experiments, from basic research to patient studies, involve huge images, often 10,000+ pixels in each dimension, using light sheet microscopy and/or large field-of-view detectors.

Dimensionality. Researchers are performing quantitative, higher-throughput experiments using these multi-dimensional image types via time-lapse, three-dimensional, & multi-spectral imaging

Scope. Researchers need to create complex workflows involving software for microscope control, high-throughput image processing, cloud computing, deep learning, & data mining

Modality. Researchers are struggling to identify and apply appropriate analytical methods for the explosion of novel types of microscopy, including super-resolution, single-molecule, and others

Complexity. Microscopy is now being used for profiling, to extract a “signature” of samples based on morphological measurements of imaged cells and microenvironment, often more subtle & complex than humans perceive.

The overall goal of quantitative and systematic biomedical imaging is to have a general platform for run-time computer vision applications with hardware acceleration and integration knowledge from biology, instrumentation, engineering and computer science, so that new instruments with run-time processing ability can address many current challenges and allow us to thoroughly study nature’s variability.

Wednesday, March 6, 2019

RISC-V Support for Efficient Hardware Undo+Redo Logging in Persistent Memory Systems

Persistent memory is a new tier of memory that combines the benefits of both storage systems and main memory. It has the data persistence of storage with the fast load/storeinterface of memory. Most previous persistent memory designs place careful control over theorder of writes arriving at persistent memory. This can prevent caches and memory controllers from optimizing system performance through write coalescing and reordering. This write-order control can be relaxed by employing undo+redo logging for data in persistent memory systems. However, traditional software logging mechanisms are expensive to adopt in persistent memory due to performance and energy overheads. Previously proposed hardware logging schemes are inefficient and do not fully address the issues in software.

To address these challenges, we propose a hardware undo+redo logging scheme which maintains data persistence by leveraging the write-back, write-allocate policies used in commodity caches. Furthermore, we develop a cache force-write-back mechanism in hardware to significantly reduce the performance and energy overheads from forcing data into persistent memory. The evaluation across persistent memory microbenchmarks and real workloads demonstrates that this design significantly improves system throughput and reduces both dynamic energy and memory traffic. It also provides strong consistency guarantees compared to software approaches.

Additionally, most persistent memory research is done using x86 due to extensive support in its instruction set. In particular, RISC-V is widely used in academia but also more recently in industry-based research. In this work, we also propose changes for RISC-V to support persistent memory. We fully integrate persistent memory and logging into a RISC-V system running on an FPGA as proof of concept. This implementation enables us to identify key challenges and optimizations for persistent memory not found on other ISAs. It also introduces new avenues of research into persistent memory using different architecture. Additionally, we make RISC-V compatible with existing persistent memory work including benchmarks, file systems, and logging mechanisms.

Wednesday, February 20, 2019

Initial Design of the Hustle Type System

Matthew Dutson from UW-Madison
Presenting Friday, Feb 22 at 1:00PM CST

Hustle is a scalable, relational data platform currently under development. It has a microservices architecture for maximum flexibility and adaptability. In implementing the type system for Hustle’s execution engine we consider a number of design tradeoffs. We seek to balance abstraction with performance, doing so in a way that allows future developers to easily add custom types and expand the functionality of existing types. We look at factors including the cost of virtual function lookups and dealing with generic types whose sizes are unknown at compile-time. We also analyze the tradeoffs of various null value representations, looking to minimize wasted space while conforming to user’s expectations of null semantics and behavior.

Tuesday, February 19, 2019

Exploiting HMC Characteristics

Today’s computer systems face the memory wall problem that the DRAM can only provide a fraction of the bandwidth that a multi-core processor can fully utilize. One promising emerging memory is the Hybrid Memory Cube (HMC). HMC promises to provide higher bandwidth and better random access performance (than DRAM) to data-hungry applications. HMC also have applications in near-memory computing.

In this work hardware and software researchers at the University of Wisconsin are collaborating to examine how database operations could potentially leverage HMC for query execution. While running a full-fledged database engine on HMC is our target, this task is onerous as porting a full-fledged database platform to a new hardware like the HMC (and leveraging the in-memory computing component) is estimate to take a few person years. So we take an initial first step by breaking down higher-level database operations into a set of four data access kernels. We encapsulate these kernels in a new benchmark called the Four Bases benchmark, as we think these four primitives can be used to compose the DNA of most higher-level database operations. Studying the behaviour of hardware on this new micro benchmark has the potential to explore the hardware-software synergy now, and provide insights changes may be needed on both the software and hardware sides to make use of inverted memory systems.

We have an initial implementation of the Four Bases benchmark on an FPGA-based RISC-V platform that connects to HMC. We observe that the random access latency is about 2x compared to sequential access, which is much smaller compared to the 10x that is often seen with DRAM. This behavior may impact the cost model of database query optimizer and may require rethinking database operator algorithms.

In this talk, we will present our work in this area, share early results, and seek feedback and collaboration from the broader CRISP community.

This is a joint work with Prof. Li’s group.

Friday, February 15, 2019

Passive Memristor Crossbars with Etch-Down Fabrication and Fine-Tuning Characteristics

(Presenting April 8, 2019) Memristors and their arrays with tunable conductance states are promising building blocks for the hardware-implementation of artificial neural network with analog computation. In this work, we report a CMOS-compatible fabrication technique for passive memristor crossbar array with etch-down and its analog computing application. Switching and fine-tuning characteristics of memristor devices are presented across a 64×64 crossbar array. In addition, vector-matrix multiplication (VMM) is demonstrated with the forward inference of single-layer neural network for MNIST images recognition and its accuracy is analyzed depending on programming precision.

Wednesday, February 13, 2019

A survey of query optimization methods

(Presenting Fri. February 15, 2019) Query optimizers are fundamental components of database systems that are responsible for producing good execution plans. To navigate the search space of execution plans optimizers use transformation rules and a search strategy. Extending, maintaining and debugging a query optimizer entails understanding and modifying the library of rules. Today query optimizers use object-oriented languages to specify the valid transformation and search strategies which results in a sizable and complex codebase. In Hustle, a new database engine we are developing, our goal is to express the transformation rules in a Domain Specific Language to minimize the code complexity and produce a concise and maintainable query optimizer.

Persistent Memory and Architecture Implication

(Larry Chiu of IBM presenting Wednesday, February 13 at 2:00PM ET)

Opportunities and challenges surround new persistent memory and storage hierarchy. New rules breaks away from architecture and design assumption from the past. Larry will discuss factors that may accelerate or reduce the adoption of the new technologies in the enterprise storage use cases.

Larry Chiu, Director of Global Storage Systems, is leading IBM Storage Research Strategy across wide discipline of storage and data management technology. Larry developed first IBM Enterprise RAID Storage, founded industry first IBM Storage Virtualization Appliance, and industry first autonomic storage tiering engine for enterprise market. Larry has a Master Science, Computer Engineering from University of South California and a Master of Science and Technology Commercialization from University of Texas, Austin.

Sunday, February 10, 2019

GRAM: Graph Processing in a ReRAM-based Computational Memory

The performance of graph processing for real-world graphs is limited by inefficient memory behaviors in traditional systems because of random memory access patterns. Offloading computations to the memory is a promising strategy to overcome such challenges. In this work, we exploit the resistive memory (ReRAM) based processing-in-memory (PIM) technology to accelerate graph applications. The proposed solution, GRAM, can efficiently executes the vertex-centric model, which is widely used in large-scale parallel graph processing programs, in the computational memory. The hardware-software co-design used in GRAM maximizes the computation parallelism while minimizing the number of data movements.

Monday, February 4, 2019

Compiler and Hardware Design for In-Memory and Near-Memory Acceleration of Cognitive Applications

(Sitao Huang, Vikram Sharma Mailthody, Zaid Qureshi- Presenting Wed. 2/6 at 2PM ET)

The increasing deployment of machine learning for applications such as image analytics and search has resulted in a number of special purpose accelerators in this domain. In this talk, we present two of our recent works. The first work is a compiler that optimizes the mapping of the computation graph for efficient execution in a memristor-based hybrid (analog-digital) deep learning accelerator. By building the compiler, we have made special purpose accelerators more accessible to software developers, and enabled the generation of better-performing executables. In our second work, we are exploring near-memory accelerations for applications like image retrieval. These applications are constrained by the bandwidth available to accelerators like GPUs, which could potentially benefit from near-memory processing.

Friday, February 1, 2019

Hopscotch: A Microbenchmark for Memory Performance Evaluation

(Presenting Friday, Feb 01, 2019) Due to the ever-increasing gap between the speed of computing elements and the speed at which memory systems can provide data, we are hitting a memory wall. As CRISP center is coming up with new memory architectures to fight this issue, we need good memory benchmarks to evaluate the performance of these emerging architectures. Ideally, these benchmarks should be able to help us finding memory system bottlenecks, should be able to profile read and write channels independently and combined, and should be tunable to match application of interest. Existing memory benchmarks such as STREAM, ApexMAP, Spatter etc. evaluate memory performance using interesting access patterns. However, they are not flexible enough to allow evaluating with all combinations of read and write patterns. We are developing a micro-benchmark where we can tune the access pattern of read and write channels independently using parameters controlling spatial and temporal locality. The benchmark will include a collection of kernel for exercising different areas of the memory system. To ensure that zero redundancy, we are going to employ diversity analysis using metrics like stall percentage, read/write ratio, median stride length, unit stride percentage etc. The current implementation supports few interesting read and write patterns on CPU and GPU platforms using OpenMP and CUDA. Future extension plan includes exploring the effect of memory controller scheduling and finding new patterns in major application domains by eliminating the patterns we already covered.

Thursday, January 31, 2019

A little latency goes a long way: Memory latency and its impact on CPU-driven inference performance

(Ameen Akel of Micron is presenting Mon. Feb 4, 2019)

Memory media latencies are often a hot topic: Systems architects often seek the lowest possible latencies, which, in turn, drives memory manufacturers to architect memories for the minimum possible access latencies. While some applications may benefit from lower media latencies, most applications exhibit very little sensitivity. We aim to dispel the memory latency myth: Workloads like DNN inference, across a wide variety of models, are not strongly correlated to memory media latency. Freeing memory companies of memory latency shackles enables favorable memory media architecture tradeoffs that systems architects may not expect.

Monday, January 28, 2019

Hierarchical and Distributed Machine Learning Inference Beyond the Edge

(Presenting Wednesday Jan 30, 2019) Networked applications with heterogeneous sensors are a growing source of data in the Internet of Things (IoT) environment. Many IoT applications use machine learning (ML) to make real-time predictions. The current dominant approach to deploying ML inference is monolithic, i.e., when inference needs to be performed using data generated by multiple sensors, the features generated by each sensor are joined in a centralized cloud-based tier to perform the inference computation. Since inference typically occurs with high frequency, the monolithic approach can quickly lead to burdensome levels of communication, which wastes energy, reduces data privacy, and often bottlenecks the network violating real-time constraints. In this work, we study a novel approach that mitigates these issues by “pushing” ML inference computations out of the cloud and onto a hierarchy of IoT devices, which compute successively more compressed representations of raw sensor data. We present a new technical challenge of “rewriting” the functional form of an ML inference computation to factor it over a network of devices without significantly reducing prediction accuracy. We present novel “hierarchy aware” neural network architectures which enable users to trade off between communication cost and accuracy. We also present novel exact factoring algorithms for other popular ML models including gradient boosted trees and random forests that preserve accuracy while substantially reducing communication. We evaluate our approach with three real-world problems, urban energy demand prediction, human activity prediction, and server performance prediction. Our approach presents substantial reductions in energy use and latency on IoT devices, while providing the same level of prediction quality to the current monolithic inference. Measurement on a common IoT device shows that energy use and latency can be reduced by up to 63% and 67% respectively without reducing accuracy relative to the full communication setting.

Sunday, January 27, 2019

PIMProf: A Performance Profiler for Processing-in-Memory Architectures

Monday, 1/28/19 at 2:00PM ET – Task 1.5 (Evaluation Through Architectural Simulation and Prototyping)

PIM architectures have drawn an increasing research interest as a mitigation to the data movement bottleneck within the current DRAM-based architectures, and a variety of them have been proposed for accelerating various data-intensive workloads. However, for a given workload, it is difficult to determine which part of a program should be offloaded to a given PIM architecture to gain the best performance and how much performance gain is possible. We propose PIMProf, a tool that uses a combination of static and runtime analysis to automatically detect the PIM candidates and estimate the speedup of the program. Our key ideas are as follows: First, PIMProf uses static analysis to capture the dependency between computation and data access, and constructs both a control flow graph and a data dependency graph of the program. Second, PIMProf profiles the computation cost and memory access cost of the program, and attributes the costs to the nodes and edges of the graph. Finally, we show how to formalize the PIM offloading problem into a cost minimization problem of the weighted graph.

Thursday, January 24, 2019

Monolithic-3D Integration Augmented Design Techniques for Computing in SRAMs

Presenting Friday, January 25th, at 2PM EST

Task 3.6: Cognitive Architectures

In-memory computing has emerged as a promising solution to address the logic-memory performance gap. We propose design techniques using monolithic-3D integration to achieve reliable multi-row activation which in turn help in computation as part of data readout. Our design is 1.8x faster than the existing techniques for Boolean computations. We quantitatively show no impact to cell stability when multiple rows are activated and thereby requiring no extra hardware for maintaining the cell stability during computations. In-memory digital to analog conversion technique is proposed using a 3D-CAM primitive. The design utilizes relatively low strength layer-2 transistors effectively and provides 7x power savings when compared with a specialized converter in-memory. Lastly, we present a linear classifier system by making use the above-mentioned techniques which is 47x faster while computing vector matrix multiplication using a dedicated hardware engine.

Tuesday, January 22, 2019

Wearout and active accelerated recovery for processing in emerging technology memories

(Presenting Wednesday, January 23rd, at 2pm EST) Trying to break the memory wall includes various efforts to bring memory closer to the processor, or to push processing into the memory stack, generally called Processing in Memory (PIM). If the memory is one the various "flavors" of non-volatile emerging memory technologies such as: spin-transfer torque RAM (STTRAM), phase-change memory (PCM), resistive RAM (RRAM), memristor, 3DXpoint, etc., limited endurance becomes an important issue in addition to all other general challenges common to PIM. Endurance refers to the fact that most non-volatile memories, including all these new emerging technologies, but also more traditional ones like Flash and EEPROM, have a limited lifetime in terms of how many times they can be written and erased - once the limit is exceeded, the number of faults in the memory increases rapidly and the memory device can no longer be used reliably. For storage applications (the main use of non-volatile memories until now) one way to deal with the limited endurance is to overprovision the device (i.e. leave some of the native capacity unutilized up-front and allocate it later as memory blocks start failing due to the limited endurance) and to use wear leveling by adding a level of memory virtualization in the form of a Flash Translation Layer (FTL) that maps logical blocks to physical blocks in a dynamic way such that the write/erase cycles are more or less equally distributed across the physical blocks such that no single block gets overwritten too many times in a row. Although the concept of an FTL was first introduced for Flash, similar mechanisms will work (and likely be necessary) for all emerging memory technologies with limited endurance (e.g. although not explicitly stated it is likely that the Intel Optane 3D XPoint is using a similar mechanism for wear-leveling, etc.). FTLs are an OK solution for storage applications but are suboptimal (to say the least) for main memory applications, and even more so for processing in memory, both from a latency point of view during normal logical-to-physical mapping, but especially because of the extra long delays necessary for moving data when a logical block needs to be re-allocated to a new physical block. Because of this methods that intrinsically compensate for the limited endurance would be especially preferable for PIM.

One such method is to take advantage of the recovery mechanisms associated with the stress that leads to the limited endurance in the first place. Since stress and wearout are mechanisms that take a device out of physical equilibrium, it turns out that simple thermodynamics tends to partially reverse the effect of stress when the stress is removed. This is a general physics argument that has been experimentally demonstrated for several wearout mechanisms, including Flash wearout, but also other more general ones, such as NBTI/PBTI, hot electrons, electromigration, etc. In this talk I will go over several of these mechanisms and explain the source of stress and ways to reverse it. The main idea is to go beyond simple passive recovery (just remove stress and wait) by reversing the direction of stress (active recovery) and accelerate the process (e.g. by increasing temperature). Such accelerated active recovery can lead to many orders of magnitude improvement in endurance, thus making processing in emerging technology memories practical.

Friday, January 18, 2019

Deep Learning for Pancreatic Cancer Histopathology Image Analysis

(Adib Keikhosravi / Kevin Eliceiri Presenting Fri. 1/18/19)

Whole slide imaging (WSI) or virtual microscopy is a type of imaging modality which is used to convert animal or human pathology tissue slides to digital images for teaching, research or clinical applications. This method is popular due to education and clinical demands. Although modern whole slide scanners can now scan tissue slides with high resolution in a relatively short period of time, significant challenges, including high cost of equipment and data storage, remain unsolved. Machine learning and deep learning techniques in Computer Aided Diagnosis (CAD) platforms have begun to be widely used for biomedical image analysis by physicians and researchers. We are trying to build a platform for histopathological image super-resolution and cancer grading and staging with the main focus on pancreatic cancer. We present a computational approach for improving the quality of the resolution of images acquired from commonly available low magnification commercial slide scanners. Images from such scanners can be acquired cheaply and are efficient in terms of storage and data transfer. However, they are generally of poorer quality than images from high-resolution scanners and microscopes and do not have the necessary resolution needed in diagnostic or clinical environments, and hence are not used in such settings. First, we developed a deep learning framework that implements regularized sparse coding to smoothly reconstruct high-resolution images, given their low-resolution counterpart. Results show that our method indeed produces images which are similar to images from high resolution scanners, both in quality and quantitative measures and compares favorably to several state-of-the-art methods across a number of test images. To further improve the results, we used a convolutional neural network (CNN) based approach, which is specifically trained to take low-resolution slide scanner images of cancer data and convert it into a high-resolution image. We validate these resolution improvements with computational analysis to show the enhanced images offer the same quantitative results. This project is still ongoing and now we are trying to use middle resolutions for improving the image quality using recurrent neural networks. On the other hand, current approaches for pathological grading/staging of many cancer types such as breast and pancreatic cancer lack accuracy and interobserver agreement. Google research recently used inception for high accuracy tumor cell localization. However, as our group has been discovering the prognostic role of stromal reorganization in different cancer types including pancreatic cancer, which is projected to be the second leading cause of cancer by 2030, we use a wholistic approach that contains both stroma and cell from small TMA punches of different grades of cancer accompanied by normal samples. For this study we used transfer learning from four award winning networks VGG16, VGG19, GoogleNet and Resnet101 for the task of pancreatic cancer grading. Although all these networks have shown great performance for natural image classifications, but Resnet showed the highest performance with 88% accuracy in four-tier grading, and higher for all one by one comparisons among normal and different grades. We fine-trained this network again for different TNM classification and staging tasks and although all the images were selected from small regions from pancreas, the results show the promising capability of CNNs in helping pathologists with diagnosis. To achieve higher accuracies we have almost doubled the size of the dataset and trainings are still running and will update the audience in future talks.

Monday, January 14, 2019

Finding and Fixing Performance Pathologies in Persistent Memory Software Stacks

(Presenting 1/16)

Emerging fast, non-volatile memories will enable systems with large amounts of non-volatile main memory (NVMM) attached to the CPU memory bus, bringing the possibility of dramatic performance gains for IO-intensive applications. This paper analyzes the impact of state-of-the-art NVMM file systems on some of these applications and explores how those applications best leverage the performance that NVMMs offer.
Our analysis leads to several conclusions about how systems and applications should adapt to NVMMs. We propose FiLe Emulation with DAX (FLEX), a technique for moving file operations into user space and show it and other simple changes can dramatically improve application performance. We examine the scalability of NVMM file systems in light of the rising core counts and pronounced NUMA effects in modern systems, and propose changes to Linux’s virtual file system (VFS) to improve scalability. We also show that adding NUMA-aware interfaces to an NVMM file system can significantly improve performance.

String Figure: A Scalable and Elastic Memory Network Architecture

(Presenting Mon 1/14)

The demand for server memory capacity and performance has been rapidly increasing due to the expanding working set size of modern applications, such as big data analytics, inmemory computing, deep learning, and server virtualization. One of promising techniques tackling such requirements is memory network, where the server memory system consists of multiple 3D die-stacked memory nodes interconnected by a highspeed network. However, current memory network designs face substantial scalability challenges, including (1) maintaining high throughput and low latency in large-scale memory networks at low hardware cost, (2) efficiently interconnecting an arbitrary number of memory nodes, and (3) supporting flexible memory network scale expansion and reduction without major modification of memory network design and physical implementation.

To address these challenges, we propose String Figure1, a highthroughput, elastic, and scalable memory network architecture. String Figure consists of three design components. First, we propose an algorithm to generate random interconnect topologies that achieve high network throughput and near-optimal path lengths in large-scale memory networks with over one thousand nodes; our topology also ensures that the number of required router ports does not increase as network scale grows. Second, we design a compute+table hybrid routing protocol that reduces both computation and storage overhead in routing. Third, we propose a set of network reconfiguration mechanisms that allows both static and dynamic network scale expansion and reduction. Our experiments based on RTL simulation demonstrate that String Figure can interconnect over one thousand memory nodes with a shortest path length within five hops across various synthetic and real workloads. Our design also achieves 1.3× throughput improvement and 36% reduction in system energy consumption compared with traditional memory network designs.