Wednesday, February 20, 2019

Initial Design of the Hustle Type System

Matthew Dutson from UW-Madison
Presenting Friday, Feb 22 at 1:00PM CST

Hustle is a scalable, relational data platform currently under development. It has a microservices architecture for maximum flexibility and adaptability. In implementing the type system for Hustle’s execution engine we consider a number of design tradeoffs. We seek to balance abstraction with performance, doing so in a way that allows future developers to easily add custom types and expand the functionality of existing types. We look at factors including the cost of virtual function lookups and dealing with generic types whose sizes are unknown at compile-time. We also analyze the tradeoffs of various null value representations, looking to minimize wasted space while conforming to user’s expectations of null semantics and behavior.

Tuesday, February 19, 2019

Exploiting HMC Characteristics

Today’s computer systems face the memory wall problem that the DRAM can only provide a fraction of the bandwidth that a multi-core processor can fully utilize. One promising emerging memory is the Hybrid Memory Cube (HMC). HMC promises to provide higher bandwidth and better random access performance (than DRAM) to data-hungry applications. HMC also have applications in near-memory computing.


In this work hardware and software researchers at the University of Wisconsin are collaborating to examine how database operations could potentially leverage HMC for query execution. While running a full-fledged database engine on HMC is our target, this task is onerous as porting a full-fledged database platform to a new hardware like the HMC (and leveraging the in-memory computing component) is estimate to take a few person years. So we take an initial first step by breaking down higher-level database operations into a set of four data access kernels. We encapsulate these kernels in a new benchmark called the Four Bases benchmark, as we think these four primitives can be used to compose the DNA of most higher-level database operations. Studying the behaviour of hardware on this new micro benchmark has the potential to explore the hardware-software synergy now, and provide insights changes may be needed on both the software and hardware sides to make use of inverted memory systems.


We have an initial implementation of the Four Bases benchmark on an FPGA-based RISC-V platform that connects to HMC. We observe that the random access latency is about 2x compared to sequential access, which is much smaller compared to the 10x that is often seen with DRAM. This behavior may impact the cost model of database query optimizer and may require rethinking database operator algorithms.


In this talk, we will present our work in this area, share early results, and seek feedback and collaboration from the broader CRISP community.


This is a joint work with Prof. Li’s group.

Friday, February 15, 2019

Passive Memristor Crossbars with Etch-Down Fabrication and Fine-Tuning Characteristics

(Presenting April 8, 2019) Memristors and their arrays with tunable conductance states are promising building blocks for the hardware-implementation of artificial neural network with analog computation. In this work,  we report a CMOS-compatible fabrication technique for passive memristor crossbar array with etch-down and its analog computing application. Switching and fine-tuning characteristics of memristor devices are presented across a 64×64 crossbar array. In addition, vector-matrix multiplication (VMM) is demonstrated with the forward inference of single-layer neural network for MNIST images recognition and its accuracy is analyzed depending on programming precision.

Wednesday, February 13, 2019

A survey of query optimization methods

(Presenting Fri. February 15, 2019) Query optimizers are fundamental components of database systems that are responsible for producing good execution plans. To navigate the search space of execution plans optimizers use transformation rules and a search strategy. Extending, maintaining and debugging a query optimizer entails understanding and modifying the library of rules. Today query optimizers use object-oriented languages to specify the valid transformation and search strategies which results in a sizable and complex codebase. In Hustle, a new database engine we are developing, our goal is to express the transformation rules in a Domain Specific Language to minimize the code complexity and produce a concise and maintainable query optimizer.

Persistent Memory and Architecture Implication

(Larry Chiu of IBM presenting Wednesday, February 13 at 2:00PM ET)


Opportunities and challenges surround new persistent memory and storage hierarchy.  New rules breaks away from architecture and design assumption from the past. Larry will discuss factors that may accelerate or reduce the adoption of the new technologies in the enterprise storage use cases.   

Larry Chiu, Director of Global Storage Systems, is leading IBM Storage Research Strategy across wide discipline of storage and data management technology. Larry developed first IBM Enterprise RAID Storage, founded industry first IBM Storage Virtualization Appliance, and industry first autonomic storage tiering engine for enterprise market. Larry has a Master Science, Computer Engineering from University of South California and a Master of Science and Technology Commercialization from University of Texas, Austin. 

Sunday, February 10, 2019

GRAM: Graph Processing in a ReRAM-based Computational Memory

The performance of graph processing for real-world graphs is limited by inefficient memory behaviors in traditional systems because of random memory access patterns. Offloading computations to the memory is a promising strategy to overcome such challenges. In this work, we exploit the resistive memory (ReRAM) based processing-in-memory (PIM) technology to accelerate graph applications. The proposed solution, GRAM, can efficiently executes the vertex-centric model, which is widely used in large-scale parallel graph processing programs, in the computational memory. The hardware-software co-design used in GRAM maximizes the computation parallelism while minimizing the number of data movements.

Monday, February 4, 2019

Compiler and Hardware Design for In-Memory and Near-Memory Acceleration of Cognitive Applications


(Sitao Huang, Vikram Sharma Mailthody, Zaid Qureshi- Presenting Wed. 2/6 at 2PM ET)

The increasing deployment of machine learning for applications such as image analytics and search has resulted in a number of special purpose accelerators in this domain. In this talk, we present two of our recent works. The first work is a compiler that optimizes the mapping of the computation graph for efficient execution in a memristor-based hybrid (analog-digital) deep learning accelerator. By building the compiler, we have made special purpose accelerators more accessible to software developers, and enabled the generation of better-performing executables. In our second work, we are exploring near-memory accelerations for applications like image retrieval. These applications are constrained by the bandwidth available to accelerators like GPUs, which could potentially benefit from near-memory processing.

Friday, February 1, 2019

Hopscotch: A Microbenchmark for Memory Performance Evaluation

(Presenting Friday, Feb 01, 2019) Due to the ever-increasing gap between the speed of computing elements and the speed at which memory systems can provide data, we are hitting a memory wall. As CRISP center is coming up with new memory architectures to fight this issue, we need good memory benchmarks to evaluate the performance of these emerging architectures. Ideally, these benchmarks should be able to help us finding memory system bottlenecks, should be able to profile read and write channels independently and combined, and should be tunable to match application of interest. Existing memory benchmarks such as STREAM, ApexMAP, Spatter etc. evaluate memory performance using interesting access patterns. However, they are not flexible enough to allow evaluating with all combinations of read and write patterns. We are developing a micro-benchmark where we can tune the access pattern of read and write channels independently using parameters controlling spatial and temporal locality. The benchmark will include a collection of kernel for exercising different areas of the memory system. To ensure that zero redundancy, we are going to employ diversity analysis using metrics like stall percentage, read/write ratio, median stride length, unit stride percentage etc. The current implementation supports few interesting read and write patterns on CPU and GPU platforms using OpenMP and CUDA. Future extension plan includes exploring the effect of memory controller scheduling and finding new patterns in major application domains by eliminating the patterns we already covered.