Monday, November 14, 2022
Variational Auto-encoder for synthetic training data generation
Monday, November 7, 2022
Accelerating SQLite with Lookahead Information Passing (LIP)
Kevin Gaffney, U. Wisconsin Madison, presenting on 11/9/22 at 1PM & 7PM EST
In the two decades following its initial release, SQLite has become the most widely deployed database engine in existence. Today, SQLite is found in nearly every smartphone, computer, web browser, television, and automobile. While it supports complex analytical queries, SQLite is primarily designed for fast online transaction processing (OLTP), employing row-oriented execution and a B-tree storage format. However, fueled by the rise of edge computing and data science, there is a growing need for efficient in-process online analytical processing (OLAP). DuckDB, a database engine nicknamed "the SQLite for analytics", has recently emerged to meet this demand. While DuckDB has shown strong performance on OLAP benchmarks, it is unclear how SQLite compares holistically. In this talk, I will discuss SQLite in the context of this changing workload landscape. I will present results from our evaluation of SQLite on three benchmarks, each representing a different flavor of in-process data management, including transactional, analytical, and blob processing. I will delve into analytical data processing on SQLite, identifying key bottlenecks and weighing potential solutions. As a result of our optimizations, SQLite is now up to 4.2X faster on SSB. Finally, I will discuss the future of SQLite, envisioning how it will evolve to meet new demands and challenges.
Monday, October 24, 2022
Improving Memory Security and Reliability by Overcoming the Threat of Rowhammer
(Moin Qureshi, Professor of Computer Science at GA Tech, presenting on Wed. 10/26 at 1:00 & 7:00 PM ET.)
Rowhammer allows an attacker to induce bit flips in a row
by rapidly accessing neighboring rows. Rowhammer is not just a reliability
concern but a severe security threat as it can be used to escalate privilege or
break confidentiality. The problem of Rowhammer continues to become worse for
two reasons: (1) The threshold of activations needed to induce Rowhammer
reduces with each generation, coming down by 30x in the last 7 years (2)
Attackers continue to come up with complex patterns that can break all
hardware-based defenses, including the ones commercially employed in current
chips. Currently, there is no guaranteed solution for Rowhammer. Hardware-based
mitigation of Rowhammer typically consists of two parts: a tracker to identify aggressor
rows and a mitigating action. At low thresholds tracking incurs significant
SRAM overheads (several megabytes). Furthermore, the common mitigating action
of refreshing neighboring victim rows is susceptible to the Half-Double attack
from Google. In this talk, I will discuss our recent solutions that enable
low-cost tracking of aggressor rows even at ultra-low thresholds (ISCA’22), a
new mitigating action of performing dynamic row migration that is resilient to
complex attacks patterns (ASPLOS’22 and MICRO’22), and a Rowhammer-aware ECC
design that provides in-built memory integrity-protection while incurring
virtually zero performance and storage overheads (HPCA’22).
Brief Bio: Moinuddin Qureshi is a Professor of
Computer Science at the Georgia Institute of Technology. His research interests
include computer architecture, hardware security, and quantum computing.
Qureshi received his Ph.D. from the University of Texas at Austin in 2007. He
was a research scientist at the IBM T. J. Watson Research Center (2007-2011),
where he developed the caching algorithms for Power 7 Systems. Qureshi received
the 2022 ACM SIGARCH Maurice Wilkes Award for contributions to
high-performance memory systems. He is a member of Hall-of-Fame of the trifecta
of architecture conferences: ISCA, MICRO, and HPCA. His research has been
recognized with multiple best-paper awards and multiple IEEE Top-Picks awards.
His papers were also awarded the 2019 NVMW Persistent Impact Prize and 2021
NVMW Persistent Impact Prize, in recognition of “exceptional impact on the
fields of study related to non-volatile memories”. Qureshi received the 2020
“Outstanding Researcher Award” from Intel and an “Outstanding Technical
Achievement” award from IBM Research. More information at https://www.cc.gatech.edu/~moin/
Monday, October 17, 2022
SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning
Zihao Ye presenting on Wed. 10/19/22 at 1:00 & 7:00 PM ET
Sparse tensors are rapidly becoming critical components of modern deep learning workloads. However, developing high-performance sparse operators can be difficult and tedious, and existing vendor libraries cannot satisfy the escalating demands from new operators. Sparse tensor compilers simplify the development of operators, but efficient sparse compilation for deep learning remains challenging because a single sparse format cannot maximize hardware efficiency, and single-shot compilers cannot keep up with latest hardware and system advances. We show that the key to addressing both challenges is two forms of composability. In this paper, we propose SparseTIR, a sparse tensor compilation abstraction that offers composable formats and composable transformations for deep learning workloads. SparseTIR constructs a search space over these composable components for performance tuning. With these improvements, SparseTIR obtains consistent performance speedups vs vendor libraries and frameworks on sparse workloads such as Graph Neural Networks, Sparse Convolution, and Network Pruning.
Sunday, October 9, 2022
MEMulator & PiMulator: Emerging Memory and Processing-in-Memory Architecture Emulation
(Sergiu Mosanu, UVA, presenting on October 12, 2022.)
Main memory is a crucial SoC and architecture design aspect, affecting system performance, power, and cost. The development of emerging memory technologies, increasingly specialized DRAM memory flavors, and Processing-in-Memory (PiM) architectures introduce the need for system-level modeling and evaluation. However, it is challenging to mimic both software and hardware aspects of emerging memory and PiM architectures using the currently available tools with high performance and fidelity. We develop a system emulation framework that employs a modular, parameterizable, FPGA synthesizable memory and PiM model. Implemented in System Verilog, the memory and PiM model allows users to generate any desired memory configuration on the FPGA fabric with complete control over the structure and distribution of the PiM logic units. We emulate a whole system by interfacing the memory emulation model with CPU soft cores and a soft memory controller. We demonstrate strategies to model several pioneering bitwise-PiM architectures and provide detailed benchmark performance results showing the platform's ability to facilitate design space exploration. We observe an emulation vs. simulation weighted-average speedup of 28x when running a memory benchmark workload. This comprehensive FPGA-based memory emulation enables fast, high-fidelity design-space exploration and evaluation of processing-in-memory architectures as part of a whole system stressed with heavy workloads.
Monday, September 26, 2022
An MLIR-based Intermediate Representation for Accelerator Design with Decoupled Customizations
Hongzheng Chen and Niansong Zhang presenting on Wed. 9/28/22.
The increasing specialized accelerators deployed in data centers and edge devices call for the need of generating high-performance accelerators efficiently. However, the custom processing engines, memory hierarchy, data type, and data communication, complicate the accelerator design. In this talk, we will present our MLIR-based accelerator IR HCL, which decouples the algorithm and hardware customizations at the IR level. We will provide several case studies to demonstrate how our IR can support a wide range of applications, how our IR primitives can be composed with different designs, and how we can achieve high performance and productivity at the same time. Finally, we will discuss the benefits and ongoing efforts of our work.
Thursday, September 15, 2022
Accelerating Few Shot Learning with HD Computing in ReRAM
Weihong Xu, UCSD, presenting on Wed. 9/21/22 at 1:00 pm & 7:00 pm ET.
Hyperdimensional (HD) computing is a lightweight algorithm
with fast learning capability that can efficiently realize classification and
few-shot learning (FSL) workloads. However, traditional von-Neumann
architecture is highly inefficient for HD algorithms due to the limited memory
bandwidth and capacity. Processing in-memory (PIM) is an emerging computing
paradigm which tries to address these issues by using memories as computing units.
Our talk introduces the efficient PIM architecture for HD as well as
application of HD on few-shot classification. First, we show the PIM-based HD
computing architecture on ReRAM, Tri-HD, to accelerate all phases of the
general HD computing pipeline namely, encoding, training, retraining, and
inference. Tri-HD is enabled by efficient in-memory logic operations and shows
orders of magnitude performance improvements over CPU. However, Tri-HD is not
suitable for area and power-constrainted device since it suffers from high HD
encoding complexity and complex dataflow. To this end, we present an algorithm
and PIM co-design, FSL-HD, to realize energy-efficient FSL. FSL-HD
significantly reduces the encoding complexity by >10x and is equipped with
an optimized dataflow for practical ReRAM. As a result, FSL-HD shows superior
FSL accuracy, flexibility, and hardware efficiency compared to state-of-the-art
ReRAM-based FSL accelerators.
Monday, April 18, 2022
How Good is my HTAP System?
(Elena Milkai, U Wisc-Madison, presenting on April 20th at 1PM & 7PM ET)
Hybrid Transactional and Analytical Processing (HTAP) systems have recently gained popularity as they combine OLAP and OLTP processing to reduce administrative and synchronization costs between dedicated systems. However, there is no precise characterization of the features that distinguish a good HTAP system from a poor one. In this paper, we seek to solve this problem from the perspectives of both performance and freshness. To simultaneously capture the performance of both transactional and analytical processing, we introduce a new concept called throughput frontier, which visualizes both transactional and analytical throughput in a single 2D graph. The throughput frontier can capture information regarding the performance of each engine, the interference between the two engines, and various system design decisions. To capture how well an HTAP system supports real-time analytics, we define a freshness metric which quantifies how recent is the snapshot of the data seen by each analytical query. We also develop a practical way to measure freshness in a real system. We design a new hybrid benchmark called HATtrick which incorporates both throughput frontier and freshness as metrics. Using the benchmark, we evaluate three representative HTAP systems under various data size and system configurations and demonstrate how the metrics reveal important system characteristics and performance information.
Thursday, March 10, 2022
Enabling Efficient Large-Scale Deep Learning Training with Cache Coherent Disaggregated Memory Systems
(Zixuan Wang, UCSD, presenting on Wed. 3/16/22 at 1:00 & 7:00 PM ET at our CRISP task-level meeting.)
Modern deep learning (DL) training
is memory consuming, constrained by the memory capacity of each computation
component and cross-device communication bandwidth. In response to such
constraints, current approaches include increasing parallelism in distributed
training and optimizing inter-device communication. However, model parameter
communication is becoming a key performance
bottleneck in distributed DL training. To
improve parameter communication performance, we propose COARSE, a disaggregated
memory extension for distributed DL training. COARSE is built on modern
cache-coherent interconnect (CCI) protocols and MPI-like collective
communication for synchronization, to allow
low-latency and parallel access to training data
and model parameters shared among worker
GPUs. To enable high bandwidth transfers between GPUs and the disaggregated
memory system, we propose a decentralized parameter communication scheme to
decouple and localize parameter synchronization traffic. Furthermore, we
propose dynamic tensor routing and
partitioning to fully utilize the non-uniform
serial bus bandwidth varied across different cloud computing systems. Finally,
we design a deadlock avoidance and dual synchronization to ensure
high-performance parameter synchronization. Our evaluation shows that COARSE achieves
up to 48.3% faster DL training compared to the state-of-the-art MPI AllReduce
communication.