Tuesday, October 29, 2019

MEDAL: Scalable DIMM based Near Data Processing Accelerator for DNA Seeding Algorithm


(Presenting on Wed. 10/30/2019)
Abstract: 
Computational genomics has proven its great potential to support precise and customized health care. However, with the wide adoption of the Next Generation Sequencing (NGS) technology, `DNA Alignment', as the crucial step in computational genomics, is becoming more and more challenging due to the booming bio-data. Consequently, various hardware approaches have been explored to accelerate DNA seeding - the core and most time consuming step in DNA alignment.



Most previous hardware approaches leverage multi-core, GPU, and FPGA to accelerate DNA seeding. However, DNA seeding is bounded by memory and above hardware approaches focus on computation. For this reason, Near Data Processing (NDP) is a better solution for DNA seeding. Unfortunately, existing NDP accelerators for DNA seeding face two grand challenges, i.e., fine-grained random memory access and scalability demand for booming bio-data. To address those challenges, we propose a practical, energy efficient, Dual-Inline Memory Module (DIMM) based, NDP Accelerator for DNA Seeding Algorithm (MEDAL), which is based on off-the-shelf DRAM components. For small databases that can be fitted within a single DRAM rank, we propose the intra-rank design, together with an algorithm-specific address mapping, bandwidth-aware data mapping, and Individual Chip Select (ICS) to address the challenge of fine-grained random memory access, improving parallelism and bandwidth utilization. Furthermore, to tackle the challenge of scalability for large databases, we propose three inter-rank designs (polling-based communication, interrupt-based communication, and Non-Volatile DIMM (NVDIMM)-based solution). In addition, we propose an algorithm-specific data compression technique to reduce memory footprint, introduce more space for the data mapping, and reduce the communication overhead. Experimental results show that for three proposed designs, on average, MEDAL can achieve 30.50x/8.37x/3.43x speedup and 289.91x/6.47x/2.89x energy reduction when compared with a 16-thread CPU baseline and two state-of-the-art NDP accelerators, respectively.



Bio of Wenqin Huangfu: 
Wenqin Huangfu is a fourth year Ph.D. student in University of California at Santa Barbara. His advisor is Professor. Yuan Xie.

Wenqin’s research interests include domain-specific accelerator, Processing-In-Memory (PIM), Near-Data-Processing (NDP), and emerging memory technology. Currently, Wenqin is focusing on hardware acceleration of bioinformatics with PIM and NDP technology.

Monday, October 14, 2019

SubZero: Zero-copy IO for Persistent Main Memory File Systems

Juno Kim, UC San Diego, presenting on Oct 14th

POSIX-style read() and write() have long been the standard interface for accessing file data. However, the data copy their semantics require imposes an unnecessary overhead on accessing files stored in persistent memories attached to the processor memory bus (PMEM). PMEM-aware file systems provide direct-access (DAX)-based mmap() to avoid the copy, but it forces the programmer to manage concurrency control and atomicity guarantees. 

We propose a new IO interface, called SubZero IO, that avoids data movement overheads when accessing persistent memory-backed files while still interacting cleanly with legacy read() and write(). SubZero IO provides two new system calls – peek() and patch() – that allow for data access and modification without a copy. They can improve performance significantly, but they require substantial changes to how an application performs IO. To avoid this, we describe a PMEM-aware implementation of read() and write() that can transparently provide most of the benefits of peek() and patch() for some applications.


Measurements of simple benchmarks show that SubZero outperforms copy-based read() and write() by up to 2× and 6×, respectively. At the application level, peek() improves GET performance of the Apache Web Server by 3.8×, and patch() boosts the SET performance of Kyoto Cabinet up to 2.7×.

Wednesday, October 9, 2019

FloatPIM: Acceleration of DNN Training in PIM

Presented by Saransh Gupta of UC San Diego on Wednesday, October 9th at 1:00PM ET.

We present PIM methods that perform in-memory computations on digital data and support high precision floating-point operations. First, we introduce an operation-dependent variable voltage application scheme which improves the performance and energy efficiency of existing PIM operations by 2x. Then, we propose an in-memory deep neural networks (DNN) architecture, which not only supports DNN inference but also training completely in-memory. To achieve this, we natively enable, for the first time, high-precision floating-point operations in memory. Our design also enables fast communication between neighboring memory blocks to reduce the internal data movement of the PIM architecture.