Tuesday, April 23, 2019

PMTest: A Fast and Flexible Testing Framework for Persistent Memory Programs


(Sihang Liu presenting on Wed. 4/24 Task 2.4 liaison meeting) 

Recent non-volatile memory technologies such as 3D XPoint and NVDIMMs have enabled persistent memory (PM) systems that can manipulate persistent data directly in memory. This advancement of memory technology has spurred the development of a new set of crash-consistent software (CCS) for PM - applications that can recover persistent data from memory in a consistent state in the event of a crash (e.g., power failure). CCS developed for persistent memory ranges from kernel modules to user-space libraries and custom applications. However, ensuring crash consistency in CCS is difficult and error-prone. Programmers typically employ low-level hardware primitives or transactional libraries to enforce ordering and durability guarantees that are required for ensuring crash consistency. Unfortunately, hardware can reorder instructions at runtime, making it difficult for the programmers to test whether the implementation enforces the correct ordering and durability guarantees. 

We believe that there is an urgent need for developing a testing framework that helps programmers identify crash consistency bugs in their CCS. We find that prior testing tools lack generality, i.e., they work only for one specific CCS or memory persistency model and/or introduce significant performance overhead. To overcome these drawbacks, we propose PMTest1 , a crash consistency testing framework that is both flexible and fast. PMTest provides flexibility by providing two basic assertion-like software checkers to test two fundamental characteristics of all CCS: the ordering and durability guarantee. These checkers can also serve as the building blocks of other application-specific, high-level checkers. PMTest enables fast testing by deducing the persist order without exhausting all possible orders. In the evaluation with eight programs, PMTest not only identified 45 synthetic crash consistency bugs, but also detected 3 new bugs in a file system (PMFS) and in applications developed using a transactional library (PMDK), while on average being 7.1x faster than the state-of-the-art tool.

Tuesday, April 16, 2019

Tuning Applications for Efficient GPU Offloading to In-memory Processing


(Presenting on Wed. 04/17/2019 at 2:00PM ET)

Authors: Yudong Wu, Mingyao Shen, Yi Hui Chen, Yuanyuan Zhou


Data movement between processors and main memory is a critical bottleneck for data-intensive applications. This problem is more severe with Graphics Processing Units (GPUs) due to their massive parallel data processing capability.  Recent research has shown that Processing-in-Memory (PIM) can greatly alleviate data movement bottleneck by reducing traffic between GPUs and memory devices. It offloads relative small execution context, instead of transferring massive data to be processed between memory devices and processors. However,  conventional application code that is highly optimized for locality to execute efficiently in GPU is not a natural match to be offloaded into PIM.   To address this challenge, our project investigates how application code can be restructured to improve the benefit of PIM offloading from GPUs.   In addition, we also study approaches to dynamically determine how much to offload as well as how to leverage all resources including GPUS in case of offloading to achieve the best possible overall performance.   From our experimental evaluations over 14 applications, our approach can averagely improve application offloading performance by 21%.

Tuesday, April 9, 2019

PIMCloud: Exploring Near-data Processing for Interactive Cloud Services

Shuang Chen from Cornell University
Presenting Wednesday, Apr 10 at 2:00PM EDT

Data centers host latency-critical (LC) as well as best-effort jobs. The former rely critically on an adequate provisioning of hardware resources to meet their QoS target. Many recent industry implementations and research proposals assume a single LC application per node; this is in part to make it easy to carve out resources for LC that one LC application, while allowing best-effort jobs to compete for the rest.

Two big changes in data centers, however, are about to shake up the status quo. First, the micro services model is making the number of LC applications hosted in data centers explode, making it impractical (and inefficient) to assume one LC application per node. That means multiple LC applications competing for resources on a single node, each with their own QoS needs. Second, the arrival of processing-in-memory (PIM) capabilities introduces a complex scheduling challenge (and opportunity).

In this talk, I will present our PIMCloud project, show some initial results, and discuss our ongoing work. First, I will present PARTIES, a novel hardware resource manager that enables successful colocation of multiple LC applications on a single node of a traditional data center. (This work will be presented next week at ASPLOS 2019.) Second, I will discuss how we envision augmenting this framework to accommodate PIM capabilities. Specifically, I will discuss some challenges and opportunities in future nodes where memory channels are themselves compute-capable.