CRISP Blog: March 2020

Tuesday, March 24, 2020

Scaling-In General Purpose Computing within the DRAM Hierarchy for Map-Reduce Workloads

(Siddhartha Balakrishna Rai is presenting on Wed. 3/25/20)
This talk is a design space exploration of the hardware (where? how many? how to interface?) and software (how to place data? how to map computations?) choices for placing RISCV cores within the rank, chip, and bank of the DIMM slots in the DRAM hierarchy to take advantage of the locality vs. parallelism trade-offs for speeding up Map-Reduce workloads.

Tuesday, March 10, 2020

Cross-Failure Bug Detection in Persistent Memory Programs

(Sihang Liu presenting Wed. 3/11/2020 at 11:00 AM and 7:00 PM Eastern Time.)

Persistent memory (PM) technologies, such as Intel’s Optane memory, deliver high performance, byte-addressability, and persistence, allowing programs to directly manipulate persistent data in memory without any OS intermediaries. An important requirement of these programs is that persistent data must remain consistent across a failure, which we refer to as the crash consistency guarantee.

However, maintaining crash consistency is not trivial. We identify that a consistent recovery critically depends not only on the execution before the failure, but also on the recovery and resumption after failure. We refer to these stages as the pre- and post-failure execution stages. In order to holistically detect crash consistency bugs, we categorize the underlying causes behind inconsistent recovery due to incorrect interactions between the pre- and post-failure execution. First, a program is not crash-consistent if the post-failure stage reads from locations that are not guaranteed to be persisted in all possible access interleavings during the pre-failure stage — a type of programming error that leads to a race that we refer to as a cross-failure race. Second, a program is not crash-consistent if the post-failure stage reads persistent data that has been left semantically inconsistent during the pre-failure stage, such as a stale log or uncommitted data. We refer to this type of bugs as a cross-failure semantic bug. Together, they form the cross-failure bugs in PM programs. In this work, we provide XFDetector, a tool that detects cross-failure bugs by automatically injecting failures into the pre-failure execution, and checking for cross-failure races and semantic bugs in the post-failure continuation. XFDetector has detected four new bugs in three pieces of PM software: one of PMDK’s examples, a PM-optimized Redis database, and a PMDK library function.