Tuesday, May 25, 2021

Corundum: statically-enforced persistent memory safety

(Morteza Hoseinzadeh of UC San Diego Presenting on 5/26 at 1:00 p.m. & 7:00 p.m. ET)

Fast, byte-addressable, persistent main memories (PM) make it possible to build complex data structures that can survive system failures. Programming for PM is challenging, not least because it combines well-known programming challenges like locking, memory management, and pointer safety with novel PM-specific bug types. It also requires logging updates to PM to facilitate recovery after a crash. A misstep in any of these areas can corrupt data, leak resources, or prevent successful recovery after a crash. Existing PM libraries in a variety of languages -- C, C++, Java, Go -- simplify some of these problems, but they still require the programmer to learn (and flawlessly apply) complex rules to ensure correctness. Opportunities for data-destroying bugs abound.

This paper presents Corundum, a Rust-based library with an idiomatic PM programming interface and leverages Rust’s type system to statically avoid most common PM programming bugs. Corundum lets programmers develop persistent data structures using familiar Rust constructs and have confidence that they will be free of those bugs. We have implemented Corundum and found its performance to be as good as or better than Intel's widely-used PMDK library, HP's Atlas, Mnemosyne, and go-pmem.

Monday, May 17, 2021

FANS: FPGA-Accelerated Near-Storage Sorting

(Weikang Qiao presenting on Wed. May 19th at 1:00 & 7:00 PM ET)
 
Large-scale sorting is always an important yet demanding task for data center applications. In addition to powerful processing capability, high-performance sorting system requires efficient utilization of the available bandwidth of various levels in the memory hierarchy. Nowadays, with the explosive data size, the frequent data transfers between the host and the storage device are becoming increasingly a performance bottleneck. Fortunately, the emergence of near-storage computing devices gives us the opportunity to accelerate large-scale sorting by avoiding the back and forth data transfer. Near-storage sorting is promising for extra performance improvement and power reduction. However, it is still an open question of how to achieve the optimal sorting performance on the existing near-storage computing device. 

In this work, we first perform an in-depth analysis of the sorting performance on the newly released Samsung SmartSSD platform. Contrary to the previous belief, our analysis shows that the end-to-end sorting performance is bound by not only the bandwidth of the flash, but also the main memory bandwidth, the configuration of the sorting kernel and the intermediate sorting status. Based on our modeling, we propose FANS, an FPGA accelerated near-storage sorting system which selects the optimized design configuration and achieves the theoretically maximum end-to-end performance when using a single Samsung SmartSSD device. The experiments demonstrate more than 3× performance speedup over the state-of-art FPGA-accelerated flash storage. 

Monday, May 10, 2021

Compiler-Driven Simulation of Reconfigurable Hardware Accelerators

(Zhijing Li of Cornell presenting on Wed. 5/12/21 at 1PM & 7PM ET) 

As customized accelerator design has become increasingly popular to keep up with the demand for high performance computing, it poses challenges for modern simulator design to adapt to such a large variety of accelerators. Existing simulators tend to two extremes: low-level and general approaches, such as RTL simulation, that can model any hardware but require substantial effort and long execution times; and higher-level application-specific models that can be much faster and easier to use but require one-off engineering effort.

This work proposes a compiler-driven simulation workflow that can model the performance of arbitrary hardware accelerators, while introducing little programming overhead. The key idea is to separate structure representation from simulation by developing an intermediate language that can flexibly represent a wide variety of hardware constructs. We design the Event Queue dialect of MLIR, a dialect that can model arbitrary hardware accelerators with explicit data movement and distributed event-based control. We also implement a simulator to model EQueue-structured programs. We show the effectiveness of our simulator by comparing it with a state-of-the-art application-specialized simulator, while offering higher extensibility. We demonstrate our simulator can save designers' time with reusable lowering pipeline and visualizable outputs.