Thursday, December 6, 2018

High-performance In-Memory Data Partitioning

[Presenting on Friday, Dec 7th]

Data partitioning is an important primitive for in-memory data processing systems, and in many cases it is the key performance bottleneck. This important primitive has been the focus of many studies in the past. However, as we argue in this talk, these previous studies have been narrow in their scope leaving many unanswered questions that are of paramount importance in practice. Consequently, to the best of our knowledge, there is no clear answer to the seemingly simple question of what is an efficient partitioning strategy for in-memory data systems. In this talk, we carefully consider this data partitioning primitive in the context of multi-core in-memory data settings. We look at past work in this area and note that many of these studies miss looking at many important aspects such as the impact of the tuple size and the impact of the data formats (e.g. row-store vs. column-store). We build on this initial observation and examine a number of data partitioning strategies, leading to a better understanding of how data partitioning methods perform on modern multi-core large memory systems. We note a few interesting observations, including how relatively simple methods work quite well in practice across a broad spectrum of data parameters. To help future researchers, we propose a partitioning benchmark so that work in this area can take a broader and more realistic perspective when working on data partitioning methods. Overall, the key contribution of this talk is to separate the wheat from the chaff in previous research in this area, analyze the relative performance of various methods on a broad set of data parameters, and help provide a more systematic evaluation framework for future work in this area. We also point to opportunities for new research directions in this area.