Tuesday, April 28, 2020

MEG1.1: A RISCV-based System Simulation Infrastructure for Exploring Memory Optimization using FPGAs and High Bandwidth Memory

(Nicholas Beckwith, U Penn., presenting on Wednesday, April 29, 2020)

In this presentation, we propose MEG1.1, a configurable, cycle-exact, and RISC-V based full system emulation infrastructure using FPGA and HBM. MEG1.1 extends MEG1.0 by providing out-of-order RISC-V cores as well as OS and architectural support to help integrate the user’s customized accelerators. Furthermore, MEG1.1 provides an HBM memory interface to fully expose the HBM’s bandwidth to the user. Leveraging MEG1.1, we present a cross-layer system optimization as an illustrative case to demonstrate the usability of MEG1.1. In this case study, we present a reconfigurable memory controller to improve the address mapping of a standard memory controller. This reconfigurable memory controller, along with its OS support, allows the user to improve the memory bandwidth accessible to the out-of-order RISC-V cores as well as to the custom near-memory accelerators. We also present the challenges and research directions for MEG 2.0 that can significantly reduce the cost and improve the portability, flexibility, usability of MEG 1.0 and 1.1 without sacrificing the performance and fidelity.

Tuesday, April 21, 2020

BaM: Enabling Accelerator Memory Accesses into the SSD


(Zaid Qureshi, David Min, and Vikram Sharma Mailthody presenting 4/22/20.) 

Storage class memories (SCM) have been considered as a prime candidate to address the growing need for applications’ memory footprint. An ideal SCM for tomorrow’s data center has TBs of memory capacity, has a few hundred nanoseconds to a couple of microsecond latency, is energy efficient, offers high memory parallelism, is scalable and is very cheap. Among the several types of SCM, 3DX-point and Flash have shown promising results.  Compared to 3DX-point, Flash offers higher throughput, thanks to several levels of parallelism, has higher density, consumes very low power per memory access, and is proven to be scalable and cost-efficient.

However, studying Flash as part of the main memory system is challenging as the existing simulators and emulators do not provide the needed flexibility and cannot address practical system-level challenges. In this talk, we will discuss our attempt in modeling an SSD using an FPGA. We show that CPUs offer a very low amount of memory parallelism over PCIe and are inefficient in exploiting the massive parallelism offered by these emerging NVM devices. To increase the memory level parallelism, we connect a GPU with an FPGA. We shall then discuss our learnings and several applications and system-level challenges we encountered.

Tuesday, April 7, 2020

INFaaS: Model-less Inference Serving

(Francisco Romero and Qian Li presenting on Wednesday, April 1 at 11:00 AM and 7:00 PM Eastern Time)


Despite existing work in machine learning inference serving, ease-of-use and cost efficiency remain key challenges. Developers must manually match the performance, accuracy, and cost constraints of their applications to decisions about selecting the right model and model optimizations, suitable hardware architectures, and auto-scaling configurations. These interacting decisions are difficult to make for developers, especially when the application load varies, applications evolve, and the available resources vary over time. Consequently, applications often end up overprovisioning resources.

In this talk, we will introduce INFaaS, a model-less inference-as-a-service system that relieves applications of making these decisions. INFaaS provides a simple interface allowing applications to specify their inference task, and performance and accuracy requirements. To implement this interface, INFaaS generates and leverages model-variants, versions of already trained models that differ in resource footprints, latencies, costs, and accuracies. Based on the characteristics of the model-variants, INFaaS automatically navigates the decision space on behalf of applications to meet their specific objectives: (a) it selects a model, hardware architecture, and any compiler optimizations, and (b) it makes scaling and resource allocation decisions. By sharing hardware resources across models and applications, INFaaS achieves up to 150× cost savings, 1.5× higher throughput, and violates latency objectives 1.5× less frequently, compared to state-of-the-art systems.