Despite existing work in machine learning inference serving, ease-of-use and cost efficiency remain key challenges. Developers must manually match the performance, accuracy, and cost constraints of their applications to decisions about selecting the right model and model optimizations, suitable hardware architectures, and auto-scaling configurations. These interacting decisions are difficult to make for developers, especially when the application load varies, applications evolve, and the available resources vary over time. Consequently, applications often end up overprovisioning resources.
In this talk, we will introduce INFaaS, a model-less inference-as-a-service system that relieves applications of making these decisions. INFaaS provides a simple interface allowing applications to specify their inference task, and performance and accuracy requirements. To implement this interface, INFaaS generates and leverages model-variants, versions of already trained models that differ in resource footprints, latencies, costs, and accuracies. Based on the characteristics of the model-variants, INFaaS automatically navigates the decision space on behalf of applications to meet their specific objectives: (a) it selects a model, hardware architecture, and any compiler optimizations, and (b) it makes scaling and resource allocation decisions. By sharing hardware resources across models and applications, INFaaS achieves up to 150× cost savings, 1.5× higher throughput, and violates latency objectives 1.5× less frequently, compared to state-of-the-art systems.