(Presenting Wednesday Jan 30, 2019) Networked applications with heterogeneous sensors are a growing source of data in the Internet of Things (IoT) environment. Many IoT applications use machine learning (ML) to make real-time predictions. The current dominant approach to deploying ML inference is monolithic, i.e., when inference needs to be performed using data generated by multiple sensors, the features generated by each sensor are joined in a centralized cloud-based tier to perform the inference computation. Since inference typically occurs with high frequency, the monolithic approach can quickly lead to burdensome levels of communication, which wastes energy, reduces data privacy, and often bottlenecks the network violating real-time constraints. In this work, we study a novel approach that mitigates these issues by “pushing” ML inference computations out of the cloud and onto a hierarchy of IoT devices, which compute successively more compressed representations of raw sensor data. We present a new technical challenge of “rewriting” the functional form of an ML inference computation to factor it over a network of devices without significantly reducing prediction accuracy. We present novel “hierarchy aware” neural network architectures which enable users to trade off between communication cost and accuracy. We also present novel exact factoring algorithms for other popular ML models including gradient boosted trees and random forests that preserve accuracy while substantially reducing communication. We evaluate our approach with three real-world problems, urban energy demand prediction, human activity prediction, and server performance prediction. Our approach presents substantial reductions in energy use and latency on IoT devices, while providing the same level of prediction quality to the current monolithic inference. Measurement on a common IoT device shows that energy use and latency can be reduced by up to 63% and 67% respectively without reducing accuracy relative to the full communication setting.