CRISP Blog: Joint Parsing for Understanding 3D Scenes and Human Activities in Videos

(Presenting Fri. 12/14.) We propose a computational framework to jointly parse a single RGB image and reconstruct a holistic 3D configuration composed by a set of CAD models using a stochastic grammar model. Specifically, we introduce a Holistic Scene Grammar (HSG) to represent the 3D scene structure, which characterizes a joint distribution over the functional and geometric space of indoor scenes. Furthermore, as the 3D environment becomes larger and more complex, the complexity of the query-reasoning system grows rapidly. Increasingly, these tasks must happen at line speeds just to keep up with the rate of new data productions, and often real-time processing is needed in order to draw timely inferences. The algorithms currently entail deep learning, dynamic programming, Monte-Carlo iteration, graph analytics, and natural language processing, these are core algorithms massively applied in various AI applications, so we extract these core algorithms and make them be the modules which could be accelerated by emerging in-memory processing technologies from CRISP. Deployments of real-time video analytics will need to do as much processing in the cameras as possible, so will span edge devices to cloud in implementing an end-to-end solution.
We also have ongoing collaborations to apply our research in the context of various applications and our project will collect a diverse set of applications (especially from task 3.4) into a benchmark suite of challenging applications. we distill key benchmark tasks relevant to every level of the system, and associated QoS metrics, and use these to evaluate the effectiveness of the system designed by our and other labs (under CRISP) and programming environment. A key aspect of developing the benchmark suite is about to develop domain-specific metrics for system efficiency to complement general-purpose QoS metrics (performance, power, etc.)

Thursday, December 13, 2018

Joint Parsing for Understanding 3D Scenes and Human Activities in Videos