Thursday, December 13, 2018

Joint Parsing for Understanding 3D Scenes and Human Activities in Videos

(Presenting Fri. 12/14.) We propose a computational framework to jointly parse a single RGB image and reconstruct a holistic 3D configuration composed by a set of CAD models using a stochastic grammar model. Specifically, we introduce a Holistic Scene Grammar (HSG) to represent the 3D scene structure, which characterizes a joint distribution over the functional and geometric space of indoor scenes. Furthermore, as the 3D environment becomes larger and more complex, the complexity of the query-reasoning system grows rapidly. Increasingly, these tasks must happen at line speeds just to keep up with the rate of new data productions, and often real-time processing is needed in order to draw timely inferences. The algorithms currently entail deep learning, dynamic programming, Monte-Carlo iteration, graph analytics, and natural language processing, these are core algorithms massively applied in various AI applications, so we extract these core algorithms and make them be the modules which could be accelerated by emerging in-memory processing technologies from CRISP. Deployments of real-time video analytics will need to do as much processing in the cameras as possible, so will span edge devices to cloud in implementing an end-to-end solution. 
We also have ongoing collaborations to apply our research in the context of various applications and our project will collect a diverse set of applications (especially from task 3.4) into a benchmark suite of challenging applications. we distill key benchmark tasks relevant to every level of the system, and associated QoS metrics, and use these to evaluate the effectiveness of the system designed by our and other labs (under CRISP) and programming environment. A key aspect of developing the benchmark suite is about to develop domain-specific metrics for system efficiency to complement general-purpose QoS metrics (performance, power, etc.) 

Tuesday, December 11, 2018

Perceptual Compression for Video Storage and Processing Systems

(Presenting 12/12)

Compressed videos constitute 70% of Internet traffic, and video upload growth rates far outpace compute and storage improvement trends. Leveraging perceptual cues like saliency, i.e., regions where viewers focus their perceptual attention, can reduce compressed video size while maintaining perceptual quality, but requires significant changes to video codecs and ignores the data management of this perceptual information.

In this talk, we describe Vignette, a new compression technique and storage manager for perception-based video compression. Vignette complements off-the-shelf compression software and hardware codec implementations. Vignette’s compression technique uses a neural network to predict saliency information used during transcoding, and its storage manager integrates perceptual information into the video storage system to support a perceptual compression feedback loop. Vignette’s saliency-based optimizations reduce storage by up to 95% with minimal quality loss, and Vignette videos lead to power savings of 50% on mobile phones during video playback.

Monday, December 10, 2018

HotSpot Extensions for Microchannels

(Trey West, UVA, presenting on Mon 12/10/18.) 
Modern applications present issues that are becoming increasingly difficult to solve with traditional 2D architectures, which has necessitated the need for research into 3D processing and memory architectures. A dominant problem in this field is that 3D architectures produce more heat than 2D architectures, yet are difficult to cool using traditional 2D approaches like heat sinks. New 3D cooling techniques are necessary, and with them, new ways to model these cooling techniques. My research has focused on taking HotSpot, an existing thermal modeling tool capable of modeling 3D architectures, and adding functionality for modeling 3D cooling techniques, with an emphasis on microchannel cooling.

Thursday, December 6, 2018

High-performance In-Memory Data Partitioning

[Presenting on Friday, Dec 7th]

Data partitioning is an important primitive for in-memory data processing systems, and in many cases it is the key performance bottleneck. This important primitive has been the focus of many studies in the past. However, as we argue in this talk, these previous studies have been narrow in their scope leaving many unanswered questions that are of paramount importance in practice. Consequently, to the best of our knowledge, there is no clear answer to the seemingly simple question of what is an efficient partitioning strategy for in-memory data systems. In this talk, we carefully consider this data partitioning primitive in the context of multi-core in-memory data settings. We look at past work in this area and note that many of these studies miss looking at many important aspects such as the impact of the tuple size and the impact of the data formats (e.g. row-store vs. column-store). We build on this initial observation and examine a number of data partitioning strategies, leading to a better understanding of how data partitioning methods perform on modern multi-core large memory systems. We note a few interesting observations, including how relatively simple methods work quite well in practice across a broad spectrum of data parameters. To help future researchers, we propose a partitioning benchmark so that work in this area can take a broader and more realistic perspective when working on data partitioning methods. Overall, the key contribution of this talk is to separate the wheat from the chaff in previous research in this area, analyze the relative performance of various methods on a broad set of data parameters, and help provide a more systematic evaluation framework for future work in this area. We also point to opportunities for new research directions in this area.

Wednesday, December 5, 2018

Integrated Data Transfer and Address Translation for CPU-GPU Environments

(Presenting 12/5)
Increasingly, accelerators such as GPUs, FPGAs, TPUs, etc. are being co-located with the main CPU on server systems to deal with the variable needs across and within workloads. While a lot of prior works have optimized for data movement amongst homogeneous components, the possible inefficiencies of similar data transfers across heterogeneous components has not been explored as much. In this talk, we will present costs due to excessive data transfers that happen in many current CPU-GPU systems, and their poor scalability as we evolve into multi-GPU systems. Since data transfers happen at coarse (page) granularities, and on demand, it results in poor efficiencies, especially with non-useful data being moved as well. We propose (i) compiler based approaches to relayout the data before the movement and (ii) novel address translation mechanisms to handle the consequence of the new data layout. We will present experimental results showing the benefits of such an approach.

Friday, November 30, 2018

Ultra-Dense, Low Power and Resilient Physical Unclonable Function Based on 3D NAND Flash Memory Array

(Presenting Mon. 12/3) -- 3D NAND flash memory has become an integral part of the cyber-physical systems to cope with the huge data explosion in this era of internet of things (IoT). Moreover, hardware security primitives such as physical unclonable function (PUF) have become indispensable in the functional circuits of these cyber-physical systems for protection against security vulnerabilities and adversary attacks. Therefore, in this talk, we present for the first time a PUF exploiting the intrinsic variability in the string current of the ubiquitous 3D NAND flash memory owing to the process variations and the inherent material imperfections such as grain boundaries and the associated traps. The proposed PUF exhibits excellent performance metrics such as uniformity (50%), diffuseness (50%) and uniqueness (50.08%) and is resilient to the machine learning attacks. The ultra-dense 3D NAND flash memory array also enables a significantly large set of challenge-response pairs (CRPs) for a strong PUF action. 

Thursday, November 29, 2018

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

(Presenting on Friday. 11/30/18)  There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms -- such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) -- requires significant manual effort. In this talk, we introduce TVM, a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives, and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations.

Tuesday, November 27, 2018

HeteroCL: An Intermediate Programming Abstraction for Heterogeneous Computing

(Presenting on Wed. 11/28/18) With the pursuit of better performance under strict physical constraints, there is an increasing need for deploying applications to heterogeneous compute architectures with accelerators. Among these accelerators, intelligent memory and storage (IMS) architectures are proposed to provide an environment where the computation can be as close as possible to the memory cells. This kind of architecture can potentially greatly increase parallelism and energy efficiency, which enables us to run data-intensive applications.

This project aims at developing an intuitive programming model that provides high-level abstractions for programming heterogeneous accelerator architectures, including FPGAs and IMS accelerators.

Thursday, November 15, 2018

A 3D In-memory Architecture for Exploiting Internal Memory Bandwidth in an Area-constrained Logic Layer 

Bulk primitives such as element-wise operations on large vectors, reduction, and scan appear in many applications. The required computation per element is low for such operations. As a result, the cost of moving data from DRAM to a separate processor surpasses the cost of computations and consequently, data movement comprises a significant portion of execution time and energy consumption. To alleviate the cost, prior work has proposed two main approaches: (i) adding logic to row buffers of memory arrays, (ii) adding processing elements to the logic layer of 3D stacked memories. The first approach imposes a significant hardware overhead and is not flexible enough to support all the required operations. Consequently, the second approach seems more practical. Due to the limited area of the logic layer,  processing elements with a traditional architecture that fits in the logic layer cannot provide enough parallelism to consume all the available internal memory bandwidth in the logic layer. The goal of this project is to propose a new architecture which is efficient enough to consume all the available bandwidth and is small enough to fit in the logic layer. 


(Presented on Monday, 11/26/18)

Welcome to our blog!

CRISP researchers and students will share posts here three times per week beginning November 26.