Thursday, January 31, 2019

A little latency goes a long way: Memory latency and its impact on CPU-driven inference performance

(Ameen Akel of Micron is presenting Mon. Feb 4, 2019)

Memory media latencies are often a hot topic: Systems architects often seek the lowest possible latencies, which, in turn, drives memory manufacturers to architect memories for the minimum possible access latencies.  While some applications may benefit from lower media latencies, most applications exhibit very little sensitivity.  We aim to dispel the memory latency myth: Workloads like DNN inference, across a wide variety of models, are not strongly correlated to memory media latency.  Freeing memory companies of memory latency shackles enables favorable memory media architecture tradeoffs that systems architects may not expect.

Monday, January 28, 2019

Hierarchical and Distributed Machine Learning Inference Beyond the Edge

(Presenting Wednesday Jan 30, 2019) Networked applications with heterogeneous sensors are a growing source of data in the Internet of Things (IoT) environment. Many IoT applications use machine learning (ML) to make real-time predictions. The current dominant approach to deploying ML inference is monolithic, i.e., when inference needs to be performed using data generated by multiple sensors, the features generated by each sensor are joined in a centralized cloud-based tier to perform the inference computation. Since inference typically occurs with high frequency, the monolithic approach can quickly lead to burdensome levels of communication, which wastes energy, reduces data privacy, and often bottlenecks the network violating real-time constraints. In this work, we study a novel approach that mitigates these issues by “pushing” ML inference computations out of the cloud and onto a hierarchy of IoT devices, which compute successively more compressed representations of raw sensor data. We present a new technical challenge of “rewriting” the functional form of an ML inference computation to factor it over a network of devices without significantly reducing prediction accuracy. We present novel “hierarchy aware” neural network architectures which enable users to trade off between communication cost and accuracy. We also present novel exact factoring algorithms for other popular ML models including gradient boosted trees and random forests that preserve accuracy while substantially reducing communication. We evaluate our approach with three real-world problems, urban energy demand prediction, human activity prediction, and server performance prediction. Our approach presents substantial reductions in energy use and latency on IoT devices, while providing the same level of prediction quality to the current monolithic inference. Measurement on a common IoT device shows that energy use and latency can be reduced by up to 63% and 67% respectively without reducing accuracy relative to the full communication setting.


Sunday, January 27, 2019

PIMProf: A Performance Profiler for Processing-in-Memory Architectures


Monday, 1/28/19 at 2:00PM ET – Task 1.5 (Evaluation Through Architectural Simulation and Prototyping)
PIM architectures have drawn an increasing research interest as a mitigation to the data movement bottleneck within the current DRAM-based architectures, and a variety of them have been proposed for accelerating various data-intensive workloads. However, for a given workload, it is difficult to determine which part of a program should be offloaded to a given PIM architecture to gain the best performance and how much performance gain is possible. We propose PIMProf, a tool that uses a combination of static and runtime analysis to automatically detect the PIM candidates and estimate the speedup of the program. Our key ideas are as follows: First, PIMProf uses static analysis to capture the dependency between computation and data access, and constructs both a control flow graph and a data dependency graph of the program. Second, PIMProf profiles the computation cost and memory access cost of the program, and attributes the costs to the nodes and edges of the graph. Finally, we show how to formalize the PIM offloading problem into a cost minimization problem of the weighted graph.

Thursday, January 24, 2019

Monolithic-3D Integration Augmented Design Techniques for Computing in SRAMs

Presenting Friday, January 25th, at 2PM EST
Task 3.6: Cognitive Architectures
In-memory computing has emerged as a promising solution to address the logic-memory performance gap. We propose design techniques using monolithic-3D integration to achieve reliable multi-row activation which in turn help in computation as part of data readout. Our design is 1.8x faster than the existing techniques for Boolean computations. We quantitatively show no impact to cell stability when multiple rows are activated and thereby requiring no extra hardware for maintaining the cell stability during computations. In-memory digital to analog conversion technique is proposed using a 3D-CAM primitive. The design utilizes relatively low strength layer-2 transistors effectively and provides 7x power savings when compared with a specialized converter in-memory. Lastly, we present a linear classifier system by making use the above-mentioned techniques which is 47x faster while computing vector matrix multiplication using a dedicated hardware engine.

Tuesday, January 22, 2019

Wearout and active accelerated recovery for processing in emerging technology memories

(Presenting Wednesday, January 23rd, at 2pm EST) Trying to break the memory wall includes various efforts to bring memory closer to the processor, or to push processing into the memory stack, generally called Processing in Memory (PIM). If the memory is one the various "flavors" of non-volatile emerging memory technologies such as: spin-transfer torque RAM (STTRAM), phase-change memory (PCM), resistive RAM (RRAM), memristor, 3DXpoint, etc., limited endurance becomes an important issue in addition to all other general challenges common to PIM. Endurance refers to the fact that most non-volatile memories, including all these new emerging technologies, but also more traditional ones like Flash and EEPROM, have a limited lifetime in terms of how many times they can be written and erased - once the limit is exceeded, the number of faults in the memory increases rapidly and the memory device can no longer be used reliably. For storage applications (the main use of non-volatile memories until now) one way to deal with the limited endurance is to overprovision the device (i.e. leave some of the native capacity unutilized up-front and allocate it later as memory blocks start failing due to the limited endurance) and to use wear leveling by adding a level of memory virtualization in the form of a Flash Translation Layer (FTL) that maps logical blocks to physical blocks in a dynamic way such that the write/erase cycles are more or less equally distributed across the physical blocks such that no single block gets overwritten too many times in a row. Although the concept of an FTL was first introduced for Flash, similar mechanisms will work (and likely be necessary) for all emerging memory technologies with limited endurance (e.g. although not explicitly stated it is likely that the Intel Optane 3D XPoint is using a similar mechanism for wear-leveling, etc.). FTLs are an OK solution for storage applications but are suboptimal (to say the least) for main memory applications, and even more so for processing in memory, both from a latency point of view during normal logical-to-physical mapping, but especially because of the extra long delays necessary for moving data when a logical block needs to be re-allocated to a new physical block. Because of this methods that intrinsically compensate for the limited endurance would be especially preferable for PIM.

One such method is to take advantage of the recovery mechanisms associated with the stress that leads to the limited endurance in the first place. Since stress and wearout are mechanisms that take a device out of physical equilibrium, it turns out that simple thermodynamics tends to partially reverse the effect of stress when the stress is removed. This is a general physics argument that has been experimentally demonstrated for several wearout mechanisms, including Flash wearout, but also other more general ones, such as NBTI/PBTI, hot electrons, electromigration, etc. In this talk I will go over several of these mechanisms and explain the source of stress and ways to reverse it. The main idea is to go beyond simple passive recovery (just remove stress and wait) by reversing the direction of stress (active recovery) and accelerate the process (e.g. by increasing temperature). Such accelerated active recovery can lead to many orders of magnitude improvement in endurance, thus making processing in emerging technology memories practical.

Friday, January 18, 2019

Deep Learning for Pancreatic Cancer Histopathology Image Analysis

(Adib Keikhosravi / Kevin Eliceiri Presenting Fri. 1/18/19) 
Whole slide imaging (WSI) or virtual microscopy is a type of imaging modality which is used to convert animal or human pathology tissue slides to digital images for teaching, research or clinical applications. This method is popular due to education and clinical demands. Although modern whole slide scanners can now scan tissue slides with high resolution in a relatively short period of time, significant challenges, including high cost of equipment and data storage, remain unsolved. Machine learning and deep learning techniques in Computer Aided Diagnosis (CAD) platforms have begun to be widely used for biomedical image analysis by physicians and researchers. We are trying to build a platform for histopathological image super-resolution and cancer grading and staging with the main focus on pancreatic cancer. We present a computational approach for improving the quality of the resolution of images acquired from commonly available low magnification commercial slide scanners. Images from such scanners can be acquired cheaply and are efficient in terms of storage and data transfer. However, they are generally of poorer quality than images from high-resolution scanners and microscopes and do not have the necessary resolution needed in diagnostic or clinical environments, and hence are not used in such settings. First, we developed a deep learning framework that implements regularized sparse coding to smoothly reconstruct high-resolution images, given their low-resolution counterpart. Results show that our method indeed produces images which are similar to images from high resolution scanners, both in quality and quantitative measures and compares favorably to several state-of-the-art methods across a number of test images. To further improve the results, we used a convolutional neural network (CNN) based approach, which is specifically trained to take low-resolution slide scanner images of cancer data and convert it into a high-resolution image. We validate these resolution improvements with computational analysis to show the enhanced images offer the same quantitative results. This project is still ongoing and now we are trying to use middle resolutions for improving the image quality using recurrent neural networks. On the other hand, current approaches for pathological grading/staging of many cancer types such as breast and pancreatic cancer lack accuracy and interobserver agreement. Google research recently used inception for high accuracy tumor cell localization. However, as our group has been discovering the prognostic role of stromal reorganization in different cancer types including pancreatic cancer, which is projected to be the second leading cause of cancer by 2030, we use a wholistic approach that contains both stroma and cell from small TMA punches of different grades of cancer accompanied by normal samples. For this study we used transfer learning from four award winning networks VGG16, VGG19, GoogleNet and Resnet101 for the task of pancreatic cancer grading. Although all these networks have shown great performance for natural image classifications, but Resnet showed the highest performance with 88% accuracy in four-tier grading, and higher for all one by one comparisons among normal and different grades. We fine-trained this network again for different TNM classification and staging tasks and although all the images were selected from small regions from pancreas, the results show the promising capability of CNNs in helping pathologists with diagnosis. To achieve higher accuracies we have almost doubled the size of the dataset and trainings are still running and will update the audience in future talks.

Monday, January 14, 2019

Finding and Fixing Performance Pathologies in Persistent Memory Software Stacks


(Presenting 1/16)

Emerging fast, non-volatile memories will enable systems with large amounts of non-volatile main memory (NVMM) attached to the CPU memory bus, bringing the possibility of dramatic performance gains for IO-intensive applications. This paper analyzes the impact of state-of-the-art NVMM file systems on some of these applications and explores how those applications best leverage the performance that NVMMs offer.
Our analysis leads to several conclusions about how systems and applications should adapt to NVMMs. We propose FiLe Emulation with DAX (FLEX), a technique for moving file operations into user space and show it and other simple changes can dramatically improve application performance. We examine the scalability of NVMM file systems in light of the rising core counts and pronounced NUMA effects in modern systems, and propose changes to Linux’s virtual file system (VFS) to improve scalability. We also show that adding NUMA-aware interfaces to an NVMM file system can significantly improve performance. 

String Figure: A Scalable and Elastic Memory Network Architecture

(Presenting Mon 1/14)

The demand for server memory capacity and performance has been rapidly increasing due to the expanding working set size of modern applications, such as big data analytics, inmemory computing, deep learning, and server virtualization. One of promising techniques tackling such requirements is memory network, where the server memory system consists of multiple 3D die-stacked memory nodes interconnected by a highspeed network. However, current memory network designs face substantial scalability challenges, including (1) maintaining high throughput and low latency in large-scale memory networks at low hardware cost, (2) efficiently interconnecting an arbitrary number of memory nodes, and (3) supporting flexible memory network scale expansion and reduction without major modification of memory network design and physical implementation.

To address these challenges, we propose String Figure1, a highthroughput, elastic, and scalable memory network architecture. String Figure consists of three design components. First, we propose an algorithm to generate random interconnect topologies that achieve high network throughput and near-optimal path lengths in large-scale memory networks with over one thousand nodes; our topology also ensures that the number of required router ports does not increase as network scale grows. Second, we design a compute+table hybrid routing protocol that reduces both computation and storage overhead in routing. Third, we propose a set of network reconfiguration mechanisms that allows both static and dynamic network scale expansion and reduction. Our experiments based on RTL simulation demonstrate that String Figure can interconnect over one thousand memory nodes with a shortest path length within five hops across various synthetic and real workloads. Our design also achieves 1.3× throughput improvement and 36% reduction in system energy consumption compared with traditional memory network designs.