Memory media latencies are often a
hot topic: Systems architects often seek the lowest possible latencies, which,
in turn, drives memory manufacturers to architect memories for the minimum
possible access latencies. While some applications may benefit from lower
media latencies, most applications exhibit very little sensitivity. We
aim to dispel the memory latency myth: Workloads like DNN inference, across a
wide variety of models, are not strongly correlated to memory media
latency. Freeing memory companies of memory latency shackles enables
favorable memory media architecture tradeoffs that systems architects may not
expect.
Thursday, January 31, 2019
A little latency goes a long way: Memory latency and its impact on CPU-driven inference performance
(Ameen Akel of Micron is presenting Mon. Feb 4, 2019)
Monday, January 28, 2019
Hierarchical and Distributed Machine Learning Inference Beyond the Edge
(Presenting Wednesday Jan 30, 2019) Networked applications with heterogeneous sensors are a growing source of data in the Internet of Things (IoT) environment. Many IoT applications use machine learning (ML) to make real-time predictions. The current dominant approach to deploying ML inference is monolithic, i.e., when inference needs to be performed using data generated by multiple sensors, the features generated by each sensor are joined in a centralized cloud-based tier to perform the inference computation. Since inference typically occurs with high frequency, the monolithic approach can quickly lead to burdensome levels of communication, which wastes energy, reduces data privacy, and often bottlenecks the network violating real-time constraints. In this work, we study a novel approach that mitigates these issues by “pushing” ML inference computations out of the cloud and onto a hierarchy of IoT devices, which compute successively more compressed representations of raw sensor data. We present a new technical challenge of “rewriting” the functional form of an ML inference computation to factor it over a network of devices without significantly reducing prediction accuracy. We present novel “hierarchy aware” neural network architectures which enable users to trade off between communication cost and accuracy. We also present novel exact factoring algorithms for other popular ML models including gradient boosted trees and random forests that preserve accuracy while substantially reducing communication. We evaluate our approach with three real-world problems, urban energy demand prediction, human activity prediction, and server performance prediction. Our approach presents substantial reductions in energy use and latency on IoT devices, while providing the same level of prediction quality to the current monolithic inference. Measurement on a common IoT device shows that energy use and latency can be reduced by up to 63% and 67% respectively without reducing accuracy relative to the full communication setting.
Sunday, January 27, 2019
PIMProf: A Performance Profiler for Processing-in-Memory Architectures
Monday,
1/28/19 at 2:00PM ET – Task 1.5 (Evaluation Through
Architectural Simulation and Prototyping)
PIM architectures have drawn an
increasing research interest as a mitigation to the data movement bottleneck
within the current DRAM-based architectures, and a variety of them have been
proposed for accelerating various data-intensive workloads. However, for a
given workload, it is difficult to determine which part of a program should be
offloaded to a given PIM architecture to gain the best performance and how much
performance gain is possible. We propose PIMProf, a
tool that uses a combination of static and runtime analysis to automatically
detect the PIM candidates and estimate the speedup of the program. Our key
ideas are as follows: First, PIMProf uses
static analysis to capture the dependency between computation and data access,
and constructs both a control flow graph and a data dependency graph of the
program. Second, PIMProf
profiles the computation cost and memory access cost of the program, and
attributes the costs to the nodes and edges of the graph. Finally, we show how
to formalize the PIM offloading problem into a cost minimization problem of the
weighted graph.
Thursday, January 24, 2019
Monolithic-3D Integration Augmented Design Techniques for Computing in SRAMs
Presenting Friday, January 25th, at 2PM EST
Task 3.6: Cognitive Architectures
In-memory computing has
emerged as a promising solution to address the logic-memory performance gap. We
propose design techniques using monolithic-3D integration to achieve reliable multi-row
activation which in turn help in computation as part of data readout. Our
design is 1.8x faster than the existing techniques for Boolean computations. We
quantitatively show no impact to cell stability when multiple rows are
activated and thereby requiring no extra hardware for maintaining the cell
stability during computations. In-memory digital to analog conversion technique
is proposed using a 3D-CAM primitive. The design utilizes relatively low
strength layer-2 transistors effectively and provides 7x power savings when
compared with a specialized converter in-memory. Lastly, we present a linear
classifier system by making use the above-mentioned techniques which is 47x
faster while computing vector matrix multiplication using a dedicated hardware
engine.
Tuesday, January 22, 2019
Wearout and active accelerated recovery for processing in emerging technology memories
(Presenting Wednesday, January 23rd, at 2pm EST) Trying to break the memory wall includes various efforts to bring memory closer to the processor, or to push processing into the memory stack, generally called Processing in Memory (PIM). If the memory is one the various "flavors" of non-volatile emerging memory technologies such as: spin-transfer torque RAM (STTRAM), phase-change memory (PCM), resistive RAM (RRAM), memristor, 3DXpoint, etc., limited endurance becomes an important issue in addition to all other general challenges common to PIM. Endurance refers to the fact that most non-volatile memories, including all these new emerging technologies, but also more traditional ones like Flash and EEPROM, have a limited lifetime in terms of how many times they can be written and erased - once the limit is exceeded, the number of faults in the memory increases rapidly and the memory device can no longer be used reliably. For storage applications (the main use of non-volatile memories until now) one way to deal with the limited endurance is to overprovision the device (i.e. leave some of the native capacity unutilized up-front and allocate it later as memory blocks start failing due to the limited endurance) and to use wear leveling by adding a level of memory virtualization in the form of a Flash Translation Layer (FTL) that maps logical blocks to physical blocks in a dynamic way such that the write/erase cycles are more or less equally distributed across the physical blocks such that no single block gets overwritten too many times in a row. Although the concept of an FTL was first introduced for Flash, similar mechanisms will work (and likely be necessary) for all emerging memory technologies with limited endurance (e.g. although not explicitly stated it is likely that the Intel Optane 3D XPoint is using a similar mechanism for wear-leveling, etc.). FTLs are an OK solution for storage applications but are suboptimal (to say the least) for main memory applications, and even more so for processing in memory, both from a latency point of view during normal logical-to-physical mapping, but especially because of the extra long delays necessary for moving data when a logical block needs to be re-allocated to a new physical block. Because of this methods that intrinsically compensate for the limited endurance would be especially preferable for PIM.
One such method is to take advantage of the recovery mechanisms associated with the stress that leads to the limited endurance in the first place. Since stress and wearout are mechanisms that take a device out of physical equilibrium, it turns out that simple thermodynamics tends to partially reverse the effect of stress when the stress is removed. This is a general physics argument that has been experimentally demonstrated for several wearout mechanisms, including Flash wearout, but also other more general ones, such as NBTI/PBTI, hot electrons, electromigration, etc. In this talk I will go over several of these mechanisms and explain the source of stress and ways to reverse it. The main idea is to go beyond simple passive recovery (just remove stress and wait) by reversing the direction of stress (active recovery) and accelerate the process (e.g. by increasing temperature). Such accelerated active recovery can lead to many orders of magnitude improvement in endurance, thus making processing in emerging technology memories practical.
One such method is to take advantage of the recovery mechanisms associated with the stress that leads to the limited endurance in the first place. Since stress and wearout are mechanisms that take a device out of physical equilibrium, it turns out that simple thermodynamics tends to partially reverse the effect of stress when the stress is removed. This is a general physics argument that has been experimentally demonstrated for several wearout mechanisms, including Flash wearout, but also other more general ones, such as NBTI/PBTI, hot electrons, electromigration, etc. In this talk I will go over several of these mechanisms and explain the source of stress and ways to reverse it. The main idea is to go beyond simple passive recovery (just remove stress and wait) by reversing the direction of stress (active recovery) and accelerate the process (e.g. by increasing temperature). Such accelerated active recovery can lead to many orders of magnitude improvement in endurance, thus making processing in emerging technology memories practical.
Friday, January 18, 2019
Deep Learning for Pancreatic Cancer Histopathology Image Analysis
(Adib Keikhosravi / Kevin Eliceiri Presenting Fri. 1/18/19)
Whole slide
imaging (WSI) or virtual microscopy is a type of imaging modality which is used
to convert animal or human pathology tissue slides to digital images for
teaching, research or clinical applications. This method is popular due to
education and clinical demands. Although modern whole slide scanners can now
scan tissue slides with high resolution in a relatively short period of time,
significant challenges, including high cost of equipment and data storage,
remain unsolved. Machine learning and deep learning techniques in Computer
Aided Diagnosis (CAD) platforms have begun to be widely used for biomedical
image analysis by physicians and researchers. We are trying to build a platform
for histopathological image super-resolution and cancer grading and staging
with the main focus on pancreatic cancer. We present a computational approach
for improving the quality of the resolution of images acquired from commonly
available low magnification commercial slide scanners. Images from such
scanners can be acquired cheaply and are efficient in terms of storage and data
transfer. However, they are generally of poorer quality than images from high-resolution
scanners and microscopes and do not have the necessary resolution needed in
diagnostic or clinical environments, and hence are not used in such settings.
First, we developed a deep learning framework that implements regularized
sparse coding to smoothly reconstruct high-resolution images, given their
low-resolution counterpart. Results show that our method indeed produces images
which are similar to images from high resolution scanners, both in quality and
quantitative measures and compares favorably to several state-of-the-art
methods across a number of test images. To further improve the results, we used
a convolutional neural network (CNN) based approach, which is specifically
trained to take low-resolution slide scanner images of cancer data and convert
it into a high-resolution image. We validate these resolution improvements with
computational analysis to show the enhanced images offer the same quantitative
results. This project is still ongoing and now we are trying to use middle
resolutions for improving the image quality using recurrent neural networks. On
the other hand, current approaches for pathological grading/staging of many
cancer types such as breast and pancreatic cancer lack accuracy and
interobserver agreement. Google research recently used inception for high
accuracy tumor cell localization. However, as our group has been discovering
the prognostic role of stromal reorganization in different cancer types
including pancreatic cancer, which is projected to be the second leading cause
of cancer by 2030, we use a wholistic approach that contains both stroma and
cell from small TMA punches of different grades of cancer accompanied by normal
samples. For this study we used transfer learning from four award winning
networks VGG16, VGG19, GoogleNet and Resnet101 for the task of pancreatic
cancer grading. Although all these networks have shown great performance for
natural image classifications, but Resnet showed the highest performance with
88% accuracy in four-tier grading, and higher for all one by one comparisons
among normal and different grades. We fine-trained this network again for
different TNM classification and staging tasks and although all the images were
selected from small regions from pancreas, the results show the promising
capability of CNNs in helping pathologists with diagnosis. To achieve higher
accuracies we have almost doubled the size of the dataset and trainings are
still running and will update the audience in future talks.
Monday, January 14, 2019
Finding and Fixing Performance Pathologies in Persistent Memory Software Stacks
(Presenting 1/16)
Emerging fast, non-volatile memories will enable systems with large amounts of non-volatile main memory (NVMM) attached to the CPU memory bus, bringing the possibility of dramatic performance gains for IO-intensive applications. This paper analyzes the impact of state-of-the-art NVMM file systems on some of these applications and explores how those applications best leverage the performance that NVMMs offer.
Our analysis leads to several conclusions about how systems and applications should adapt to NVMMs. We propose FiLe Emulation with DAX (FLEX), a technique for moving file operations into user space and show it and other simple changes can dramatically improve application performance. We examine the scalability of NVMM file systems in light of the rising core counts and pronounced NUMA effects in modern systems, and propose changes to Linux’s virtual file system (VFS) to improve scalability. We also show that adding NUMA-aware interfaces to an NVMM file system can significantly improve performance.
Emerging fast, non-volatile memories will enable systems with large amounts of non-volatile main memory (NVMM) attached to the CPU memory bus, bringing the possibility of dramatic performance gains for IO-intensive applications. This paper analyzes the impact of state-of-the-art NVMM file systems on some of these applications and explores how those applications best leverage the performance that NVMMs offer.
Our analysis leads to several conclusions about how systems and applications should adapt to NVMMs. We propose FiLe Emulation with DAX (FLEX), a technique for moving file operations into user space and show it and other simple changes can dramatically improve application performance. We examine the scalability of NVMM file systems in light of the rising core counts and pronounced NUMA effects in modern systems, and propose changes to Linux’s virtual file system (VFS) to improve scalability. We also show that adding NUMA-aware interfaces to an NVMM file system can significantly improve performance.
String Figure: A Scalable and Elastic Memory Network Architecture
(Presenting Mon 1/14)
The demand for server memory capacity and performance has been rapidly increasing due to the expanding working set size of modern applications, such as big data analytics, inmemory computing, deep learning, and server virtualization. One of promising techniques tackling such requirements is memory network, where the server memory system consists of multiple 3D die-stacked memory nodes interconnected by a highspeed network. However, current memory network designs face substantial scalability challenges, including (1) maintaining high throughput and low latency in large-scale memory networks at low hardware cost, (2) efficiently interconnecting an arbitrary number of memory nodes, and (3) supporting flexible memory network scale expansion and reduction without major modification of memory network design and physical implementation.
To address these challenges, we propose String Figure1, a highthroughput, elastic, and scalable memory network architecture. String Figure consists of three design components. First, we propose an algorithm to generate random interconnect topologies that achieve high network throughput and near-optimal path lengths in large-scale memory networks with over one thousand nodes; our topology also ensures that the number of required router ports does not increase as network scale grows. Second, we design a compute+table hybrid routing protocol that reduces both computation and storage overhead in routing. Third, we propose a set of network reconfiguration mechanisms that allows both static and dynamic network scale expansion and reduction. Our experiments based on RTL simulation demonstrate that String Figure can interconnect over one thousand memory nodes with a shortest path length within five hops across various synthetic and real workloads. Our design also achieves 1.3× throughput improvement and 36% reduction in system energy consumption compared with traditional memory network designs.
The demand for server memory capacity and performance has been rapidly increasing due to the expanding working set size of modern applications, such as big data analytics, inmemory computing, deep learning, and server virtualization. One of promising techniques tackling such requirements is memory network, where the server memory system consists of multiple 3D die-stacked memory nodes interconnected by a highspeed network. However, current memory network designs face substantial scalability challenges, including (1) maintaining high throughput and low latency in large-scale memory networks at low hardware cost, (2) efficiently interconnecting an arbitrary number of memory nodes, and (3) supporting flexible memory network scale expansion and reduction without major modification of memory network design and physical implementation.
To address these challenges, we propose String Figure1, a highthroughput, elastic, and scalable memory network architecture. String Figure consists of three design components. First, we propose an algorithm to generate random interconnect topologies that achieve high network throughput and near-optimal path lengths in large-scale memory networks with over one thousand nodes; our topology also ensures that the number of required router ports does not increase as network scale grows. Second, we design a compute+table hybrid routing protocol that reduces both computation and storage overhead in routing. Third, we propose a set of network reconfiguration mechanisms that allows both static and dynamic network scale expansion and reduction. Our experiments based on RTL simulation demonstrate that String Figure can interconnect over one thousand memory nodes with a shortest path length within five hops across various synthetic and real workloads. Our design also achieves 1.3× throughput improvement and 36% reduction in system energy consumption compared with traditional memory network designs.
Subscribe to:
Posts (Atom)