CRISP Blog

Monday, November 14, 2022

Variational Auto-encoder for synthetic training data generation

Bin Li of U. Wisconsin Madison presenting on Wednesday, November 16, 2022 at both 1:00 & 7:00 PM EST

The scale and variety of training data are crucial to the generalizability of deep neural networks. However, obtaining labeled training data can be time-consuming and difficult in specialized domains, such as bioimaging and medical imaging. We proposed a synthetic training data generation framework for data enrichment and augmentation based on deep generative models. Our framework consists of a variational autoencoder (VAE) and a conditional generative adversarial network (cGAN). We demonstrated the use of this framework on an important bioimage analysis task named collagen fiber tracking. The VAE was trained using a limited amount of manually labeled collagen fiber images and was used to generate synthetic collagen fiber centerlines with increased varieties. The cGAN was trained to map the synthetic collagen fiber centerlines into realistic-looking collagen fiber images, resulting in a synthetic training dataset with image-centerline pairs. At last, we trained a U-Net using enriched image-centerline pairs for collagen fiber centerline tracking. Evaluations based on collagen images collected from pancreas, liver, and breast cancer samples show that our pipeline achieves better centerline tracking than several popular fiber centerline tracking tools. The generalizability of the network is further increased when synthetic data is incorporated for training.

Monday, November 7, 2022

Accelerating SQLite with Lookahead Information Passing (LIP)

Kevin Gaffney, U. Wisconsin Madison, presenting on 11/9/22 at 1PM & 7PM EST

In the two decades following its initial release, SQLite has become the most widely deployed database engine in existence. Today, SQLite is found in nearly every smartphone, computer, web browser, television, and automobile. While it supports complex analytical queries, SQLite is primarily designed for fast online transaction processing (OLTP), employing row-oriented execution and a B-tree storage format. However, fueled by the rise of edge computing and data science, there is a growing need for efficient in-process online analytical processing (OLAP). DuckDB, a database engine nicknamed "the SQLite for analytics", has recently emerged to meet this demand. While DuckDB has shown strong performance on OLAP benchmarks, it is unclear how SQLite compares holistically. In this talk, I will discuss SQLite in the context of this changing workload landscape. I will present results from our evaluation of SQLite on three benchmarks, each representing a different flavor of in-process data management, including transactional, analytical, and blob processing. I will delve into analytical data processing on SQLite, identifying key bottlenecks and weighing potential solutions. As a result of our optimizations, SQLite is now up to 4.2X faster on SSB. Finally, I will discuss the future of SQLite, envisioning how it will evolve to meet new demands and challenges.

Monday, October 24, 2022

Improving Memory Security and Reliability by Overcoming the Threat of Rowhammer

(Moin Qureshi, Professor of Computer Science at GA Tech, presenting on Wed. 10/26 at 1:00 & 7:00 PM ET.)

Rowhammer allows an attacker to induce bit flips in a row by rapidly accessing neighboring rows. Rowhammer is not just a reliability concern but a severe security threat as it can be used to escalate privilege or break confidentiality. The problem of Rowhammer continues to become worse for two reasons: (1) The threshold of activations needed to induce Rowhammer reduces with each generation, coming down by 30x in the last 7 years (2) Attackers continue to come up with complex patterns that can break all hardware-based defenses, including the ones commercially employed in current chips. Currently, there is no guaranteed solution for Rowhammer. Hardware-based mitigation of Rowhammer typically consists of two parts: a tracker to identify aggressor rows and a mitigating action. At low thresholds tracking incurs significant SRAM overheads (several megabytes). Furthermore, the common mitigating action of refreshing neighboring victim rows is susceptible to the Half-Double attack from Google. In this talk, I will discuss our recent solutions that enable low-cost tracking of aggressor rows even at ultra-low thresholds (ISCA’22), a new mitigating action of performing dynamic row migration that is resilient to complex attacks patterns (ASPLOS’22 and MICRO’22), and a Rowhammer-aware ECC design that provides in-built memory integrity-protection while incurring virtually zero performance and storage overheads (HPCA’22).

Brief Bio: Moinuddin Qureshi is a Professor of Computer Science at the Georgia Institute of Technology. His research interests include computer architecture, hardware security, and quantum computing. Qureshi received his Ph.D. from the University of Texas at Austin in 2007. He was a research scientist at the IBM T. J. Watson Research Center (2007-2011), where he developed the caching algorithms for Power 7 Systems. Qureshi received the 2022 ACM SIGARCH Maurice Wilkes Award for contributions to high-performance memory systems. He is a member of Hall-of-Fame of the trifecta of architecture conferences: ISCA, MICRO, and HPCA. His research has been recognized with multiple best-paper awards and multiple IEEE Top-Picks awards. His papers were also awarded the 2019 NVMW Persistent Impact Prize and 2021 NVMW Persistent Impact Prize, in recognition of “exceptional impact on the fields of study related to non-volatile memories”. Qureshi received the 2020 “Outstanding Researcher Award” from Intel and an “Outstanding Technical Achievement” award from IBM Research. More information at https://www.cc.gatech.edu/~moin/

Monday, October 17, 2022

SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning

Zihao Ye presenting on Wed. 10/19/22 at 1:00 & 7:00 PM ET

Sparse tensors are rapidly becoming critical components of modern deep learning workloads. However, developing high-performance sparse operators can be difficult and tedious, and existing vendor libraries cannot satisfy the escalating demands from new operators. Sparse tensor compilers simplify the development of operators, but efficient sparse compilation for deep learning remains challenging because a single sparse format cannot maximize hardware efficiency, and single-shot compilers cannot keep up with latest hardware and system advances. We show that the key to addressing both challenges is two forms of composability. In this paper, we propose SparseTIR, a sparse tensor compilation abstraction that offers composable formats and composable transformations for deep learning workloads. SparseTIR constructs a search space over these composable components for performance tuning. With these improvements, SparseTIR obtains consistent performance speedups vs vendor libraries and frameworks on sparse workloads such as Graph Neural Networks, Sparse Convolution, and Network Pruning.

Sunday, October 9, 2022

MEMulator & PiMulator: Emerging Memory and Processing-in-Memory Architecture Emulation

(Sergiu Mosanu, UVA, presenting on October 12, 2022.)

Main memory is a crucial SoC and architecture design aspect, affecting system performance, power, and cost. The development of emerging memory technologies, increasingly specialized DRAM memory flavors, and Processing-in-Memory (PiM) architectures introduce the need for system-level modeling and evaluation. However, it is challenging to mimic both software and hardware aspects of emerging memory and PiM architectures using the currently available tools with high performance and fidelity. We develop a system emulation framework that employs a modular, parameterizable, FPGA synthesizable memory and PiM model. Implemented in System Verilog, the memory and PiM model allows users to generate any desired memory configuration on the FPGA fabric with complete control over the structure and distribution of the PiM logic units. We emulate a whole system by interfacing the memory emulation model with CPU soft cores and a soft memory controller. We demonstrate strategies to model several pioneering bitwise-PiM architectures and provide detailed benchmark performance results showing the platform's ability to facilitate design space exploration. We observe an emulation vs. simulation weighted-average speedup of 28x when running a memory benchmark workload. This comprehensive FPGA-based memory emulation enables fast, high-fidelity design-space exploration and evaluation of processing-in-memory architectures as part of a whole system stressed with heavy workloads.

Monday, September 26, 2022

An MLIR-based Intermediate Representation for Accelerator Design with Decoupled Customizations

Hongzheng Chen and Niansong Zhang presenting on Wed. 9/28/22.

The increasing specialized accelerators deployed in data centers and edge devices call for the need of generating high-performance accelerators efficiently. However, the custom processing engines, memory hierarchy, data type, and data communication, complicate the accelerator design. In this talk, we will present our MLIR-based accelerator IR HCL, which decouples the algorithm and hardware customizations at the IR level. We will provide several case studies to demonstrate how our IR can support a wide range of applications, how our IR primitives can be composed with different designs, and how we can achieve high performance and productivity at the same time. Finally, we will discuss the benefits and ongoing efforts of our work.

Thursday, September 15, 2022

Accelerating Few Shot Learning with HD Computing in ReRAM

Weihong Xu, UCSD, presenting on Wed. 9/21/22 at 1:00 pm & 7:00 pm ET.

Hyperdimensional (HD) computing is a lightweight algorithm with fast learning capability that can efficiently realize classification and few-shot learning (FSL) workloads. However, traditional von-Neumann architecture is highly inefficient for HD algorithms due to the limited memory bandwidth and capacity. Processing in-memory (PIM) is an emerging computing paradigm which tries to address these issues by using memories as computing units. Our talk introduces the efficient PIM architecture for HD as well as application of HD on few-shot classification. First, we show the PIM-based HD computing architecture on ReRAM, Tri-HD, to accelerate all phases of the general HD computing pipeline namely, encoding, training, retraining, and inference. Tri-HD is enabled by efficient in-memory logic operations and shows orders of magnitude performance improvements over CPU. However, Tri-HD is not suitable for area and power-constrainted device since it suffers from high HD encoding complexity and complex dataflow. To this end, we present an algorithm and PIM co-design, FSL-HD, to realize energy-efficient FSL. FSL-HD significantly reduces the encoding complexity by >10x and is equipped with an optimized dataflow for practical ReRAM. As a result, FSL-HD shows superior FSL accuracy, flexibility, and hardware efficiency compared to state-of-the-art ReRAM-based FSL accelerators.

Monday, April 18, 2022

How Good is my HTAP System?

(Elena Milkai, U Wisc-Madison, presenting on April 20th at 1PM & 7PM ET)

Hybrid Transactional and Analytical Processing (HTAP) systems have recently gained popularity as they combine OLAP and OLTP processing to reduce administrative and synchronization costs between dedicated systems. However, there is no precise characterization of the features that distinguish a good HTAP system from a poor one. In this paper, we seek to solve this problem from the perspectives of both performance and freshness. To simultaneously capture the performance of both transactional and analytical processing, we introduce a new concept called throughput frontier, which visualizes both transactional and analytical throughput in a single 2D graph. The throughput frontier can capture information regarding the performance of each engine, the interference between the two engines, and various system design decisions. To capture how well an HTAP system supports real-time analytics, we define a freshness metric which quantifies how recent is the snapshot of the data seen by each analytical query. We also develop a practical way to measure freshness in a real system. We design a new hybrid benchmark called HATtrick which incorporates both throughput frontier and freshness as metrics. Using the benchmark, we evaluate three representative HTAP systems under various data size and system configurations and demonstrate how the metrics reveal important system characteristics and performance information.

Thursday, March 10, 2022

Enabling Efficient Large-Scale Deep Learning Training with Cache Coherent Disaggregated Memory Systems

(Zixuan Wang, UCSD, presenting on Wed. 3/16/22 at 1:00 & 7:00 PM ET at our CRISP task-level meeting.)

Modern deep learning (DL) training is memory consuming, constrained by the memory capacity of each computation component and cross-device communication bandwidth. In response to such constraints, current approaches include increasing parallelism in distributed training and optimizing inter-device communication. However, model parameter
communication is becoming a key performance bottleneck in distributed DL training. To improve parameter communication performance, we propose COARSE, a disaggregated memory extension for distributed DL training. COARSE is built on modern cache-coherent interconnect (CCI) protocols and MPI-like collective communication for synchronization, to allow
low-latency and parallel access to training data and model parameters shared among worker GPUs. To enable high bandwidth transfers between GPUs and the disaggregated memory system, we propose a decentralized parameter communication scheme to decouple and localize parameter synchronization traffic. Furthermore, we propose dynamic tensor routing and
partitioning to fully utilize the non-uniform serial bus bandwidth varied across different cloud computing systems. Finally, we design a deadlock avoidance and dual synchronization to ensure high-performance parameter synchronization. Our evaluation shows that COARSE achieves up to 48.3% faster DL training compared to the state-of-the-art MPI AllReduce
communication.

Contact Zixuan for questions, zxwang@ucsd.edu.

Monday, November 8, 2021

GraphLily: Near-Memory Acceleration for Graph Linear Algebra

(Yixiao Du, Cornell, presenting on Wednesday, November 10, 2021 at 1:00 p.m. & 7:00 p.m. ET. )

Graph processing is typically memory-bound due to low compute to memory access ratio and irregular data access patterns. The emerging high-bandwidth memory (HBM) delivers exceptional bandwidth and is adopted on FPGAs with near-memory integration and customizable datapaths, which brings the potential to significantly boost the performance of graph processing and relieve the programming burden.
We present GraphLily, a graph linear algebra overlay, to accelerate graph processing on HBM-equipped FPGAs. GraphLily supports a rich set of graph algorithms by adopting the GraphBLAS programming interface, which formulates graph algorithms as sparse linear algebra operations. Operators of different GraphBLAS semirings share the same datapath, which is configurable at run time. GraphLily further builds a middleware to provide runtime support. With this middleware, we can port existing GraphBLAS programs to FPGAs with slight modifications to the original code intended for CPU/GPU execution. GraphLily provides efficient, memory-optimized accelerators for the two widely-used kernels in graph processing, namely, sparse-matrix dense-vector multiplication (SpMV) and sparse-matrix sparse-vector multiplication (SpMSpV). Leveraging the near-memory connectivity, the SpMV accelerator uses a sparse matrix storage format tailored to HBM that enables streaming, vectorized accesses to each channel and concurrent accesses to multiple channels. Besides, the SpMV accelerator exploits data reuse in accesses of the dense vector by introducing a scalable on-chip buffer design. The SpMSpV accelerator complements the SpMV accelerator to handle cases where the input vector has a high sparsity. The evaluation shows that compared with state-of-the-art graph processing frameworks on CPUs and GPUs, GraphLily achieves up to 2.5x and 1.1x higher throughput, while reducing the energy consumption by 8.1x and 2.4x; compared with prior single-purpose graph accelerators on FPGAs, GraphLily achieves 1.2x–1.9x higher throughput. GraphLily is a prototype of near-memory graph processing. We are working to further improve the performance.

Monday, October 25, 2021

DRAM-CAM: Extending Sieve for General-Purpose Exact Pattern Matching

(Lingxi Wu, UVA, presenting on Wednesday, October 27, 2021 at 1:00 PM & 7:00 PM ET.)

Exact pattern matching is a widely used kernel in many application domains. A prominent example is k-mer matching in bioinformatics where input DNA sequences of size K are exactly matched against a library of reference DNA patterns for genome classification. Previously we propose three DRAM-based in-situ accelerator designs, dubbed Sieve, to alleviate the bottleneck stage of K-mer matching in genomics.

We notice k-mer matching shares many similarities with other exact pattern matching intensive workloads (e.g., text processing and data mining), thus we extend Sieve with several cost-effective modifications to make it capable of accommodating applications beyond bioinformatics. This enhanced architecture is named Sieve ++.

We show that Sieve ++ as an exact pattern matching accelerator offers significant advantages over traditional architectures. Sieve ++ outperforms CPU by two orders of magnitude for five out of six selected benchmarks that represent a broad set of real-world exact pattern matching use cases.

Monday, October 18, 2021

Ayudante: A Deep Reinforcement Learning Approach to Assist Persistent Memory Programming

(Hanxian Huang, UCSD, presenting on Wednesday, October 20, 2021 at 1:00pm and 7:00pm ET)

Programming PM imposes non-trivial labor effort on writing code to adopt new PM-aware libraries and APIs. In addition, non-expert PM code can be error-prone. In order to ease the burden of PM programmers, we propose Ayudante, a deep reinforcement learning (RL)- based PM programming assistant framework consisting of two key components: a deep RL-based PM code generator and a code refining pipeline. Given a piece of C, C++, or Java source code developed for conventional volatile memory systems, our code generator automatically generates the corresponding PM code and checks its data persistence. The code refining pipeline parses the generated code to provide a report for further program testing and performance optimization. Our evaluation on an Intel server equipped with Optane DC PM demonstrates that both microbenchmark programs and a key-value store application generated by Ayudante pass PMDK checkers. Performance evaluation on the microbenchmarks shows that the generated code achieves comparable speedup and memory access performance as PMDK code examples.

Friday, October 8, 2021

Reticle: A Virtual Machine for Programming Modern FPGAs

(Luis Vega presenting on Wednesday, October 13, 2021 at 1:00 p.m. and 7:00 p.m. ET)

Modern field-programmable gate arrays (FPGAs) have recently powered high-profile efficiency gains in systems from datacenters to embedded devices by offering ensembles of heterogeneous, reconfigurable hardware units. Programming stacks for FPGAs, however, are stuck in the past—they are based on traditional hardware languages, which were appropriate when FPGAs were simple, homogeneous fabrics of basic programmable primitives. We describe Reticle, a new low-level abstraction for FPGA programming that, unlike existing languages, explicitly represents the special-purpose units available on a particular FPGA device. Reticle has two levels: a portable intermediate language and a target-specific assembly language. We show how to use a standard instruction selection approach to lower intermediate programs to assembly programs, which can be both faster and more effective than the complex metaheuristics that existing FPGA toolchains use. We use Reticle to implement linear algebra operators and coroutines and find that Reticle compilation runs up to 100times faster than current approaches while producing comparable or better run-time and utilization.

Monday, June 28, 2021

Dual-stream Multiple Instance Learning Network for Whole Slide Image Classification with Self-supervised Contrastive Learning

(Bin Li presenting on June 30 at 1:00 PM and 7:00 PM Eastern Time)

Whole slide imaging is one of the essential tools in modern histopathology and presents a great opportunity for machine learning applications. Whole slide images (WSIs) have very high resolutions and usually lack localized patch-level annotations. Consequently, the classification of WSIs is often cast as a multiple instance learning (MIL) problem, where a patch-based classifier is trained using slide-level labels. We propose a MIL-based method for WSI classification and tumor detection that does not require localized annotations. First, we introduce a novel MIL aggregator that models the relations of the instances in a dual-stream architecture with trainable distance measurement. Second, since WSIs can produce large or unbalanced bags that hinder the training of MIL models, we propose to use self-supervised contrastive learning to extract good representations for MIL and alleviate the issue of prohibitive memory cost for large bags. We demonstrate that self-supervised contrastive learning can produce robust representations for MIL without the need for patch-level labels, alleviating the issue of prohibitive memory cost for training deep models. Our model offers significant improvement in classification accuracy for both unbalanced and balanced WSI datasets and under both binary and multiple class MIL settings. Additional benchmark results on standard MIL datasets further demonstrate the superior performance of our MIL aggregator on general MIL problems. Collaborating with hardware and computer vision groups, we demonstrate the use of WSI classification as a driving use case for developing deep learning accelerators and efficient training paradigms for related problems such as video and language analysis.

Tuesday, June 1, 2021

Efficient and Secure Learning across Memory Hierarchy

(Saransh Gupta, UC San Diego, presenting on Wednesday, June 2, 2021 at 1:00 & 7:00 PM ET)

Recent years have witnessed a rapid growth in the amount of generated data. Learning algorithms, like hyperdimensional (HD) computing, promise to reduce the computation complexity of processing such a huge amount of data. However, traditional computing systems are highly inefficient for such algorithms, mainly due to the limited cache capacity and memory bandwidth. In this talk, we propose a processing in-memory (PIM)-based HD computing architecture that accelerates all phases of the HD computing pipeline namely, encoding, training, retraining, and inference. Our architecture is enabled by fast and energy-efficient in-memory logic operations, combined with a hardware-friendly distance metric. Our proposed PIM solution provides 434x speedup as compared to the state-of-the-art HD computing implementations.

While this makes learning less reliant on the cloud, many applications, most notably in healthcare, finance, and defense, still need cloud computing and demand privacy which today’s solutions cannot fully provide. Fully homomorphic encryption (FHE) elevates the bar of today’s solutions by adding confidentiality of data during processing, while introducing significant data size expansion. In this talk, we also present a design of the first PIM-based accelerators of both client and server using the latest Ring-GSW based fully homomorphic encryption schemes. Our design supports various security levels and provides on average 2007x higher throughput than the best implementation while running FHE-enabled neural networks.

Tuesday, May 25, 2021

Corundum: statically-enforced persistent memory safety

(Morteza Hoseinzadeh of UC San Diego Presenting on 5/26 at 1:00 p.m. & 7:00 p.m. ET)

Fast, byte-addressable, persistent main memories (PM) make it possible to build complex data structures that can survive system failures. Programming for PM is challenging, not least because it combines well-known programming challenges like locking, memory management, and pointer safety with novel PM-specific bug types. It also requires logging updates to PM to facilitate recovery after a crash. A misstep in any of these areas can corrupt data, leak resources, or prevent successful recovery after a crash. Existing PM libraries in a variety of languages -- C, C++, Java, Go -- simplify some of these problems, but they still require the programmer to learn (and flawlessly apply) complex rules to ensure correctness. Opportunities for data-destroying bugs abound.

This paper presents Corundum, a Rust-based library with an idiomatic PM programming interface and leverages Rust’s type system to statically avoid most common PM programming bugs. Corundum lets programmers develop persistent data structures using familiar Rust constructs and have confidence that they will be free of those bugs. We have implemented Corundum and found its performance to be as good as or better than Intel's widely-used PMDK library, HP's Atlas, Mnemosyne, and go-pmem.

Monday, May 17, 2021

FANS: FPGA-Accelerated Near-Storage Sorting

(Weikang Qiao presenting on Wed. May 19th at 1:00 & 7:00 PM ET)

Large-scale sorting is always an important yet demanding task for data center applications. In addition to powerful processing capability, high-performance sorting system requires efficient utilization of the available bandwidth of various levels in the memory hierarchy. Nowadays, with the explosive data size, the frequent data transfers between the host and the storage device are becoming increasingly a performance bottleneck. Fortunately, the emergence of near-storage computing devices gives us the opportunity to accelerate large-scale sorting by avoiding the back and forth data transfer. Near-storage sorting is promising for extra performance improvement and power reduction. However, it is still an open question of how to achieve the optimal sorting performance on the existing near-storage computing device.

In this work, we first perform an in-depth analysis of the sorting performance on the newly released Samsung SmartSSD platform. Contrary to the previous belief, our analysis shows that the end-to-end sorting performance is bound by not only the bandwidth of the flash, but also the main memory bandwidth, the configuration of the sorting kernel and the intermediate sorting status. Based on our modeling, we propose FANS, an FPGA accelerated near-storage sorting system which selects the optimized design configuration and achieves the theoretically maximum end-to-end performance when using a single Samsung SmartSSD device. The experiments demonstrate more than 3× performance speedup over the state-of-art FPGA-accelerated flash storage.

Monday, May 10, 2021

Compiler-Driven Simulation of Reconfigurable Hardware Accelerators

(Zhijing Li of Cornell presenting on Wed. 5/12/21 at 1PM & 7PM ET)

As customized accelerator design has become increasingly popular to keep up with the demand for high performance computing, it poses challenges for modern simulator design to adapt to such a large variety of accelerators. Existing simulators tend to two extremes: low-level and general approaches, such as RTL simulation, that can model any hardware but require substantial effort and long execution times; and higher-level application-specific models that can be much faster and easier to use but require one-off engineering effort.

This work proposes a compiler-driven simulation workflow that can model the performance of arbitrary hardware accelerators, while introducing little programming overhead. The key idea is to separate structure representation from simulation by developing an intermediate language that can flexibly represent a wide variety of hardware constructs. We design the Event Queue dialect of MLIR, a dialect that can model arbitrary hardware accelerators with explicit data movement and distributed event-based control. We also implement a simulator to model EQueue-structured programs. We show the effectiveness of our simulator by comparing it with a state-of-the-art application-specialized simulator, while offering higher extensibility. We demonstrate our simulator can save designers' time with reusable lowering pipeline and visualizable outputs.

Monday, March 22, 2021

Defense against the Dark Arts: New Attacks and Defenses for Shared Last-Level Caches

(Gururaj Saileshwar presenting on Wednesday, March 24 at 1:00 p.m. and 7 p.m. ET)

Caches in modern processors improve performance by enabling fast access to data. However, the sharing of the processor caches (particularly the last-level cache) between different processor-cores, VMs and applications, results in vulnerability to side-channel and covert-channel attacks via caches. Cache side-channel attacks allow malware to infer data-access patterns of sensitive applications based on their cache-accesses, to leak encryption keys, confidential IP like DNN models, etc. Similarly, cache covert-channels allow malicious processes to use timing variation due to caches as a means of covert communication (e.g., to exfiltrate sensitive information from a sandbox).

In this talk, I will first present our Streamline Attack (to appear in ASPLOS’21), the current fastest cache covert-channel attack and the first to achieve >1MB/s bandwidth (3-6x higher than prior attacks) by leveraging asynchronous communication protocols. With this attack, we demonstrate that cache covert-channels can have higher information capacity than previously understood, while having fewer ISA or micro-architecture specific limitations. Next, I will present MIRAGE, a Defense Against Cache Side-Channel Attacks (to appear in USENIX Security’21) that provides a practical way to enable a fully-associative last-level cache, which is immune to set-conflict based cache attacks. This defense promises to meaningfully end the arms-race between attackers and defenders in the area of conflict-based cache attacks, where successive defenses have been broken by newer attacks in the past few years.

Monday, December 7, 2020

Multiple Instance Learning Network for Whole Slide Image Classification with Hardware Acceleration

(Bin Li, U. Wisconsin-Madison, presenting on 12/9/20 at 11:00 a.m. and 7:00 p.m. ET)

We propose a novel multiple instance learning (MIL) model for disease detection in whole slide images (WSI) as a driving application for developing hardware acceleration architectures. Our model applies self-attention mechanism on latent representations to model the dependencies between the instances. Our model demonstrates the state-of-the-art performance on both classic MIL benchmark datasets and real-word clinical WSI datasets. Collaborating with hardware research groups, we propose a processing-in-memory (PIM) design for our application as well as general long-sequence attention-based models, which outperforms other PIM-based and GPU-based baselines by a large margin. Meanwhile, we are continuously working with PIM, GPU, and FPGA groups on software-hardware co-design and acceleration performance benchmarking.