CRISP Blog: November 2022

Monday, November 14, 2022

Variational Auto-encoder for synthetic training data generation

Bin Li of U. Wisconsin Madison presenting on Wednesday, November 16, 2022 at both 1:00 & 7:00 PM EST

The scale and variety of training data are crucial to the generalizability of deep neural networks. However, obtaining labeled training data can be time-consuming and difficult in specialized domains, such as bioimaging and medical imaging. We proposed a synthetic training data generation framework for data enrichment and augmentation based on deep generative models. Our framework consists of a variational autoencoder (VAE) and a conditional generative adversarial network (cGAN). We demonstrated the use of this framework on an important bioimage analysis task named collagen fiber tracking. The VAE was trained using a limited amount of manually labeled collagen fiber images and was used to generate synthetic collagen fiber centerlines with increased varieties. The cGAN was trained to map the synthetic collagen fiber centerlines into realistic-looking collagen fiber images, resulting in a synthetic training dataset with image-centerline pairs. At last, we trained a U-Net using enriched image-centerline pairs for collagen fiber centerline tracking. Evaluations based on collagen images collected from pancreas, liver, and breast cancer samples show that our pipeline achieves better centerline tracking than several popular fiber centerline tracking tools. The generalizability of the network is further increased when synthetic data is incorporated for training.

Monday, November 7, 2022

Accelerating SQLite with Lookahead Information Passing (LIP)

Kevin Gaffney, U. Wisconsin Madison, presenting on 11/9/22 at 1PM & 7PM EST

In the two decades following its initial release, SQLite has become the most widely deployed database engine in existence. Today, SQLite is found in nearly every smartphone, computer, web browser, television, and automobile. While it supports complex analytical queries, SQLite is primarily designed for fast online transaction processing (OLTP), employing row-oriented execution and a B-tree storage format. However, fueled by the rise of edge computing and data science, there is a growing need for efficient in-process online analytical processing (OLAP). DuckDB, a database engine nicknamed "the SQLite for analytics", has recently emerged to meet this demand. While DuckDB has shown strong performance on OLAP benchmarks, it is unclear how SQLite compares holistically. In this talk, I will discuss SQLite in the context of this changing workload landscape. I will present results from our evaluation of SQLite on three benchmarks, each representing a different flavor of in-process data management, including transactional, analytical, and blob processing. I will delve into analytical data processing on SQLite, identifying key bottlenecks and weighing potential solutions. As a result of our optimizations, SQLite is now up to 4.2X faster on SSB. Finally, I will discuss the future of SQLite, envisioning how it will evolve to meet new demands and challenges.