A 3D In-memory Architecture for Exploiting Internal Memory Bandwidth in an Area-constrained Logic Layer
Bulk primitives such as element-wise operations on large vectors, reduction, and scan appear in many applications. The required computation per element is low for such operations. As a result, the cost of moving data from DRAM to a separate processor surpasses the cost of computations and consequently, data movement comprises a significant portion of execution time and energy consumption. To alleviate the cost, prior work has proposed two main approaches: (i) adding logic to row buffers of memory arrays, (ii) adding processing elements to the logic layer of 3D stacked memories. The first approach imposes a significant hardware overhead and is not flexible enough to support all the required operations. Consequently, the second approach seems more practical. Due to the limited area of the logic layer, processing elements with a traditional architecture that fits in the logic layer cannot provide enough parallelism to consume all the available internal memory bandwidth in the logic layer. The goal of this project is to propose a new architecture which is efficient enough to consume all the available bandwidth and is small enough to fit in the logic layer.
(Presented on Monday, 11/26/18)