New techniques efficiently accelerate sparse tensors for massive AI models
Researchers from MIT and NVIDIA have developed two techniques that accelerate the processing of sparse tensors, a type of data structure that’s used for high-performance computing tasks. The complementary techniques could result in significant improvements to the performance and energy-efficiency of systems like the massive machine-learning models that drive generative artificial intelligence.
 Tensors are data structures used by machine-learning models. Both of the new methods seek to efficiently exploit what’s known as sparsity — zero values — in the tensors. When processing these tensors, one can skip over the zeros and save on both computation and memory. For instance, anything multiplied by zero is zero, so it can skip that operation. And it can compress the tensor (zeros don’t need to be stored) so a larger portion can be stored in on-chip memory.
 However, there are several challenges to exploiting sparsity. Finding the nonzero values in a large tensor is no easy task. Existing approaches often limit the locations of nonzero values by enforcing a sparsity pattern to simplify the search, but this limits the variety of sparse tensors that can be processed efficiently.
 Another challenge is that the number of nonzero values can vary in different regions of the tensor. This makes it difficult to determine how much space is required to store different regions in memory. To make sure the region fits, more space is often allocated than is needed, causing the storage buffer to be underutilized. This increases off-chip memory traffic, which requires extra computation.
 The MIT and NVIDIA researchers crafted two solutions to address these problems. For one, they developed a technique that allows the hardware to efficiently find the nonzero values for a wider variety of sparsity patterns.
 For the other solution, they created a method that can handle the case where the data do not fit in memory, which increases the utilization of the storage buffer and reduces off-chip memory traffic.
 Both methods boost the performance and reduce the energy demands of hardware accelerators specifically designed to speed up the processing of sparse tensors. More
 