Breaking News

FlashAttention-3 enables H100 GPU power for LLMs

by VARINDIA 2024-07-18

Colfax Research, Meta, Nvidia, Princeton University, Georgia Tech, and Together AI researchers have unveiled FlashAttention-3, an innovative method that dramatically accelerates attention processing on Nvidia Hopper GPUs (H100 and H800). Enhancing the performance and efficiency for LLM training and inference, FlashAttention-3 further optimises the usage of resources on Nvidia Hopper GPUs, building upon earlier work on FlashAttention and FlashAttention-2.

Attention is a core component of the transformer architecture used in large language models (LLMs). But as LLMs grow larger and handle longer input sequences, the computational cost of attention becomes a bottleneck. FlashAttention-3 can address this challenge. One of the key innovations of transformers is the attention mechanism, which enables the model to compute the relationship between different tokens in an input sequence.

While the attention mechanism is very effective, it is also computationally expensive. The cost of attention computation grows quadratically with the length of the input sequence. As LLMs are scaled to handle longer and longer input sequences, the attention mechanism becomes a major bottleneck.

Furthermore, modern hardware accelerators such as GPUs are optimized for matrix multiplication (matmul) operations, which are the building blocks of deep learning models. These accelerators also have computational units for other types of operations such as exponentiation, but those units are hundreds of times slower than the matmul components.

Attention computations use a combination of matrix multiplications and other special functions that are not as optimized for GPUs. One of the important aspects of optimizing attention computation is to schedule the workloads in a way that operations do not get blocked by each other and make efficient use of different types of memory components.

FlashAttention-3 takes advantage of new features in Nvidia Hopper GPUs to maximize performance. These features enable higher throughput on matrix multiplication operations, faster data transfer across different memory segments, and better efficiency on low-precision operations.

FlashAttention-3 introduces several innovations to improve the performance of attention computation on H100 GPUs.

FlashAttention-3 schedules operations in a way that maximizes the overlap between computation and the movement of data between different memory segments of the GPU. This reduces the time the GPU spends idle waiting for data to be transferred. It also interleaves the matrix multiplication and softmax operations to reduce the possibility of bottlenecks in computing attention values.