The transformer architecture introduced scaled dot-product self-attention as a mechanism for relating different positions of a sequence to compute a representation. Given input matrices Q (queries), K (keys), and V (values), attention weights are computed as softmax(QK^T / sqrt(d_k)) and applied to V to produce the output. Multi-head attention runs h parallel attention heads on linearly projected subspaces of Q, K, and V, concatenating the results and projecting back to the model dimension. This allows the model to jointly attend to information from different representation subspaces at different positions. Positional encodings are added to the input embeddings to provide sequence order information, since self-attention is permutation-invariant. Causal (masked) self-attention prevents positions from attending to subsequent positions, enabling autoregressive language modeling. Cross-attention in encoder-decoder architectures uses decoder queries with encoder keys and values, allowing the decoder to focus on relevant input positions. Flash attention optimizes memory access patterns during backpropagation by tiling the attention computation to avoid materializing the full N×N attention matrix. The attention mechanism generalizes beyond NLP to computer vision (ViT), graphs (GAT), and protein structure prediction (AlphaFold), demonstrating its versatility as a general information routing mechanism.
