Title: Attention Is All You Need — Summary Notes

The transformer architecture introduces a model based entirely on attention
mechanisms, dispensing with recurrence and convolutions. The key innovation is
multi-head self-attention, which allows the model to jointly attend to
information from different representation subspaces at different positions.

Key components:
- Scaled dot-product attention: Attention(Q,K,V) = softmax(QK^T / sqrt(d_k))V
- Multi-head attention: multiple attention functions in parallel
- Positional encoding: sine/cosine functions to inject position information
- Feed-forward networks: two linear transformations with ReLU
- Residual connections and layer normalization

The paper argues that recurrence is not necessary for sequence modeling,
challenging the prevailing view that sequential processing is fundamental
to language understanding.
