Attention Mechanisms
The core mechanism enabling LLMs to process sequential data.
Overview
Attention allows models to weigh the importance of different tokens when processing each token.
Key Concepts
Scaled Dot-Product Attention
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$
Multi-Head Attention
- Multiple attention heads in parallel
- Each head learns different relationships
- Outputs concatenated and projected
Variations
Causal/Autoregressive Attention
- Masks future tokens
- Used in GPT-style models
Bidirectional Attention
- Full context access
- Used in BERT-style models