| Concept | Description |
|---|---|
| Large Language Model | Neural network trained on massive text data to understand and generate language |
| Foundation Model | Pre-trained model that can be adapted for many downstream tasks |
| Parameters | Weights learned during training (billions in modern LLMs) |
| Context Window | Maximum tokens the model can process at once |
| Tokens | Text broken into chunks (words, subwords, characters) |
| Component | Purpose | Key Feature |
|---|---|---|
| Self-Attention | Relate tokens to each other | Captures long-range dependencies |
| Multi-Head Attention | Multiple attention patterns | Different relationship types |
| Feed-Forward Network | Process attention output | Non-linear transformations |
| Layer Normalization | Stabilize training | Normalize activations |
| Positional Encoding | Token position info | Sequence order awareness |
Attention(Q,K,V) = softmax(QK^T / √d_k) × V
The scaling factor √d_k prevents dot products from becoming too large.