Core Concepts
Autoregressive language models generate text one token at a time, where each new token depends on all previous tokens in the sequence.
This approach enables the model to capture long-range dependencies and produce coherent, contextually relevant text.
Autoregressive Formula
The probability of a sequence is the product of the probabilities of each token given all previous tokens in the sequence.
Model Features
- Generates text sequentially, one token at a time
- Each prediction depends on all previous context
- Uses causal masking to prevent future information leakage
- Probabilistic output allows for creative variations
- Parameters control randomness and diversity
Transformer Visualization
Generation Controls
VIEW MODE
Model Architecture
Embedding Layer
Maps input tokens to vector space
Positional Encoding
Adds positional information using sine waves
Transformer Blocks
Multi-layer stack with attention mechanisms
Multi-Head Attention
Captures relationships between tokens
Feed Forward Network
Processes each position independently
Output Layer
Produces probability distribution over tokens
Generated Tokens
Input prompt: "Once upon a"
Token Probabilities
Showing top 5 probable next tokens