Core Concepts

Autoregressive language models generate text one token at a time, where each new token depends on all previous tokens in the sequence.

This approach enables the model to capture long-range dependencies and produce coherent, contextually relevant text.

Autoregressive Formula

P(x) = Πt=1T P(xt | x<t)

The probability of a sequence is the product of the probabilities of each token given all previous tokens in the sequence.

Model Features

  • Generates text sequentially, one token at a time
  • Each prediction depends on all previous context
  • Uses causal masking to prevent future information leakage
  • Probabilistic output allows for creative variations
  • Parameters control randomness and diversity

Transformer Visualization

t = 1
Confidence: 0.85
Path: Embedding → Attention → FFN → Softmax
Depends only on previous tokens

Generation Controls

VIEW MODE

8
12
1.0
50
0.9

Model Architecture

Embedding Layer

Maps input tokens to vector space

Positional Encoding

Adds positional information using sine waves

Transformer Blocks

Multi-layer stack with attention mechanisms

Multi-Head Attention

Captures relationships between tokens

Feed Forward Network

Processes each position independently

Output Layer

Produces probability distribution over tokens

Generated Tokens

Once upon a ...

Input prompt: "Once upon a"

Token Probabilities

Showing top 5 probable next tokens

Tensor Shapes

Input tokens: [1, 3]
Embedding: [1, 3, 4096]
Attention weights: [12, 1, 3, 3]
FFN output: [1, 3, 4096]
Logits: [1, 3, 50257]