// CHAPTER 02 OF 05

The
Transformer
Stack

Tokenization → Embeddings → Attention → Transformer. Four ideas that unlocked modern AI.

3
Tokenization

Before AI can read anything, it breaks text into tiny pieces called tokens. A token isn't always a full word — "playing" might split into "play" and "ing".

Why not just use full words? Language is messy. New words appear daily, people make typos, mix languages. Tokens let AI handle anything with a fixed set of building blocks.

Live Tokenizer — Type anything!
0 tokens
💡 AI doesn't read text the way humans do. It reads tokens — and from tokens, it builds meaning step by step.
4
Embeddings

Once text is tokenized, each token gets turned into a vector — basically a list of numbers that represents its meaning. The model works with numbers, not words.

Think of it as a map. Similar words live close together; different words are far apart. The model understands meaning through distance and direction.

Click a word — see what's nearby in "meaning space"
👨‍⚕️ doctor
profession
👩‍⚕️ nurse
profession
🏥 hospital
place
⛰️ mountain
nature
🎭 actor
profession
💃 actress
profession
🌊 ocean
nature
🤴 prince
royalty
Selected Semantically close Far away in meaning
💡 The model doesn't think in definitions. It thinks in geometry. "actor" → "actress" follows the same pattern as "prince" → "princess". That's wild!
5
Attention

A word's meaning changes depending on context. "Apple" can be a fruit or a company. Embeddings alone can't handle this — they give each word a fixed representation.

Attention lets every word look at every other word and decide what matters. It's the model asking: "given everything in this sentence, what should I focus on?"

Click a word — see what the model focuses on

Sentence: "She bought shares in Apple last week"

She
bought
shares
in
Apple
last
week
👆 Click any word to see where the model focuses its attention
💡 Before attention, models read word-by-word, left to right. Attention changed that — now the model sees everything at once and decides what matters.
6
Transformer

Everything — tokens, embeddings, attention — comes together in the transformer. It's the architecture that powers almost every modern AI today.

Introduced in 2017 in a paper called "Attention Is All You Need". The radical idea: make attention the core mechanism, process everything in parallel.

How transformers refine understanding layer by layer
📝
Early Layers
Basic grammar and sentence structure
Surface
🔗
Middle Layers
Word relationships and context
Relationships
🧩
Deep Layers
Complex reasoning and connections
Reasoning
Text → Tokens → VectorsAttentionUnderstanding
Why transformers are fast: Old models read text one word at a time (slow). Transformers process all tokens in parallel. That's why they can scale to billions of parameters — they use GPUs the same way you'd parallelize tasks across multiple cores.
💡 GPT, Claude, Gemini, Llama — they all use the transformer architecture. It's the foundation of basically all modern AI language systems.
Back to
Ch 1: Neural Nets
Chapter 2 done! 🔥
6 concepts down
Up next
LLMs & Hallucination