Chapter 2: The Transformer Stack 🔤

3

Tokenization

Before AI can read anything, it breaks text into tiny pieces called tokens. A token isn't always a full word — "playing" might split into "play" and "ing".

Why not just use full words? Language is messy. New words appear daily, people make typos, mix languages. Tokens let AI handle anything with a fixed set of building blocks.

Live Tokenizer — Type anything!

0 tokens

💡 AI doesn't read text the way humans do. It reads tokens — and from tokens, it builds meaning step by step.

4

Embeddings

Once text is tokenized, each token gets turned into a vector — basically a list of numbers that represents its meaning. The model works with numbers, not words.

Think of it as a map. Similar words live close together; different words are far apart. The model understands meaning through distance and direction.

Click a word — see what's nearby in "meaning space"

👨‍⚕️ doctor

profession

👩‍⚕️ nurse

profession

🏥 hospital

place

⛰️ mountain

nature

🎭 actor

profession

💃 actress

profession

🌊 ocean

nature

🤴 prince

royalty

Selected Semantically close Far away in meaning

💡 The model doesn't think in definitions. It thinks in geometry. "actor" → "actress" follows the same pattern as "prince" → "princess". That's wild!

5

Attention

A word's meaning changes depending on context. "Apple" can be a fruit or a company. Embeddings alone can't handle this — they give each word a fixed representation.

Attention lets every word look at every other word and decide what matters. It's the model asking: "given everything in this sentence, what should I focus on?"

Click a word — see what the model focuses on

Sentence: "She bought shares in Apple last week"

She

bought

shares

in

Apple

last

week

👆 Click any word to see where the model focuses its attention

💡 Before attention, models read word-by-word, left to right. Attention changed that — now the model sees everything at once and decides what matters.

6

Transformer

Everything — tokens, embeddings, attention — comes together in the transformer. It's the architecture that powers almost every modern AI today.

Introduced in 2017 in a paper called "Attention Is All You Need". The radical idea: make attention the core mechanism, process everything in parallel.

How transformers refine understanding layer by layer

📝

Early Layers

Basic grammar and sentence structure

Surface

🔗

Middle Layers

Word relationships and context

Relationships

🧩

Deep Layers

Complex reasoning and connections

Reasoning

Text → Tokens → Vectors → Attention → Understanding ✨

⚡

Why transformers are fast: Old models read text one word at a time (slow). Transformers process all tokens in parallel. That's why they can scale to billions of parameters — they use GPUs the same way you'd parallelize tasks across multiple cores.

💡 GPT, Claude, Gemini, Llama — they all use the transformer architecture. It's the foundation of basically all modern AI language systems.

TheTransformerStack

The
Transformer
Stack