Before AI can read anything, it breaks text into tiny pieces called tokens. A token isn't always a full word — "playing" might split into "play" and "ing".
Why not just use full words? Language is messy. New words appear daily, people make typos, mix languages. Tokens let AI handle anything with a fixed set of building blocks.
Once text is tokenized, each token gets turned into a vector — basically a list of numbers that represents its meaning. The model works with numbers, not words.
Think of it as a map. Similar words live close together; different words are far apart. The model understands meaning through distance and direction.
A word's meaning changes depending on context. "Apple" can be a fruit or a company. Embeddings alone can't handle this — they give each word a fixed representation.
Attention lets every word look at every other word and decide what matters. It's the model asking: "given everything in this sentence, what should I focus on?"
Sentence: "She bought shares in Apple last week"
Everything — tokens, embeddings, attention — comes together in the transformer. It's the architecture that powers almost every modern AI today.
Introduced in 2017 in a paper called "Attention Is All You Need". The radical idea: make attention the core mechanism, process everything in parallel.