Early AI models were unimodal — text in, text out. Multimodal AI can process and generate across multiple types of data: text, images, audio, video, and even code — often simultaneously.
The key insight is that the transformer architecture scales to any modality. Images can be broken into patches (visual tokens), audio into spectrograms, video into frame sequences — all processed by the same attention mechanism you met in Chapter 2.
You met basic AI agents in Chapter 5. The 2025-2026 wave is different: agentic AI systems that run for hours or days, use dozens of tools, spawn sub-agents, and complete complex multi-step tasks with minimal human input.
- ↳ Single LLM
- ↳ Basic tool use (search, calculator)
- ↳ Short context window
- ↳ Human in loop constantly
- ↳ Multi-model orchestration
- ↳ Browser, IDE, APIs, file system
- ↳ 200K+ token memory
- ↳ Runs autonomously for hours
Systems like Devin (software engineer agent), Operator (browser-based agent), and Claude's computer use can control a computer like a human would — clicking, typing, navigating — to complete complex tasks end-to-end.
Here's the honest answer about AI's future: nobody knows exactly, and anyone who claims certainty is overconfident. But here are the well-evidenced trends shaping the next 3–5 years:
What makes a model "multimodal"?