AI learns from human-generated data. And humans are biased. So — spoiler — AI can be biased too. AI bias happens when a model produces systematically unfair results for certain groups of people.
This isn't theoretical. Real examples have included:
- Facial recognition systems with much higher error rates for darker-skinned faces
- Hiring algorithms that downranked women because training data reflected historical hiring bias
- Loan approval AI that denied credit more often in certain zip codes (proxy for race)
- Medical AI trained primarily on male patients performing worse for women
Here's the tricky part: "fair" is not one thing. Mathematicians have defined dozens of fairness metrics, and many of them are mathematically impossible to satisfy at the same time.
This means AI fairness isn't a purely technical problem. It's a social and political one. Engineers have to make choices about which type of fairness matters most for their specific use case — and that requires involving affected communities.
Alignment is the problem of making sure AI systems actually do what we want — not just what we literally asked for. This sounds easy. It's one of the hardest open problems in AI.
This gets worse as AI gets more capable. The bigger problem is called goal misgeneralization — the AI learns a proxy goal that works during training, then pursues that proxy goal in ways that diverge dangerously in the real world.
RLHF (which you met in Chapter 4) is one alignment technique — humans rate AI responses and the model learns from that feedback. But it has limits: humans can be manipulated, fooled, or simply disagree on what "good" looks like.
Once we have alignment goals, how do we actually implement them? Modern AI safety uses a stack of techniques:
On the governance side, the EU's AI Act (2024) was the world's first major AI regulation, classifying AI systems by risk and imposing requirements on high-risk uses like hiring, credit scoring, and critical infrastructure.
What does "AI alignment" primarily refer to?