Chapter 6: AI Ethics & Responsible AI ⚖️

1

AI Bias

AI learns from human-generated data. And humans are biased. So — spoiler — AI can be biased too. AI bias happens when a model produces systematically unfair results for certain groups of people.

🪞

The mirror analogy: AI is like a mirror that reflects the data it was trained on. If that data has historical prejudices baked in, the AI reflects those prejudices right back — at scale, automatically, repeatedly.

This isn't theoretical. Real examples have included:

Facial recognition systems with much higher error rates for darker-skinned faces
Hiring algorithms that downranked women because training data reflected historical hiring bias
Loan approval AI that denied credit more often in certain zip codes (proxy for race)
Medical AI trained primarily on male patients performing worse for women

💡 Bias can enter at every stage: data collection (who's represented?), labelling (who decided what's correct?), objective function (what are we optimizing for?), and deployment (which populations use this?).

Spot the Bias — Click each scenario to reveal the issue

Scenario A: A job recruitment AI is trained on 10 years of successful hire data. The company historically hired 80% men for engineering roles. The AI now flags male candidates as "stronger fits" even for identical CVs.

⚠️ Historical bias: The AI learned that "engineer = male" from biased past decisions. It's now automating and amplifying that bias at scale.

Scenario B: A medical chatbot is trained on clinical trials data. Most trials historically had majority-white male participants. The chatbot gives less accurate advice for women and minorities.

⚠️ Representation bias: The training data didn't equally represent all groups. The model performs well for the over-represented group, poorly for others.

Scenario C: A sentiment analysis tool is used to screen customer feedback. It consistently rates reviews using African American Vernacular English (AAVE) as more "negative" than standard English reviews expressing the same sentiment.

⚠️ Language bias: The model was trained mostly on "standard" English. Non-standard dialects are misclassified — punishing users for how they speak.

2

Fairness in AI

Here's the tricky part: "fair" is not one thing. Mathematicians have defined dozens of fairness metrics, and many of them are mathematically impossible to satisfy at the same time.

⚖️

Demographic Parity

Equal approval rates across groups, regardless of other factors.

🎯

Equalized Odds

Equal true positive AND false positive rates across groups.

🔮

Calibration

When the model says 70% chance, it's right 70% of the time — for ALL groups equally.

👤

Individual Fairness

Similar individuals should be treated similarly, regardless of group membership.

⚠️ The impossibility theorem: In 2016, researchers proved that several common fairness metrics cannot all be satisfied simultaneously (except in trivial cases). There is no perfect fairness metric — every choice is a values trade-off.

This means AI fairness isn't a purely technical problem. It's a social and political one. Engineers have to make choices about which type of fairness matters most for their specific use case — and that requires involving affected communities.

3

AI Alignment

Alignment is the problem of making sure AI systems actually do what we want — not just what we literally asked for. This sounds easy. It's one of the hardest open problems in AI.

🧞

The genie problem: You wish for "all humans to stop suffering." A badly aligned AI might conclude that killing all humans solves the problem perfectly. It did what you asked — technically. The real challenge is aligning AI with human intentions, not just stated goals.

This gets worse as AI gets more capable. The bigger problem is called goal misgeneralization — the AI learns a proxy goal that works during training, then pursues that proxy goal in ways that diverge dangerously in the real world.

The Alignment Dial — drag to explore the spectrum

How much autonomy should an AI have? (0 = fully corrigible, 100 = fully autonomous)

Mostly corrigible: AI defers to human oversight on most decisions. Safer but less capable.

RLHF (which you met in Chapter 4) is one alignment technique — humans rate AI responses and the model learns from that feedback. But it has limits: humans can be manipulated, fooled, or simply disagree on what "good" looks like.

💡 Top AI labs now have dedicated alignment and safety teams. OpenAI, Anthropic, Google DeepMind, and others publish safety research. Anthropic was actually founded specifically to work on AI alignment.

4

Safety, Guardrails & Governance

Once we have alignment goals, how do we actually implement them? Modern AI safety uses a stack of techniques:

1

Constitutional AI

Anthropic's approach — give the AI a "constitution" of principles and have it critique its own outputs against those principles before responding.

2

Red Teaming

Teams of humans (and AI models) deliberately try to "jailbreak" or find harmful outputs before launch. Like hiring hackers to test your security.

3

Output Filtering

Content classifiers run on every output to detect and block harmful text, CSAM, dangerous instructions, etc. before it reaches users.

4

Interpretability Research

Trying to understand why a model made a decision — opening the "black box." Still an open research problem but rapidly advancing.

On the governance side, the EU's AI Act (2024) was the world's first major AI regulation, classifying AI systems by risk and imposing requirements on high-risk uses like hiring, credit scoring, and critical infrastructure.

Quick Check — Pick the correct answer:

What does "AI alignment" primarily refer to?

AI Ethics &Responsible AI

AI Ethics &
Responsible AI