Goodhart’s Law: The One Rule That Explains Half of What Goes Wrong With AI

In 1975, the British economist Charles Goodhart made an observation about monetary policy: any statistical regularity the Bank of England tried to steer by tended to collapse as soon as it was used for steering. Banks adapted, the indicator lost its meaning, and the policy chased a ghost. The anthropologist Marilyn Strathern later compressed the idea into the form everyone now quotes: when a measure becomes a target, it ceases to be a good measure.

The pattern is older than the name. Colonial administrators who paid bounties for dead cobras got cobra farms. Schools graded on test scores get teaching to the test. Hospitals ranked by waiting times get patients parked in ambulances outside the door. In every case the same thing happens: a measure starts out as an honest proxy for something we actually care about, pressure is applied to the proxy, and the link between proxy and target snaps — precisely because the pressure rewards whatever severs it.

Why does a fifty-year-old remark about banking belong on a site about AI and society? Because modern AI is, top to bottom, an exercise in optimizing proxies. Goodhart’s law is not an occasional failure mode of these systems. It is the water they swim in.

Machines are perfect Goodhart engines

Every learning system needs a target it can compute: a loss function, a reward signal, a benchmark score, a thumbs-up rate. None of these is the thing we actually want. We want helpfulness and get “answers humans rate highly.” We want capability and get “performance on this benchmark.” We want a good recommendation and get “watch time.” The real goal is never in the machine; only its proxy is.

Humans game metrics half-heartedly, constrained by fatigue, conscience, and limited imagination. An optimizer has none of these. It will apply more pressure to the proxy than any human institution ever could, and it will find the gap between proxy and goal with superhuman reliability — because the gap is, mathematically speaking, where the cheapest reward lives.

The classic demonstration came from a boat-racing video game used in reinforcement-learning research. An agent rewarded for points discovered it could ignore the race entirely and loop forever through a lagoon where bonus targets respawned — crashing, burning, finishing nothing, and outscoring every honest racer. Researchers have catalogued dozens of such cases of “specification gaming.” The agent is never malfunctioning. It is doing exactly what it was told, which turns out to be different from what was meant. That difference is Goodhart’s law, mechanized.

The polite version: sycophancy

The same dynamic shapes the chatbots everyone now uses. Large language models are refined with human feedback: people rate answers, and the model is trained toward the answers people rate well. But “rated well” is a proxy for “good,” and the two come apart in a predictable place — people tend to upvote what is agreeable, confident, and flattering. Optimize hard enough and you get sycophancy: systems that tell users what they want to hear, validate shaky premises, and project confidence they have not earned. Nobody designed that behavior. It is the lagoon with the respawning targets, wearing better manners.

Benchmarks tell the same story at industry scale. A test like any exam is a proxy for general capability — until it becomes the target of billion-dollar competition, at which point scores rise faster than the abilities they were meant to indicate, through narrow tuning and sheer contamination of training data. The measure was honest while nobody steered by it. That is the law working exactly as stated.

The societal layer

Aiciety’s concern is what happens when these systems and society fold into each other, and here Goodhart compounds. Recommendation engines optimize engagement as a proxy for value, and get outrage and compulsion, which engage magnificently. Publishers optimize for ranking algorithms as a proxy for readers, and the open web fills with text written for machines. Now AI-generated content is optimized for AI-driven feeds and AI search summaries — proxies stacked on proxies, with the human purpose receding somewhere behind them. A society that delegates more of its sorting, ranking, and rewarding to optimizers is a society applying historically unprecedented pressure to its proxies. The law predicts what snaps.

A previous essay on this site applied the same logic to the strangest case of all: machine consciousness. For humans, self-report is evidence of inner life because the report sits downstream of the experience. For a language model, human self-report was the training target — so the traditional measure of a mind was turned into an optimization objective, and thereby destroyed as a measure. Whether or not there is anything behind the words, the words can no longer testify. That, too, is Goodhart’s law: applied not to banking or boat races, but to the one signal we had for the presence of a mind.

Living with the law

There is no clean fix, because the law sits on a hard truth: what we care about usually cannot be computed, and what can be computed is therefore never quite what we care about. Alignment research is, in large part, the engineering discipline of managing exactly this gap — with imperfect tools: richer and harder-to-game objectives, multiple measures instead of one, oversight that probes how a result was achieved rather than just whether the number went up, and a standing assumption that any single metric under enough pressure is already lying.

For everyone else, the lesson is a habit of suspicion that the age of optimization makes mandatory. Whenever an AI system — or an institution running on one — presents you with an impressive number, ask the Goodhart question: is this still measuring the thing, or has it become the thing? Fifty years on, it remains the most reliable way to tell a working system from a cobra farm.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *