AI is getting worse: here’s why

AI outputs are degrading in real time. Day by day, month by month, they’re getting more “blurry,” the way photocopies of photocopies get progressively less sharp.

AI models are increasingly trained on synthetic data generated by other AI models, because the human-generated data they’ve been trained on--the “clean fuel” of books, court records, scientific journals, etc.--is just about used up.

Which is OK as long as the training domain is a “hard” domain like coding or math, where there’s always a right answer. As the supply of human-generated data dwindles, AI labs prompt machines to output synthetic data to train the new model. The outputs are verified by a “referee” machine and fed into the new model.

But in a “soft” domain like medicine, law or business, it’s not so simple. As human-generated data runs dry, AI labs can and do spin up synthetic data but this is where the trouble starts. The referee can’t grade this soft data right or wrong, so it has to grade based on plausibility--does this 𝘴𝘰𝘶𝘯𝘥 like a good lawyer? It validates the data that sounds most confident and standard, and discards any data outside the mainstream.

And then what? A paper published in Nature has the answer. When AI models are trained repetitively on soft data made by other machines, not humans, collapse sets in. With each repetition, information is lost.

Specifically? To quote the Nature paper, “The tails of information die first.” The probable, the common, the average gets verified, because LLMs are designed to find the center of a data set. If 900 synthetic articles say the sky is “light blue” and 10 human articles say it’s “cerulean,” the machine sees cerulean as an error or outlier and discards it.

These tails--edge cases, non-mainstream viewpoints, counterintuitive takes--are weeded out because they don’t chime with the large volume of synthetic data produced for the training process.

When this goes on long enough, the model degrades. It’s still good for math and coding and can still correctly answer non-complex questions. But if asked a “human” question that requires niche, non-mainstream knowledge, the model just makes shit up (to use the technical term).

This is why AI models continue to score poorly on Person QA assessments, which test a model’s ability to retrieve and reason when asked about people. If asked, “What did Jamie Dimon say about the market long-term?” the AI might mash up 1,000 plausible synthetic versions of what he could have said rather than retrieve what he actually said.

But AI companies are figuring it out, right? No. The data shows that, as models evolve, their grip on the truth is slipping. OpenAI’s own Person QA testing, the benchmark for facts about real people, shows a startling decline.

>o1 was wrong 16% of the time.
>o3 was wrong 33% of the time.
>o4-mini was wrong 48% of the time.

This is why your brand cannot take raw AI and put it out into the world with your name on it. Because you could be dead wrong in public.

Next
Next

Brands that post raw AI risk looking dumb