
Subtitle : Why undisclosed synthetic feedback loops are a structural safety risk for large AI models
Frontier AI systems are increasingly trained and fine-tuned on synthetic data—model-generated text, reasoning traces, and dialogues produced by earlier models rather than collected from the world.
Used carefully and in small doses, synthetic data can be useful.
Used heavily and recursively, it becomes something else: a hall of illusions.
In such a hall, models are no longer primarily learning from reality. They are learning from their own reflections.
In a recent preprint, “The Hall of Illusions: How Heavy Synthetic Data Training Erodes Real-World Performance” (Lei, 2025, https://doi.org/10.5281/zenodo.17782033, I formalize this risk and demonstrate it in simple, fully reproducible experiments. In a controlled feedback loop, models trained repeatedly on mixtures of real and fake data become increasingly confident and increasingly wrong on held-out real data—especially in the long tail of rare events and edge cases.
The lesson is not exotic. It is the same geometry engineers already know from mirror cavities in optics and feedback loops in control systems: when you feed a system too much of its own output and too little fresh input, it will eventually lose contact with the external signal.
Today, large AI labs are building increasingly powerful systems while providing almost no public visibility into:
From the outside, we cannot audit their feedback loops.
We can only see the structure, and we know how such structures behave.
⸻
What this letter is—and is not—saying
This letter does not claim that all synthetic data is harmful.
It does not call for a ban on self-training or distillation.
It does say:
This is not just a research aesthetics question.
It is a safety and governance question.
⸻
Our requests to AI labs and developers
We call on organizations that train and deploy large models—especially so-called frontier labs—to adopt the following minimum standards when using synthetic data at scale:
For any major training or fine-tuning stage, report an approximate breakdown of real versus synthetic tokens (for example, “pretraining: ≤10% synthetic; post-training: 60–80% synthetic”). Exact recipes are not required; order-of-magnitude clarity is.
Before relying on heavy synthetic data, run simple but explicit experiments where a model is retrained on mixtures of real and synthetic data across several generations, and track performance on held-out real-only test sets. Publish the setup and results, even if they are small-scale, so others can see whether collapse appears.
Keep dedicated test sets that are:
Use these to monitor whether performance on reality is drifting while synthetic-like benchmarks stay flat or improve.
Safety, alignment, and governance teams should treat heavy synthetic training as a risk parameter, not a mere optimization detail. They should have the authority to block or modify training runs whose synthetic feedback loops are opaque, unmeasured, or clearly collapsing on real-world tests.
Provide enough information—synthetic fractions, high-level training structure, and collapse-test summaries—that external researchers and regulators can meaningfully evaluate the risk of synthetic collapse. A hall of illusions is most dangerous when no one outside the cavity can see where the light is coming from.
⸻
Why this matters now
The world is being asked to rely on AI systems for search, assistance, education, creativity, and in some cases, decision support in high-stakes domains. Many of these systems are already being nudged toward synthetic-heavy regimes because it is cheaper and faster than collecting more real data.
If we do nothing, we risk drifting into a future where major models are:
The good news is that this risk is measurable and controllable—if labs are willing to measure it and to be honest about what they find.
Toy experiments, like those in Lei (2025), exist because the companies building the largest halls of illusions have chosen not to show their own tests. That choice can be reversed.
⸻
Call to sign
We invite:
to sign this letter and to ask, at minimum, for transparency and basic collapse testing wherever synthetic data is used at scale.
The question is not whether synthetic data is clever or convenient.
The question is whether we are willing to accept undocumented halls of illusions at the core of the systems we increasingly depend on.
Initial signatory:
Lei, Yu
Author of “The Hall of Illusions: How Heavy Synthetic Data Training Erodes Real-World Performance”