A controversial research paper is throwing cold water on the AI industry's agent dreams. Published mid-2025, "Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models" claims to mathematically prove that large language models can't reliably handle complex computational and agentic tasks. But as Google, OpenAI, and dozens of startups pour billions into agent AI, they're betting the math is wrong - or at least incomplete.
The big AI companies promised 2025 would be "the year of the AI agents." It turned out to be the year of talking about AI agents. Now a research paper is suggesting the wait might be permanent.
Published without fanfare during the height of agent hype, "Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models" delivers a mathematical gut punch to the agentic AI vision. The paper, authored by former SAP CTO Vishal Sikka and his teenage prodigy son, claims to prove that LLMs are fundamentally incapable of carrying out computational and agentic tasks beyond a certain complexity. Even reasoning models that go beyond pure word prediction won't fix the problem, according to their analysis.
"There is no way they can be reliable," Sikka told Wired in a recent interview. The researcher, who studied under AI pioneer John McCarthy before his career at SAP, Infosys, and Oracle, now runs AI services startup Vianai. His verdict on agents running critical systems like nuclear power plants? Forget it. You might get one to file some papers and save time, but mistakes are inevitable.
The timing couldn't be more awkward for an industry that's bet its future on autonomous AI systems. Google's Demis Hassabis just reported breakthroughs in minimizing hallucinations at Davos this week, while hyperscalers and startups race to ship agent products. But the mathematical critique has support from an unlikely source - OpenAI itself.
In a paper published last September, OpenAI scientists wrote that "despite significant progress, hallucinations continue to plague the field, and are still present in the latest models." They proved it by asking three models, including ChatGPT, to provide the title of the lead author's dissertation. All three made up fake titles. All misreported the publication year. In a blog post about the research, OpenAI glumly stated that in AI models, "accuracy will never reach 100 percent."
The reliability problem is already killing agent adoption in enterprise. "The value has not been delivered," says Himanshu Tyagi, cofounder of open source AI company Sentient. He points out that dealing with hallucinations can disrupt entire workflows, negating much of an agent's value. It's a chicken-and-egg problem - companies won't deploy agents at scale until they're reliable, but the technology can't improve without real-world deployment.
Now a startup called Harmonic claims to have cracked part of the puzzle. Cofounded by Robinhood CEO Vlad Tenev and Stanford-trained mathematician Tudor Achim, Harmonic just reported a breakthrough in AI coding that tops benchmarks on reliability. Their secret? Using formal methods of mathematical reasoning to verify LLM outputs by encoding them in the Lean programming language, which is known for its verification capabilities.
"Are we doomed to be in a world where AI just generates slop and humans can't really check it? That would be a crazy world," Achim told Wired. Their product, called Aristotle (points for humility), focuses on "mathematical superintelligence" and coding - domains where verification is possible. Things like history essays remain beyond its boundaries. For now.
Achim doesn't buy the doom-and-gloom narrative. "I would say that most models at this point have the level of pure intelligence required to reason through booking a travel itinerary," he argues. His bigger claim? That hallucinations aren't just unavoidable - they're necessary. "I think hallucinations are intrinsic to LLMs and also necessary for going beyond human intelligence," Achim says. "The way that systems learn is by hallucinating something. It's often wrong, but sometimes it's something that no human has ever thought before."
Even Sikka, the mathematical skeptic, acknowledges that workarounds exist. "Our paper is saying that a pure LLM has this inherent limitation - but at the same time it's true that you can build components around LLMs that overcome those limitations," he admits. The industry's bet is that guardrails, verification systems, and hybrid architectures can filter out the "imaginative bullshit" that LLMs love to produce.
The philosophical debate extends beyond pure mathematics. Computer pioneer Alan Kay, a friend of Sikka's, suggests the argument is "posed well enough to get comments from real computational theorists" - reminiscent of his famous 1984 take on the Macintosh as "the first personal computer good enough to be criticized." But he thinks the mathematical question misses the bigger picture. Instead, Kay invokes Marshall McLuhan's "the medium is the message" dictum: Don't ask whether something is good or bad. Find out what's going on.
What's going on is a massive industry push toward cognitive automation, mathematical limitations be damned. Google, OpenAI, Anthropic, and dozens of well-funded startups have too much at stake to let theoretical concerns slow them down. AI coding agents already took off in 2025, proving at least narrow agent use cases can work.
The resolution might be that both sides are right. Hallucinations will remain a permanent feature of LLM-based systems. Pure mathematical reliability is impossible. But the delta between guardrails and hallucinations will narrow year by year. Tasks that agents perform will always require some verification - and yes, disasters will happen when people get sloppy. But eventually, proponents argue, agents will match or surpass human reliability while being faster and cheaper.
The AI agent debate boils down to a tension between mathematical truth and economic inevitability. Sikka's paper proves what many suspected - that pure LLMs can't be perfectly reliable. OpenAI's own research confirms hallucinations are permanent. But the industry isn't building pure LLMs anymore. They're building hybrid systems with verification layers, guardrails, and domain-specific architectures. Whether that's enough to overcome fundamental mathematical limitations remains an open question. What's certain is that 2026 won't be "the year of the agent" either - but it'll be another year of more agents, incrementally better and more widely deployed. The massive automation of human cognitive activity is coming, mathematical proof or not. Whether that improves our work and lives, as Alan Kay suggests, won't be mathematically verifiable.