GPT-5 has an IQ of 360... and my toaster meditates on Kant

I’m fed up. Fed up with reading nonsense served lukewarm as if it were revelations. You’d think the trend of self-proclaimed experts would eventually run out of steam, but no, it’s thriving more than ever like a viral algorithm. An army of new AI prophets, emerging from nowhere, yesterday still selling premium detergent or evangelizing PowerPoint emptiness, now discovering themselves as LLM reasoning doctors, armed with three blog articles, two YouTube videos, and self-confidence capable of powering a data center.

They burst onto the stage like PowerPoint rockstars: lapel mic adjusted to the millimeter, sleeves rolled up to display fake “authentic” casualness, jeans more expensive than your rent but artfully faded to say “I’m just like you,” while knowing full well they’re not. Their calibrated smile gleams under the spotlights, while their slides dripping with more or less hollow buzzwords scroll by to the rhythm of an old top 40.

And with an inspired tone, they present us with their latest digital darling, this pseudo-Einstein in silicon who supposedly spends his nights meditating on the metaphysics of some random guy between matrix multiplications.

To hear them tell it, GPT-5, endowed with an IQ “superior to Einstein’s” according to the most enthusiastic press releases, would engage in abyssal reflections on the human condition, scribbling existential dissertations before dozing off, proud of its conceptual genius. Except no. No more than my toaster wonders in the morning whether it should brown my bread in a Kantian or Schopenhauerian fashion does an LLM wake up with an intellectual career plan. Behind the flattering veneer, there’s no intention, no intuition, no consciousness. Just mechanics that chain words together. WORDS, PERIOD.

And the real problem isn’t just that these speeches embellish reality. It’s that they lie, plain and simple, and most often out of ignorance. They disguise a statistical engine as an inspired guru, dress up a probability calculator as an enlightened consciousness, and deliberately blur—or naively—the understanding of what they claim to explain.

Result: an audience captivated by the illusion that thinking is just knowing how to stack phrases like aligning plastic blocks.

NOTE: The explanation that follows may seem a bit technical if you’re not familiar with the subject. So to avoid losing everyone along the way, I’ve divided this article into two parts.

First, a short (and popularized) version that’s clear and jargon-free, summarizing the essential: GPT-5 (and its peers) do not reason in the proper sense of the term, despite what some embellished speeches claim.
Then, for the curious, a more detailed version that dives deeper into the real limitations of these models, with concrete examples and research results.

Short (and popularized) version

To reason, for us, isn’t just chaining words that seem to “sound right.” It’s observing, reflecting, connecting what we know, what we’ve experienced, and what we feel to form an idea. It’s testing it, adjusting it if it doesn’t hold up, or dropping it if it doesn’t work. It’s also perceiving when something’s off, sensing that an element is missing or that a lead isn’t the right one.

A generative AI does none of this. It doesn’t see, doesn’t hear, doesn’t remember. It doesn’t understand what it produces. It simply predicts the most probable sequence of words based on a given context, relying on billions of examples it has analyzed. Its “reflex” isn’t to think, but to calculate. It’s a large-scale guessing game, without intention, without intuition, and without awareness of its own limitations.

Certain methods, like “Chain-of-Thought” (chaining reasoning steps), can give the impression that the machine follows reasoning comparable to ours. But research shows that, in some cases, multiplying steps degrades the quality of responses. Talking about “reasoning” in this context is therefore a misuse of language: it leads to confusing statistical calculation with real thought. And this shift isn’t trivial, because it shapes our perception of these systems and fuels expectations that don’t correspond to what they can actually accomplish.

The long version

In humans, reasoning means articulating a logical chain of ideas based on an internal model of the world. This model is built over time through experience, nourished by memory of past facts, enriched by observation, and often modulated by emotions. Neuroscience shows that this process simultaneously engages several brain networks: the prefrontal regions, which play a central role in planning and evaluation; the hippocampus, involved in memory and relating information; and associative circuits that integrate perceptions, memories, and general knowledge. This network allows establishing links between different situations, anticipating consequences, formulating hypotheses, and testing them.

Reasoning also means knowing how to suspend judgment, recognize contradiction, or reconsider an idea when facts contradict it—operations that mobilize both conscious analysis and automatic error detection mechanisms. It’s telling yourself: “If A implies B, and B is false, then A must be too,” and understanding not only the formal validity of this logical chain but also its relevance in context. In other words, it’s being able to evaluate both the structure and value of a conclusion before even stating it, integrating knowledge, benchmarks, and experiences that our brain’s biology makes malleable and adaptive.

And in humans, this evaluation doesn’t occur in a vacuum: it relies on a sensory and social environment, where each perception, each interaction, can influence reasoning. We don’t just recognize patterns or repeat familiar sequences; we constantly update our hypotheses based on what’s happening around us, what we see, hear, feel. It’s this continuous loop between the world, our senses, and our judgment that distinguishes human reasoning from simple statistical pattern alignment.

As Thinking, Fast and Slow? On the Distinction Between Reasoning and Pattern Matching in LLMs points out, human thought involves dynamic updating of our hypotheses as new information arrives. We constantly confront our mental representations with perceived reality and feedback from our environment.

An LLM, on the other hand, possesses no internal representation of the world, no consciousness, no intention. Its functioning relies on mathematical architecture that, from the provided context, estimates the most probable word sequence according to statistical models learned on vast text sets. In other words, where we reason in constant interaction with reality, it simply extends a statistically probable sequence.

There’s no recall of lived experiences, no sensory integration, no biological evaluation mechanism. The LLM doesn’t access the meaning of what it produces: it manipulates symbols (tokens) and probabilities, not lived concepts. Each generated word is the consequence of a calculation, not the result of understanding or reflection.

And yet, observing certain responses makes it easy to understand where the illusion comes from. The way an LLM chains its sentences, the structured style it can adopt—all this easily gives the impression that it really “reasons.” This impression is reinforced when using specific techniques that lead it to detail its “reflection” steps. The best known bears the seductive name Chain-of-Thought.

The Chain-of-Thought misunderstanding

Zhou et al.’s (2023) work shows that the length of a Chain-of-Thought directly influences performance, and not always positively. Too many intermediate steps mechanically increase the risk of errors: an approximation introduced early can propagate from one link to another until it distorts the final conclusion. Conversely, too short a path can neglect essential logical transitions or ignore information necessary for complete problem resolution.

The results thus draw a bell-curve relationship: there exists an optimal zone where the chain is detailed enough to illuminate the answer, but not long enough to become counterproductive. This equilibrium point isn’t the fruit of adaptive strategy: the model doesn’t “know” to stop when the solution is found. It strictly follows the imposed structure, even if it leads into unnecessary steps that can degrade response accuracy.

In humans, this length management is flexible, adaptive: we can decide to take a detour to explore a lead, go back if a point seems shaky, or interrupt the process if we sense we already have the right answer. An LLM remains trapped in the structure prescribed to it; it mechanically unfolds the steps, even if they lead straight into a dead end.

However, as several works show, including The Unreasonable Effectiveness of Chain-of-Thought Reasoning and The Curse of CoT (2025), excessive lengthening of chains can degrade final accuracy. Each additional step increases the risk of propagating an error, and in certain contexts like in-context learning based on explicit patterns, this degradation becomes systematic. The authors of The Curse of CoT notably identify a “distance effect”: adding intermediate steps lengthens the context between examples and response, disrupting the model’s ability to exploit demonstrations. They also highlight a striking duality: the explicit reasoning produced by CoT, often noisy, interferes with more robust implicit reasoning, to the point where combining the two deteriorates the final result.

Architectures that imitate reflection

Let’s take a concrete example: the Reflexion method, introduced in 2023 by Shinn, Cassano et al. On paper, you might think it’s an agent capable of learning from its mistakes. The scenario seems appealing: the AI accomplishes a task, evaluates its own work, identifies its weak points, then corrects itself in a new attempt.

The actual functioning is more prosaic, and especially more mechanical. Here’s the cycle:

We ask an AI agent, generally based on a large language model, to execute a given task.
After this first attempt, we ask it to produce a textual self-evaluation of its response, supposedly pointing out errors or limitations.
This evaluation text is then summarized and kept in an external memory.
During the next attempt, this summary is reinjected into the input prompt.

The model then generates a new response influenced by this additional information. But it doesn’t “reflect” on its previous response in the human sense of the term.

It keeps no internal trace of its errors, doesn’t review them with hindsight, and doesn’t develop conceptual understanding of what it did wrong. It simply reacts to the text provided to it, as it would for any other instruction, because it learned to imitate this kind of correction in its training data.

In practice, Reflexion is therefore an external control architecture: it’s the structure surrounding the model that orchestrates the response → critique → new response process. The AI is a component of this system, but the “improvement process” is entirely imposed from outside.

To use a simple image, it’s like an actor who is handed, before replaying a scene, a sheet listing his previous errors. He might correct certain details in his performance, but not because he’s acquired awareness of his role—simply because he’s following precise instructions dictated by another.

What maintains the illusion is that, in fact, these techniques effectively improve results. The responses produced seem more relevant, more accomplished. And eventually, one could be convinced that such progress can only come from some form of authentic reflection. Yet between “generating more satisfactory output” and “reasoning” in the human sense, there’s a gap that no raw performance can bridge.

Added to this is another documented flaw: the non-faithfulness of produced chains. A study (Chain-of-Thought Reasoning In The Wild Is Not Always Faithful, 2025) shows that models can decide on an answer based on implicit bias, then construct afterward an argument chain designed to justify this response. This phenomenon, called post-hoc rationalization, is observed even without voluntary bias in the prompt. Researchers also describe “unacknowledged illogical shortcuts” (unfaithful illogical shortcuts): the model omits essential steps but gives the illusion of rigorous reasoning.

Why this linguistic abuse is a problem

So, you might ask, what’s the harm in saying that “AI reasons”? After all, it’s just a way of speaking. That’s precisely where the trap closes: through poorly chosen words, we imperceptibly modify how we perceive and use these systems.

In education, presenting a language model as a “mind” capable of reasoning installs an illusion of understanding. Learners end up relying on its responses as intellectual authority, skipping the necessary steps to build their own reasoning. They then lose the opportunity to verify, doubt, confront ideas—which is the very essence of learning.
Cognitively, we get used to delegating not just execution, but also the formulation of our own thoughts. It’s insidious externalization: we let a statistical engine fill the gaps in our reflection, to the point of forgetting that the machine doesn’t “think”—it guesses the most probable continuation. This habit ends up weakening our intellectual vigilance, like a muscle we no longer exercise.
Ethically, words shape responsibilities. Saying that “an AI reasoned thus” amounts to attributing intention or judgment it doesn’t have. This blurs the lines: who should be accountable if the resulting decision causes harm? The designer? The user? The organization that chose to trust it? The more we attribute human qualities to the machine, the more we blur the sharing of human responsibilities.

Rather than talking about reasoning, let’s talk about reasoning simulation. Or, more honestly still, instruction-guided production. Because that’s what it’s about: text generation oriented by context, sometimes enriched by peripheral mechanisms—external memory, intermediate evaluation, reinjection of relevant elements—that frame the model’s behavior.

This vocabulary has less luster than the anthropomorphic metaphors propagated, with certain aplomb, by many (pseudo) AI experts. That’s true. But it has the advantage of not misrepresenting reality. Using these precise words reminds us that we’re talking about a purely computational process: each output is only the result of probability calculation performed from examples encountered during training. Nothing more. Nothing resembling autonomous reflection or any form of awareness.

Using precise terms, even if they’re less seductive, also helps keep our critical thinking alert. And even when we admit that these systems “reason” in the computational sense, we must keep certain structural limitations in mind. Recent theoretical results (Lower Bounds for Chain-of-Thought Reasoning in Hard-Attention Transformers, 2025) show that, for certain tasks, the minimum length of a reasoning chain grows linearly with problem size. In other words, even with optimal CoT, there exist fundamental constraints that set a ceiling on what these architectures can accomplish, and this ceiling comes at computational cost.

This terminological and technical clarity prevents language from gradually opening the door to serious misunderstandings, where we would attribute intentions or judgments to the machine that are only reflections of our own projections. In other words, putting words in their proper place gives us the possibility to think lucidly about what these systems actually do, and what they will never do. It’s an indispensable prerequisite before deciding what we can, or cannot, entrust to them.

We live in an era where language, carefully calibrated by algorithms, can make a machine appear intelligent. But it’s just an appearance! An illusion of thought, maintained by text fluidity and apparent response coherence. And if we let ourselves be seduced by this appearance, we risk not only attributing faculties it doesn’t have, but also, imperceptibly, delegating our own.

This shift is insidious: it doesn’t happen in a day, but through accumulated small habits—accepting without verifying, repeating without understanding, relying on the tool rather than our own judgment.

So, the next time you hear that an AI “reasons,” ask yourself this simple but essential question: Is it the one thinking… or have I stopped doing so?

To go (even) further

For those who would like to dig a little deeper into the subject and have the courage to dive into the sometimes arid prose of academic publications, certain recent works offer valuable insight. Among them, three papers available on arXiv constitute a good starting point:

The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning
Chain-of-Thought Reasoning In The Wild Is Not Always Faithful
Lower Bounds for Chain-of-Thought Reasoning in Hard-Attention Transformers

Of course, there are others…

GPT-5 has an IQ of 360… and my toaster meditates on Kant

Short (and popularized) version

The long version

Why this linguistic abuse is a problem

To go (even) further