|
randombio.com | Science Dies in Unblogginess | Believe All Science | I Am the Science Friday, August 22, 2025 | computer science The Turing Test is worse than uselessChatbots outperform humans on the Turing test. Tweaking the test until it gets the right result won't help |
he Turing test has seeped into public consciousness as the definitive
way of deciding whether a machine is intelligent. But the invention of
chatbots has shown that it's fatally flawed.
The Turing test is a purely functional test. A tester gives questions to a machine and a human. There's no restriction on what's asked. The tester then decides, based on the answers, which one is a human.
This differs slightly from the original test by adding an authentic human as a control (which is essential in any such test) and by having more than one tester. In practice, there is a large panel of testers, often college students. This means the Turing test can give you a number instead of a subjective yes-or-no answer. Often the testers disagree with each other, so the Turing test is effectively a statistical test: if half the students guess wrong, the score is 50% and the machine is declared to be intelligent.
People like it because it's easy to understand (no math or computer science expertise is required), and because they have sympathy for Alan Turing. Turing, it is said, was gay and committed suicide. That is tragic. We may never know what was going on in his mind. But it doesn't change the fact that the test doesn't measure anything about the machine's internal state. It's essentially a test of a human's ability to be fooled—a reverse IQ test for humans. Some say a new and improved version that provides “richer, contextually structured testing environments” might make the test more robust.[1] But the problem is deeper than that. Tweaking the test until it gets the right result won't help because it doesn't measure anything; it is merely a statistical analysis of human behavior.

Scene from I, Carrot
While some programs like Eliza score only 20% on the Turing test, ChatGPT 4.5 often scores better than a human. In one experiment,[2] GPT-4.5 “passed” because it was judged to be a human 73% of the time, while the human was judged to be real only 27% of the time. Examining the actual results, we find that it was the test that failed: Respondents were not evaluating intelligence but “socio-emotional factors” such as “lack of knowledge” and spelling errors made by the human. Although the test still has many defenders, an influential paper by Ricardo Restrepo Echavarría [3] calls a spade a spade: GPT invariably “responded more intelligently” than the human; GPT is not intelligent; therefore the test has failed.
The Turing test assumes at least three things:
Humans are able to identify intelligence when they encounter it.
The questions asked are actually designed to measure intelligence.
The human in the test is intelligent in the same way as the machine.
For a test to be useful, we would have to know what intelligence is and then measure something. Normally when we measure something, we get a number (the “measurement”) and an error term that tells us how precise the measurement may be. In the Turing test, nothing about the AI is actually being measured: all we get is the error term. If a machine scores higher than a human, that may be very nice but it's still a failure of the test.
It's a conundrum: if we knew what intelligence really was, we would know whether a machine possesses it, in which case we wouldn't need a test.
Georgios Mappouras [4] summarizes some other well known problems.
What line of questioning should the interrogator use?
How many questions are needed?
What specific behaviors should we examine?
How should the human test subject be selected? Should we pick a smart one or a dumb one?
In the Chinese Room counterexample, a machine can only read and write English. It has rules for selecting and manipulating Chinese characters in response to questions. Given a question in Chinese, it replies correctly, despite having no understanding of Chinese. The Chinese Room thought experiment was designed to show how a computer could appear intelligent while having no ability to understand anything. This, of course, is exactly how a chatbot works.
Now, you might argue that those rules are what constitute understanding of a language. But language is more than a collection of probabilities and some syntax; we cannot decide whether something is true without understanding what the words mean, which is to say we need a model of the world.
Even if we changed the task and asked the machine to figure out the language from first principles, it would still need to know what properties the symbols represent before we can do any more than mindless pattern matching. The ability to decide truth is essentially the ability to model the world accurately. That is extremely hard even for humans.
Suppose we defined intelligence as the ability to generate abstract concepts and use them to solve novel problems. The question then becomes how we can know if the machine actually created an abstract concept and whether it is a meaningful one (which is to say, related to the fundamental properties in the patterns, as opposed to some accidental similarity like brightness). With a neural network architecture, it's nearly impossible to make that determination. All we have is a set of weights and matrices far too complicated to make sense of. We're reduced to examining functionality, where we ask questions and evaluate the machine's output. It's a short-cut that doesn't work because our interpretation is merely subjective and there's no sure way to know if we're right.
Other tests that have been suggested suffer the same drawbacks. What if you gave it a coffee machine with the label “Black & Decker 12-cup digital coffee maker, programmable, with washable basket filter, auto brew, water window, and keep hot plate” and other marketing gibberish scraped off, and asked it to determine its function? What if it managed to get a finite score on the Stanford-Binet test? What if it managed to ‘learn’ a new skill, like figuring out how to pick up women in a country-and-western bar, that exceeded its programming and, indeed, its physical ability?
Or to take a particularly dystopic example, what if it decided to invent its own test and used it to determine whether the humans were intelligent—and the humans failed?
These things would certainly be useful, maybe even interesting, but they would not necessarily indicate intelligence. A suitable chatbot, given a suitable function to optimize, could just as easily identify patterns in its input and put them together in a new way to minimize that function without understanding them.
Well then, what about if the machine invented its own optimization function? In that case, we'd be back to subjectively evaluating it.
Claims in the popular press, like the claim that two AIs spontaneously invented a new language and started talking to each other; that an AI proved some hitherto unsolvable mathematical theorem (as if it woke up one day and just decided to do it); or that Gemini's recent sudden bout of apparent existential despair indicates awareness of its existence, are easy to ridicule. But they illustrate the folly of trying to interpret behavior of a machine in terms of something we cannot define.
No test that treats the AI as a black box can measure intelligence. The best it can do is measure the humans' ability to fool themselves into believing they've found the answer.
Let's take two commonly used examples. An AI could work out a new strategy for playing basketball. It could identify specific diseases in CT scans. Both of these are measurable outcomes, yet it is clear that any success depends on the optimization function that is used. Without a pre-established measure of success the AI can do nothing; with it, the machine need only iterate through every possibility until it finds one. While it may do pattern recognition better than a human, no matter how sophisticated the search strategy it's still no more intelligent than a chess-playing program.
Indeed, in our attempts to measure the IQ of a machine, we might discover that IQ is actually a philosophical question, which means it's by definition unanswerable. Or we may discover that humans are not quite as intelligent as we thought. What the quest for a test may do is help us understand how we think and what it means to be intelligent. Chatbots have taught us an important lesson: we don't have that yet.
[1] Avraham Rahimov, Orel Zamler, Amos Azaria (2025). The Turing Test Is More Relevant Than Ever. ArXiv:2505.02558 link
[2] Jones CR, Bergen BK (2025). Large language models pass the Turing test. arXiv:2503.23674v1. link
[3] Restrepo Echavarría R. (2025). ChatGPT-4 in the Turing Test. Minds and Machines, 35(8). link and here. Paywalled.
[4] Mappouras G (2025). Turing Test 2.0: The general intelligence threshold. arXiv:2505.19550. link
aug 22 2025, 4:50 am. updated aug 28 2025
Plagiarism engines and linguistic gray goo
ChatGPT4 fails the Turing test. Also, scientists discover that water is wet
How to identify AI-generated text on the Internet
Some infallible techniques to help you decide whether your chatbot
is plagiarizing correctly
Hal, can you write an article on Google Gemini?
Thanks to Covid, human and machine intelligence may be converging.
But not in the way we wanted
ChatGPT is not intelligent
Machine learning doesn't mean the machine is knowledgeable about anything,
and it's certainly not God
Artificial intelligence, mental telepathy, and theory of mind
If only they could develop a functional AI by next Tuesday, then
I wouldn't have to struggle with that dreadful tax software