It has been all over the tech news for the past 24 hours, an “artificial intelligence” agent or chatbot known as “Eugene Goostman” has passed the famous Turing Test. Or did it?
Did the Singularity just happen?
So, what really happened in this Turing Test challenge and what does it mean for the future of humanity and our machines?
The original press release states, “The 65 year-old iconic Turing Test was passed for the very first time by supercomputer Eugene Goostman during Turing Test 2014 held at the renowned Royal Society in London on Saturday.” and further continues “If a computer is mistaken for a human more than 30% of the time during a series of five minute keyboard conversations it passes the test. No computer has ever achieved this, until now. Eugene managed to convince 33% of the human judges that it was human.”
First, some background is required in order to understand exactly what the Turing Test is and what it is not. Perhaps understandably there is some confusion over exactly what Turing said and meant as the Test.
The Turing Test originally appeared in Turing’s 1950 paper “Computing Machinery and Intelligence” which can be found here. In the paper, Turing outlines the procedure of the Test which he calls the Imitation Game:
Here’s exactly what Turing said that leads to confusion:
Clearly this is not a statement of a success criteria for the Test, and the statement does not appear in the section of the paper where the Test or Imitation Game is described and specified but in later section on defending the notion against objections. In the earlier section where the Imitation Game is defined, Turing arguably implies that a single instance of fooling a human would constitute passing of the Test. In some limited sense then, the Turing Test was passed years ago and in fact the software Eugene Goostman has previously fooled judges in Turing Test competitions.
The press release actually anticipated this objection right up front, “”Some will claim that the Test has already been passed. The words Turing Test have been applied to similar competitions around the world.”
However this recent test was not a single “one off” but included a panel of 25 judges who each reviewed five competing software programs.
Turing himself was ambiguous and incomplete in describing the Test and he made various statements about the Test on different occasions. Some of these were in his technical publications but others were in broadcast interviews. For example, he never states how long the procedure should take or what other criteria might be used for ending the test in the original paper. Further, it is not well known that Turing offered a somewhat different formulation of the Test in 1952 wherein he suggested the use of a panel of randomly selected judges. In the 1952 version, Turing states that at least 50% or a simple majority is require to pass his Test. The 1952 formulation was presented in a BBC talk broadcast “Can Digital Computers Think” in which Turing also predicted that the Test wouldn’t be passed for at least 100 years or not before 2050. To further the ambiguity, Turing failed to mention that there should be a human and computer participant in each trial in this presentation perhaps modifying the idea of the Test in a problematic way that introduces bias.
It seems that this idea about 30% being a passing mark is a mistaken understanding resulting from taking Turing’s 1950 prediction about future computer performance out of context. The confusion is understandable since this mistaken idea is widely but erroneously reported in the literature. But if you read the original in context it is clear that Turing was predicting machine performance and not stating the success criteria for the Imitation Game.
According to the press release, “Eugene was ‘born’ in 2001.” and he has been under continued development since. The announcement continues, “This year we improved the ‘dialog controller’ which makes the conversation far more human-like when compared to programs that just answer questions. Going forward we plan to make Eugene smarter and continue working on improving what we refer to as ‘conversation logic’.”
Based on the press frenzy, you might have assumed that this performance was a huge leap forward for Eugene. But it wasn’t. Back in 2012 Goostman had fooled 29% of the judges meaning the 2014 result is an improvement of just 4%. And with a panel of 25 judges this means just one additional judge was fooled from the 2012 trial, hardly what we might call a great leap ahead in performance. However this test was “open ended”, that is, the judges were not restricted in their questioning or question areas. While open ended tests are obviously more challenging, the actual difference here is small and possibly not due to a real advancement in “intelligence” or performance.
The press release states, ” Our main idea was that he can claim that he knows anything, but his age also makes it perfectly reasonable that he doesn’t know everything. We spent a lot of time developing a character with a believable personality.” But beyond avoiding answering, the chatbot isn’t really able to do much. The implications will therefore be quite limited.
Eugene doesn’t have the ability to learn or even remember very much within the five minute test period for example. It doesn’t utilize a database of domain knowledge that a real 13 year old would have. It is in essence a parlor trick. This will greatly limit the utility of this software in practice and I have to question some of the claimed application areas at least near term.
In my experience, it is trivially easy to get nonsense out of Eugene if you know what to say.[Editor’s note: the chatbot was down as of publication]
While text based agents that can reliably imitate human interactions will eventually be a big deal, this software’s performance is insufficient to maintain the required illusion. Imagine using a Eugene Goostman bot as a “mindclone” to answer your office telephone while you skip out to the beach or a long lunch. I am sorry to report that this software simply isn’t going to do a very good job if your boss calls and even if it fooled the caller, it couldn’t explain to you what they wanted. Generally speaking this test result may say more about the gullibility of the panel of judges than intelligence or power of the machine. The absence of a control group makes it impossible to rule this out.
A stronger form of the Turing Test would include a control group and could be conducted following Turing’s 1952 explanation using the Internet to gather a large pool of potential non-specialist judges from which to randomly sample. In addition the success criteria of a simple majority stated by Turing in 1952 makes some sense while the widely reported 30% threshold does not. This stronger Internet based test seems like a viable and useful idea to further AI research.
But sadly we’ll have to wait a bit longer for the Singularity to happen.
A. M. Turing (1950) Computing Machinery and Intelligence. Mind 49: 433-460.
The Turing Test: Verbal Behavior as the Hallmark of Intelligence (Bradford Books) [Kindle Edition]