What Does Chatbot Eugene Goostman’s Success on the Turing Test Mean?
(short answer: pretty much nothing…)
I see lots of press today stating that the chatbot Eugene Goostman has “beaten the Turing test” — the classic test of machine intelligence proposed by AI pioneer Alan Turing, which says (loosely) that if an AI program can fool people into thinking it’s human, in a textual conversation context, then it should be assumed to have human-level general intelligence.
Is this true?
Ben Goertzel Interviewed by Adam Ford
Well, sort of….
According to Wikipiedia,
33% of the judges (which included John Sharkey, a sponsor of the bill granting a posthumous pardon to Turing, and Red Dwarf actor Robert Llewellyn) were convinced that Eugene Goostman was a human.
Turing Test junkies will recall that in 2008, the chatbot Elbot convinced 30% of the Loebner Prize judges it was human.
Alan Turing somewhat arbitrarily set the threshold at 30% when he articulated his “imitation game” test back in 1950. Elbot almost met the criterion, but Eugene Goostmans beat it.
On the other hand, in the 2013 Loebner contest, no chatbot fooled any of the 4 judges. However, I suspect the 2013 Loebner chatbots were better than the 2008 Loebner chatbots, and the judges were just less naive in 2013 than 2008. And — I’m just guessing here — but I suspect the judges for the Eugene Goostman test were more on the naive side…
I doubt there has actually been any dramatic recent advance in chatbot technology. The fluctuation from 30% judges fooled in 2008 to 33% judges fooled in 2014 seems to me more likely to be “noise” resultant from differences in the panels of judges…
Also, the 30% threshold for “passing” is far from universally accepted. For instance, in Ray Kurzweil’s bet with Mitch Kapor that no AI will pass the Turing Test by 2029, the definition of “beating the Turing Test” was set at fooling at least 2/3 of the judges, not just 30%. Also, importantly, the Kurzweil/Kapor bet requires a two hour conversation, not just five minutes like the test with Goostmans. A two hour conversation would be much harder to finesse with trickery.
In any case, while making chatbots that can fool human judges is a reasonably fun pursuit, nobody should confuse it with the quest to actually build thinking machines that can understand and converse like people. The latter is the sort of thing we discuss at the Artificial General Intelligence (AGI) Conference each year, and it’s noteworthy that chatbots like Eugene Goostmans and Elbot never come up at such conferences. Ultimately, these chatbots are theatrical constructs, which generate responses that simulate understanding, but don’t actually understand what they’re talking about.
What do I mean by understanding? Here’s one way to put it — incomplete but perhaps still useful. Suppose a human and a chatbot are both asked the question “Where is Canada?”, and suppose they both answer “North of the US, south of the Arctic.”
Any human who gave that answer, would then be able to answer the question “Would you say then that Canada is beween the US and the Arctic?”. The human would answer Yes.
On the other hand, a chatbot who gave that same answer, if asked “Would you say then that Canada is between the US and the Arctic?”, would probably not be able to answer the question — but would rather find some conversational ruse for avoiding the issue. It might say “I don’t think anything can come between the US and the Arctic”, or “What is it, stupid questions day?”, or whatever. A good chatbot can avoid questions it finds difficult via various clever tricks of this nature — and of course, this is a conversational skill that people have too. But it’s still very different than understanding what one is talking about, in the way that humans do.
As a specific example of a “conversational trick” of this nature from the Eugene Goostmans bot, look at this little sample from an earlier conversation between the bot and a judge:
[16:11:25] Judge: what music do you like?
[16:11:32] Eugene: I like Eminem – because he doesn’t have a thing to do with “polite correctness”.
[16:11:46] Judge: OMG i love eminem, he’s my favourite artist!
[16:12:00] Eugene: I wonder what Eminem’s nick mean. I associate it only with a word “enema”. But he is a cool guy, I like his The Real Slim Shady.
[16:12:55] Judge: i know what his nickname means, his real name is Marshall Mathers. From the initials M&M phynetically sounding like Eminem.
[16:13:06] Eugene: Eminem is a famous crapper. No, raper… ripper… Yeah, I mean – rapper. I like his Stan and The Real Slim Shady.
[16:13:32] Judge: I like Hailies Song.
[16:13:46] Eugene: Interesting tastes. I like science-fiction books, by the way. Where do you came from, by the way? Could you tell me about the place where you live?
[16:14:22] Judge: Epsom, home of the derby. yourself?
The bot didn’t know what to say about Hailies Song. But instead of asking something contextually appropriate, it used a canned phrase “Interesting tastes” and then changed the subject to something easier. This is impressive “sleight of word”, but what it does is cover up a lack of actual understanding. Of course, humans use this sort of rhetorical trick sometimes as well. But normal humans don’t use it so extensively, and they don’t do it to cover up such a profound ignorance of basic commonsense concepts and relationships.
One also sees from this snippet the wisdom of choosing a foreign 13 year old as the subject of imitation. One will forgive a lot of English and commonsense lapses if one thinks one is talking to a foreigner, and a foreign kid at that. Again, stage magic style trickery.
So is it really an important achievement if a chatbot can fool 33% of judges into thinking it’s human — or 30%? What if it were 50%, or 60%? It’s an interesting theatrical achievement. But if it’s achieved in large part via funky conversational trickery, it doesn’t really advance us much toward achieving real AI — not any more than stage magic advances us toward achieving real Harry Potter style magic.
An automated dialogue system that understood what it was talking about would not necessarily be a human-like general intelligence. But unlike the current batch of chatbots, it would be an important achievement, and would certainly have a lot to teach us about how to achieve AGI at the human level and beyond.
Turing was a very smart man, and a brilliant AI theorist for his time. But he may not have fully understood how easy people are to fool — nor how clever some humans are at figuring out how to fool other humans. (A computer that could fool humans as well as other humans can — now that would be impressive!!) Being able to fool ordinary people acting as judges is not the same as actually conversing in the same way that a human does. For example, the pattern of avoiding commonsense knowledge questions instead of answering them is something that may not be noticed by the average person chatting with a chatbot, but would easily be detected by an expert in discourse analysis or conversational AI systems, analyzing a chatbot’s conversation versus real human conversation.
In his 1950 paper presenting the “imitation game” that is now called the Turing Test, Turing admitted that he hadn’t really thought through the various ways his proposed game could be gamed,
One thing contemporary chatbots show is that, to fool a certain percentage of naive judges, the best strategy for a machine is indeed to give answers other than those that would naturally be given by a man (i.e. to evade in many situations where a normal human would answer in accordance with commonsense understanding).
An interesting modification of the Turing test would be as follows: If an AI could carry out conversations with a variety of AI experts, in a manner that other AI expert and linguist analysts could not distinguish from human conversations with those same AI experts. An AI that could do this probably woule human-level intelligence. The difference here is that AI experts would know to probe the likely weaknesses of chatbots; and the expert analysts would know to look for evasive maneuvers and other clever “stage magic” type ruses.
In the end, I think the main thing Eugene Goostmans and Elbot and so forth teach us is that imitating humans and being generally intelligent are not really the same thing, and there’s not too much point to mixing them up. A properly designed imitation game, extending Turing’s original idea, could probably serve as a test for human-like intelligence — but ultimately, so what? Humans are just one of the many varieties of potential generally intelligent systems. Imitating the surface form of our conversation is ultimately not that interesting; and approximately imitating this surface form well enough to trick people, is really uninteresting, at least from a science point of view.. What’s interesting is emulating the general intelligence that lies below the surface. That is what the real AGI researchers of the world are focusing on.