Sign In

Remember Me

What Does Chatbot Eugene Goostman’s Success on the Turing Test Mean?

eugene_goostman

(short answer: pretty much nothing…)

I see lots of press today stating that the chatbot Eugene Goostman has “beaten the Turing test” — the classic test of machine intelligence proposed by AI pioneer Alan Turing, which says (loosely) that if an AI program can fool people into thinking it’s human, in a textual conversation context, then it should be assumed to have human-level general intelligence.

Is this true?


Ben Goertzel Interviewed by Adam Ford

Well, sort of….

According to Wikipiedia,
33% of the judges (which included John Sharkey, a sponsor of the bill granting a posthumous pardon to Turing, and Red Dwarf actor Robert Llewellyn) were convinced that Eugene Goostman was a human.

Turing Test junkies will recall that  in 2008, the chatbot Elbot convinced 30% of the Loebner Prize judges it was human.

Alan Turing somewhat arbitrarily set the threshold at 30% when he articulated his “imitation game” test back in 1950.   Elbot almost met the criterion, but Eugene Goostmans beat it.

On the other hand, in the  2013 Loebner contest, no chatbot fooled any of the 4 judges.  However, I suspect the 2013 Loebner chatbots were better than the 2008 Loebner chatbots, and the judges were just less naive in 2013 than 2008.  And — I’m just guessing here — but I suspect the judges for the Eugene Goostman test were more on the naive side…

I doubt there has actually been any dramatic recent advance in chatbot technology. The fluctuation from 30% judges fooled in 2008 to 33% judges fooled in 2014 seems to me more likely to be “noise” resultant from differences in the panels of judges…

Also, the 30% threshold for “passing” is far from universally accepted. For instance, in Ray Kurzweil’s bet with Mitch Kapor that no AI will pass the Turing Test by 2029, the definition of “beating the Turing Test” was set at fooling at least 2/3 of the judges, not just 30%.  Also, importantly, the Kurzweil/Kapor bet requires a two hour conversation, not just five minutes like the test with Goostmans.   A two hour conversation would be much harder to finesse with trickery.

In any case, while making chatbots that can fool human judges is a reasonably fun pursuit, nobody should confuse it with the quest to actually build thinking machines that can understand and converse like people.  The latter is the sort of thing we discuss at the Artificial General Intelligence (AGI) Conference each year, and it’s noteworthy that chatbots like Eugene Goostmans and Elbot never come up at such conferences.   Ultimately, these chatbots are theatrical constructs, which generate responses that simulate understanding, but don’t actually understand what they’re talking about.

What do I mean by understanding?   Here’s one way to put it — incomplete but perhaps still useful.  Suppose a human and a chatbot are both asked the question “Where is Canada?”, and suppose they both answer “North of the US, south of the Arctic.”

Any human who gave that answer, would then be able to answer the question “Would you say then that Canada is beween the US and the Arctic?”.    The human would answer Yes.

On the other hand, a chatbot who gave that same answer, if asked “Would you say then that Canada is between the US and the Arctic?”, would probably not be able to answer the question — but would rather find some conversational ruse for avoiding the issue.  It might say “I don’t think anything can come between the US and the Arctic”, or “What is it, stupid questions day?”, or whatever.   A good chatbot can avoid questions it finds difficult via various clever tricks of this nature — and of course, this is a conversational skill that people have too.   But it’s still very different than understanding what one is talking about, in the way that humans do.

As a specific example of a “conversational trick” of this nature from the Eugene Goostmans bot, look at this little sample from an earlier conversation between the bot and  a judge:

[16:11:25] Judge: what music do you like?

[16:11:32] Eugene: I like Eminem – because he doesn’t have a thing to do with “polite correctness”.

[16:11:46] Judge: OMG i love eminem, he’s my favourite artist!

[16:12:00] Eugene: I wonder what Eminem’s nick mean. I associate it only with a word “enema”. But he is a cool guy, I like his The Real Slim Shady.

[16:12:55] Judge: i know what his nickname means, his real name is Marshall Mathers. From the initials M&M phynetically sounding like Eminem.

[16:13:06] Eugene: Eminem is a famous crapper. No, raper… ripper… Yeah, I mean – rapper. I like his Stan and The Real Slim Shady.

[16:13:32] Judge: I like Hailies Song.

[16:13:46] Eugene: Interesting tastes. I like science-fiction books, by the way. Where do you came from, by the way? Could you tell me about the place where you live?

[16:14:22] Judge: Epsom, home of the derby. yourself?

The bot didn’t know what to say about Hailies Song.  But instead of asking something contextually appropriate, it used a canned phrase “Interesting tastes” and then changed the subject to something easier.  This is impressive “sleight of word”, but what it does is cover up a lack of actual understanding.   Of course, humans use this sort of rhetorical trick sometimes as well.   But normal humans don’t use it so extensively, and they don’t do it to cover up such a profound ignorance of basic commonsense concepts and relationships.

One also sees from this snippet the wisdom of choosing a foreign 13 year old as the subject of imitation.   One will forgive a lot of English and commonsense lapses if one thinks one is talking to a foreigner, and a foreign kid at that.  Again, stage magic style trickery.

So is it really an important achievement if a chatbot can fool 33% of judges into thinking it’s human — or 30%?  What if it were 50%, or 60%?   It’s an interesting theatrical achievement.  But if it’s achieved in large part via funky conversational trickery, it doesn’t really advance us much toward achieving real AI — not any  more than stage magic advances us toward achieving real Harry Potter style magic.

An automated dialogue system that understood what it was talking about would not necessarily be a human-like general intelligence. But unlike the current batch of chatbots, it would be an important achievement, and would certainly have a lot to teach us about how to achieve AGI at the human level and beyond.

Turing was a very smart man, and a brilliant AI theorist for his time.  But he may not have fully understood how easy people are to fool — nor how clever some humans are at figuring out how to fool other humans.   (A computer that could fool humans as well as other humans can — now that would be impressive!!)  Being able to fool ordinary people acting as judges is not the same as actually conversing in the same way that a human does.   For example, the pattern of avoiding commonsense knowledge questions instead of answering them is something that may not be noticed by the average person chatting with a chatbot, but would easily be detected by an expert in discourse analysis or conversational AI systems, analyzing a chatbot’s conversation versus real human conversation.

In his 1950 paper presenting the “imitation game” that is now called the Turing Test, Turing admitted that he hadn’t really thought through the various ways his proposed game could be gamed,

It might be urged that when playing the “imitation game” the best strategy for the machine may possibly be something other than imitation of the behaviour of a man. This may be, but I think it is unlikely that there is any great effect of this kind. In any case there is no intention to investigate here the th eory of the game, and it will be assumed that the best strategy is to try to provide answers that would naturally be given by a man. 

 

One thing contemporary chatbots show is that, to fool a certain percentage of naive judges, the best strategy for a machine is indeed to give answers other than those that would naturally be given by a man (i.e. to evade in many situations where a normal human would answer in accordance with commonsense understanding).

An interesting modification of the Turing test would be as follows: If an AI could carry out conversations with a variety of AI experts, in a manner that other AI expert and linguist analysts could not distinguish from human conversations with those same AI experts. An AI that could do this probably woule human-level intelligence.   The difference here is that AI experts would know to probe the likely weaknesses of chatbots; and the expert analysts would know to look for evasive maneuvers and other clever “stage magic” type ruses.

In the end, I think the main thing Eugene Goostmans and Elbot and so forth teach us is that imitating humans and being generally intelligent are not really the same thing, and there’s not too much point to mixing them up.  A properly designed imitation game, extending Turing’s original idea,  could probably serve as a test for human-like intelligence — but ultimately, so what?  Humans are just one of the many varieties of potential generally intelligent systems.  Imitating the surface form of our conversation is ultimately not that interesting; and approximately imitating this surface form well enough to trick people, is really uninteresting, at least from a science point of view..  What’s interesting is emulating the general intelligence that lies below the surface.  That is what the real AGI researchers of the world are focusing on.

  

    12 Comments

    1. Good insights from Ben for sure. An article worth the read.
      >>…A properly designed imitation game, extending Turing’s original idea, could probably serve as a test for human-like intelligence — but ultimately, so what?

      Maybe we should explore a stop-gap method to AGI? … perhaps work on a bridge between narrow AI and AGI, rather than giving up in frustration in the quest of “emulating the general intelligence that lies below the surface.”

      After all, there’s already some people, working in the very field, now questioning the whole ‘Singularity’ premise.

      Imitating human like behavior – (consider the daily routine of the average “joe” – really average joe with a 9-5 day… predictable weekends…year after year, till retirement )- might not be so hard to emulate. Each such individual emulated would most likely have a finite quota of “common sense” that chatbot algorithms can grow from.

      Point? maybe chatbot algorithms, while not the best way forward to AGI, might be the stepping stones to the fork in the road that narrow AI can then take to reach the AGI super highway.

    2. The whole thing with it being a 13 year old foreigner “with a strange sense of humor” is such a transparent ruse for masking goofs it makes the whole spectacle embarrassingly underwhelming. That tells me right away the creators of Eugene have little confidence in its AI.

      -Typo? Weird syntax? => Hey, English isn’t his first language!
      -Answer which has no relation to the question asked? =>Hey, we said he has a weird sense if humor!
      -Makes a joke instead of answering a question? => Hey, he’s immature!

      Then on top of it, every now and then ‘Eugene’ gives these answers that don’t sound at all like a 13 year old Ukrainian bur rather like the words of an adult albeit clearly lifted from Wikipedia. In one section from the 2012 ‘show’, when asked about his home town he (pardon the expression) robotically rattles off a bunch of sterile factoids about Odessa, primarily its great opera house. A 13 year old kid who likes Eminem is a fan of the Odessa opera scene?

      A judge would have to be a serious Aspie to think ‘Eugene’ was anything but a run-o-the-mill chat bot.

    3. Here is my interview with Ben Goertzel on “Eugene Goostman” ‘passing’ the Turing Test – hot off the press! https://www.youtube.com/watch?v=_5OfaGTwbiI

    4. With a few minutes of exposure, you not only learn the way the AI cheats, but also its personality, making the test unfair.

      Wouldn’t the judges already be familiar with the chatbot since they are often trained online and also considering the judges personal interest and curiosity to try out different bots?

    5. A control should really be set up. Have them speak with an actual human to set up a threshold for each panel. More importantly there should be both a level of literacy required (has to pass as an adult with 6th grade reading level) and judges to actively try and trap it.

    6. Ah, but aren’t we all just pretending to understand the world? 😛 Great article, thank you!

    7. I fully agree. Some time ago I wrote a similar post http://forum.complexevents.com/viewtopic.php?f=13&t=321&p=1544#p1544

    Leave a Reply