Beyond the Turing Test — The Winograd Schema Challenge
Nuance Communications, Inc. today announced an annual competition to develop programs that can solve the Winograd Schema Challenge, a test developed by Hector Levesque, Professor of Computer Science at the University of Toronto, and winner of the 2013 IJCAI Award for Research Excellence. Nuance announced the challenge at the 28th AAAI Conference in Quebec, Canada.
Nuance is sponsoring the yearly competition in cooperation along with CommonsenseReasoning.org , a research group dedicated to furthering and promoting research in the field of formal commonsense reasoning. CommonsenseReasoning.org will organize, administer, and evaluate the Winograd Schema Challenge. The winning program that passes the test will receive a grand prize of $25,000. The test is designed to judge whether a program has truly modeled human level intelligence.
Artificial Intelligence (AI) has long been measured by the “Turing Test,” proposed in 1950 by one of the great pioneers of computer science, Alan Turing, who sought a way to determine whether a computer program exhibited human level intelligence. The test is considered passed if the program can convince a human that he or she is conversing with a human and not a machine. No system has ever passed the Turing Test, and most existing programs that have tried rely on considerable trickery to fool humans. Even the recently unveiled program modeling a 13-year-old boy has left many skeptical. These efforts have also suggested that the Turing Test may not be an ideal way to judge a machine’s intelligence.
The Winograd Schema Challenge is an alternative to the Turing Test that provides a more accurate measure of genuine machine intelligence. Rather than base the test on the sort of short free-form conversation suggested by the Turing Test, the Winograd Schema Challenge poses a set of multiple-choice questions that have a form where the answers are expected to be fairly obvious to a layperson, but ambiguous for a machine without human-like reasoning or intelligence.
An example of a Winograd Schema question is the following: “The trophy would not fit in the brown suitcase because it was too big. What was too big? Answer 0: the trophy or Answer 1: the suitcase?” A human who answers these questions correctly typically uses his abilities in spatial reasoning, his knowledge about the typical sizes of objects, and other types of commonsense reasoning, to determine the correct answer.
“There has been renewed interest in AI and Natural Language Processing (NLP) as a means of humanizing the complex technological landscape that we encounter in our day-to-day lives,” said Charles Ortiz, Senior Principal Manager of AI and Senior Research Scientist, Natural Language and Artificial Intelligence Laboratory, Nuance Communications.
“The Winograd Schema Challenge provides us with a tool for concretely measuring research progress in commonsense reasoning, an essential element of our intelligent systems. Competitions such as the Winograd Schema Challenge can help guide more systematic research efforts that will, in the process, allow us to realize new systems that push the boundaries of current AI capabilities and lead to smarter personal assistants and intelligent systems.”
The test will be administered on a yearly basis by CommonsenseReasoning.org starting in 2015. The first submission deadline will be October 1, 2015. The 2015 Commonsense Reasoning Symposium, to be held at the AAAI Spring Symposium at Stanford from March 23-25, 2015, will include a special session for presentations and discussions on progress and issues related to this Winograd Schema Challenge. Contest details can be found at http://commonsensereasoning.org/winograd.html
The winner that meets the baseline for human performance will receive a grand prize of $25,000. In the case of multiple winners, a panel of judges will base their choice on either further testing or examination of traces of program execution. If no program meets those thresholds, a first prize of $3,000 and a second prize of $2,000 will be awarded to the two highest scoring entries. In the case of teams, the prize will be given to the team lead whose responsibility will be to divide the prize among its teammates as appropriate.
The Turing Test is intended to serve as a test of whether a machine has achieved human-level intelligence. In one of its best-known versions , a person attempts to determine whether he or she is conversing (via text) with a human or a machine. However, it has been criticized as being inadequate. At its core, the Turing Test measures a human’s ability to judge deception: Can a machine fool a human into thinking that it too is human? Chatbots like Eugene Goostman can fool at least some judges into thinking it is human, but that likely reveals more about how easy it is to fool some humans, especially in the course of a short conversation, than the bot’s intelligence . It also suggests that the Turing Test may not be an ideal way to judge a machine’s intelligence. The alternative: The Winograd Schema Challenge. Rather than base the test on the sort of short free-form conversation suggested by the Turing Test, the Winograd Schema Challenge (WSC) poses a set of multiple-choice questions that have a particular form. Two examples follow; the second, from which the WSC gets its name, is due to Terry Winograd.
I. The trophy would not fit in the brown suitcase because it was too big (small). What was too big (small)? Answer 0: the trophy Answer 1: the suitcase
II. The town councilors refused to give the demonstrators a permit because they feared (advocated) violence. Who feared (advocated) violence? Answer 0: the town councilors Answer 1: the angry demonstrators The answers to the questions (in the above examples, 0 for the sentences if the bolded words are used; 1 for the sentences if the words in red are used) are expected to be obvious to a layperson. A human who answers these questions correctly typically uses his abilities in spatial and interpersonal reasoning, his knowledge about the typical sizes of objects, and of how political demonstrations unfold, as well as other types of commonsense reasoning, to determine the correct answer. During Commonsense-2013, the Winograd Schema Challenge was therefore proposed as a promising method for tracking progress in automating commonsense reasoning. Features of the Challenge. Winograd Schemas typically share the following features: [Details can be found in Levesque (2011) and Levesque et al. (2012).]
- Two entities or sets of entities, not necessarily people or sentient beings, are mentioned in the sentences by noun phrases.
- A pronoun or possessive adjective is used to reference one of the parties (of the right sort so it can refer to either party).
- The question involves determining the referent of the pronoun.
- There is a special word that is mentioned in the sentence and possibly the question. When replaced with an alternate word, the answer changes although the question still makes sense (e.g., in the above examples, “big” can be changed to “small”; “feared” can be changed to “advocated”.)
The test is projected to consist of at least 40 Winograd Schemas and will be administered on a yearly basis, with a non-repetitive set of test questions supplied each year. Ernest Davis has created an initial library of more than 100 sample Winograd Schemas that can be used by participants to test their systems during development, at http://www.cs.nyu.edu/davise/papers/WS.html. This library will be augmented each year with the examples from the previous year’s test. Further details regarding the establishing of a baseline for human performance for each year’s test, and the threshold that entries would minimally have to meet to qualify for prizes, will be given at the WSC website. Rules for entering. Individuals or teams may enter. If approved by the organizers, a team can include an industry partner. Prize fund. The winner that meets the baseline for human performance will receive a grand prize of $25,000. Details of other prizes are given at the WSC website. Important dates. The test will be administered on a yearly basis starting in 2015. The first submission deadline will be October 1, 2015. Additional details will appear at http://www.commonsensereasoning.org/winograd. The 2015 Commonsense Reasoning Symposium, to be held at the AAAI Spring Symposium at Stanford from March 23-25, 2015, will include a special session for presentations and discussions on progress and issues related to the Winograd Schema Challenge. More information. Visit http://www.commonsensereasoning.org/winograd or contact Leora Morgenstern at firstname.lastname@example.org or Charlie Ortiz at email@example.com.