Best of H+: Gesture Recognition, Mind-Reading Machines, and Social Robotics

“Domo Arigato, Mister Roboto.”

Recall the classic Styx song and ask yourself a simple question: how can you tell that the band members are pretending to be robots? It’s the stiff, jerky movements, right?

When we think of a gesture or a voice as “robotic,” we mean that it’s abrupt, rigid, emotionless. To be “robotic” is the opposite of “human.” Human and animal motion is fantastically complex and responsive to the environment, in a way that robots can’t yet replicate. What’s more, human motion is social: our faces and gestures respond to social cues, and we modulate our body language to deal with the presence of others. For example, we can negotiate a crowded hallway without colliding. We’re very good at moving and expressing ourselves physically in a social environment. Robots don’t have such subtle modulation, yet. But new research is helping them catch up.

Before we can teach robots to interact in a social environment, we need to identify how humans do it. This is goal of the field of gesture recognition. For a machine to identify and classify human gestures, it needs a few components. First, it needs a camera that can perceive depth and motion very accurately. One way to do this is to project random patterns onto nearby objects with an infrared laser, and measure the distances to objects by the distortion of the patterns. Second, a gesture recognition technology needs a feature selector to pick out salient sub-patterns in an image, like facial features or joint angles. This is usually a filter laid over the data, based on edge detection or something similar. It works like the face detector found in many digital cameras. The feature selector simplifies the picture down to its most important elements. Third, a gesture recognition technology needs a database of typical gestures, movements, or expressions. And finally, it needs an algorithm, usually a hidden Markov model, to identify from the features which gesture is being performed. (Given a sequence of observed events, a hidden Markov model is just a statistical procedure for inferring the probability distributions that best account for those events.) Put it all together and you get a machine that can watch you move and identify which motion you’re executing.

If you’re one of millions of consumers, you already have a device that can do this: your Kinect. A massive, years-long feat of machine learning research developed at Microsoft Labs Cambridge allows the Kinect to identify your movements with startling accuracy. Though it was intended as a gee-whiz controller for video games, the Kinect has been adapted by intrepid amateurs to do everything from manipulating surgical tools to making a Princess Leia Hologram. And the industry is taking notice: PrimeSense, the gesture recognition company responsible for the Kinect, just received a new round of funding from the investing group Silver Lake, which also invested in Skype. Gesture recognition, it seems, has potential far beyond its initial application – current technology is opening up a new world of machines that can detect and react to your movements.

One thing we can do with gesture recognition is to gain a better understanding of human gestures and expressions in social context. With machines that can categorize movements, gestures, and expressions, we can better understand how humans react to social cues and learn through imitation. We can even make progress towards developing robots that can mimic human gestures and expressions, and replicate the kind of social cognitive development that human infants undergo. Two labs at major research institutions are making great strides in the new field of “social robotics.”

The human-computer interaction lab at Cambridge University is working on a large-scale project to teach computers to recognize emotions from looking at facial expressions. Cambridge researchers Rana el Kallouby and Peter Robinson are developing a mind-reading machine. Specifically, they’re training an algorithm on sample faces to predict emotional affect from videos of facial expressions. The HCI lab is collaborating with autism researcher Simon Baron-Cohen, who developed a taxonomy of human expressions and the emotions they represent (the sort of social knowledge that autistic people often don’t have instinctively.) Together, the computer scientists and the autism researchers developed the Mind Reading DVD, a tutorial on identifying 412 different emotional states from videos of facial expressions, as a therapeutic aid for autistic people. Computer programs, again using hidden Markov models, were able to learn from the examples in the DVD how to correctly read an emotional state when looking at a facial expression. In fact, the computer program outperformed the vast majority of human subjects at guessing emotional states.

Another project at the Cambridge lab focuses on “mind-reading” for personal computers: observing the user’s emotional state from body language and facial expression. This could potentially make computers more responsive to the user’s point of view – socially intelligent computers won’t wait for a response from a user who has left the room or do irrelevant tasks while the user is frantically working towards a deadline. One goal is to make online learning tutorials responsive to the student’s emotional state. That is, to develop computers that can detect boredom, confusion, or loss of interest on the online student’s face, and respond by changing the pace of the tutorial, just as a good teacher would. The hope is to ultimately create socially intelligent computers that can respond naturally to non-verbal emotional cues. The same process is being explored with humanoid robots, except that in this case robots are trained to make facial expressions to express empathy and rapport with humans. If we’re going to have robots in our daily life, making sure they can communicate effectively and win users’ trust is paramount. Understanding how facial expressions and body postures relate to emotional states can help us better understand how humans learn social intelligence, as well as making computers and robots more useful and responsive to us.

The social robotics lab at Yale University is also using machines to understand human movements in a social context. Yale’s approach is more focused on creating robots that can emulate the same kind of social cognition that humans and animals exhibit. The Yale lab hopes to build robots that learn the same way we learn – by observation and imitation of other people. The lab is currently developing “wolfpack” groups of robots that can observe each other’s movements and exhibit following behavior. The ability of animals and humans to “follow the leader” and stay out of each other’s way without colliding is an example of how we learn to move in a social context. To stay out of the alpha wolf’s way, the beta wolves have to understand that he’s not just a moving shape but an intentional agent moving to seek a goal. In other words, a computational model that can mimic a wolfpack is a social model. Building robots to imitate wolf behavior implies building robots with a “theory of mind” – the ability to understand that other individuals have motivations of their own.

Another project at the Yale lab is “intention from motion” – the ability to observe other agents moving, infer that moving agents have intentions and goals, and learn to interact with them. For example, a mobile-controlled robot car at the lab learned to play Tag and Catch, almost entirely on its own, by observing other players. Again, this is basically a feat of social cognition – the robot has to see the shapes passing in front of its vision sensors as individuals, as playmates, and to figure out what they’re trying to do in the game. This is the skill that most children learn, and that children with developmental deficits struggle with. The little robot car passed this developmental test with flying colors.

The lab’s humanoid robot Nico, through moving and touching objects and imitating adults, learned about the world much the same way that human infants do: it learned to distinguish its own arm from other objects. And it learned first to reach out and touch objects, then to point to objects, and finally to follow the adult’s gaze to objects of mutual interest. (Again, human babies with certain developmental deficits often fail at those tasks – they never learn to point or to follow their mother’s gaze.) Just like a human baby, Nico learned about the world by interacting with an attentive adult. Once again, this is the development of motor skills in a social context, and it requires an implicit theory of mind (Nico can understand that the experimenter is an individual who can have mental states like “wants to look at the red block.”) What’s interesting here is that these aren’t pre-programmed movements; the robots in Yale’s social robotics lab are learning to interact with the world independently, and in particular they’re learning to interact in a social world with intentional agents, just like animals and humans do.

It’s clear that robots and programs that can respond to social cues have produced some promising results. But where could social robotics lead us in the future? Well, the future’s “Mr. Roboto” might be a little less lonely. We could see true humanoid robots with responsive, natural facial expressions and gestures, possibly to care for the elderly or disabled. We could use gesture and facial-expression recognition technology as a therapeutic tool for autism. And gesture recognition might give us the power to make automated responses based on gait, body language, or expressions.

For instance, a video surveillance system could pick out unusual gaits to recognize possible terrorists, or observe facial expressions to gauge audience response at a lecture or movie. It could even pick out the most interested customers for marketing purposes (like this window shopping display, which uses gesture recognition to distinguish interested from uninterested customers.) We could envision a future in which all our appliances and environments know how we’re feeling, and can intuit what we want from our posture and facial expressions; imagine if the chipper little animated paperclip in Microsoft Word could tell when we just wanted to be left alone! Or imagine advertising billboards and banner ads that could tell from your eyes whether you were intrigued; a dating website that could tell whether visitors to your profile were attracted to you; a tool that could estimate whether someone is lying in an interview or bluffing in a negotiation.

The best prediction algorithms for facial expressions perform as well as the top 6% of humans – so for many of us, automating social perceptions may do better than our own intuitions. Like many technologies, this offers both opportunities for greater richness and convenience in everyday life, and risks of privacy invasion or abuse of power. But either way it’s likely that we’ll see a lot more machines that can identify our motions and gestures in context.

One might speculate, even more ambitiously, that social robotics might be the future of artificial intelligence. After all, humans and animals aren’t brains in boxes. We learn by interacting with our environment, especially with other living and moving things. Our knowledge of the world is contextual and richly structured, based on experience in the physical world.

Common machine learning algorithms may take in arbitrary data sets of images or text files, and apply statistical rules to organize them into general categories and predictions. But we humans can do much more than that, because we know something about the world and we have context-specific knowledge. A picture of a cow is just an arbitrary set of pixels to most computer programs, but to us, it’s a 3-d animal, so we know how it would look from different angles, we know that a cow is more like a horse than it is like a truck, and we know that it gives milk. And, for humans and animals at least, we build up this kind of rich contextual knowledge by actually moving through the “meatspace” world. It’s conceivable that the best artificial learners will not be computer programs, but robots with physical bodies.

Humans naturally develop “intuitive theories” of how the world works through interacting with it. We develop an “intuitive theory of physics” from moving objects around and seeing them fall. Given a picture of blocks stacked in different configurations on a table, people are remarkably good at figuring out if they’ll fall down or remain stable, and computers can only mimic this effect if they include a model of Newton’s laws of physics. Similarly, humans develop an “intuitive theory of mind” from interacting with other humans: we perceive them as individuals with intentions and motivations of their own, and this makes us much better at predicting their behavior. Some evolutionary biologists even think that we evolved our big brains especially for intuiting complex motivations in social groups; social cognition is what we’re designed for. (For example, see the classic 1944 Heider-Simmel experiment where a video of moving geometrical shapes is instantly recognizable as a big triangle bullying the little triangle. We naturally see intention in even abstract movements.) This kind of contextual knowledge – intuitive theories of physics and theories of mind – is developed by interacting in a physical and social environment. Cognitive scientist Josh Tenenbaum of MIT has explored the idea of these “intuitive theories” and demonstrated that implementing them into computer programs significantly improves predictive ability, compared to context-free, general-purpose machine learning algorithms. He calls this advantage of experiential learning the “blessing of abstraction.”

Maybe social robots can reap this “blessing” by interacting directly with the world and with us. Like Yale’s robot Nico, learning to reach and point the way a baby does. Or like the humanoid robot at the Xpero research time in Bonn-Rhein-Sieg university, which spontaneously developed tool use after spending time in a room manipulating boxes and balls. Let me repeat that for emphasis: a robot playing with balls and blocks spontaneously discovered the use of tools, in this case the trick of using a long, thin object to reach something high up. Robots that physically manipulate objects and imitate human social interactions could have enormous learning potential.

And we might learn as much from them as they from us. To simulate a behavior is to begin to understand it. Developing robots that can identify and imitate our gestures and expressions can help us catalogue and understand our own movements. Designing programs that infer emotions from faces and postures can teach us about how humans learn to “read minds,” and allow us to help autistic humans who don’t know how to read faces. Robots that mimic animal social behavior or the early stages of human childhood development can teach us about how people and animals learn to move in a social environment, and how we infer other individuals’ goals from their movements. We modulate our movements to respond to a physical and social environment, in very subtle and sensitive ways, and researchers even now don’t fully comprehend everything we can do. As we teach robots to do what we do instinctively – less Robot Dance, more Fred Astaire panache — we’ll discover new worlds within ourselves, and even perhaps improve upon our own biological capabilities. Domo arigato, indeed.


  1. Great article!

  2. I don’t see any android or actroid in your article

    but clearly : this should have been in it : robots need to be more human than human ( robots designed by human : should be a ideal human : “lol” )

    A japan firm will sell multifonction actroid for $20 000.

    The singularity is here : but you don’t see it.

    SO ….

    Where is the gap between “mechanoid/ robotics” and biology ?

    ” I DON’T KNOW ” It is trouble water ….

    If the capacity of robotics and sensors become equal or superior to biological actuators and sensors …

    Yesterday I read an article about photovolotaic solar panel that can regenerate itself.

    There are artificial actuators 1000 better faster than yours …

    You are right for one thing : we have not the electronic brain , yets : or do we ?


    ( the singularity is here )

  3. I believe the popular term is “emotional machine”

Leave a Reply