Neural Networks and Deep Learning

One of the SCI FOO sessions I enjoyed the most this year was a discussion of deep learning by AI researcher Juergen Schmidhuber. For an overview of recent progress, see this recent paper. Also of interest: Michael Nielsen’s pedagogical book project.

An application which especially caught my attention is described by Schmidhuber here:

Many traditional methods of Evolutionary Computation [15-19] can evolve problem solvers with hundreds of parameters, but not millions. Ours can [1,2], by greatly reducing the search space through evolving compact, compressed descriptions [3-8] of huge solvers. For example, a Recurrent Neural Network [34-36] with over a million synapses or weights learned (without a teacher) to drive a simulated car based on a high-dimensional video-like visual input stream.

More details here. They trained a deep neural net to drive a car using visual input (pixels from the driver’s perspective, generated by a video game); output consists of steering orientation and accelerator/brake activation. There was no hard coded structure corresponding to physics — the neural net optimized a utility function primarily defined by time between crashes. It learned how to drive the car around the track after less than 10k training sessions.

For some earlier discussion of deep neural nets and their application to language translation, see here. Schmidhuber has also worked on Solomonoff universal induction.

These TED videos give you some flavor of Schmidhuber’s sense of humor 🙂 Apparently his younger brother (mentioned in the first video) has transitioned from theoretical physics to algorithmic finance. Schmidhuber on China.

I’ve also been reading Michael Nielsen’s online book on neural nets and deep learning. I particularly liked the subsection quoted below. For people who think deep learning is anything close to a solved problem, or anticipate a near term, quick take-off to the Singularity, I suggest they read the passage below and grok it deeply.

Neural Networks and Deep Learning (Chapter 3):

You have to realize that our theoretical tools are very weak. Sometimes, we have good mathematical intuitions for why a particular technique should work. Sometimes our intuition ends up being wrong […] The questions become: how well does my method work on this particular problem, and how large is the set of problems on which it works well. — Question and answer with neural networks researcher Yann LeCun

Once, attending a conference on the foundations of quantum mechanics, I noticed what seemed to me a most curious verbal habit: when talks finished, questions from the audience often began with “I’m very sympathetic to your point of view, but […]”. Quantum foundations was not my usual field, and I noticed this style of questioning because at other scientific conferences I’d rarely or never heard a questioner express their sympathy for the point of view of the speaker. At the time, I thought the prevalence of the question suggested that little genuine progress was being made in quantum foundations, and people were merely spinning their wheels. Later, I realized that assessment was too harsh. The speakers were wrestling with some of the hardest problems human minds have ever confronted. Of course progress was slow! But there was still value in hearing updates on how people were thinking, even if they didn’t always have unarguable new progress to report.

You may have noticed a verbal tic similar to “I’m very sympathetic […]” in the current book. To explain what we’re seeing I’ve often fallen back on saying “Heuristically, […]”, or “Roughly speaking, […]”, following up with a story to explain some phenomenon or other. These stories are plausible, but the empirical evidence I’ve presented has often been pretty thin. If you look through the research literature you’ll see that stories in a similar style appear in many research papers on neural nets, often with thin supporting evidence. What should we think about such stories?

 

In many parts of science – especially those parts that deal with simple phenomena – it’s possible to obtain very solid, very reliable evidence for quite general hypotheses. But in neural networks there are large numbers of parameters and hyper-parameters, and extremely complex interactions between them. In such extraordinarily complex systems it’s exceedingly difficult to establish reliable general statements. Understanding neural networks in their full generality is a problem that, like quantum foundations, tests the limits of the human mind. Instead, we often make do with evidence for or against a few specific instances of a general statement. As a result those statements sometimes later need to be modified or abandoned, when new evidence comes to light.

 

[ Sufficiently advanced AI will come to resemble biology, even psychology, in its complexity and resistance to rigorous generalization … ]

One way of viewing this situation is that any heuristic story about neural networks carries with it an implied challenge. For example, consider the statement I quoted earlier, explaining why dropout works* *From ImageNet Classification with Deep Convolutional Neural Networks by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton (2012).: “This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.” This is a rich, provocative statement, and one could build a fruitful research program entirely around unpacking the statement, figuring out what in it is true, what is false, what needs variation and refinement. Indeed, there is now a small industry of researchers who are investigating dropout (and many variations), trying to understand how it works, and what its limits are. And so it goes with many of the heuristics we’ve discussed. Each heuristic is not just a (potential) explanation, it’s also a challenge to investigate and understand in more detail.

 

Of course, there is not time for any single person to investigate all these heuristic explanations in depth. It’s going to take decades (or longer) for the community of neural networks researchers to develop a really powerful, evidence-based theory of how neural networks learn. Does this mean you should reject heuristic explanations as unrigorous, and not sufficiently evidence-based? No! In fact, we need such heuristics to inspire and guide our thinking. It’s like the great age of exploration: the early explorers sometimes explored (and made new discoveries) on the basis of beliefs which were wrong in important ways. Later, those mistakes were corrected as we filled in our knowledge of geography. When you understand something poorly – as the explorers understood geography, and as we understand neural nets today – it’s more important to explore boldly than it is to be rigorously correct in every step of your thinking. And so you should view these stories as a useful guide to how to think about neural nets, while retaining a healthy awareness of the limitations of such stories, and carefully keeping track of just how strong the evidence is for any given line of reasoning. Put another way, we need good stories to help motivate and inspire us, and rigorous in-depth investigation in order to uncover the real facts of the matter.

See also here from an earlier post:

… evolution has [ encoded the results of a huge environment-dependent optimization ] in the structure of our brains (and genes), a process that AI would have to somehow replicate. A very crude estimate of the amount of computational power used by nature in this process leads to a pessimistic prognosis for AI even if one is willing to extrapolate Moore’s Law well into the future. [ Moore’s Law (Dennard scalingmay be toast for the next decade or so! ] Most naive analyses of AI and computational power only ask what is required to simulate a human brain, but do not ask what is required to evolve one. I would guess that our best hope is to cheat by using what nature has already given us — emulating the human brain as much as possible.

If indeed there are good (deep) generalized learning architectures to be discovered, that will take time. Even with such a learning architecture at hand, training it will require interaction with a rich exterior world — either the real world (via sensors and appendages capable of manipulation) or a computationally expensive virtual world. Either way, I feel confident in my bet that a strong version of the Turing test (allowing, e.g., me to communicate with the counterpart over weeks or months; to try to teach it things like physics and watch its progress; eventually for it to teach me) won’t be passed until at least 2050 and probably well beyond.

Turing as polymath: … In a similar way Turing found a home in Cambridge mathematical culture, yet did not belong entirely to it. The division between ‘pure’ and ‘applied’ mathematics was at Cambridge then as now very strong, but Turing ignored it, and he never showed mathematical parochialism. If anything, it was the attitude of a Russell that he acquired, assuming that mastery of so difficult a subject granted the right to invade others.

###

Stephen is the Vice-President for Research and Graduate Studies, and Professor of Theoretical Physics, Michigan State University. This post recently appeared on hos blog here: http://infoproc.blogspot.com/