# Build An Optimal Scientist, Then Retire

Jürgen Schmidhuber (pronounced Yirgan Shmidhoobuh) is one of the world’s most interesting minds in artificial intelligence. Schmidhuber is co-director of the Swiss AI lab IDSIA in Lugano and a professor of Cognitive Robotics at the Tech University Munich. Since the 1980s, he has worked on topics in computer science and robotics including artificial curiosity, theories of surprise, incremental program evolution ("the first approach that made it possible to evolve entire soccer team strategies from scratch", according to his website), universal learning algorithms, optimally self-improving theoretical constructs called Gödel machines, artificial ants, robots that are taught how to tie shoelaces using reinforcement learning, and much more. A search for Jürgen on Google Scholar returns over 4,000 results.

At a recent talk at Singularity Summit 2009 in New York, a gathering of futurists and researchers from cutting-edge fields including AI, nanotech, and biotech, Dr. Schmidhuber touched the surface of some of his research interests, including a tongue-in-cheek argument that the Singularity must occur in 1540, based on a seemingly accelerating trend of major events that occurred between 1444 and 1517.

Dr. Schmidhuber is also an artist, creating "low-complexity art" based on principles from algorithmic information theory. In this interview, I ask Dr. Schmidhuber about his work, his philosophy towards artificial intelligence, and views on the future.

**h+: Your website states that your "main scientific ambition is to build an optimal scientist, then retire.” What makes you think that creating generally intelligent AI will be possible in the next few decades rather than taking centuries or never occurring?**

**JÜRGEN SCHMIDHUBER:** In the new millennium, work at IDSIA already led to theoretically optimal universal problem solvers, such as the asymptotically fastest algorithm for all well-defined problems, and the Gödel Machine. AI is becoming a formal science! The basic principles of the new methods are very simple. This makes me optimistic that the answer to an essential remaining open question is also simple: If an intelligent agent can execute only a fixed number of computational instructions per unit time interval (say, 10 trillion elementary operations per second), what is the best way of using them to get as close as possible to the recent theoretical limits of universal AIs?

**h+: The website for the Dalle Molle Institute for AI, which you co-direct, lists dozens of fascinating projects. Can you tell us a little bit about which projects you are working on currently?**

**JS: **We have several projects on brain-like recurrent neural nets (RNN) — networks of neurons with feedback connections. Biological RNN can learn many behaviors/sequence processing tasks/algorithms/programs that are not learnable by traditional machine learning methods. This explains the rapidly growing interest in artificial RNN for technical applications: general computers which can learn algorithms to map input sequences to output sequences, with or without a teacher. They are computationally more powerful and biologically more plausible than other adaptive approaches such as Hidden Markov Models (no continuous internal states), feedforward networks and Support Vector Machines (no internal states at all). Our artificial RNN have recently given state-of-the-art results in time series prediction, adaptive robotics and control, connected handwriting recognition, image classification, aspects of speech recognition, protein analysis, stock market prediction, and other sequence learning problems. We are continuing to improve them (see resources).

We also have ongoing projects based on a simple principle explaining essential aspects of subjective beauty, novelty, surprise, interestingness, attention, curiosity, creativity, music, jokes, and art & science in general. Any data becomes temporarily interesting by itself to some self-improving but computationally limited subjective observer once he learns to predict or compress the data in a better way, thus making it subjectively simpler and more "beautiful." Curiosity is the desire to create or discover more non-random, non-arbitrary, regular data that is novel and surprising not in the traditional sense of Boltzmann and Shannon but in the sense that it allows for compression progress because its regularity was not yet known. This drive maximizes interestingness, the first derivative of subjective beauty or compressibility… that is, the steepness of the learning curve. It motivates exploring infants, pure mathematicians, composers, artists, dancers, comedians, yourself, and (since 1990) our increasingly complex artificial systems. Ongoing project: build artificial robotic scientists and artists equipped with curiosity and creativity (see resources).

**h+ What is the "asymptotically fastest algorithm for all well-defined problems?" Maybe you could expand on that somewhat for non-technical readers?**

**JS: **At IDSIA my former postdoc Marcus Hutter (now professor in Canberra) wrote down an algorithm that takes any formally well-defined problem (say, a Traveling Salesman Problem or whatever) as an input and solves it as quickly as the unknown fastest program that provably solves all instances of the given problem class (all TSPs in our example), save for a very small multiplicative slowdown (1% or less) and an additive constant that does not depend on the problem size (the number of cities in our example). Most problems are so big that the constant becomes totally negligible. Is that the end of computer science? Almost but not quite. Our universe is full of small problems where the additive constant is still relevant. However, the self-referential Gödel machine (also developed at IDSIA) can deal with such constants in a way that is again theoretically optimal in a sense.

**h+: You work on both biologically inspired and more formal theoretical approaches to AI. Would you call yourself a "neat" or a "scruffy" with respect to AI?**

**JS:** I am promoting the New AI, that is, AI as a Formal Science, as opposed to a bunch of heuristics. Heuristics come and go, theorems are for eternity. The new millennium results of IDSIA describe the first general AIs that are provably theoretically optimal in various important senses. However, I am ready to admit that inspiration for the neat and mathematically rigorous systems often comes from scruffy biological systems. In fact, several of our most successful practical AI systems are not based on the recent theoretical optimality results. I believe, however, that theory and practice will converge soon.

**h+: What do you think of using virtual worlds vs. real-world robotics to train AI systems? Is there a major difference?**

**JS: **Answer A… in our research, virtual and real worlds actually complement each other. We use machine learning and artificial curiosity to learn or improve simulations of the real world, then train the robot in the sim to achieve desirable goals (mental trials can be much faster and safer than real trials). Then we transfer the learned behavior back to the real robot, and so on. Problem: current hardware is too slow when it comes to modeling very complex robots in very complex environments.

Answer B… no difference to the extent that the real world itself may be just a sim. Neither Heisenberg’s uncertainty principle nor Bell’s inequality exclude the possibility, that the Universe, including all observers inhabiting it, is in principle computable by a completely deterministic computer program, as first suggested by computer pioneer Konrad Zuse in 1967. Then the simplest explanation of our universe is the simplest program that computes it. In 1997 I pointed out that the simplest such program actually computes all possible universes with all types of physical constants and laws, not just ours. More papers on this can be found on the IDSIA site (see resources).

**h+: You’re known for designing something called a Gödel machine. Can you tell us a little bit about what that does?**

**JS: **It’s a self-referential universal problem solver. In 1931, Gödel exhibited the limits of mathematics and computation by creating a formula that speaks about itself, claiming to be unprovable by an algorithmic theorem prover: either the formula is true but unprovable, or math itself is flawed in an algorithmic sense. This inspired my Gödel machine — an agent-controlling program that speaks about itself, ready to rewrite itself in arbitrary fashion once it has found a proof that the rewrite is useful according to an arbitrary user-defined utility function (all well-defined problems can be encoded by such a utility function). Any self-rewrite of the Gödel machine is necessarily globally optimal — no local maxima! — since this proof necessarily must have demonstrated the uselessness of continuing the proof search for even better rewrites. A Gödel machine will optimally speed up its proof searcher and other program parts, provided the speed up’s utility is indeed provable. More papers on this can be found on the IDSIA Gödel Machine page. (see resources)

**h+: In your excellent talk at the Singularity Summit 2009, you described simple algorithmic principles that underlie discovery, subjective beauty, selective attention, curiosity and creativity. What are those principles?**

**JS: **They are very simple indeed. All we need is (1) An adaptive predictor or compressor of the continually growing sensory data history, reflecting what’s currently known about sequences of actions and sensory inputs, (2) A learning algorithm (e.g., a recurrent neural network algorithm) that continually improves the predictor or compressor (detecting novel spatio-temporal patterns that subsequently become known patterns), (3) Intrinsic rewards measuring the predictor’s or compressor’s improvements due to the learning algorithm, (4) A reward optimizer or reinforcement learner that translates those rewards into action sequences expected to optimize future reward, thus motivating the agent to create additional novel patterns predictable or compressible in previously unknown ways.

We implemented / discussed the following variants:

(A) Intrinsic reward as measured by improvement in mean squared error (1991),

(B) Intrinsic reward as measured by relative entropies between the agent’s priors and posteriors (1995),

(C) Learning of probabilistic, hierarchical programs and skills through zero-sum intrinsic reward games (1997-2002),

(D) Mathematically optimal, intrinsically motivated systems driven by compression progress (2006-2009).

How does the theory informally explain the motivation to create or perceive art and music? For example, why are some songs interesting to some observer? Not the song he just heard ten times in a row. It became too predictable in the process. Not the other weird one with the completely unfamiliar rhythm and tonality. It seems too irregular and contains too much arbitrariness and subjective noise. The observer is interested in songs that are unfamiliar enough to contain somewhat unexpected harmonies or melodies or beats etc., but familiar enough to allow for quickly recognizing the presence of a new learnable regularity or compressibility in the sound stream: a novel pattern! Sure, this song will get boring over time, but not yet.

All of this perfectly fits our principle: the current compressor of the observer tries to compress his history of acoustic and other inputs where possible. The action selector tries to find history-influencing actions such that the continually growing historic data allows for improving the compressor’s performance. The interesting musical and other subsequences are precisely those with previously unknown yet learnable types of regularities, because they lead to compressor improvements. The boring patterns are those that are either already perfectly known or arbitrary or random, or whose structure seems too hard to understand. Similar statements not only hold for other dynamic art including film and dance (taking into account the compressibility of controller action sequences), but also for painting and sculpture, which also cause dynamic pattern sequences due to attention-shifting actions of the observer.

How does the theory explain the nature of inductive sciences such as physics? If the history of the entire universe were computable, and there is no evidence against this possibility, then its simplest explanation would be the shortest program that computes it. Unfortunately there is no general way of finding the shortest program computing any given data. Therefore physicists have traditionally proceeded incrementally, analyzing just a small aspect of the world at any given time, trying to find simple laws that allow for describing their limited observations better than the best previously known law, essentially trying to find a program that compresses the observed data better than the best previously known program. An unusually large compression breakthrough deserves the name discovery. For example, Newton’s law of gravity can be formulated as a short piece of code that allows for substantially compressing many observation sequences involving falling apples and other objects. Although its predictive power is limited — for example, it does not explain quantum fluctuations of apple atoms — it still allows for greatly reducing the number of bits required to encode the data stream by assigning short codes to events that are predictable with high probability under the assumption that the law holds. Einstein’s general relativity theory yields additional compression progress as it compactly explains many previously unexplained deviations from Newton’s predictions. Most physicists believe there is still room for further advances, and this is what is driving them to invent new experiments unveiling novel, previously unpublished patterns. Physicists are just following their compression progress drive!

We have ongoing projects based on a simple principle explaining essential aspects of subjective beauty, novelty, surprise, interestingness, attention, curiosity, creativity, music, jokes…

How does the compression progress drive explain humor? Some subjective observers who read a given joke for the first time may think it is funny. Why? As the eyes are sequentially scanning the text the brain receives a complex visual input stream. The latter is subjectively partially compressible as it relates to the observer’s previous knowledge about letters and words. That is, given the reader’s current knowledge and current compressor, the raw data can be encoded by fewer bits than required to store random data of the same size. The punch line at the end, however, is unexpected. Initially this failed expectation results in sub-optimal data compression — storage of expected events does not cost anything, but deviations from predictions require extra bits to encode them. The compressor, however, does not stay the same forever. Within a short time interval, its learning algorithm improves its performance on the data seen so far, by discovering the non-random, non-arbitrary and therefore compressible pattern relating the punch line to previous text and previous knowledge of the reader. This saves a few bits of storage. The number of saved bits (or a similar measure of learning progress) becomes the observer’s intrinsic reward, possibly strong enough to motivate him to read on in search for more reward through additional yet unknown patterns. The recent joke, however, will never be novel or funny again.

**h+: If intelligent machines were created tomorrow, what sort of implications do you think that would have for humanity and civilization?**

**JS: **Gödel machines and the like will rapidly improve themselves and become incomprehensible. It’s a bit like asking an ant of 10 million years ago: If humans were created tomorrow, what sort of implications do you think that would have for all the ant colonies? In hindsight we know that many ant colonies are still doing fine, but some of them (for example, those in my house) have goal conflicts with humans, and live dangerously.

neat. because the world needs better intentionality simulators.

His team now has the best method for connected handwriting recognition. It uses a self-learning recurrent neural network that won several handwriting competitions at ICDAR 2009. But when you study papers on this at http://www.idsia.ch/~juergen/rnn.html you’ll find the recurrent network is not trained by those super-universal learning algorithms, but by “greedy” techniques such as gradient descent. Even in his own lab, being practical is sometimes preferred over being theoretically optimal đź™‚