Build An Optimal Scientist, Then Retire

Digital brain

Jürgen Schmidhuber (pronounced Yirgan Shmidhoobuh) is one of the world’s most interesting minds in artificial intelligence. Schmidhuber is co-director of the Swiss AI lab IDSIA in Lugano and a professor of Cognitive Robotics at the Tech University Munich. Since the 1980s, he has worked on topics in computer science and robotics including artificial curiosity, theories of surprise, incremental program evolution ("the first approach that made it possible to evolve entire soccer team strategies from scratch", according to his website), universal learning algorithms, optimally self-improving theoretical constructs called Gödel machines, artificial ants, robots that are taught how to tie shoelaces using reinforcement learning, and much more. A search for Jürgen on Google Scholar returns over 4,000 results.

At a recent talk at Singularity Summit 2009 in New York, a gathering of futurists and researchers from cutting-edge fields including AI, nanotech, and biotech, Dr. Schmidhuber touched the surface of some of his research interests, including a tongue-in-cheek argument that the Singularity must occur in 1540, based on a seemingly accelerating trend of major events that occurred between 1444 and 1517.

Dr. Schmidhuber is also an artist, creating "low-complexity art" based on principles from algorithmic information theory. In this interview, I ask Dr. Schmidhuber about his work, his philosophy towards artificial intelligence, and views on the future.

h+: Your website states that your "main scientific ambition is to build an optimal scientist, then retire.” What makes you think that creating generally intelligent AI will be possible in the next few decades rather than taking centuries or never occurring?

JÜRGEN SCHMIDHUBER: In the new millennium, work at IDSIA already led to theoretically optimal universal problem solvers, such as the asymptotically fastest algorithm for all well-defined problems, and the Gödel Machine. AI is becoming a formal science! The basic principles of the new methods are very simple. This makes me optimistic that the answer to an essential remaining open question is also simple: If an intelligent agent can execute only a fixed number of computational instructions per unit time interval (say, 10 trillion elementary operations per second), what is the best way of using them to get as close as possible to the recent theoretical limits of universal AIs?

h+: The website for the Dalle Molle Institute for AI, which you co-direct, lists dozens of fascinating projects. Can you tell us a little bit about which projects you are working on currently?

JS: We have several projects on brain-like recurrent neural nets (RNN) — networks of neurons with feedback connections. Biological RNN can learn many behaviors/sequence processing tasks/algorithms/programs that are not learnable by traditional machine learning methods. This explains the rapidly growing interest in artificial RNN for technical applications: general computers which can learn algorithms to map input sequences to output sequences, with or without a teacher. They are computationally more powerful and biologically more plausible than other adaptive approaches such as Hidden Markov Models (no continuous internal states), feedforward networks and Support Vector Machines (no internal states at all). Our artificial RNN have recently given state-of-the-art results in time series prediction, adaptive robotics and control, connected handwriting recognition, image classification, aspects of speech recognition, protein analysis, stock market prediction, and other sequence learning problems. We are continuing to improve them (see resources).

Photo credit: Jürgen Schmidhuber

We also have ongoing projects based on a simple principle explaining essential aspects of subjective beauty, novelty, surprise, interestingness, attention, curiosity, creativity, music, jokes, and art & science in general. Any data becomes temporarily interesting by itself to some self-improving but computationally limited subjective observer once he learns to predict or compress the data in a better way, thus making it subjectively simpler and more "beautiful." Curiosity is the desire to create or discover more non-random, non-arbitrary, regular data that is novel and surprising not in the traditional sense of Boltzmann and Shannon but in the sense that it allows for compression progress because its regularity was not yet known. This drive maximizes interestingness, the first derivative of subjective beauty or compressibility… that is, the steepness of the learning curve. It motivates exploring infants, pure mathematicians, composers, artists, dancers, comedians, yourself, and (since 1990) our increasingly complex artificial systems. Ongoing project: build artificial robotic scientists and artists equipped with curiosity and creativity (see resources).

Prof. Jürgen Schmidhuber. Photo credit: idsia.chh+ What is the "asymptotically fastest algorithm for all well-defined problems?" Maybe you could expand on that somewhat for non-technical readers?

JS: At IDSIA my former postdoc Marcus Hutter (now professor in Canberra) wrote down an algorithm that takes any formally well-defined problem (say, a Traveling Salesman Problem or whatever) as an input and solves it as quickly as the unknown fastest program that provably solves all instances of the given problem class (all TSPs in our example), save for a very small multiplicative slowdown (1% or less) and an additive constant that does not depend on the problem size (the number of cities in our example). Most problems are so big that the constant becomes totally negligible. Is that the end of computer science? Almost but not quite. Our universe is full of small problems where the additive constant is still relevant. However, the self-referential Gödel machine (also developed at IDSIA) can deal with such constants in a way that is again theoretically optimal in a sense.

h+: You work on both biologically inspired and more formal theoretical approaches to AI. Would you call yourself a "neat" or a "scruffy" with respect to AI?

JS: I am promoting the New AI, that is, AI as a Formal Science, as opposed to a bunch of heuristics. Heuristics come and go, theorems are for eternity. The new millennium results of IDSIA describe the first general AIs that are provably theoretically optimal in various important senses. However, I am ready to admit that inspiration for the neat and mathematically rigorous systems often comes from scruffy biological systems. In fact, several of our most successful practical AI systems are not based on the recent theoretical optimality results. I believe, however, that theory and practice will converge soon.

h+: What do you think of using virtual worlds vs. real-world robotics to train AI systems? Is there a major difference?

JS: Answer A… in our research, virtual and real worlds actually complement each other. We use machine learning and artificial curiosity to learn or improve simulations of the real world, then train the robot in the sim to achieve desirable goals (mental trials can be much faster and safer than real trials). Then we transfer the learned behavior back to the real robot, and so on. Problem: current hardware is too slow when it comes to modeling very complex robots in very complex environments.

Answer B… no difference to the extent that the real world itself may be just a sim. Neither Heisenberg’s uncertainty principle nor Bell’s inequality exclude the possibility, that the Universe, including all observers inhabiting it, is in principle computable by a completely deterministic computer program, as first suggested by computer pioneer Konrad Zuse in 1967. Then the simplest explanation of our universe is the simplest program that computes it. In 1997 I pointed out that the simplest such program actually computes all possible universes with all types of physical constants and laws, not just ours. More papers on this can be found on the IDSIA site (see resources).

Robot on robot horse riding off into the sunset, steered by a Goedel machine. Photo credit: idsia.chh+: You’re known for designing something called a Gödel machine. Can you tell us a little bit about what that does?

JS: It’s a self-referential universal problem solver. In 1931, Gödel exhibited the limits of mathematics and computation by creating a formula that speaks about itself, claiming to be unprovable by an algorithmic theorem prover: either the formula is true but unprovable, or math itself is flawed in an algorithmic sense. This inspired my Gödel machine — an agent-controlling program that speaks about itself, ready to rewrite itself in arbitrary fashion once it has found a proof that the rewrite is useful according to an arbitrary user-defined utility function (all well-defined problems can be encoded by such a utility function). Any self-rewrite of the Gödel machine is necessarily globally optimal — no local maxima! — since this proof necessarily must have demonstrated the uselessness of continuing the proof search for even better rewrites. A Gödel machine will optimally speed up its proof searcher and other program parts, provided the speed up’s utility is indeed provable. More papers on this can be found on the IDSIA Gödel Machine page. (see resources)

h+: In your excellent talk at the Singularity Summit 2009, you described simple algorithmic principles that underlie discovery, subjective beauty, selective attention, curiosity and creativity. What are those principles?

JS: They are very simple indeed. All we need is (1) An adaptive predictor or compressor of the continually growing sensory data history, reflecting what’s currently known about sequences of actions and sensory inputs, (2) A learning algorithm (e.g., a recurrent neural network algorithm) that continually improves the predictor or compressor (detecting novel spatio-temporal patterns that subsequently become known patterns), (3) Intrinsic rewards measuring the predictor’s or compressor’s improvements due to the learning algorithm, (4) A reward optimizer or reinforcement learner that translates those rewards into action sequences expected to optimize future reward, thus motivating the agent to create additional novel patterns predictable or compressible in previously unknown ways.

We implemented / discussed the following variants:

(A) Intrinsic reward as measured by improvement in mean squared error (1991),
(B) Intrinsic reward as measured by relative entropies between the agent’s priors and posteriors (1995),
(C) Learning of probabilistic, hierarchical programs and skills through zero-sum intrinsic reward games (1997-2002),
(D) Mathematically optimal, intrinsically motivated systems driven by compression progress (2006-2009).

How does the theory informally explain the motivation to create or perceive art and music? For example, why are some songs interesting to some observer? Not the song he just heard ten times in a row. It became too predictable in the process. Not the other weird one with the completely unfamiliar rhythm and tonality. It seems too irregular and contains too much arbitrariness and subjective noise. The observer is interested in songs that are unfamiliar enough to contain somewhat unexpected harmonies or melodies or beats etc., but familiar enough to allow for quickly recognizing the presence of a new learnable regularity or compressibility in the sound stream: a novel pattern! Sure, this song will get boring over time, but not yet.

Illustration of the low Kolmogorov complexity (or algorithmic simplicity). Photo credit: Jürgen SchmidhuberAll of this perfectly fits our principle: the current compressor of the observer tries to compress his history of acoustic and other inputs where possible. The action selector tries to find history-influencing actions such that the continually growing historic data allows for improving the compressor’s performance. The interesting musical and other subsequences are precisely those with previously unknown yet learnable types of regularities, because they lead to compressor improvements. The boring patterns are those that are either already perfectly known or arbitrary or random, or whose structure seems too hard to understand. Similar statements not only hold for other dynamic art including film and dance (taking into account the compressibility of controller action sequences), but also for painting and sculpture, which also cause dynamic pattern sequences due to attention-shifting actions of the observer.

How does the theory explain the nature of inductive sciences such as physics? If the history of the entire universe were computable, and there is no evidence against this possibility, then its simplest explanation would be the shortest program that computes it. Unfortunately there is no general way of finding the shortest program computing any given data. Therefore physicists have traditionally proceeded incrementally, analyzing just a small aspect of the world at any given time, trying to find simple laws that allow for describing their limited observations better than the best previously known law, essentially trying to find a program that compresses the observed data better than the best previously known program. An unusually large compression breakthrough deserves the name discovery. For example, Newton’s law of gravity can be formulated as a short piece of code that allows for substantially compressing many observation sequences involving falling apples and other objects. Although its predictive power is limited — for example, it does not explain quantum fluctuations of apple atoms — it still allows for greatly reducing the number of bits required to encode the data stream by assigning short codes to events that are predictable with high probability under the assumption that the law holds. Einstein’s general relativity theory yields additional compression progress as it compactly explains many previously unexplained deviations from Newton’s predictions. Most physicists believe there is still room for further advances, and this is what is driving them to invent new experiments unveiling novel, previously unpublished patterns. Physicists are just following their compression progress drive!

We have ongoing projects based on a simple principle explaining essential aspects of subjective beauty, novelty, surprise, interestingness, attention, curiosity, creativity, music, jokes…

How does the compression progress drive explain humor? Some subjective observers who read a given joke for the first time may think it is funny. Why? As the eyes are sequentially scanning the text the brain receives a complex visual input stream. The latter is subjectively partially compressible as it relates to the observer’s previous knowledge about letters and words. That is, given the reader’s current knowledge and current compressor, the raw data can be encoded by fewer bits than required to store random data of the same size. The punch line at the end, however, is unexpected. Initially this failed expectation results in sub-optimal data compression — storage of expected events does not cost anything, but deviations from predictions require extra bits to encode them. The compressor, however, does not stay the same forever. Within a short time interval, its learning algorithm improves its performance on the data seen so far, by discovering the non-random, non-arbitrary and therefore compressible pattern relating the punch line to previous text and previous knowledge of the reader. This saves a few bits of storage. The number of saved bits (or a similar measure of learning progress) becomes the observer’s intrinsic reward, possibly strong enough to motivate him to read on in search for more reward through additional yet unknown patterns. The recent joke, however, will never be novel or funny again.

h+: If intelligent machines were created tomorrow, what sort of implications do you think that would have for humanity and civilization?

JS: Gödel machines and the like will rapidly improve themselves and become incomprehensible. It’s a bit like asking an ant of 10 million years ago: If humans were created tomorrow, what sort of implications do you think that would have for all the ant colonies? In hindsight we know that many ant colonies are still doing fine, but some of them (for example, those in my house) have goal conflicts with humans, and live dangerously.

 

22 Responses

  1. Anonymous says:

    neat. because the world needs better intentionality simulators.

  2. Fred says:

    His team now has the best method for connected handwriting recognition. It uses a self-learning recurrent neural network that won several handwriting competitions at ICDAR 2009. But when you study papers on this at http://www.idsia.ch/~juergen/rnn.html you’ll find the recurrent network is not trained by those super-universal learning algorithms, but by “greedy” techniques such as gradient descent. Even in his own lab, being practical is sometimes preferred over being theoretically optimal :-)

  3. Anonymous says:

    One of the most interesting contemporary researchers in the field of artificial intelligence. It’s a shame that he’s frequently overshadowed by superficial researchers in the US who just get more attention by the press.

  4. David says:

    If his universal problem solver proves the TSP in provably fastest time plus a constant, can’t this be used to solve P<>NP?

  5. If the programs improve themselves based on a user-defined definition of “better” then we shouldn’t have too much trouble controlling them.

  6. Joe The User says:

    Ha, ha, ha,

    I get it. He’s mathematician/performance artist parading a series of true mathematical theories for the specification of general problem solving. The only problem, though perhaps only the mathematically sophisticated get it, is that all the constants he describes going to way, way, beyond the size and age of the universe. It’s a clever Reductio ad absurdum, undermining the plausibility any real AGI.

    You can specify a machine that solves every easily but it takes the age of the universe to solve it’s first problem. You can specify a method of improving that but the improvements take the age of the universe too except for those optimize for your constraint. If you pick the right constrain, you’re done. You just need another machine to find but guess how long that search takes?

    Gregory Chaitin should start looking for funding for the calculation of his constant… Lol

    This kind of stuff is useful but I think useful for seeing the opposite point – that all AI is heuristics, anything higher level is just a heuristic at a different level. Solving math problems in general is useless. Everything about the brain is heuristic. It is based on small and large invariants of life on earth, some things we understand and some things we don’t understand. The recent article on the flie’s visualizing algorithm shows where things are really going.

  7. Tim says:

    As someone said at the Singularity Summit: Jürgen Schmidhuber is the Leonardo da Vinci of our era! Brilliant scientist and artist at the same time. Have you seen his drawing of a self-similar woman at 24:50 of his talk? Made from pure mathematics! Blows me away.

  8. Michele says:

    Overshadowed? Are you perhaps referring to Genetic Programming (GP)? Schmidhuber had meta-GP (using GP to evolve better GP) as an undergrad student in 1987, three years before Koza had plain GP: http://www.idsia.ch/~juergen/diploma.html . But it’s Koza who’s getting the GP citations :-)
    Schmidhuber seems to consider GP as a sin of his youth though. More recently he wrote the “Optimal Ordered Problem Solver” for automatically learning programs: http://www.idsia.ch/~juergen/oops.html . Shining bright and casting shadows of his own :-)

  9. Michele says:

    No, this can’t be used to solve P<>NP, because you still don’t know whether the initially unknown fastest algorithm itself is in P or in NP.

  10. Joe The User says:

    The program could certainly used to prove P<>NP if such a proof is among the proofs one could eloborate in a list of all proofs. But the proverbial thousand monkeys with type-writers and automatic proof-checker could also be used to prove P<>NP. The problem is that both these methods will take the life of universe to complete – except that his method will be very fast once the program is optimized for given criterion. But the optimization process will sadly take the life of the universe to complete also … wah, wah, wah. It’s like Alan Turing and Kurt Geodel as stand-up commedians. “Take my theorem prover…please”

  11. Rob says:

    Have you read where he addresses this issue and the remaining open question: “The basic principles of the new methods are very simple. This makes me optimistic that the answer to an essential remaining open question is also simple: If an intelligent agent can execute only a fixed number of computational instructions per unit time interval (say, 10 trillion elementary operations per second), what is the best way of using them to get as close as possible to the recent theoretical limits of universal AIs?” The answer to this question would lead us to optimal practical AI, and straight into the singularity! Even a human brain might be able to do 10 trillion; he thinks “theory and practice will converge soon.”

  12. AI researchers bring a smile to my face. Never have so many people been paid so much to do the obviously impossible. The brain does not work like a machine. The best you can ever get with “AI” is a really nice demonstration of a reductionist mechanical model of the brain. That will never be anything like a real brain, and it will never produce anything like intelligence.

  13. Anonymous says:

    Read more, respond less

  14. Anonymous says:

    If it’s obviously impossible then you must have a proof of its impossibility. Publishing such a proof would make you famous. And yet, I have never heard of you.

    You may want to consider that a microprocessor is very unlike an abacus, and yet the same mathematics can be done with both.

    Saying intelligence can only be produced by the brain is a religious position, not a scientific one.

  15. Anonymous says:

    “The recent joke, however, will never be novel or funny again.”

    You might be a great mathematician but your knowledge of human behaviour leaves a lot to be desired. We’ve all laughed at jokes more than once and not just because we’ve forgotten them.

    “why are some songs interesting to some observer? Not the song he just heard ten times in a row. It became too predictable in the process.”

    Rubbish. Plenty of us listen to the same song again and again in some cases for years.

    I think you’re theory on how the mind works needs a re-think (assuming your compressor hasn’t crashed).

  16. Hans & Franz says:

    Here is a short version of his incredible talk, only 10 minutes:

  17. Gunnar says:

    He comes across as damn cocky and arrogant. His webpage is full of “i’m so fucking special” talk, – this leads me to take everything he says with quite a grain of salt. He is clearly a good PR person though.
    His student, Markus Hutter, seems slightly more in touch with reality, he guest blogs here among other places: http://hunch.net/?p=727

  18. A high school biology class gives a perfect proof of the impossibility of a machine brain. This proof may be articulated with greater precision in further study of biology, but it flows from biological fundamentals. How do you propose to simulate a biological system with an infinite range of possible states using a finite machine? Not possible. The only thing you can simulate is a reductionist model of the brain, which is not the same thing. Writing a proof of this would not make me famous, because I just did, and I am not famous!

  19. Anonymous says:

    “a biological system with an infinite range of possible states using a finite machine”

    You’re kidding, right? Surely you’re not claiming that a specific brain has “infinite possibilities.” In the abstract, as a species over infinite time you might claim that, but then it is due to the infinite time, not of the brain.

    A large range of sates, even as large as that of the human brain is not infinite. It’s just very large. You are making a magical-religious claim, not a scientific one.

  20. Anonymous says:

    You’re pulling infinities out of your ass.

  21. Hans & Franz says:

    Come on, if you look more closely you’ll see he is making fun of himself. See video link below. Ve like zat!

  22. Anonymous says:

    i think this may be a misunderstanding – when somebody tells you the same joke again then it may be funny because it’s actually not the same joke any more – maybe you did not expect the repetition, and extra bits were necessary to store the data when it arrived, but then learning kicks in, improving subjective data compressibility by linking back to the previous data, and you save a few bits, and that’s fun again… similarly for running gags in changing contexts that add new patterns… similarly for different interpretations of the same piece of music… but which healthy person does not get bored by listening without pause to the same recording 100 times in a row? once you know it by heart, don’t you want to switch to a different though maybe similar recording?

Share Your Thoughts