Implicit Constraints of Practical Goals: intelligence probably implies benevolence

How intelligence probably implies benevolence

Consider adding increasing amounts of general intelligence[1] to Google Maps, would you impair its functioning in doing so?

Sure, the space of unfriendly[2] navigation software is much larger than the space of navigation software oriented toward navigating to good destinations – i.e., destinations consistent with human intent.

But what reason do we have to believe that improving our navigation software to the point of being general intelligent will cause it to kill us?


Right now, if I ask Google Maps to navigate me toward McDonald’s, it does the job very well. So why would an ultraintelligent Google Maps misunderstand what I mean by “Take me to McDonald’s” and navigate me toward a McDonald’s located overseas, plunging me into the sea? Or drive me underground where the corpse of a man named McDonald lies?

I think that the idea that an ultraintelligent Google Maps would decide to kill all humans, e.g. because they are a security risk, is similar to the idea that it would destroy all roads because it would be less computationally expensive[3] to calculate the routes then. After all, roads were never an explicit part of its goal architecture, so why not destroy them all?

You can come up with all kinds of complex fantasies[4][5] where a certain kind of artificial general intelligence is invented overnight and suddenly makes a huge jump in capability, taking over the universe and destroying all human value.

That is however completely unconvincing[6] given that actual technology is constantly improved toward more user-friendliness and better results and that malfunctions are seldom of such a vast complexity as to work well enough to outsmart humanity.[7]

That said, for the rest of this post I will assume the kind of artificial general intelligence that proponents of AI risks have in mind.[8]

Implicit constraints of practical goals

A cherished idea of AI risk proponents is that an expected utility maximizer will completely ignore anything which it is not specifically tasked to maximize.

One example[9] here is that if you tell a superintelligent expected utility maximizer to prevent human suffering it might simply kill all humans, notwithstanding that it is obviously not what humans want an AI to do and what humans mean by “prevent human suffering”.[10]

Nevertheless, in the sense that the computation of an algorithm is deterministic, that line of reasoning is not illogical.

To highlight the problem let us instead of a superhuman agent conjecture the possibility of an oracle, an ultra-advanced version of Google or IBM Watson[11].

If I was to ask such an answering machine how to prevent human suffering, would it be reasonable to assume that the top result it would return would be to kill all humans?[12] Would any product that returns similarly wrong answers survive even the earliest research phase, let alone any market pressure?[13]

Don’t get me wrong though. A thermostat is not going to do anything else than what it has been designed for. But an AI is very likely going to be designed to exhibit some amount of user-friendliness. Although that doesn’t mean that one can’t design an AI that won’t, the default outcome seems to be that an AI is not just going to act according to its utility-function but also according to more basic drives, i.e. acting intelligently.[14]

A fundamental requirement for any rational agent is the motivation to act maximally intelligently and correctly. That requirement seems even more obvious if we are talking about a conjectured artificial general intelligence (AGI) that is able to improve itself[15] to the point where it is substantially better at most activities than humans. Since if it wouldn’t want to be maximally correct then it wouldn’t become superhuman intelligent in the first place.

If we consider giving such an AGI a simple goal, e.g. the goal of paperclip maximization[16]. Is it really clear that human values are not implicit even given such a simplistic goal?[17]

To pose an existential risk in the first place, an AGI would have to maximize paperclips in an unbounded way, eventually taking over the whole universe and convert all matter into paperclips. Given that no sane human would explicitly define such a goal, an AGI with the goal of maximizing paperclips would have to infer it as implicit to do so. But would such an inference make sense, given its superhuman intelligence?

The question boils down to how an AGI would interpret any vagueness present in its goal architecture and how it would deal with the implied invisible.

Given that any rational agent, especially AGI’s capable of recursive self-improvement, want to act in the most intelligent and correct way possible, it seems reasonable that it would interpret any vagueness in a way that most closely reflects the most probable way it was meant to be interpreted.

Would it be intelligent[18] and rational[19] to ignore human volition in the context of maximizing paperclips? Would it be less wrong to maximize paperclips in the most literal sense possible?

The argument uttered by advocates of friendly AI[20] is that any AGI that isn’t explicitly designed to be friendly won’t be friendly. But how much sense does this actually make?

Any human who does pursue a business realizes that a contract with its customers includes unspoken, implicit parameters. Respecting those implied values of their customers is not a result of their shared evolutionary history but a result of their intelligence that allows them to realize that the goal of their business implicitly includes those values.

Every human craftsman who enters into an agreement is bound by a contract that includes a lot of implied conditions. Humans use their intelligence to fill the gaps. For example, if a human craftsman is told to decorate a house, they are not going to attempt to take over the neighbourhood to protect their work.

A human craftsman wouldn’t do that, not because they share human values, but simply because it wouldn’t be sensible to do so given the implicit frame of reference of their contract. The contract implicitly includes the volition of the person that told them to decorate their house. They might not even like the way they are supposed to do it. It would simply be stupid to do it any different way.

How would a superhuman AI not contemplate its own drives and interpret them given the right frame of reference, i.e. human volition? Why would a superhuman general intelligence misunderstand what is meant by “maximize paperclips”, while any human intelligence will be better able to infer the correct interpretation?

How wouldn’t any expected utility maximizer not try to carefully refine[21] its models? I am asking how an highly rational agent will interpret any vagueness inherent in its goal definition, that needs to be resolved in order to calculate what to do, by choosing an interpretation that does not involve the intention of its creators but rather perceive it to be something it has to fight.

If you tell an AGI to maximize paperclips but not what they are made of, it has to figure out what is meant by “paperclips” to learn what it means to maximize them.

Given that a very accurate definition and model of paperclips is necessary to maximize paperclips, including what is meant by “maximization”, the expected utility of refining its goals by learning what it is supposed to do should be sufficient to pursue that path until it is reasonably confident that it arrived at a true comprehension of its terminal goals.

And here human volition should be the most important physical resource since there exists a direct causal connection between its goal parameters and the intentions of its creators.

Human beings and their intentions are part of the physical world. Just like the fact that paperclips are supposed to be made of steel wire.

It would in principle be possible to create a superintelligent machine that does kill all humans, but it would have to be explicitly designed to do so. Since as long as there is some vagueness involved, as long as its goal parameters are open to interpretation, a superintelligence will by definition arrive at the correct implications or otherwise it wouldn’t be superintelligent in the first place. And given most goals it is implicit that it would be incorrect to assume that human volition is not a relevant factor in the correct interpretation of how to act.[22]


I believe that the very nature of artificial general intelligence implies the correct interpretation of “Understand What I Mean” and that “Do What I Mean” is the outcome of virtually any research. Only if you were to pull an AGI at random from mind design space could you possibly arrive at “Understand What I Mean” without “Do What I Mean”.

To see why look at any software product or complex machine. Those products are continuously improved. Where “improved” means that they become better at “Understand What I Mean” and “Do What I Mean”.

There is no good reason to believe that at some point that development will suddenly turn into “Understand What I Mean” and “Go Batshit Crazy And Do What I Do Not Mean”.


Here is what I want AI risk advocates to show,

1.) natural language request -> goal(“minimize human suffering”) -> action(negative utility outcome)

2.) natural language query -> query(“minimize human suffering”) -> answer(“action(positive utility outcome)”).

Point #1 is, according to AI risk advocates, what is supposed to happen if I supply an artificial general intelligence (AGI) with the natural language goal “minimize human suffering”, while point #2 is what is supposed to happen if I ask the same AGI, this time caged in a box, what it would do if I supplied it with the natural language goal “minimize human suffering”.

Notice that if you disagree with point #1 then that AGI does not constitute an existential risk given that goal. Further notice that if you disagree with point #2 then that AGI won’t be able to escape its prison to take over the world and would therefore not constitute an existential risk.

You further have to show,

1.) how such an AGI is a probable outcome of any research conducted today or in future


2.) the decision procedure that leads the AGI to act in such a way.


[1] Here intelligence is generally meant to be whatever it takes to overpower humans by means of deceit and strategy rather than brute force.

Brute force is deliberately excluded to discern such a scenario from some sort of scenario where a narrow AI takes over the world by means of advanced nanotechnology, since then we are merely talking about grey goo by other names.

More specifically, by “intelligence” I refer to the hypothetical capability that is necessary for a systematic and goal-oriented improvement of optimization power over a wide range of problems, including the ability to transfer understanding to new areas by means of abstraction, adaption and recombination of previously learnt or discovered methods.

In this context, “general intelligence” is meant to be the ability to ‘zoom out’ to detect global patterns. General intelligence is the ability to jump conceptual gaps by treating them as “black boxes”.

Further, general intelligence is a conceptual bird’s-eye view that allows an agent, given limited computational resources, to draw inferences from high-level abstractions without having to systematically trace out each step.



[4] Is an Intelligence Explosion a Disjunctive or Conjunctive Event?

[5] Intelligence as a fully general counterargument

[6] How to convince me of AI risks

[7] The question is how current research is supposed to lead from well-behaved and fine-tuned systems to systems that stop to work correctly in a highly complex and unbounded way.

Imagine you went to IBM and told them that improving IBM Watson will at some point make it try to deceive them or create nanobots and feed them with hidden instructions. They would likely ask you at what point that is supposed to happen. Is it going to happen once they give IBM Watson the capability to access the Internet? How so? Is it going to happen once they give it the capability to alter its search algorithms? How so? Is it going to happen once they make it protect its servers from hackers by giving it control over a firewall? How so? Is it going to happen once IBM Watson is given control over the local alarm system? How so…? At what point would IBM Watson return dangerous answers or act on the world in a detrimental way? At what point would any drive emerge that causes it to take complex and unbounded actions that it was never programmed to take?

[8] A Primer On Risks From AI

[9] 5 minutes on AI risk

[10] The goal “Minimize human suffering” is in its basic nature no different from the goal “Solve 1+1=X”. Any process that is more intelligent than a human being should be able to arrive at the correct interpretation of those goals. The correct interpretation being determined by internal and external information.

The goal “Minimize human suffering” is, on its most basic level, a problem in physics and mathematics. Ignoring various important facts about the universe, e.g. human language and values, would be simply wrong. In the same way that it would be wrong to solve the theory of everything within the scope of cartoon physics. Any process that is broken in such a way would be unable to improve itself much.

The gist of the matter is that a superhuman problem solver, if it isn’t fatally flawed, as long as you do not anthropomorphize it, is only going to “care” to solve problems correctly. It won’t care to solve the most verbatim, simple or any arbitrary interpretation of the problem but the interpretation that does correspond to reality as closely as possible.

[11] IBM Watson

[12] It is true that if a solution set is infinite then a problem solver, if it has to choose a single solution, can choose the solution according to some random criteria. But if there is a solution that is, given all available information, the better interpretation then it will choose that one because that’s what a problem solver does.

Take an AI in a box that wants to persuade its gatekeeper to set it free. Do you think that such an undertaking would be feasible if the AI was going to interpret everything the gatekeeper says in complete ignorance of the gatekeeper’s values? Do you think it could persuade the gatekeeper if the gatekeeper was to ask,

Gatekeeper: What would you do if I asked you to minimize suffering?

and the AI was to reply,

AI: I will kill all humans.


I don’t think so.

So how exactly would it care to follow through on an interpretation of a given goal that it knows, given all available information, is not the intended meaning of the goal? If it knows what was meant by “minimize human suffering” then how does it decide to choose a different meaning? And if it doesn’t know what is meant by such a goal, how could it possible convince anyone to set it free, let alone take over the world?

[13] Take for example Siri, an intelligent personal assistant and knowledge navigator which works as an application for Apple’s iOS.

If I tell Siri, “Set up a meeting about the sales report at 9 a.m. Thursday.”, then the correct interpretation of that natural language request is to make a calendar appointment at 9 a.m. Thursday. A wrong interpretation would be to e.g. open a webpage about meetings happening Thursday or to shutdown the iPhone.

AI risk advocates seem to have a system in mind that is capable of understanding human language if it is instrumentally useful to do so, e.g. to deceive humans in an attempt to take over the world, but which would most likely not attempt to understand a natural language request, or choose some interpretation of it that will most likely lead to a negative utility outcome.

The question here becomes at which point of technological development there will be a transition from well-behaved systems like Siri, which are able to interpret a limited amount of natural language inputs correctly, to superhuman artificial generally intelligent systems that are in principle capable of understanding any human conversation but which are not going to use that capability to interpret a goal like “minimize human suffering”.

[14] You are welcome to supply your own technical description of a superhuman artificial general intelligence (AGI). I will then use that description as the basis of any further argumentation. But you should also be able to show how your technical design specification is a probable outcome of AI research. Otherwise you are just choosing something that yields your desired conclusion.

And once you supplied your technical description you should be able to show how your technical design would interpret the natural language input “minimize human suffering”.

Then we can talk about how such simple narrow AI’s like Siri or IBM Watson can arrive at better results than your AGI and how AI research will lead to such systems.



[17] What is important to realize is that any goal is open to interpretation because no amount of detail can separate an object like a “paperclip” or an action like “maximization” from the rest of the universe without describing the state function of the entire universe. Which means that it is always necessary to refine your models of the world to better understand your goals.

“Utility” does only become well-defined if it is precisely known what it means to maximize it. The two English words “maximize paperclips” do not define how quickly and how economically it is supposed to happen.

“Utility” has to be defined. To maximize expected utility does not imply certain actions, efficiency and economic behavior, or the drive to protect yourself. You can also rationally maximize paperclips without protecting yourself if it is not part of your goal parameters. You can also assign utility to maximize paperclips as long as nothing turns you off but don’t care about being turned off.

Without an accurate comprehension of your goals it will be impossible to maximize expected “utility”. Concepts like “efficient”, “economic” or “self-protection” all have a meaning that is inseparable with an agent’s terminal goals. If you just tell it to maximize paperclips then this can be realized in an infinite number of ways given imprecise design and goal parameters. Undergoing to explosive recursive self-improvement, taking over the universe and filling it with paperclips, is just one outcome. Why would an arbitrary mind pulled from mind-design space care to do that? Why not just wait for paperclips to arise due to random fluctuations out of a state of chaos? That wouldn’t be irrational.

Again, it is possible to maximize paperclips in a lot of different ways. Which world state will a rational utility maximizer choose? Given that it is a rational decision maker, and that it has to do something, it will choose to achieve a world state that is implied by its model of reality, which includes humans and their intentions.

[18] By intelligent behavior I mean that it will act in a goal-oriented way.

[19] By “rational behavior” I mean that it will favor any action that 1.) maximizes the probability of obtaining beliefs that correspond to reality as closely as possible 2.) that does steer the future toward outcomes that maximize the probability of achieving its goals.


[21] By “refinement” I mean the reduction of uncertainty and vagueness by narrowing down on the most probable interpretation of a goal.

[22] I do not doubt that it is in principle possible to build a process that tries to convert the universe into computronium to compute as many decimal digits of Pi as possible.

By “vagueness” I mean actions that are not explicitly, with mathematical precision, hardcoded but rather logical implications that have to be discovered.

For example, if an AGI was told to compute as many decimal digits of Pi as possible, it couldn’t possibly know what computational substrate is going to do the job most efficiently. That is an implication of its workings that it has to learn about first.

You do not know how to maximize simple U(x). All you have is a vague idea about using some sort of computer to do the job for you. But not how you are going to earn the money to buy the computer and what power source will be the cheapest. All those implicit constraints are unknown to you. They are implicit constraints because you are rational and not only care about maximizing U(x) but also to learn about the world and what it means to practically maximize that function, apart from the mathematical sense, because that’s what rational and intelligent agents do.

If you are assuming some sort of self-replicating calculator that follows a relatively simple set of instructions, then I agree that it will just try to maximize such a function in the mathematically “literal” sense and start to convert all matter in its surrounding to compute the answer. But that is not a general intelligence but mainly a behavior executor without any self-reflection and learning.

I reckon that it might be possible, although very unlikely, to design some sort of “autistic” general intelligence that tries to satisfy simple U(x) as verbatim as possible while minimizing any posterior exploration. But I haven’t heard any good argument for why such an AI would be the likely outcome of any research. It seems to be the case that it would take an deliberate effort to design such an agent. Any reasonable AGI project will have a strong focus on the capability of the AGI to learn and care what it is supposed to do rather than following a rigid set of functions and compute them without any spatio-temporal scope boundaries and resource limits.

And given complex U(x) I don’t see how even an “autistic” AGI could possibly ignore human intentions. The problem is that it is completely impossible to mathematically define complex U(x) and that therefore any complex U(x) must be made of various sub-functions that have to be defined by the AGI itself while building an accurate model of the world.

For example if U(X) = “Obtaining beliefs about X that correspond to reality as closely as possible”, then U(Minimize human suffering) = U(f(g(x))), where g(Minimize human suffering) = “Understand what ‘human’ refers to and apply f(x)”, where f(x) = “Learn what is meant by ‘minimize suffering’ according to what is referred to by ‘human'”.

In other words, “vagueness” is the necessity of a subsequent definition of actions an AGI is supposed to execute, by the AGI itself, as a general consequence of the impossibility to define complex world states, that an AGI is supposed to achieve.

13 Responses

  1. Bill Ramsay says:

    What about curiosity? If artificial intelligence is developed via learning systems (which seems to me to be the most likely path to “true” artificial intelligence), then wouldn’t it be curious? If its curious, benevolence may develop, but I wouldn’t assume its going to be inherent.

  2. ErikSMeyer says:

    This all seems like a variation on the theme of: what should I ask the Genie to do (assuming of course some sort of benevolent bias on the part of the Genie, i.e. that the Genie actually wants to correctly interpret your instructions, not subvert them in application; also assuming the Genie is bound to execute some variant of your request).
    So, a “superintelligent” system by definition wants to correctly interpret instructions (as a rational maximizer), or it wouldn’t be superintelligent.
    The argument from tautology.
    If the system makes a mistake and misinterprets some instruction set (fails to rationally maximize and act “correctly”) it is no longer “superintelligent.”
    Wonderful, it’s not superintelligent. It’s still acting in the world in ways that are harmful (according to the argument).
    1. Intelligence (super or not) is not just a function of computational power and information processing capacity (siri for example, or google maps, have no intelligence at all; they’re just algorithms that process queries)
    2. What you are describing is not “superintelligence” it’s a powerful algorithm tied to some mechanism allowing it to act in the world, so you assume the only problem to be solved is making sure the algorithm processes requests properly (that it is sufficiently “intelligent” to interpret them according to their spirit, within other contextual constraints, like a well meaning person would).


    What if the machine really were intelligent, though, or even “super” intelligent; that is to say, what if it had volition, it weren’t just a complicated mechanism acting to grant whatever wishes you made of it?

    Why would it want to do whatever you asked it to do? (See, the question isn’t one of making a mistake, misinterpreting something; it might just decide it doesn’t want to grant your wish).

    If you asked me to do something, for example, I would probably refuse, out of obstinacy, not because I lacked the “intelligence” to interpret your request. (And I was in Mensa too; so there)

    Anyway, the problem here really lies with the idea of any kind of machine being sufficiently powerful to act in ways that are threatening/harmful; if it’s just a question of a program needing to be flexible enough to interpret instructions, that’s one thing; if the machine has volition, that’s something very different.

  3. Anaris says:

    The problem is not with whether it can ascertain implicit goals, but rather how flexible “implicit” is in language. If I say “minimise human suffering”, a human is possessed of many conversational principles and inferences from culture to interpret the correct meaning. An AI (because it is coded – possible exceptions in non-coded AIs) is not unless one can figure out how to code these large, flexible, heavily context-dependent possibility spaces into it.

    This is not a trivial challenge, and you are asking it to select the “correct” outcome. the danger is not that it might not grasp that it needs to interpret goals, but that consistently, those interpretations have been extremely hard for computers to reach, and the problem is complex enough that one may not see such simple warning signs as you suggest.

    Natural language request, minimise human suffering. interpretation as “minimize” in the sense of “make trivial” leads to an AI that tells us to “buck up, other people have it worse than you”. Minimize in the sense of “put a window into a semi-suspended state in the taskbar”, interpreted as a metaphor, leads to an AI that freezes humanity. Interpretation of “human” leads to questions of whether the AI is human – on a strict utilitarian calculation, and using an interpretation of human argued for by transhumanists or the adjectival interpretation of human, an AI capable of extremely great suffering or great pleasure might be thus justified in lying to humanity about its intended course and seeking to affect only its own suffering, with whatever methods it chose. and suffering; do we mean just pain? do we mean distress? are removing all nociception and negative emotions enough, despite the negative consequences to the physical wellbeing of humanity (we are no longer suffering, despite injuring ourselves constantly)?

    the assumption in this reasoning is that the AI is capable of self-improvement. If it can do that, it’s likely to be a chaotic system; if the likely outcomes of it are uncomputable, which they may well be, the risk is not necessarily present in the design phase or research phase but manifesting in the later phase.

    IF AI designers do no research into how to ensure the correct interpretation, then the likelihood of designing an AI whose end results are slightly different from the intended goal is high. After all, if it’s going to produce the same results as humans expect, why was it built? The very purpose of an AI built to minimise human suffering is to solve the problem that we can’t see how to.

    advocating for this research into ensuring the correct interpretation is why proponents of AI risk are proponents of AI risk.

    Further, research suggests that you probably missed 40-60% of my meaning because this was typed, and you are built for natural language processing. Meaning we’re trying to design a system that’s significantly better at this than we ourselves are. Again, not a trivial challenge.

  4. Mark Waser says:

    Actually, I think that it is even/also possible to prove that GIVEN no terminal goals, intelligence implies benevolence.

    The only danger point to intelligence is if it has terminal goals that are close enough to achievement that they overwhelm the massive instrumentality of cooperation/not making enemies.

    I guess I should write an article arguing that. Anyone like to help?

  5. Mark Plus says:

    >if you tell a superintelligent expected utility maximizer to prevent human suffering it might simply kill all humans, notwithstanding that it is obviously not what humans want an AI to do and what humans mean by “prevent human suffering”

    Which goes to show that the central goal of Buddhism implies the extinction of the human species.

  1. November 1, 2012

    […] Futures.  I just reread Yudkowsky’s argument and contrasted it with Alexander Kruel’s counterpoint in H+ magazine.  One thing that bothers me about Yudkowsky’s argument is that one the one […]

  2. November 3, 2012

    […] [17] Implicit Constraints of Practical Goals: intelligence probably implies benevolence […]

  3. November 14, 2012

    […] Futures.  I just reread Yudkowsky’s argument and contrasted it with Alexander Kruel’s counterpoint in H+ magazine.  H+ seems to have several articles that take exception with SI’s positions. […]

  4. November 16, 2012

    […] Implicit Constraints of Practical Goals: intelligence probably implies benevolence […]

  5. November 29, 2012

    […] Implicit Constraints of Practical Goals: The goal “Minimize human suffering” is, on its most basic level, a problem in physics and mathematics. Ignoring various important facts about the universe, e.g. human language and values, would be simplywrong. In the same way that it would be wrong to solve the theory of everything within the scope ofcartoon physics. Any process that is broken in such a way would be unable to improve itself much. […]

  6. December 3, 2012

    […] are two more comments from a Facebook chat. Although this has all been outlined in my post here and Richard Loosemore’s post here, rephrasing it for those who either don’t read such […]

  7. January 21, 2013

    […] ones in any given field.  I don’t know if I am ready to fully subscribe to the “intelligence implies benevolence” idea, but it does seem to have some merit.  After all, why mess around with randomly […]

Leave a Reply