Sign In

Remember Me

Bostrom on Superintelligence (3): Doom and the Treacherous Turn

Front-cover-197x300This is the third part of my series on Nick Bostrom’s recent book Superintelligence: Paths, Dangers, Strategies. In the first two entries, I looked at some of Bostrom’s conceptual claims about the nature of agency, and the possibility of superintelligent agents pursuing goals that may be inimical to human interests. I now move on to see how these conceptual claims feed into Bostrom’s case for an AI doomsday scenario.

Bostrom sets out this case in Chapter 8 of his book, which is entitled “Is the default outcome doom?”. To me, this is the most important chapter in the book. In setting out the case for doom, Bostrom engages in some very, how shall I put it, “interesting” (?) forms of reasoning. Critics will no doubt latch onto them as weak points in his argument, but if Bostrom is right then there is something truly disturbing about the creation of superintelligent AIs.

Anyway, I’ll be discussing Chapter 8 over the next two posts. In the remainder of this one, I’ll do two things. First, I’ll look at Bostrom’s three-pronged argument for doom. This constitutes his basic case for the doomsday scenario. Then, I’ll look at something Bostrom calls the “Treacherous Turn”. This is intended to shore up the basic case for doom by responding to an obvious criticism. In the course of articulating this treacherous turn, I hope to highlight some of the profound epistemic costs of Bostrom’s view. Those costs may have to be borne — i.e. Bostrom may be right — but we should be aware of them nonetheless.

1. The Three-Pronged Argument for Doom
Bostrom is famous for coming up with the concept of an “existential risk”. He defines this as a risk “that threatens to cause the extinction of Earth-originating intelligent life or to otherwise permanently and drastically destroy its potential for future desirable development” (Bostrom 2014, p. 115). One of the goals of the institute he runs — the Future of Humanity Institute — is to identify, investigate and propose possible solutions to such existential risks. One of the main reasons for his interest in superintelligence is the possibility that such intelligence could pose an existential risk. So when he asks the question “Is the default outcome doom?”, what he is really asking is “Is the creation of a superintelligent AI likely to create an existential risk?”

Bostrom introduces an argument for thinking that it might. The argument is based on three theses, all of which he articulates and defends in the book — two of which we already looked at, and one of which was discussed in an earlier chapter, not included in the scope of this series of posts. The three theses are (in abbreviated form):

(1) The first mover thesis: The first superintelligence, by virtue of being first, could obtain a decisive strategic advantage over all other intelligences. It could form a “singleton” and be in a position to shape the future of all Earth-originating intelligent life.

(2) The orthogonality thesis: Pretty much any level of intelligence is consistent with pretty much any final goal. Thus, we cannot assume that a superintelligent artificial agent will have any of the benevolent values or goals that we tend to associate with wise and intelligent human beings (shorter version: great intelligence is consistent with goals that pose a grave existential risk).

(3) The instrumental convergence thesis: A superintelligent AI is likely to converge on certain instrumentally useful sub-goals, that is: sub-goals that make it more likely to achieve a wide range of final goals across a wide-range of environments. These convergent sub-goals include the goal of open-ended resource acquisition (i.e. the acquisition of resources that help it to pursue and secure its final goals).

Bostrom doesn’t set out his argument for existential risk formally, but gives us enough clues to see how the argument might fit together. The first step is to argue that the conjunction of these three theses allows us to reach the following, interim, conclusion:

(4) Therefore, “the first superintelligence may [have the power] to shape the future of Earth-originating life, could easily have non-anthropomorphic final goals, and would likely have instrumental reasons to pursue open-ended resource acquisition” (Bostrom 2014, p. 116)

If we then combine that interim conclusion with the following premise:

(5) Human beings “consist of useful resources (such as conveniently located atoms)” and “we depend for our survival and flourishing on many more local resources” (Bostrom 2014, p. 116).

We can reach the conclusion that:

(6) Therefore, the first superintelligence could have the power and reason to do things that lead to human extinction (by appropriating resources we rely on, or by using us as resources).

And that is, essentially, the same thing as saying that the first superintelligence could pose a significant existential risk. I have mapped out this pattern of reasoning below.

Now, clearly, this doomsday argument is highly speculative. There are a number of pretty wild assumptions that go into it, and critics will no doubt be apt to question them. Bostrom acknowledges this, saying that it would indeed be “incredible” to imagine a project that would build and release such a potentially catastrophic AI into the world. There are two reasons for this incredulity. The first is that, surely, in the process of creating a superintelligent AI, we would have an array of safety measures and test protocols in place to ensure that it didn’t pose an existential threat before releasing it into the world. The second is that, surely, AI programmers and creators would programme the AI to have benevolent final goals, and so would not pursue open-ended resource acquisition.

These reasons are intuitively attractive. They provide us with some optimism about the creation of artificial general intelligence. But Bostrom isn’t quite so optimistic (though, to be fair, he actually is pretty sober throughout the book: he doesn’t come across as a wild-eyed doom-mongerer, or as a polyanna-ish optimist, he lays out his analysis in a “matter of fact” manner). He argues that when we think about the nature of a superintelligent AI more clearly, we see that neither of these reasons for optimism is persuasive. I’ll look at his response to the first reason for optimism in the remainder of this post.

2. The Problem of the Treacherous Turn
Critics of AI doomsayers sometimes chastise those doomsayers for their empirically detached understanding of AI. The doomsayers don’t pay enough attention to how AIs are actually created and designed in the real world, they engage in too much speculation and too much armchair theorising. In the real world, AI projects are guided by human programmers and designers. These programmers and designers create AIs with specific goals in mind — though some are also interested in created general intelligences — and they typically test their designs in limited “safe” environments before releasing them to the general public. An example might be the AI that goes into self-driving cars: these AI are designed with a specific final goal in mind (the ability to safely navigate a car to a given destination), and they are rigorously tested for their ability to do this safely, and without posing a significant risk (“existential” or otherwise) to human beings. The point the critics then make is why couldn’t this approach to AI development be followed in all instances. Why couldn’t careful advance testing protect us from existential risk? Let’s call this the “safety test” objection to the doomsday argument:

(7) Safety test objection: An AI could be empirically tested in a constrained environment before being released into the wild. Provided this testing is done in a rigorous manner, it should ensure that the AI is “friendly” to us, i.e. poses no existential risk.

The safety test objection doesn’t function as a rebuttal to any of the premises of Bostrom’s original argument. In other words, it accepts the bare possibility of what Bostrom has to say. It simply argues that there is a simple way to avoid the negative outcomes. Consequently, I view it as a reason to reject the conclusion of Bostrom’s argument.

Is the safety test objection plausible? Bostrom says “no”. To see why, we need to understand the nature of strategic thinking. If I have certain goals I wish to achieve, but I need your cooperation to help me achieve them, and you are unwilling to provide that cooperation because you don’t like my goals, it may be in my interest to convince you that I don’t have those goals. Or to put it more succinctly: if I have some wicked or malevolent intent, it may nevertheless be in my interests to “play nice” so that you can help to put me in a position to implement that malevolent intent. Actually, the point is more general than that. Even if my intentions are entirely benevolent, there may nevertheless be contexts in which it pays to deceive you as to their true nature. Furthermore, the point doesn’t just apply to intentions, it also applies to abilities and skills. I may be an ace pool player, for example, but if I want to win a lucrative bet with you, it might pay me to pretend that I am incompetent for a couple of games. This will lull you into a false sense of security, encourage you to put a big bet on one game, at which point I can reveal my true skill and win the money. Humans play these strategic games of deception and concealment with each other all the time.

Bostrom’s response to the safety test objection makes the point that superintelligent AIs could play the same sort of games. They could “play nice” while being tested, concealing their true intentions and abilities from us, so as to facilitate their being put in position to exercise their true abilities and realise their true intentions. As he himself puts it:

The flaw in this idea [the safety test objection] is that behaving nicely while in the box is a convergent instrumental goal for friendly and unfriendly AIs alike. An unfriendly AI of sufficient intelligence realizes that its unfriendly final goals will be best realised if it behaves in a friendly manner initially, so that it will be let out of the box. It will only start behaving in a way that reveals its unfriendly nature when it no longer matters whether we find out; that is, when the AI is strong enough that human opposition is ineffectual.

(Bostrom 2014, p. 117)

Or to put it another way: no matter how much testing we do, it is always possible that the AI will take a “treacherous turn”:

(8) The Treacherous Turn Problem: An AI can appear to pose no threat to human beings through its initial development and testing, but once in a sufficiently strong position it can take a treacherous turn, i.e. start to optimise the world in ways that pose an existential threat to human beings.

(Note: this definition diverges somewhat from the definition given by Bostrom in the text. I don’t think the alterations I make do great violence to the concept, but I want the reader to be aware of that they are there.)

Bostrom is keen to emphasise how far-reaching this problem is. In the book, he presents an elaborate story about the design and creation of a superintelligent AI, based on initial work done on self-driving cars. The story is supposed to shows that all the caution and advance testing in the world cannot rule out the possibility of an AI taking a treacherous turn. He also notes that an advanced AI may even encourage its own destruction, if it is convinced that doing so will lead to the creation of a new AI that will be able to achieve the same goals. Finally, he highlights how an AI could take a treacherous turn by just suddenly happening upon a treacherous way of achieving its final goals.

This is all superficially plausible. It is indeed conceivable that an intelligent system — capable of strategic planning — could take such treacherous turns. And a sufficiently time-indifferent AI could play a “long game” with us, i.e. it could conceal its true intentions and abilities for a very long time. Nevertheless, accepting this has some pretty profound epistemic costs. It seems to suggest that no amount of empirical evidence could ever rule out the possibility of a future AI taking a treacherous turn. In fact, its even worse than that. If we take it seriously, then it is possible that we have already created an existentially threatening AI. It’s just that it is concealing its true intentions and powers from us for the time being.

I don’t quite know what to make of this. Bostrom is a pretty rational, bayesian guy. I tend to think he would say that if all the evidence suggests that our AI is non-threatening (and if there is a lot of that evidence), then we should heavily discount the probability of a treacherous turn. But he doesn’t seem to add that qualification in the chapter. He seems to think the threat of an existential catastrophe from a superintelligent AI is pretty serious. So I’m not sure whether he embraces the epistemic costs I just mentioned or not.

Anyway, that brings us to the end of this post. To briefly recap, Bostrom’s doomsday argument is based on the combination of three theses: (i) the first mover thesis; (ii) the orthogonality thesis; and (iii) the instrumental convergence thesis. Collectively, these theses suggest that the first superintelligent AI could have non-anthropomorphic final goals and could pursue them in ways that are inimical to human interests. There are two obvious ripostes to this argument. We’ve just looked at one of them — the safety test objection — and seen how Bostrom’s reply seems to impose significant epistemic costs on the doomsayer. In the next post, we’ll look at the second riposte and what Bostrom has to say about it. It may be that his reason for taking the threat seriously stem more from that riposte.


John Danaher is an academic with interests in the philosophy of technology, religion, ethics and law. John holds a PhD student specialising in the philosophy of criminal law (specifically, criminal responsibility and game theory). He formerly was a lecturer in law at Keele University, interested in technology, ethics, philosophy and law. He is currently a lecturer at the National University of Ireland, Galway (starting July 2014).

He blogs at and can be found here:

This article previously appeared here. Republished under creative commons license.