Sign In

Remember Me

Bostrom on Superintelligence (5): Limiting an AI’s Capabilities

Front-cover-197x300This is the fifth part of my series on Nick Bostrom’s book Superintelligence: Paths, Dangers, Strategies. So far in the series, we’ve covered why Bostrom thinks superintelligent AIs might pose an existential risk to human beings. We’ve done this by looking at some of his key claims about the nature of artificial intelligence (the orthogonality thesis and the instrumental convergence thesis); and at the structure of his existential risk argument.

In the remaining posts in this series, we’re going to focus on ways in which to contain that existential risk. We start by looking at chapter 9 of the book, which is entitled “The Control Problem”. In this chapter, Bostrom tries to do two things. First, he tries to explain exactly what the problem is when it comes to containing existential risk (that is the control problem). Second, he tries to catalogue and offer brief evaluations of various strategies for addressing that problem.

We’re going to cover both of these things today. First, by talking about principal agent problems and the unique nature of the principal-agent problem that arises in the construction of a superintelligent AI. And second, by looking at one possible set of solutions to that problem: limiting the AI’s capabilities. We will continue to catalogue possible solutions to the control problem in the next post.

1. Principal-Agent Problems and the Control Problem
Principal-agent problems are a mainstay of economic and regulatory theory. They are very easy to explain. Suppose that I want to buy a house, but I don’t have time to view lots of houses and negotiate deals on the ones that best suit my needs. Consequently, I decide to hire you to do all this for me. In this scenario, I am the principal (the person who wants some task to be performed in accordance with my interests), and you are the agent (the person carrying out the tasks on my behalf).

The principal-agent problem arises because the interests of the principal and the agent are not necessarily aligned, and because the agent has access to information that the principal does not. So, for example, when I send you out to look for a house, I have no way of knowing if you actually looked at a sufficient number of houses (you could easily lie to me about the number), or whether you actually negotiated the best possible deal. You just want to get your agent’s fee; you don’t necessarily want to get a house that will suit my needs. After all, you aren’t going to live there. This gives you an incentive to do less than I would like you to do and act in ways that are counterproductive to my interests.

Principal-agent problems are common in many economic and regulatory situations. A classic example arises in the management of companies/corporations. In many publicly-owned companies, the owners of the company (the shareholders) are not the same as the people who manage the company on a day-to-day basis. The owner’s put their money at risk for the company; but the managers do not. There is thus a danger that the managers’ interests are not aligned with those of the shareholders, and that they might make decisions that are averse to those interests.

There are, of course, a variety of “solutions” to this type of principal-agent problem. A classic one is to pay the managers in company stocks so that their interests become aligned with those of the shareholders. Similarly, there are a range of oversight and governance mechanisms that are supposed to hold the managers accountable for bad behaviour. We needn’t get into all that here, though — all we need is the general overview of the principal-agent problem.

Why do we need it? Because, according to Bostrom, the development and creation of superintelligent AIs gives rise to a unique and exceptionally difficult version of the principal-agent problem. To be more precise, it gives rise to two separate principal-agent problems. As he describes them (pp. 127-128)

The First Principal Agent Problem: This involves human principals and human agents. The first project to develop a highly intelligent AI will, presumably, involve some wealthy financial backers (maybe governments) who hire a group of AI engineers. The sponsors will need to ensure that the engineers carry out the project in accordance with their interests. This is a standard principal-agent problem, which may pose difficulties, but nothing too unusual.

The Control Problem: This involves human principals and artificial agents. The engineers who create the first advanced AI will have a set of goals or interests they will wish for it to pursue. The question is whether they will be able to do this successfully once the AI exceeds a certain threshold in intelligence and competence. This is a unique problem, which gives rise to unique set of concerns. A failure to address this problem is what could give rise to existential risk.

Bostrom’s focus is on the Control Problem, and the possible solutions to it. He identifies two general methods for addressing it. The first is to try to limit an advanced AI’s abilities (the “Capability Control Method”); the second is try to motivate the advanced AI to pursue goals that are in our (human) interest (the “Motivation Selection Method”). Bostrom is clear that either set of methods would have to be implemented before an AI achieves superintelligence. Why? Because after that the AI could have a decisive strategic advantage over human beings, and it may be impossible to constrain or limit it in any way.

In the remainder of this post we will look at four capability control methods: (i) boxing; (ii) incentive design; (iii) stunting; and (iv) tripwiring.

2. Boxing Methods of Control
Perhaps the simplest and most intuitively compelling method of controlling an AI is to put it in a “box”. Not a literal box of course — though that’s one form it could take — but rather a metaphorical box: a set of protocols that contains and constrains the way in which it can interact with the world. A typical AI will have three components: a sensor (or input channel); a processor; and an actuator (or output channel). The AI will receive inputs from the external world via its sensor (e.g. light rays, radio waves, chemical compounds, sound waves and so on), it will manipulate and transform those inputs using its processor, and it will then respond to world by performing an “action” of some kind via its actuators. The action could be something simple like displaying information on a screen, or something more complex like buying and selling shares on a stock exchange. Given the right set of actuators, an AI could conceivably perform any sort of action.

It would seem logical then to say that an AI with a high degree of intelligence, access to the right information, and a wide array of actuators at its disposal, would have the power to pose a significant existential threat to human beings (whether it would be inclined to do so is a separate matter, covered in previous posts). But if we could limit the kinds of information it receives, and minimise the number of actions it can perform in the world, we might have a way to control the threat. This is what the boxing solution proposes.

Bostrom discusses two types of boxing solution in the book. The first he describes as a form of “physical” containment. This is like literal boxing. You place the AI in a physical container of some kind, and only allow it to “act” in the real world via a specific set of actuators. This means removing things like robotic arms or other devices that would allow the AI to break out of its containment area. One big problem with physical containment is the fact that AIs can manipulate devices using radio waves, which it could easily generate by shuffling electrons around inside its processors. We would, consequently, need to build a containment field that prevented radio-wave communication. A Faraday Cage might do the trick.

Bostrom doesn’t have anything overly negative to say about physical containment at this stage (he discusses it in more detail in chapter 10). He thinks it is a simple solution to the problem. He just thinks it could have all sorts of subtle vulnerabilities that we may not realise (e.g. how many of you thought of the radio-wave problem before it was pointed out?), and may lull us into a false sense of security. He also notes that physical containment will reduce the functionality of the AI, which may be a drawback that some sponsors and developers are unwilling to tolerate.

The other method of boxing Bostrom refers to is “informational” containment. This restricts the flow of information into and out of the AI. The idea here is that it may be the inter-connected nature of informational flows that makes an AI dangerous. This is particularly true if the AI is connected to the internet which, with the advent of the “Internet of Things”, will soon become something akin to a central nervous system for the entire globe. If we could de-link the AI from this system, we might be able to contain it. The apotheosis of this would be an AI which can only receive information via a teletype interface, and which can only respond via the same medium with “yes” or “no” answers.

Would this work? Bostrom is sceptical. For one thing, humans would act as the gatekeepers between the AI and the outside world (e.g. by inputting queries on the teletype interface) and humans are vulnerable to manipulation. A smart AI could trick its human gatekeepers into letting it out of the box. Admittedly this might be difficult if it is limited to “yes” or “no” answers. There are further problems here though, discussed in chapter 10 of the book.

Bostrom also notes that you could combine the two methods — i.e. physical and informational containment — but argues that it is virtually impossible to have a perfectly isolated system: any interaction with a human agent would compromise the isolation.

3. Incentive Methods
The second capability control method is difficult to explain. Bostrom refers to it as the “incentive” method, but I find that slightly confusing as it seems to overlap with motivation selection methods. The idea seems to be that if you crafted the right environment for the AI — the right “incentive environment” — it wouldn’t be able to act in an existentially threatening manner. An appropriate analogy here might be the way in which you raise a child. Obviously, children have certain innate qualities that make them different from AIs, but set that to the side for the moment. The child has goals, but has to interact with an external environment in order to pursue those goals. We obviously don’t want our children to grow up into amoral psychopaths who will do anything to get what they want. So instead, we socialise and train them within an incentive environment that forces them to behave in morally and socially acceptable ways.

The suggestion is that we could do something similar with an AI. Bostrom imagines several scenarios here, including one in which the AI is forced to cooperate with other AIs and one in which it is socialised within a human environment (in the hope that it will acquire human norms and values). I’m going to ignore those and focus instead on what I think is his most interesting proposal. This involves creating an AI that whose final value involves receiving “cryptographic reward tokens”. In other words, the AI can be motivated to do anything to receive these tokens. The tokens are sequences of numbers that operate as the keys to ciphers. They would be pretty meaningless to human beings, but overwhelmingly desirable to the AI. The tokens would have to be created before the AI, and would have to stored in a secure location, to be destroyed if the AI ever makes an attempt to get them.

The argument then is that one could use this set-up to ensure that the AI always “plays nice”, i.e. goes along with what its human principals want it to do. I’ll leave Bostrom explain the idea:

The AI could then be given a bounded utility function, and matters could be arranged so that the AI would get (for instance) 99% of its maximum utility from the first reward token, 99% of its remaining utility from the second reward token; and so on. By keeping the “discount rate” high, and by transmitting reward tokens to the AI at a reasonably high rate, one would dissuade the AI from attempting to carry out long-term plans that would increase the expected number of reward tokens in the distal future at the expense of some risk that its reward stream will be cut short, or that the next reward token will be cancelled and destroyed. If the AI always believes that defecting (i.e. doing anything other than its best to look after the principal’s interests) would incur at least a 2% risk of forfeiting the next reward token, whereas cooperating would reduce the risk of forfeiting the next reward token below 1%, then a utility-maximising AI would always cooperate.

(Bostrom 2014, p. 133)

As I say, I think this is interesting. There is an argument and logic that one can follow. Would it work to contain existential risk? Bostrom says “not necessarily”. He gives three reasons for doubt. First, the system of reward tokens would be administered by humans and the AI may come to distrust those humans or worry about their becoming incapacitated, or the possibility of “regime change”, or the possibility of any other human error that prevents it from getting its reward. This may lead it to take precautions that threaten our existence. Second, he doubts whether we would actually be able to tell whether the outcomes produced by the AI are beneficial to us (this is an argument developed elsewhere in the book — I covered related epistemic concerns in part three). Third, he thinks there are other “esoteric considerations” that might make this idea less feasible than it initially seems. He cites as a specific example of this the uncertainty the AI might have about its own “simulation hypothesis”.

Again, we get a flavour of the rather profound epistemic implications of Bostrom’s views: if he is right, we should be plagued by doubts about what an advanced AI might do, and how it might affect us.

4. Stunting and Tripwiring
The final two methods for controlling an AIs capabilities are stunting and tripwiring. “Stunting”, as the name implies, involves hampering or disabling the AI in some way. Examples would include making the AI run on slow hardware, reducing its memory capacity, or limiting the kind of data it can process (this would seem to overlap slightly with the informational containment strategy discussed earlier).

Bostrom argues that the use of stunting poses a dilemma. Either we stunt the AI too much and it just becomes another “dumb” piece of software; or we stunt it too little and it is capable of overcoming its disabilities. Getting the balance just right could be pretty tricky. He also discusses a “data deprivation” method of stunting. This would attempt to limit the kind of data the AI can process to a narrow domain, e.g. it could only process facts about peptide chemistry. Bostrom argues that this too would be an uncertain method of capability control. The problem is that what we think of as narrow and constrained datasets might be anything but. They could contain circumstantial or implied facts that an acute, superintelligent mind, could work out. Or the data could be relevant to other fields in ways that we don’t fully understand. Or it may even be that a sufficiently intelligent AI could work out lots of things via deduction or a priori reflection.

Tripwiring is a different method of capability control. It involves building into any AI development project a set of “tripwires” which, if crossed, will lead to the project being shut down and destroyed. Bostrom identifies three types of tripwire:

Behavioural Tripwires: A network of detectors would be placed around a “boxed” AI in order to detect any attempt to breach the containment, e.g. any attempt to make radio communication.

Ability Tripwires: The AI would be regularly tested for its capabilities. If at any point new abilities are detected, or its rate of improvement in a particular skill seems high, or if its gets close to some capability threshold, it will be shut down.

Content Tripwires: The AI’s internal processing could be constantly monitored for any suspicious changes, e.g. changes in how it represents its final values.

Bostrom thinks that tripwires could be useful, particularly during the development phase and if used in conjunction with other methods. But, unsurprisingly, he thinks that they too have shortcomings. An advanced AI could subvert a tripwire in an unexpected or unanticipated way. He also notes that AI project developers could grow impatient if tripwires repeatedly hamper their progress. They might undermine any safety advantage gained by the tripwire system.

5. Conclusion
Okay, so that brings us to the end of this post. The four capability control methods are summarised in the table below. In the next post we will look at motivation selection methods.


John Danaher is an academic with interests in the philosophy of technology, religion, ethics and law. John holds a PhD student specialising in the philosophy of criminal law (specifically, criminal responsibility and game theory). He formerly was a lecturer in law at Keele University, interested in technology, ethics, philosophy and law. He is currently a lecturer at the National University of Ireland, Galway (starting July 2014).

He blogs at and can be found here:

This article previously appeared here. Republished under creative commons license.