Reward Function Integrity in Artificially Intelligent Systems

Reward Function Integrity in Artificially Intelligent Systems

Roman gave a talk at the Oxford Winter Intelligence conference / AGI12 in December last year. [Preprint of paper here: UTILITY FUNCTION SECURITY IN ARTIFICIALLY INTELLIGENT AGENTS]


Roman YampolskiyAbstract: In this paper we will address an important issue of reward function integrity in artificially intelligent systems. Throughout the paper, we will analyze historical examples of wireheading in man and machine and evaluate a number of approaches proposed for dealing with reward-function corruption. While simplistic optimizers driven to maximize a proxy measure for a particular goal will always be a subject to corruption, sufficiently rational self-improving machines are believed by many to be safe from wireheading problems. Claims are often made that such machines will know that their true goals are different from the proxy measures, utilized to represent the progress towards goal achievement in their fitness functions, and will choose not to modify their reward functions in a way which does not improve chances for the true goal achievement. Likewise, supposedly such advanced machines will choose to avoid corrupting other system components such as input sensors, memory, internal and external communication channels, CPU architecture and software modules. They will also work hard on making sure that external environmental forces including other agents will not make such modifications to them. We will present a number of potential reasons for arguing that wireheading problem is still far from being completely solved. Nothing precludes sufficiently smart self-improving systems from optimizing their reward mechanisms in order to optimize their current-goal achievement and in the process making a mistake leading to corruption of their reward functions.

In many ways the theme of this paper will be about how addiction and mental illness, topics well studied in human subjects, will manifest in artificially intelligent agents. We will describe behaviors equivalent to suicide, autism, antisocial personality disorder, drug addiction and many others in intelligent machines. Perhaps via better understanding of those problems in artificial agents we will also become better at dealing with them in biological entities.

A still unresolved issue is the problem of perverse instantiation. How can we provide orders to superintelligent machines without danger of ambiguous order interpretation resulting in a serious existential risk? The answer seems to require machines that have human-like common sense to interpret the meaning of our words. However being superintelligent and having common sense are not the same things and it is entirely possible that we will succeed in constructing a machine which has one without the other. Finding a way around the literalness problem is a major research challenge. A new language specifically developed to avoid ambiguity may be a step in the right direction.
Throughout the paper we will consider wireheading as a potential choice made by the intelligent agent. As smart machines become more prevalent, a possibility will arise that undesirable changes to the fitness function will be a product of the external environment. For example in the context of military robots the enemy may attempt to re-program the robot via hacking or computer virus to turn it against its original designers, a situation which is similar to that faced by human war prisoners subjected to brainwashing or hypnosis. Alternatively robots could be kidnapped and physically re-wired. In such scenarios it becomes important to be able to detect changes in the agent’s reward function caused by forced or self-administered wireheading. Behavioral profiling of artificially intelligent agents may present a potential solution to wireheading detection.
The full paper will address the following challenges and potential solutions: Wireheading in Machines (Direct stimulation, Maximizing reward to the point of resource overconsumption, Killing humans to protect reward channel, Ontological Crises, Changing initial goals to an easier target, Infinite loop of reward collecting, Changing human desires or physical composition, Reward inflation and deflation), Perverse Instantiation, Sensory Illusions — a Form of Indirect Wireheading. Potential Solutions to the Wireheading Problem (Inaccessible reward-function (hidden, encrypted, hardwired, etc.), Reward function resetting, Revulsion, Utility Indifference, External Controls, Evolutionary competition between agents, Learned Reward Function, Making utility function be bound to the real world).


For more videos of lectures and interviews with thought leaders please Subscribe to Adam Ford’s YouTube Channel



Leave a Reply