Speculations about autonomous AI assume simplistic theories of motivation. They also confuse those with ethical theories. Building AI systems on these ideas would produce monsters.
AI theories of motivation
Usually, for an event to count as an action, it has to have a motivation. It is unclear what “motivation” means, or how motivations work. There are many conflicting theories, all dubious. Only two are influential in artificial intelligence research.
Objective function optimization: An agent tries to increase some numerical “goodness” measure. A simple example is “how much money do I have?” That is the agent’s sole, unchanging objective: to make the number go up.
Goal pursuit: An agent has a set of goals that it tries to achieve, like “build a flying car.” In this framework, “values,” “desires,” and “intentions” are considered types of goals. Commonly, the goal set is organized with subgoals that enable actions that lead to accomplishing more important goals. For instance, you open the drawer to get a spoon to eat your cornflakes to have enough concentrated attention to write the report to get a promotion to increase your salary to die with plenty of toys, which is your ultimate goal. You make a plan by reasoning backward: to get toys, you’ll have to get a promotion, and so on, which means you need to open the drawer.
Ethics, in these frameworks, consists of having the morally correct objective function or ultimate goal. Or, since people don’t seem to have any single ultimate motivation, theorists may suggest we have a set of abstract, general values, from which all morally correct goals derive. Values are axiomatic principles; they need no justification, and are not subordinate to any other sort of goal.
AI research takes for granted that an agent must be organized in one of these ways, and that actions must result from calculating how to maximize the objective function, or from creating a plan to achieve its goals. Or, rather, since both those calculations are provably mathematically impossible, an agent must somehow approximate them.
Human motivation is not like AI’s
Unfortunately, these simplistic theories of motivation are mistaken, both descriptively and prescriptively.1 People do not have objective functions, nor do we have such clear-cut goals as AI theories suggest.
Ethical theories that conform to these models lead to morally repugnant conclusions which most people would reject. An agent that operated according to them would be a monster. Any fixed motivational structure is a psychopathic monomania; an inability to reason contextually; a deficit of spontaneity; an unwillingness to learn what is good. Monomanias make monsters.
We often, not always, have specific goals in particular situations; but they are (to varying degrees) contextual, transient, nebulous, and inconsistent. Mostly our activity responds directly to concrete opportunities which are obvious in the context at the time, without need for consideration of goals. We usually, not always, can see what is moral in particular situations; but attempts to construct general, abstract frameworks for ethics have always failed to account adequately for specifics.
We can usually say “why I did” something after the fact. This is the norm of accountability, which underlies reasonable human activity, and is a central mechanism of human morality.2 Unless you are pathologically committed to ideological rationalism, “why I did it” is never “to maximize my objective function.” Nor is your account a causal explanation of how your brain brought about the action. Instead, it is a justification for the pragmatic usefulness and/or social acceptability of what you did, based on specific contingencies in the situation rather than some abstract theory. “I’m sorry, but I had to unplug the office espresso machine; it was clogged with lime scale.”
In a framework with subgoals, asking “why” a few times is supposed to reveal successively more important ones until you reach your ultimate goal; or at least an abstract value that is not a subgoal for anything else. This rarely provides an accurate, useful, or even meaningful explanation. No one attempts to justify the correctness of disabling the coffee maker by reasoning from ultimate principles, because we know that doesn’t work. “To increase the total amount of pleasure in the universe” is not an acceptable answer to “why did you unplug it.”
Choosing motivations for AIs
For AI systems, where does the objective function, or goal set, come from? Until recently, AI research took for granted that the builder or operator would supply them. If you want AI to cure cancer, you give it that goal. This makes sense for “tool AIs” with a specific function.
Dr. Evil might give a system the goal of creating biological weapons, because he’s evil. This is the domain of AI ethics: the morally correct use of AI systems by people. The fault here is with Dr. Evil, rather than the AI, which is a mere thing, not a morally accountable agent.
But what about powerful, autonomous, general-purpose AIs? Since they could do an unforeseeable assortment of things, we can’t set out in detail what goals they should have. Instead, we’d need to give them general, abstract values from which they could derive specific goals themselves. That is more like creating an ethical theory than like engineering design.
This is the domain of AI safety, where systems are often imagined as moral (or immoral) agents themselves, whose actions result from an ethical reasoning process instilled at conception. This is termed alignment. AIs should align to human values, ideally by understanding and acting according to them, or at minimum by reliably recognizing and intending to respect them.
Attempts to specify what abstract values we want an AI to respect fail because we don’t have those. That’s not how human motivation works, nor are “values” a workable basis for an accurate ethical framework. This has been recognized repeatedly in the field, with useful discussions by Miya Perry,3 A. V. Turchin-Bogemsky,4 and Eliezer Yudkowsky.5
This is a fundamental conundrum of AI safety: it appears impossible to specify any motivation or morality that does not cause a catastrophe for humanity, if a sufficiently powerful artificial agent pursues it.
Our intuitions about misaligned Scary AIs parallel universal human ones about monsters. Monsters are dangerous; irrational; unintelligible; inhuman; unnatural; overwhelmingly powerful; and simultaneously repulsive and attractive.
Conceptual interpretation breaks down in the uncanny valley between “thing” and “mind.” Vampires are corpses that walk. Trolls are sentient rocks. AIs are machines that think. We wobble between trying to understand them mechanically or psychologically, and fear that neither mode will work adequately.
Monstrous AI is imagined as being much more human-like than current AI, yet not human enough to reason with, or reason about, reliably. It might combine human deviousness, strategizing, and hostility with inscrutable, radically alien motivations.
The opposite may be an even greater danger. That would be an AI that is inhuman because—unlike us—it has a crisply-defined objective function or goal set, and acts rationally to satisfy it. This is a hideous caricature of what it is to be human.
Optimization is a powerfully useful method in particular, limited sorts of contexts. Maximizing the output of a paperclip factory, subject to constraints such as capital costs and worker safety, is an excellent idea. Optimization is appropriate in closed worlds in which all the relevant variables are known and controllable.6
Putting huge power behind an optimization process and setting it loose in the open world is irresponsible, because the consequences are unknowable and can be disastrous. If you don’t understand what you are optimizing, or how it gets optimized, or what effect optimizing that will have in unconstrained open-world situations, you are building a monster that might turn the entire earth into paperclips. We have already done something similar, and it may soon be disastrous—as the next chapter details.
There are more sophisticated theories of human motivation, and of ethics, than those prevalent in AI research. We might be tempted to install one in AI systems. That would be a mistake: the best theories are much more complex, but still radically inadequate. They would almost certainly produce monsters even less predictable than ones with known objective functions.
There’s a widespread intuition that already-deployed AIs are not yet human enough to be monstrous, so Scary AI is still years or decades in the future. That may be catastrophically complacent. The AI we have now is already dangerous; irrational; unintelligible; inhuman; unnatural; overwhelmingly powerful; and simultaneously repulsive and attractive.
“Alignment” may be critically important if we accidentally create Monstrous AI, and attempting to restrain it is our only hope.
Otherwise, “getting machines to do what we want” is called “engineering.” I believe that is likely to be the best approach to AI safety. Later in this book, I suggest applying conventional software safety engineering methods, rejecting the notion that AI is a special case.
As John von Neumann put it in “Can we survive technology?” in 1955:
What safeguard remains? Apparently only day-to-day—or perhaps year-to-year—opportunistic measures, a long sequence of small, correct decisions. And this is not surprising. After all, the crisis is due to the rapidity of progress, to the probable further acceleration thereof, and to the reaching of certain critical relationships. … Under present conditions it is unreasonable to expect a novel cure-all. … Any attempt to find automatically safe channels for the present explosive variety of progress must lead to frustration. The only safety possible is relative, and it lies in an intelligent exercise of day-to-day judgment … To ask in advance for a complete recipe would be unreasonable. We can specify only the human qualities required: patience, flexibility, intelligence.
- 1.Psychology and philosophy both have enormous literatures on motivation and on moral reasoning. Diverse schools within each offer incompatible theories, none of which are broadly accepted. Our understanding in this area consists mainly of pre-scientific intuitions and unsystematic observations.
- 2.See “You are accountable for reasonableness” in my In the Cells of the Eggplant.
- 3.“Benevolent AI Is a Bad Idea,” Palladium, November 10, 2023.
- 4.“AI Alignment Problem: ‘Human Values’ Don’t Actually Exist,” LessWrong, 22nd Apr 2019.
- 5.Summarized at “Coherent Extrapolated Volition” and “Complexity of Value” at LessWrong, undated.
- 6.See “The Spanish Inquisition” in my In the Cells of the Eggplant; and Paul N. Edwards’ The Closed World.