Motivation, morals, and monsters

Speculations about autonomous AI assume simplistic theories of motivation; and mistakenly confuse those with ethical theories. Building AI systems on these ideas would produce monsters.

AI theories of motivation

Usually, for an event to count as an action, it has to have a motivation. The meaning of “motivation” is unclear, and so is its nature. There are many conflicting theories, all dubious. Only two are influential in artificial intelligence research.

Ethics, in these frameworks, consists of having the morally correct objective function or terminal goal.

AI research takes for granted that an agent must be organized in one of these ways, and that actions must result from calculating how to maximize the objective function, or from creating a plan to achieve its goals. Or, rather, since both those calculations are provably mathematically impossible, an agent must somehow approximate them.

Human motivation is not like AI’s

Unfortunately, these simplistic theories of motivation are mistaken, both descriptively and prescriptively.1 People do not have objective functions, nor do we have such clear-cut goals as AI theories suggest.

Ethical theories that conform to these models lead to morally repugnant conclusions that most people would reject. An agent that operated according to them would be a monster. Any fixed motivational structure is a psychopathic monomania; an inability to reason contextually; a deficit of spontaneity; an unwillingness to learn what is good. Monomanias make monsters.2

We often, not always, have specific goals in particular situations; but they are (to varying degrees) contextual, transient, nebulous, and inconsistent. We usually, not always, can see what is moral in particular situations, but attempts to abstract fully general theories of ethics have always failed.

We can usually say “why I did” something after the fact. This is the norm of accountability, which underlies reasonable human activity, and is a central mechanism of human morality.3 Unless you are pathologically committed to ideological rationalism, “why I did it” is never “to maximize my objective function.” Nor is your account a causal explanation of how your brain brought about the action. Instead, it is a justification for the pragmatic usefulness and/or social acceptability of what you did, referring to specific contingencies in the situation rather than some abstract theory. “I’m sorry, but I had to unplug the office espresso machine; it was clogged with lime scale.”

In a framework with subgoals, asking “why” a few times should reveal successively more important ones until you reach your terminal goal, or overall purpose in life. This rarely provides an accurate, useful, or even meaningful explanation. No one attempts to justify the correctness of disabling the coffee maker by reasoning from ultimate principles, because we know that doesn’t work. “To increase the total amount of pleasure in the universe” is not an acceptable answer to “why did you unplug it.”

Choosing motivations for AIs

For AI systems, where does the objective function, or goal set, come from? Until recently, AI research took for granted that the builder or operator would supply them. If you want AI to cure cancer, you give it that goal. This makes sense for “tool AIs” with a specific function. Dr. Evil might give a system the goal of creating biological weapons, because he’s evil. This is the domain of AI ethics: the morally correct use of AI systems by people. The fault here is with Dr. Evil, rather than the AI, which is a mere thing, not a morally accountable agent.

But what about powerful, autonomous, general-purpose AIs? Since they could do an unforeseeable assortment of things, the task of specifying what they should do is more like creating an ethical theory than like providing a goal set. This is the domain of AI safety, where the problem is termed alignment. AI should align to human values, ideally by understanding and acting according to them, or at minimum by reliably intending to respect them. An AI system itself is often imagined as a moral (or immoral) agent, whose actions derive from an ethical reasoning process instilled in it at conception.

Attempts to specify what “values” we want an AI to respect fail because we don’t have those. That’s not how human motivation works. Eliezer Yudkowsky, the founder of the safety field, recognized this early on. He suggested substituting “coherent extrapolated volition,” a prescriptive theory of what we rationally ought to want, in place of whatever unknown, nebulous, messy motivations we actually have.4 He soon found flaws in this approach and deprecated it, but there’s been no better replacement.5

This has long been the fundamental conundrum of AI safety: it appears impossible to specify a motivation that does not cause a catastrophe for humanity if an agent with god-like superpowers pursues it.

Monstrous AI

Our intuitions about misaligned Scary AIs parallel universal human ones about monsters: they are dangerous; irrational; unintelligible; inhuman; unnatural; overwhelmingly powerful; and simultaneously repulsive and attractive.6 Conceptual interpretation breaks down in the uncanny valley between “thing” and “mind.” Vampires are corpses that walk. Trolls are sentient rocks. AIs are machines that think. We wobble between trying to understand them mechanically or psychologically, and fear that neither mode will work adequately. Monstrous AI is imagined as being much more human-like than current AI, yet not human enough to reason with, or reason about, reliably. It might combine human deviousness, strategizing, and hostility with inscrutable, radically alien motivations.

The opposite may be an even greater danger: an AI that is inhuman because—unlike us—it has a crisply-defined objective function or goal set, and acts rationally to satisfy it. That is a hideous caricature of what it is to be human. Optimization is a powerfully useful method in particular, limited sorts of contexts. Maximizing the output of a paperclip factory, subject to constraints such as capital costs and worker safety, is an excellent idea. Optimization is appropriate in closed worlds in which all the relevant variables are known and controllable.7 Putting huge power behind an optimization process and setting it loose in the open world is irresponsible, because the consequences are unknowable and can be disastrous. If you don’t understand what you are optimizing, or how it gets optimized, or what effect optimizing that will have in unconstrained open-world situations, you are building a monster that might turn the entire earth into paperclips. We have already done something similar, and it may soon be disastrous—as the next chapter details.

There are more sophisticated theories of human motivation, and of ethics, than those prevalent in AI research. We might be tempted to install one in AI systems. That would be a mistake: the best theories are much more complex but still radically inadequate, and would almost certainly produce monsters even less predictable than ones with known objective functions. It would also risk creating moral subjects—machines with rights. Unless we could ensure their welfare, which we can’t, we’d have no business doing that.

The intuition that already-deployed AIs are not yet human enough to be monstrous, so Scary AI is still years or decades in the future, may be catastrophically complacent. The AI we have now is already dangerous; irrational; unintelligible; inhuman; unnatural; overwhelmingly powerful; and simultaneously repulsive and attractive.

Alignment” may be critically important if we accidentally create Monstrous AI, and attempting to restrain it is our only hope. Otherwise, “getting machines to do what we want” is called “engineering.” My general approach to life applies an engineering mindset to traditionally philosophical problems. I believe that is likely to be the best approach to AI safety. Later in this book, I suggest applying conventional software safety engineering methods, rejecting the notion that AI is a special case.

As John von Neumann put it in “Can we survive technology?” in 1955:

What safeguard remains? Apparently only day-to-day—or perhaps year-to-year—opportunistic measures, a long sequence of small, correct decisions. And this is not surprising. After all, the crisis is due to the rapidity of progress, to the probable further acceleration thereof, and to the reaching of certain critical relationships. … Under present conditions it is unreasonable to expect a novel cure-all. … Any attempt to find automatically safe channels for the present explosive variety of progress must lead to frustration. The only safety possible is relative, and it lies in an intelligent exercise of day-to-day judgment … To ask in advance for a complete recipe would be unreasonable. We can specify only the human qualities required: patience, flexibility, intelligence.

  1. 1.Psychology and philosophy both have enormous literatures on motivation and on moral reasoning. Diverse schools within each offer incompatible theories, none of which are broadly accepted. Our understanding in this area consists mainly of pre-scientific intuitions and unsystematic observations.
  2. 2.See the unfinished “Mission” chapter in Meaningness. “Post-rational nihilism” is a common, dire consequence of trying to find your ultimate purpose and failing—as eventually you must. This tweet thread summarizes the problem.
  3. 3.See “You are accountable for reasonableness” in In the Cells of the Eggplant.
  4. 4.Coherent Extrapolated Volition” at LessWrong.
  5. 5.A. V. Turchin-Bogemsky, “AI Alignment Problem: ‘Human Values’ don’t Actually Exist.”
  6. 6.These are, I think, also all inconvenient characteristics of ourselves that we want to deny, and so project onto imaginary monsters instead. See my “We are all monsters,” from which I took this list.
  7. 7.See “The Spanish Inquisition” in The Eggplant; and Paul N. Edwards’ The Closed World.