The type of scientific and engineering understanding most relevant to AI safety is run-time, task-relevant, and algorithmic. That can lead to more reliable, safer systems. Unfortunately, gaining such understanding has been neglected in AI research, so currently we have little.
It’s common to explain that current AI systems are “neural networks,” consisting of arithmetical “units” that are trained with “machine learning.” This explanation is true as far as it goes, but it applies uniformly to extremely different AI systems, so it can’t give much insight into what specific ones can or can’t do, or why.
We need a different kind of explanation: one that can tell us how and when a system is likely to fail, and what to do about it
Run time is the operation of an AI system when deployed; it is what users see, and what can affect the world.1 It contrasts with training time, which is the construction of the system using backprop. In current practice, these are entirely disjoint; a network is trained once, frozen, and does not change once put into use. Training time behavior is important for the people creating a system, but irrelevant for everyone else.
We could distinguish three levels of analysis for any computational system.2 The black box level describes what a system does, without explaining why or how. It considers only the input-output behavior. The algorithmic level explains why and how, abstractly. A full algorithmic analysis specifies the transformations that lead from each input to each output. The mechanistic level explains the specific machinery that accomplishes the algorithm.
One run-time black box explanation of a chatbot is “a program that accepts human-language inputs and produces often-appropriate human-language outputs.” One run-time algorithmic explanation is that it alternates matrix multiplications with nonlinearly increasing function applications. One run-time mechanistic explanation is that the matrices are distributed to GPU memory and the multiplications are performed in parallel. These explanations are necessary for people building AI systems, but irrelevant for understanding them from a user or societal perspective.
They are nevertheless commonly given to the interested public, although they are quite useless. They give no insight into what the system is likely to do, right or wrong. They don’t explain why the outputs are “often appropriate”; what sorts of inputs are likely to produce outputs that aren’t appropriate; and what those outputs are likely to be instead.
It’s also common to explain chatbots as “predicting the next word,” which is true, but useless. Lacking the “why,” it completely fails to explain what the chatbot actually does (which is not—from the user’s point of view—predicting the next word). Lacking the “how,” it says nothing about chatbots’ capabilities and limits.
We need, instead, task-relevant run-time algorithmic understanding. That means understanding in terms relevant to wanted and unwanted behavior in the situation of use. It means insight into what a system can or can’t do, and why. That can suggest both useful applications and risks.
Unfortunately, we don’t have much task-relevant run-time algorithmic understanding, because little effort has been made to gain it.
Far more research effort has gone into understanding training time than run time. This is an accident of history: current AI research culture descends from the machine learning field. Overemphasis on the the mathematics of optimization—which is what “training” means—encourages black-box thinking. It leads to neglect of the domain-specific reasons AI systems work.
Research on task-relevant run time behavior has also mainly been at the black box level. Most is either haphazard observation of individually interesting outputs, or quantitative performance benchmarking—in both cases without attempting to answer “how” questions.
Run-time task-relevant algorithmic investigations are uncommon in AI research. The field rewards system performance, not understanding. It may take a huge amount of work to figure out what algorithms a network uses, because those emerge from training on a particular dataset, and weren’t engineered in.
This is not an adequate excuse, in my opinion. Lack of attention to task-relevant run-time algorithms leads to AI systems that don’t work well, because no one understands them. Then they can make inscrutable mistakes that may cause serious harms.
With task-relevant run-time algorithmic understanding, we should be able to:
Find causes of good outputs, which may lead to new ways of enhancing them.
Find causes of mistaken or unwanted outputs, which may lead to new ways of preventing them.
Evaluate the likelihood of dangerous outputs in novel environments. We can do better than the current practice of feeding in lots of poorly-characterized input data and measuring how frequently we get bad outputs. This could help risk/benefit analysis before deployment.
Predict what specifically what bad behavior is likely to occur under specifically which circumstances.
Find better ways of improving systems than current practice. That is limited to “alter the optimization criterion” and “increase the pressure to conform to it.” Those are at best limited, and arguably fundamentally flawed and unsafe.
Find better technologies for accomplishing the sorts of tasks for which backprop is currently the leading contender.
We can get this kind of understanding with science; with reverse engineering; and with synergies between the two.
Science, in this case, proceeds by formulating task-relevant algorithmic hypotheses and devising and running experimental tests for them. We create hypotheses via analysis of the task dynamics, knowledge of general backprop behavior, understanding of human psychology and neuroscience, and informal observations of existing systems. I discuss an example in “Classifying images,” later in this chapter.
Due to the complexity and inherent randomness of backprop networks, testing algorithmic hypotheses through examining input/output behavior may often be infeasible. Analyzing them at the mechanistic level first may yield algorithmic level insights. Sufficient mechanistic understanding—an explanation of how specific parts of the system contribute to its overall functioning—can reveal algorithms directly.
- 1.Run time is also referred to with other terms, such as “inference time” and “feedforward computation.” Usage doesn’t seem to have standardized yet. There’s a good analogy to compilation time vs. run time in conventional software, though.
- 2.This idea is originally due to the pioneering computational neuroscientist David Marr in 1976, and remains influential in cognitive science. I’m using different terms for the three levels than he did, but they’re conceptually similar or identical.