We’ve seen that current AI practice leads to technologies that are expensive, difficult to apply in real-world situations, and inherently unsafe. Neglected scientific and engineering investigations can bring better understanding of the risks of current AI technology, and can lead to safer technologies.
AI is unavoidably hybrid as an intellectual discipline. It incorporates aspects of six others: science, engineering, mathematics, philosophy, design, and spectacle. Each of these contributes valuable ways of understanding, and their synergies power AI insights. Different schools of thought within AI emphasize some disciplines and deemphasize others, which contribute to their different strengths and weaknesses.
The current backprop-based mainstream overemphasizes spectacle (the creation of impressive demo systems) and mathematics (function optimization methods). It neglects science (understanding how and why the networks work) and engineering (building reliable, efficient solutions to specific problems). Naturally, this has led to powerful optimization methods which can yield spectacular results, but which we don’t understand and which aren’t reliable or efficient when applied to specific problems.
To address these problems, I suggest taking the math less seriously; getting much more skeptical about spectacles; and doing AI research as science and engineering instead.
- May reveal that there is less to seemingly-spectacular results than meets the eye, thereby deflating hype (and consequently funding and deployment)
- May enable adding safety features to technologies similar to those we have now
- May lead to a full replacement of backprop with quite different, safer technologies.
This chapter of Gradient Dissent draws on my 2018 essay “How should we evaluate progress in AI?” That covers some of the same themes in greater depth; so you might like to read it if the discussion here is intriguing.
Current AI practice is neither science nor engineering
“Science” means “figuring out how things work.” “Engineering” means “designing devices based on an understanding of how they work.”1 Science and engineering are good. Current AI practice is neither.
AI research mainly aims at creating impressive demos, such as game-playing programs and chatbots, rather than scientific understanding. As I explained in “How should we evaluate progress in AI?”, such demos often do not show what they seem to—whether through carelessness in exposition, honest researcher confusion, or deliberate deception.
Recently, much effort has also gone into finding inputs (“prompts”) that that somehow yield cool outputs. That treats the network as a “black box.”
Inference is the application of a network to an input.2 Inference time is use of the network after training has completed. For example, your use of DALL-E to make pictures is “at inference time.”
It’s almost taboo to look inside the box to find out how the network operates at inference time. Instead, nearly all AI research considers only input/output behavior. Specifically, it considers only a network’s average error, which backprop aims to minimize.
Current AI descends from machine learning as a field. It retains the emphasis on “learning,” i.e. error minimization algorithms, and neglects investigation of what trained networks do and how. Most effort has gone instead into fiddling with training algorithm details to make them work better. There may be an unconscious slip from recognizing that the “learning” phase is mysterious to assuming that inference is too—but it’s often not as much.
The fundamental backprop algorithm is very simple, and makes mathematical sense, but it hardly ever works. Various kludgy add-ons address its typical failure modes. For example, minimizing error can lead to overfitting, so it’s common to optimize some complex regularized function of the error instead. Such alterations to the algorithm are ad hoc, and not derived from theoretical principles. In most cases, their operation can be adjusted by turning “knobs” called hyperparameters. The effects of the hyperparameters are poorly understood.
Getting good results from backprop usually depends on setting several hyperparameters just right. There’s no theoretical basis for that, and in practice “neural networks” get created by semi-random tweaking, or “intuition,” rather than applying principled design methods. Often, no hand-chosen set works well, so researchers run a hyperparameter search to try out many combinations of values, hoping to find one that produces an adequate network. (Might this necessity for epicycles be a clue that the approach is overall wrong?)
The black box methodology has created amazing things. This practice of intuitive exploration of mechanism variations might be worthy of meta-level investigation as a new way of knowing that is neither science nor engineering.3 This could be a fascinating project in epistemological theory.
It would not, however, excuse the field from addressing serious safety questions before its products are deployed (that is, put into use). There’s currently no known way to do that apart from conventional science and engineering.
Lack of attention to inference-time mechanisms leads to AI systems that don’t work well, because no one understands them. Then they can make inscrutable mistakes that may cause serious harms.
Benchmark performance rarely reflects reality
Competitions to get the best score on “benchmarks” drive the creation of advanced AI systems. This motivates ad hoc kludgery rather than analysis and principled design.
A benchmark is a set of quiz problems used to measure the predictive accuracy of a backprop network. A random subset of the problems are used as training data; then the network is tested against the remainder. For many benchmarks, it’s considered exciting to get the error rate down below 50%.
Benchmarking incentivizes rapid, blind construction of systems by increased brute force, and by random tweaking to see what marginally decreases wrong outputs. Minimal effort goes into understanding what the networks do and how, and why they make the errors they do.
It’s considered perfectly normal and acceptable to build and deploy systems that you don’t understand and which give bad outputs much of the time. “We can keep decreasing the error rate” isn’t an adequate justification if errors are costly, harmful, or potentially disastrous.
Benchmarks are not objectively correct measures of performance. Most are rather haphazard collections of quiz problems assembled unsystematically and with minimal thought. Often the people employed to construct them have no relevant expertise. Many benchmarks are crowd-sourced on platforms such as Mechanical Turk, and are full of errors. Many other benchmarks are constructed mechanically, by fitting random values into fill-the-blanks templates. It is not surprising that a statistical method can, in effect, recover simple templates from the pattern of examples, without “understanding” anything.
It’s common for systems that get high scores on benchmarks to do dramatically worse when applied to seemingly similar tasks in practice.4 One common reason is that the benchmark examples are dissimilar to real-world ones in unexpected ways.
Another is that half-baked benchmark construction often creates easily-exploited spurious correlations in the training data. This bedevils “reasoning” benchmarks in particular, as we’ll see in Are language models Scary?
Overemphasis on the mathematics of optimization encourages black-box thinking, and neglect of the domain-specific reasons AI systems work.
A “neural network” is simply a mathematical function, like a quadratic: ax²+bx+c. Unlike a quadratic, which has three fixed parameters (a, b, and c), current networks have billions.
The result of training is a network. That is, training produces a function with the parameters set to values that cause its outputs to approximate the training data. Once training is complete, the parameters are fixed permanently.
Training doesn’t change the form of the network function, just its parameters. The form is mainly fixed and chosen by researchers.5 This architectural choice is usually critical to performance, for reasons that are poorly understood.
Error backpropagation, or backprop for short, is the basis of algorithms used to train networks. It’s conceptually simple: essentially the same as Newton’s method for finding function maxima or minima.6 You may remember that from an introductory calculus class; if not, it’s not important for this discussion.
Backprop’s gradient descent7 method attempts to minimize the difference, for each training data input, between the network’s output and the correct output. That difference is the amount of error for data item. Put a different way, backprop optimizes the network’s fit to the data; or creates an approximation to the data.
The “neural” functions backprop training gets applied to are compositions mainly of matrix multiplications, which are a central topic in linear algebra. Together, gradient descent and matrix multiplication make centuries of established knowledge in mainstream mathematics formally applicable.
This style of analysis has led to significant efficiency improvements. It also provides some insight into the mostly-mysterious question of why backprop works at all, despite extreme functional non-convexity.8
However, mathematical analysis of the optimization algorithm ignores the specific task the network is optimized for, and ignores the operation of the resulting network at inference time. Those are what matter for practical applications and for safety. Real-world tasks are usually not amenable to mathematical analysis, due to task complexity.9 Overvaluing mathematical results has reinforced the field’s “black box” approach.
Backprop almost certainly doesn’t work when applied to arbitrary problems. The reasons it works when it does are task-specific: optimization discovers particular patterns in the task domain that provide good performance (and occasional risky bad performance) in practice. We’ll see some examples in Are language models Scary?
Task-relevant mechanistic understanding with science
Backprop networks are mysterious largely because so little effort has gone into understanding them scientifically. Ordinary scientific practice may go a long way toward making them understandable, and thereby safer.
For AI safety, what matters is what a system may do when deployed. Research priorities should shift from haphazard observation of input-output behavior to task-relevant mechanistic understanding.
Mechanistic means learning how and why systems work, not just “how much,” as in benchmarking, or “what,” as in searching for inputs that produce cool outputs for unknown reasons. It means a explanation of how specific parts of the system contribute to its overall functioning.
Task-relevant means understanding in terms of wanted and unwanted behavior in the situation of use. That is not provided by, for example, the largely irrelevant mechanistic knowledge that network units compute a nonlinear function of the weighted sum of their inputs.
(I discuss the nature of task-relevant mechanistic understanding at greater length in “AI at the relevant level of explanation,” where I use “the inference-time algorithmic level” to name the same concept.)
With task-relevant mechanistic understanding, we should be able to:
Find sources and causes of good outputs, which may lead to new ways of enhancing them.
Find sources and causes of mistaken or unwanted outputs, which may lead to new ways of preventing them.
Evaluate the likelihood of dangerous outputs in novel environments. We can do better than the current practice feeding in lots of poorly-characterized input data and measuring how frequently we get bad outputs. This could help risk/benefit analysis before deployment.
Predict what specifically what bad behavior is likely to occur under specifically which circumstances.
Find better ways of improving systems than current practice. That is limited to “alter the optimization criterion” and “increase the pressure to conform to it.” Those are at best limited, and arguably fundamentally flawed and unsafe.
Find better technologies for accomplishing the sorts of tasks for which backprop is currently the leading contender.
We can get that understanding with science and with reverse engineering, and with synergies between the two.
Science, in this case, proceeds by formulating task-relevant mechanistic hypotheses and devising and running experimental tests for them. We create hypotheses via analysis of the task dynamics, knowledge of general backprop behavior, understanding of human psychology and neuroscience, and informal observations of existing systems.
Are language models Scary? covers some examples at length: how some theories about image classifiers were invented, tested, and confirmed some years ago; plus hypotheses about current language models.
Task-relevant mechanistic understanding with reverse engineering
Current practice mainly treats backprop networks as inscrutable black boxes. Opening them up to examine their operation reveals that they work in straightforward ways. That should make them amenable to reengineering for greater safety and better performance.
We can break open backprop networks to find out how they do what they do, in task-relevant mechanistic terms. This is analogous to reverse engineering, except that backprop networks aren’t engineered. Molecular biology is another analog: understanding how metabolic pathways and genetic regulatory networks work by probing and teasing apart the many specific molecular interactions that make them up.
Investigations typically find that small pieces of backprop networks compute the sorts of things you’d expect them to given the task requirements.10 For example, specific circuits within image classifiers detect edges, which has been known for decades as critical in both mammalian and machine vision.11 Feed-forward modules in transformer language models act as key-value stores, with individual units representing specific facts.12
Altering one of those little bits causes the network to change the corresponding specific functionality, while retaining the rest. For example, a series of studies have found that specific facts, such as that the Eiffel Tower is in Paris, can be located in language models, and then the few relevant parameters can be directly modified so the model “believes” the Tower is in Rome.13
Such knowledge, accumulated, dispels the aura of magic. It suggests that the network as a whole can be made comprehensible; and perhaps made less impressive, once understood in task-relevant mechanistic terms.
This level of understanding explains, for example, why image generators can produce photorealistic pictures of horses, but puts the wrong the number of legs on them about 20% of the time; and suggests reasons this might be hard to fix. That doesn’t matter for safety, but analogous errors do. If leg count were critical, forcing the error rate down just with more stringent training would be an inherently unsafe approach. Neel Nanda’s “Longlist of Theories of Impact for Interpretability” discusses safety benefits of task-relevant mechanistic understanding.14
Reverse engineering a backprop network is not so different from reverse engineering legacy code. Analogs of familiar methods work:
Instrument the code: Add apparatus that lets you trace the network’s computation during inference. For example, Meng et al. developed a technique for locating the pathways through a network that correspond to particular facts.15
Minimize naturally-occurring test cases: In Are language models Scary? I suggest that instead of building bigger language models to improve benchmark performance, we should see how small we can make them while maintaining performance. Smaller networks are likely to be easier to understand.
Construct small artificial test cases: Neel Nanda and Tom Lieberum’s full reverse engineering and mechanistic explanation of a small network optimized for modular arithmetic is a fascinating, beautiful example.16 It probably gives significant insight into the operation of large models (although this remains to be demonstrated).
Alter the system to make it easier to understand, without significantly changing functionality: For example, Elhage et al. recently demonstrated that making a small, theoretically-motivated change to the activation function decreases polysemy of units in a language model while retaining performance.17 Filan et al. describe regularization and initialization methods for increasing the modularity of networks in a graph cut framework.18
Safer AI with conventional engineering
Current AI practice is not engineering, even when it aims for practical applications, because it is not based on scientific understanding. Enforcing engineering norms on the field could lead to considerably safer systems.
Work in AI is often described as “engineering.” It’s highly technical, it certainly isn’t science, and it’s not just mathematics, so it must be engineering by process of elimination? It’s most likely to be called “engineering” when it aims toward practical application, with the idea that engineering consists of solving practical problems with technology. That definition includes too much; most things you do with a computer (or even in the kitchen) solve practical problems with technology, and definitely aren’t engineering.
The American Engineers’ Council for Professional Development definition:
The creative application of scientific principles to design or develop structures, machines, apparatus, or manufacturing processes, or works utilizing them singly or in combination; or to construct or operate the same with full cognizance of their design; or to forecast their behavior under specific operating conditions; all as respects an intended function, economics of operation and safety to life and property.
I have added emphasis on features of engineering absent from current AI practice.
Fiddling with hyperparameters to get a better benchmark score is not engineering. This is not an arbitrary or values-neutral matter of definition. It’s a matter of norms; of the ethic and ethos of engineering. The chief engineer for a bridge construction project should be the first to drive across it.
Ideally, software should be proven correct. That is unusual in current software engineering practice. However, responsible software development projects require, at minimum, unit testing (checking that each of the parts works in isolation), integration testing (the system works as a whole), and code review (every part of the program gets examined for possible errors by someone other than its author).
We should require analogous practices when backprop nets are deployed in situations in which errors matter. That would be very expensive currently, but objecting to that is like complaining that safety engineering for cars, or any other heavy machinery, is expensive. If you want to manufacture automobiles, you have to pay that cost.
Making this a requirement would incentivize developing better testing and debugging tools. I expect those are possible, given recent progress, and given how little effort has been put into developing them.
Typically, when you understand the function of a piece of a network, you find that it computes that only probabilistically and approximately. In some cases, researchers can replace those bits with deterministic, engineered, exact equivalents. For example, Cammarata et al. replaced backprop-derived curve-detection units with manually engineered ones in “Curve circuits.” Such re-engineering may increase reliability without affecting performance.
Ideally, the entire network could be replaced with a deterministic, fully understood, engineered alternative. This should be an AI safety engineering goal.
Engineered systems can’t be guaranteed absolutely safe. Even if we understand exactly what a device does, its interaction with unpredictable situations will be inherently unpredictable. You can’t engineer a perfectly safe car, because of black ice, landslides, and drunk drivers. Cars are statistically safer than horses, however; engineered solutions can be more predictable than those that emerge from an optimization process, whether backprop or evolution.
Adversarial research versus confounding spectacles
Forceful debunking may be necessary to persuade AI researchers to do the necessary science instead of aiming for spectacular but misleading demos. That will require new incentives and institutions.
Trying to show you can get a system to do something cool—especially when it’s anything cool, you don’t much care what—is neither science nor engineering. That’s the dominant mode in current AI research practice, though. This drives competition on benchmarks, and publicizing individual exciting outputs—texts and images—without bothering to investigate what produced them.
It is natural and legitimate to want to amaze people. You are excited about your research program, and you want to share that. Your research is probably also driven by particular beliefs, and it’s natural to want to persuade people of them. A spectacular demonstration can change beliefs, and whole ways of thinking, in minutes—far faster than any technical exposition or logical argument.
The danger is in giving the impression that a program can do more than it does in reality; or that what it does is more interesting than it really is; or that it works according to your speculative explanation, which is exciting but wrong. The claims made in even the best current AI research are often misleading. Results (even when accurately reported) typically do not show what they seem to show. If the public learns a true fact, that the program does X in a particular, dramatic case, it’s natural to assume it can do X in most seemingly-similar cases. But that may not be true. There is, almost always, less to spectacular AI “successes” than meets the eye. This deception, usually unintended, takes in researchers as well as outsiders, and contributes to AI’s perennial hype cycle.
Zachary C. Lipton and Jacob Steinhardt attribute this to:
”(i) failure to distinguish between explanation and speculation; (ii) failure to identify the sources of empirical gains, e.g., emphasizing unnecessary modifications to neural architectures when gains actually stem from hyper-parameter tuning; (iii) mathiness: the use of mathematics that obfuscates or impresses rather than clarifies…; and (iv) misuse of language, e.g., by choosing terms of art with colloquial connotations or by overloading established technical terms.”19
The first two of these could be summarized as failure to perform control experiments: that is, to consider and test alternative possible explanations. A common bad practice, for instance, is to assume that a network has “learned” what you wanted it to if it gets a “good” score on a supposedly relevant benchmark. An alternative explanation, that it has gamed the benchmark by finding some statistical regularity other than the one you wanted, is more often true. That gets tested too rarely.
Figuring out how and where and why backprop works will require shifting research incentives from spectacle to understanding. A conventional approach would try to shift academic peer review toward more careful application of proper scientific norms. This wouldn’t work: because the most-hyped research is done in industrial labs, and because the field’s de facto standards for peer review are so abysmal that reform, even if possible, would take much too long.
I suggest incentivizing adversarial AI research that would perform missing control experiments, with the expectation they will show that many hyped results do not mean what they seem to.20 It would address questions like:
- Did it work for the reason the authors thought it did?
- If not, why did it work?
- Under what circumstances will it work, or not?
- What does it do when it doesn’t work?
- What other methods might have worked better?
- Did they report all the things they tried that didn’t work? No, they did not. For example, they ran thousands of hyperparameter combinations; how and why did most fail? We could probably learn from that, but mostly we don’t because it’s rarely attempted.
Fame and funding should flow to researchers who answer these questions. This demands unusual funders, and unusual researchers.
Funders want to see “progress.” This research program aims at negative progress in spectacle generation. Ideally, its results should be disappointing to the public. It should progress scientific understanding instead. That is more valuable in the long run, but it may take unusual fortitude for funders to accept the tradeoff.21
Most researchers don’t do control experiments, or enough of them, unless they are forced to. Control experiments are boring, a huge amount of work, and are likely to show that your result is less interesting than you hoped. There are many possible explanations for any result; ruling out all of them except the exciting one takes much more work than futzing around with a network to get it to produce some cool outputs. Control experiments seem like janitorial scutwork.
What could motivate researchers to do other people’s janitorial scutwork, to clean up the mess? Some researchers actually care about the truth, and are naturally attracted to this sort of work. Watching hype overwhelm concern for truth also gets some of us angry, in a useful way. It’s an unattractive but important fact that annoyance with other researchers’ carelessness drives much of the best scientific work.
Competitiveness also motivates benchmark-chasing (“we beat the team from the Other Lab!”). I expect it’s feasible to shift some of that emotional energy to “we showed that the Other Lab were fooling themselves and they were wrong, hah hah!”
This debunking must concentrate on the best work; 90% of research in all fields is of course crud, and not worth refuting. The aim should be to demonstrate ways even the best is misleading (it over-claims, overgeneralizes, fudges, cherry-picks, etc.). Some AI critics do this now, which is valuable, but their analyses are relatively shallow and abstract (“the system doesn’t really understand language, because look, it gets these five examples wrong”). That’s not their fault; detailed control experiments take a lot of work and computational resources.
The psychology replication movement would be a valuable model. Many of the field’s long-held beliefs were due to researchers fooling themselves with inadequate controls, non-reporting of negative results, overgeneralizing narrow findings, refusing to make code and data available to other researchers to analyze, and other such epistemological failures.22 This got some young researchers angry enough to force major incentive changes. Those make newer work far more credible.
Researchers chase funding and prestige. Explicit calls for proposals for adversarial work could motivate with funding. Prizes could motivate with prestige.
Academia is a more natural environment for debunking than industry. However, given the magnitude of the task, academic AI labs may not be a good structure for the scale required. An independent “adversarial AI lab” might concentrate the necessary resources.23 A Focused Research Organization might provide the right structure for the task.
- 1.These are rough definitions only, but adequate here.
- 2.This term is potentially misleading in that most networks don’t do anything similar to what’s called “inference” in any other context.
- 3.Michael Nielsen’s “The role of ‘explanation’ in AI” makes this case. Subbarao Kambhampati’s “Changing the Nature of AI Research” is a somewhat skeptical discussion of the approach.
- 4.This is notably true of language models; see Kiela et al., “Dynabench: Rethinking Benchmarking in NLP” (2021) for examples, explanations, and proposed remedies.
- 5.Sometimes details of the form are varied during hyperparameter search, since there is no principled theory of why particular forms work better than others for particular applications.
- 6.In a narrower, more technical usage, “backpropagation” refers only to one aspect of the algorithm: computation of error gradients layer-by-layer, caching intermediate results for efficiency. Commonly, though, “backpropagation” is used somewhat vaguely to refer to any gradient-based optimization method for “neural networks.”
- 7.You can blame the pun “Gradient Dissent” on @lumpenspace.
- 8.The math also makes for impressive-looking papers with plenty of intimidating and potentially obfuscatory notation. That often helps get triviality or nonsense through peer review.
- 9.This is unlike engineering applications, in which the environment can be sufficiently constrained. For example, “no vehicles on the bridge over 30 tons” enables stress analysis. Backprop is fundamentally incompatible with engineering.
- 10.Much of the best work in this area has been done by Chris Olah and his collaborators. For an overview as of 2020, see their “Thread: Circuits.”
- 11.Olah et al., “An Overview of Early Vision in InceptionV1.”
- 12.Dai et al., “Knowledge Neurons in Pretrained Transformers.”
- 13.For a summary of one such study, plus a useful literature review, see Meng et al., in “Locating and Editing Factual Associations in GPT.” As of late 2022, the state of the art is Meng et al., “Mass-Editing Memory in a Transformer.”
- 14.This is often termed “interpretability” in AI safety, but that term is overloaded and confusing; see Zachary C. Lipton’s “The Mythos of Model Interpretability.” For a detailed survey of recent results in “interpretability” under various definitions, see Peter Hase and Owen Shen’s “Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers.”
- 15.Meng et al., “Locating and Editing Factual Associations in GPT.”
- 16.Neel Nanda and Tom Lieberum, “A Mechanistic Interpretability Analysis of Grokking.”
- 17.Elhage et al., “Softmax Linear Units.”
- 18.Filan et al., “Clusterability in Neural Networks.”
- 19.Zachary C. Lipton and Jacob Steinhardt, “Troubling Trends in Machine Learning Scholarship.” Also see Hullman et al., “The worst of both worlds: A comparative analysis of errors in learning from data in psychology and machine learning” for a similar, detailed analysis.
- 20.As I mention in Are language models Scary?, I did several such experiments around 1990. Mostly I didn’t bother to publish them, because I guessed (correctly) that backprop would fizzle when other people realized the then-hyped results were mostly researchers fooling themselves. I did write up one in “Input Generalization in Delayed Reinforcement Learning: An Algorithm and Performance Comparisons” (with Leslie Kaelbling). We found that a success story for reinforcement learning with backprop worked slightly better when we replaced the backprop network with a linear perceptron, and we could explain why the perceptron worked in task-relevant terms.
- 21.AI safety organizations may be a natural fit; they should want to promote deeper understanding, as opposed to progress in AI capabilities. That’s particularly true regarding “what can the system do when it doesn’t behave as its creators intended?” On the other hand, safety organizations want to persuade the public that AI is dangerous, and discovering how current systems are less powerful than they look might not help.
- 22.See “How should we evaluate progress in AI?” and “Troubling Trends in Machine Learning Scholarship” for discussion of many other questionable research practices and specific epistemological errors.
- 23.I got this idea in 2015 when I was annoyed with exaggerated claims for image classifiers, as discussed in Are language models Scary?