Current AI practice is not engineering, even when it aims for practical applications, because it is not based on scientific understanding. Enforcing engineering norms on the field could lead to considerably safer systems.
Work in AI is often described as “engineering.” It’s highly technical, it certainly isn’t science, and it’s not just mathematics, so it must be engineering by process of elimination? It’s most likely to be called “engineering” when it aims toward practical application, with the idea that engineering consists of solving practical problems with technology. That definition includes too much, though; most things you do with a computer (or even in the kitchen) solve practical problems with technology, and definitely aren’t engineering.
The American Engineers’ Council for Professional Development definition:
The creative application of scientific principles to design or develop structures, machines, apparatus, or manufacturing processes, or works utilizing them singly or in combination; or to construct or operate the same with full cognizance of their design; or to forecast their behavior under specific operating conditions; all as respects an intended function, economics of operation and safety to life and property.
I have added emphasis on features of engineering mainly absent from current AI practice.
Considerable genuine engineering has gone into getting known optimization algorithms to run efficiently on available hardware, and into designing new hardware that will run known algorithms faster. However, the algorithms themselves are not derived from engineering; they are not based on scientific principles, they are not rationally designed, we have very limited ability to forecast their behavior, and safety is at best an afterthought.
Fiddling with hyperparameters to get a better benchmark score—the main mode of AI system development—is not engineering. Much effort also goes into finding inputs (“prompts”) that that somehow yield “good” outputs. That is also not engineering.
This judgement is not an arbitrary or values-neutral matter of definition. It’s a matter of norms; of the ethic and ethos of engineering. The chief engineer for a bridge construction project should be the first to drive across it. A sane engineer does not trust a bridge that “only” failed three percent of the time in a simplistic simulation.
The black box tinkering methodology has created amazing things. This practice of intuitive exploration of mechanism variations might be worthy of meta-level investigation as a new way of knowing that is neither science nor engineering.1 This could be a fascinating project in epistemological theory.
It would not, however, excuse the field from addressing serious safety questions before its products are deployed. There’s currently no known way to do that apart from conventional science and engineering.
Task-relevant mechanistic understanding with reverse engineering
Current practice mainly treats backprop networks as inscrutable black boxes. Opening them up to examine their operation often reveals that they work in straightforward ways. That should make them amenable to reengineering for greater safety and better performance.
Current AI research retains the machine learning field’s emphasis on “learning,” meaning error minimization algorithms, and neglects investigation of what trained networks do and how. For this reason, there may be an unconscious slip from recognizing that the “learning” phase is mysterious to assuming that the run-time computation is too. It’s almost taboo to look inside the black box to find out how the network operates at run time. That often turns out to be less mysterious than the emergence of capabilities during training.
We can break open backprop networks to find out how they do what they do, in task-relevant mechanistic terms. This is analogous to reverse engineering, except that backprop networks aren’t engineered. Molecular biology is another analog: understanding how metabolic pathways and genetic regulatory networks work by probing and teasing apart the many specific molecular interactions that make them up.
Investigations typically find that small pieces of backprop networks compute the sorts of things you’d expect them to given the task requirements.2 For example, specific circuits within image classifiers detect edges, which has been known for decades as critical in both mammalian and conventional machine vision systems.3 Feed-forward modules in GPTs act as key-value stores, with individual units representing specific facts.4
Altering one of those little bits causes the network to change the corresponding specific functionality, while retaining the rest. For example, a series of studies have found that specific facts, such as that the Eiffel Tower is in Paris, can be located in GPTs, and then the few relevant parameters can be directly modified so the model “believes” the Tower is in Rome.5
Such knowledge, accumulated, dispels the aura of magic. It suggests that the network as a whole can be made comprehensible; and perhaps made less impressive, once understood in task-relevant mechanistic terms.
This level of understanding explains, for example, why vintage-2022 image generators could produce photorealistic pictures of horses, but put the wrong the number of legs on them about 20% of the time.6 That doesn’t matter for safety, but analogous errors do. If leg count were critical, forcing the error rate down just with more stringent training would be an inherently unsafe approach. Neel Nanda’s “Longlist of Theories of Impact for Interpretability”7 discusses safety benefits of task-relevant mechanistic understanding.8
Reverse engineering a backprop network is not so different from reverse engineering legacy code. Analogs of familiar methods work:
Instrument the code: Add apparatus that lets you trace the network’s computation at run time. For example, Meng
et al.developed a technique for locating the pathways through a GPT network that correspond to particular facts.9 Several researchers have developed “lens” methods that reveal how each layer in a transformer refines the prediction of the next word.10
Construct small artificial test cases: Nanda et al.’s full reverse engineering and mechanistic explanation of a small network optimized for modular arithmetic is a fascinating, beautiful example.11 It probably gives significant insight into the operation of large models (although this remains to be demonstrated).
Minimize naturally-occurring test cases: Instead of building bigger GPTs to improve benchmark performance, we should see how small we can make them while maintaining performance. Smaller networks are likely to be easier to understand. (The upcoming “Better text generation” section covers this.)
Automate diagnostic methods: Determining the function of bits of a network is mainly still a painstaking manual process, requiring significant human intuition. However, as methods for that are validated and systematized, they can be automated; some early work of this sort has been demonstrated.12
Alter the system to make it easier to understand, without significantly changing functionality: For example, Elhage et al. demonstrated that making a small, theoretically-motivated change to the activation function decreases polysemy of units in a GPT while retaining performance.13 Filan et al. describe regularization and initialization methods for increasing the modularity of networks in a graph cut framework.14
Engineering safer AI
Ideally, software should be proven correct. That is unusual in current software engineering practice. However, responsible software development projects require, at minimum, unit testing (checking that each of the parts works in isolation), integration testing (the system works as a whole), and code review (every part of the program gets examined for possible errors by someone other than its author).
We should require analogous practices when backprop nets are deployed in situations in which errors matter. That would be very expensive currently, but objecting to that is like complaining that safety engineering for cars is expensive. If you want to manufacture automobiles, you have to pay that cost.
Making this a requirement would incentivize developing better testing and debugging tools. I expect those are possible, given recent progress, and given how little effort has been put into developing them so far.
Typically, when you understand the function of a piece of a network, you find that it computes that only probabilistically and approximately. In some cases, researchers can replace those bits with deterministic, engineered, exact equivalents. For example, Cammarata et al. successfully replaced backprop-derived curve-detection units in an image classifier with manually engineered ones.15 Such re-engineering may increase reliability without affecting performance.
Ideally, the entire network could be replaced with a deterministic, fully understood, engineered alternative. This should be an AI safety engineering goal.
Engineered systems can’t be guaranteed absolutely safe. Even if we understand exactly what a device does, its interaction with unpredictable situations will be inherently unpredictable. You can’t engineer a perfectly safe car, because of black ice, landslides, and drunk drivers. Cars are statistically safer than horses, however; engineered solutions can be more predictable than those that emerge from an optimization process, whether backprop or evolution.
- 1.Michael Nielsen’s “The role of ‘explanation’ in AI” (Sporadica, 09-30-2022) makes this case. Subbarao Kambhampati’s “Changing the Nature of AI Research” (Communications of the ACM, Volume 65, Issue 9, pp 8–9) is a somewhat skeptical discussion of the approach.
- 2.Much of the best work in this area has been done by Chris Olah and his collaborators. For an overview as of 2020, see their “Thread: Circuits” (distill.pub/2020/circuits/).
- 3.Olah et al., “An Overview of Early Vision in InceptionV1,” distill.pub/2020/circuits/early-vision/.
- 4.Dai et al., “Knowledge Neurons in Pretrained Transformers,” arXiv:2104.08696v2, 2022.
- 5.For a summary of one such study, plus a useful literature review, see Meng et al., in “Locating and Editing Factual Associations in GPT,” rome.baulab.info. As of late 2022, the state of the art is Meng et al., “Mass-Editing Memory in a Transformer,” arXiv:2210.07229v2.
- 6.As of September 2022, according to François Chollet, a prominent researcher in the area, at twitter.com/fchollet/status/1573879858203340800. Getting body part counts right (fingers are especially difficult) has been a main focus of improvement in image generator development since then.
- 7.LessWrong, 11th Mar 2022.
- 8.This is often termed “interpretability” in machine learning research, but that term is overloaded and confusing; see Zachary C. Lipton’s “The Mythos of Model Interpretability,” arXiv:1606.03490, 2016.
- 9.Meng et al., “Locating and Editing Factual Associations in GPT,” arXiv:2202.05262v5, 2022.
- 10.For example, Belrose
et al., “Eliciting Latent Predictions from Transformers with the Tuned Lens,” arXiv:2303.08112, March 2023.
- 11.“Progress measures for grokking via mechanistic interpretability,” arXiv:2301.05217, 12 Jan 2023.
- 12.For example, in Conmy
et al., “Towards Automated Circuit Discovery for Mechanistic Interpretability,” arXiv:2304.14997, 28 Apr 2023.
- 13.Elhage et al., “Softmax Linear Units,” transformer-circuits.pub/2022/solu/index.html.
- 14.Filan et al., “Clusterability in Neural Networks,” arXiv:2103.03386v, 2021.
- 15.“Curve circuits,” distill.pub/2020/circuits/curve-circuits/.