Fight DOOM AI with SCIENCE! and ENGINEERING!!

Current AI practices produce technologies that are expensive, difficult to apply in real-world situations, and inherently unsafe. Neglected scientific and engineering investigations can bring better understanding of specific risks of current AI technology, and can lead to safer technologies.

“Science” means “figuring out how things work.” “Engineering” means “designing devices based on an understanding of how they work.”¹ Science and engineering are good. Current AI practice is neither.

Most AI research is not science. In fact, the field actively resists figuring out how AI systems work. It aims at creating impressive demos, such as game-playing programs and chatbots, more often than attempting scientific understanding. The demos often do not show what they seem to.
Most applied AI work is not engineering, even when it produces practical applications, because it is not based on scientific understanding. It creates products by semi-random tweaking, rather than applying principled design methods. Consequently, the resulting systems are unreliable and unsafe.

Enforcing conventional scientific and engineering norms on AI could lead to considerably safer systems. A somewhat-technical companion document, Gradient Dissent, explains how.

From an engineering point of view, “neural” networks are awful. They are enormously expensive. Getting adequate results for new real-world uses is often impossible, and usually takes person-years of work by specialists if it succeeds. Most importantly from an AI risk point of view, they are unfixably unreliable, and therefore unacceptably unsafe for most applications.

We would reject any other technology that violated basic engineering criteria so completely. A subsequent section suggests public relations motivations for spending tens of billions of dollars developing a technology with these drawbacks. But, it’s also true neural networks can do things that no other current technology can do at all. Despite their unreliability, some outputs from text and image generators are astonishing.

“Neural” networks are mysterious largely because so little effort has gone into understanding them scientifically. Current research mainly treats them as inscrutable black boxes. Ordinary scientific practice may go a long way toward making them understandable, and thereby safer.

It also seems likely to me that, over the next several years, we’ll develop alternative technologies that are less expensive, easier to understand, and safer. Gradient Dissent sketches one way this might be achieved for text generation AI. It suggests separating linguistic ability from factual knowledge; using other methods to gain the former; and relying on curated text for knowledge instead of storing it in the “neural” network.

What you can do

Technology company executives and regulators can require adequate assurances of safety before allowing the release of AI systems. This has been a main goal for AI ethics organizations; and anyone else concerned with AI risk should join in the effort.

Responsible software engineering projects require extensive testing and code review. We should require analogous practices when machine learning systems are deployed in situations in which errors matter. That would be very expensive now, because AI systems are enormous and because very little effort has gone into creating tools for investigating them.

However, objecting to that is like complaining that safety engineering for cars is very expensive. If you want to manufacture automobiles, you have to pay that cost. Imposing this requirement will motivate AI companies to develop testing tools that don’t exist yet, but should.

Opening up AI black boxes to examine their operation is called mechanistic interpretability in the field. It often reveals that AI systems are not so mysterious after all, and work in straightforward ways that make scientific and engineering sense. That may make them amenable to reengineering for greater safety and better performance.

This seems to me the most promising short-term technical approach to increased AI safety. There has been little incentive for it. The field has rewarded the development of new and improved capabilities, without understanding, instead.

Funders, including governments can support mechanistic interpretability research, and—going a step further—can encourage the development of the discipline with RFPs, by organizing workshops, and through recognizing outstanding work.

In the longer term, funders can encourage efforts to find alternatives to “neural” methods, which are exceptionally risky. Their remarkable effectiveness may be due to enormous prior financial investment, rather than any intrinsic merit. In any case, putting all our eggs in this one basket seems unwise. Vastly more funding has gone into this one technology than all other AI methods combined. If we must have AI, we should seek to replace “neural” networks with simpler, cheaper, and safer alternatives. Klinger et al.’s “A narrowing of AI research?” discusses plausible policy responses, and a framework for funders to broaden bets.²

Gradient Dissent suggests creating an “Adversarial AI Lab” that would probe AI systems to find and publicize bad behavior. Its funding should be non-commercial, to prevent its agenda getting captured by the technology companies whose research and products it may discredit.

AI researchers and would-be researchers can choose mechanistic interpretability as their specialty.

Neel Nanda’s “Mechanistic Interpretability Quickstart Guide” suggests easy ways to begin. His “Concrete Steps to Get Started in Transformer Mechanistic Interpretability” explains how to dissect text generators specifically.³

This subfield should be particularly attractive for academics, although so far there’s been little awareness of the opportunity there. (The scant work to date has mostly been done in industry.)

Because mechanistic interpretability has been under-studied, there are probably orchards full of “low-hanging fruit,” meaning impressive results that can be obtained easily. What has been discovered so far is tremendously intellectually exciting for me—more so than anything else in AI research in decades.

The studies may be revealing inherent aspects of the vision and language tasks themselves, rather than properties of “neural” networks. (I discuss this in Gradient Dissent.) The tasks require abstract computations that probably must be performed similarly by people and by any artificial system. If that’s confirmed, it will cast light on human perception, communication, and cognition. It may also make it feasible to engineer mechanisms which perform the same tasks using much less computer power, and with much greater reliability.

AI research has focused instead on how networks “learn,” neglecting questions about what do they do once learning is completed, and how. In the past few years, academics investigating “learning” have been increasingly shut out, because exciting new results mostly consume hundreds of thousands of dollars worth of supercomputer time. In contrast, cutting edge mechanistic interpretability research can be done with minimal resources.

Academia values science and principled engineering. “Machine learning” has mostly not been that. Mechanistic interpretability is that. You have a shot at foundational work in a new scientific and engineering discipline, which may well outlast “neural” “learning” methods. Go for it!

The seeming ability of text generators to perform multi-step commonsense reasoning is currently the only plausible stepping stone toward Scary AI. I do find it somewhat worrying. So far, there have been no published investigations of either the mechanism for this ability or its ultimate limits. To the extent that apparent reasoning seems worrying, that project seems urgent.

Verifying that text generators aren’t on the road to superintelligence, by understanding better what they can’t do, should be an immediate priority. The results may be reassuring in showing that there’s nothing mind-like happening, and that the “neural” networks implement a straightforward algorithm, or cheat somehow (as they typically do). Alternatively, if they are doing something worrisome, it would be better to know that, and to try to understand how—sooner rather than later.

Current research incentives in the field will not prioritize that research. AI safety organizations and other funders should. AI safety organizations have prioritized other “alignment” approaches that seem to have reached dead ends. I suggest a pivot to centering mechanistic interpretability research, particularly for text generators.

2 Comments

1.These are rough definitions only, but adequate here.
2.arXiv:2009.10385v4, 2022.
3.These and other resources are at neelnanda.io/mechanistic-interpretability.