How should we evaluate progress in AI?

Wolpertinger image courtesy Rainer Zenz

The evaluation question is inseparable from questions about what sort of thing AI is—and both are inseparable from questions about how best to do it.

Most intellectual disciplines have standard, unquestioned criteria for what counts as progress. Artificial intelligence is an exception. It has always borrowed criteria, approaches, and specific methods from at least six fields:

1. Science
2. Engineering
3. Mathematics
4. Philosophy
5. Design
6. Spectacle

This has always caused trouble. The diverse evaluation criteria are incommensurable. They suggest divergent directions for research. They produce sharp disagreements about what methods to apply, which results are important, and how well the field is progressing.

Can’t AI make up its mind about what it is trying to do? Can’t it just decide to be something respectable—science or engineering—and use a coherent set of evaluation criteria drawn from one of those disciplines?

That doesn’t seem to be possible. AI is unavoidably a wolpertinger, stitched together from bits of other disciplines. It’s rarely possible to evaluate specific AI projects according to the criteria of a single one of them.

This post offers a framework for thinking about what makes the AI wolpertinger fly. The framework is, so to speak, parameterized: it accommodates differing perspectives on the relative value of criteria from the six disciplines, and their role in AI research. How they are best combined is a judgement call, differing according to the observer and the project observed. Nevertheless, one can make cogent arguments in favor of weighting particular criteria more or less heavily.1

Choices about how to evaluate AI lead to choices about what problems to address, what approaches to take, and what methods to apply. I will advocate improving AI practice through greater use of scientific experimentation; pursuit particularly of philosophically interesting questions; better understanding of design practice; and greater care in creating spectacular demos. Follow-on posts will explain these points in more detail.

This framework is meant mainly for AI participants. For others, the pressing question may be “how long until superintelligent AI takes my job / makes us all rich without having to work / hunts down and kills all humans so it can make more paperclips.” I think the rational conclusion of a sophisticated, in-depth analysis, based on a detailed evaluation framework such as the one explored in this post, is: “Who knows?”

Some skepticism about near-term progress follows from considerations I’ll present here, though. AI has neglected scientific theory testing, and much of what the field thinks it knows may be false. And, demonstrations of apparent capabilities are often misleading.

The rest of this post has six sections explaining how progress criteria from the six disciplines work within AI; and then a concluding section recapitulating how I think they they should be weighted.


Science’s progress criteria are:

  • Newly-discovered truths
  • Broader explanations
  • An unusual sense of “interestingness,” related to, but not identical with, ordinary curiosity

Let’s take them in order…

“The greatest defect”

Giant squid attack!

The mainstream AI research program of the ’50s through the ’80s is now called “good old-fashioned AI” (GOFAI), since not many people pursue it anymore. GOFAI was exciting because it gave interesting, plausible explanations for how knowledge, reasoning, perception, and action work. For decades, we failed to put those theories to strenuous tests—and when we did, they turned out to false. Nearly everything we thought we knew was wrong. The GOFAI research program collapsed around 1990.

A. J. Ayer, a proponent of logical positivism in his youth, was asked after it conclusively failed, “What do you now in retrospect think the main shortcomings of the movement were?” And he answered, “Well, I suppose the greatest defect is that nearly all of it was false!”2

GOFAI had several defects, but… the main thing is, nearly all of it was false. We should have realized this earlier, but we were distracted by fascinating philosophical and psychological questions, and by wow, look at this cool thing we can make it do!

As far as current AI goes, the most important question is: what parts of it are true? It may have other virtues or defects, but until enough science is done to sort out which bits are just factually true, those are secondary.

Science aims to learn how the world works, by experiment when possible, or observation otherwise. In AI, we have the luxury of experiment. Still better: we have the luxury of perfectly repeatable experiments, under perfectly controlled conditions! Almost no other domain is as ideally suited to scientific investigation.

Yet it is uncommon for AI research to include either a hypothesis or an experiment. Papers commonly report work that sort of sounds like an experiment, but those often amount to:

We applied an architecture of class X to a task in class Y and got Z% correct.

There is no specific hypothesis here. Without a hypothesis, you are not doing a scientific experiment, you are just recording a factoid. Individual true facts (“the squid we caught today is Z% bigger than the last one!”) are not science without a testable general theory (“cold water causes abyssal gigantism by way of extended lifespan”).

Explaining AI

Theories are much better if they are explanations, not just a formula for prediction. (Explanation is a criterion of scientific progress, although not an absolute requirement.) A good experiment should eliminate all but one possible explanation for the data, using controls.

Your algorithm got Z% correct: Why? What does that imply for performance on similar problems? AI papers often just speculate. Implicitly, the answer may be “we got Z% correct because architecture class X is awesomely powerful, and it will probably work for you, too!” The paper may state that “Z% is better than a previous paper that used an architecture of class W,” with the implication that X is better than W. But is it—in general?

Current machine learning research, by contrast with GOFAI, does not prioritize explanations. Sometimes it seems the field actively resists them. (I’ll suggest possible reasons below.) As far as scientific criteria go, without rigorous tests of explanatory hypotheses, you are left only with interestingness. Too often, interestingness (“Z% correct is awesome!”) is primary in public presentations of AI.

“This year, we’re getting Z% correct, whereas last year we could only get (Z-ε)%” does sound like progress. But is it meaningful? If the specific problem you are improving against is one people want solutions for, it may be engineering progress—discussed in the next section. It’s not scientific progress unless you understand where the improvement is coming from. Usually you can’t get that without extensive, rigorous experiments. You need to systematically test numerous variants of your program against numerous variants of the task, in order to isolate the factors that lead to success. You also need to test against entirely other architectures, and entirely other tasks.

This is a big job. Many researchers do some experiments of this sort. From individual projects, that may be the most we can reasonably expect, given limited resources. However, to adequately test hypotheses, the field as a whole needs to fill in the missing pieces—and often doesn’t. Its culture of competing against quantitative benchmarks encourages atheoretical tinkering, rather than science.

In many of the most-hyped recent AI “breakthroughs,” the control experiments that seem most obvious and important are missing. (I plan to discuss several of these in follow-up posts.)

Is AI scientifically interesting?

Because AI investigates artificial intelligence, its central questions are not necessarily scientifically interesting. They are interesting for biology only to the extent that AI systems deliberately model natural intelligence; or to the extent that you can argue that there is only one sort of computation that could perform a task, so biology and artificial intelligence necessarily coincide. This may be true of the early stages of visual processing, for example.

AI is mostly not about what nature does compute (science), nor about what we can compute today (engineering), nor about what could in principle be computed with unlimited resources (mathematics). It is about what might be computed by machines we might realistically build in the not-too-distant future. As this essay goes along, I will suggest that AI’s criterion of interestingness is therefore closer to that of philosophy of mind than to those of science, engineering, or mathematics.

Learning from the replicability reform movement

The “first principle” of science, Feynman said in his famous cargo cult address, is that

You must not fool yourself—and you are the easiest person to fool. So you have to be very careful about that. After you’ve not fooled yourself, it’s easy not to fool other scientists.

The current replication crisis shows that many scientific fields have been fooling themselves on a massive scale. Most published research findings are false.

Social psychology is one field confronting this problem. Psychologists are engaged in impressive retrospective analysis, and in prospective reform efforts. Meta-scientists in that field find that false conclusions are most likely when:

  • Researchers pursue dramatic, surprising theories with implications for human nature and everyday life
  • Researchers and the media collaborate to spin exciting interpretive narratives for the public, generalizing well beyond specific findings
  • Researchers feel free to interpret their results after the fact
  • Researchers do not report null results (“failures”)
  • Researchers rarely repeat each other’s work to find problems
  • Researchers do not document their work in enough detail that others could check it
  • Experiments are done on an inadequate scale (in any of several dimensions)
  • Controls are missing or inadequate (in any of several ways)
  • Experiments are not systematically varied to find the limits of the theory
  • Large amounts of money and/or prestige are at stake.

These failures of scientific practice seem as common in AI research now as they were in social psychology a decade ago. From psychology’s experience, we should expect that many supposed AI results are scientifically false.

The problem—in both psychology and AI—is not bad scientists. It is that the communities have had bad epistemic norms: ones that do not reliably lead to new truths. Individual researchers do what they see other, successful researchers doing. We can’t expect them to do otherwise—not without a social reform movement.

The exciting news is that psychologists are taking these problems seriously. They are putting in place new epistemic norms that should help prevent such failures of scientific practice. These reforms should make discoveries of true, explanatory, interesting theories more common.

Can AI learn from psychology’s experience, to improve standards of practice?

I think it can, and should!

That said, AI is a wolpertinger. It’s not just science, and probably can’t just follow the replicability movement’s lead.


Engineering applies well-characterized technical methods to well-characterized practical problems to yield well-characterized practical solutions.

Engineering’s progress criteria are quite different from science’s. If you discover new truths or explanations in the course of engineering, it’s incidental. And engineering isn’t supposed to be “interesting” in the scientific sense; instead, it is exciting when it yields practical value.

Engineering finds solutions within explicit constraints, and optimizes (or satisfices) explicit objectives. Typically there are several, often with explicit numerical trade-offs between them. For instance: cost, safety, durability, reliability, ease of use, and ease of maintenance.

AI researchers often say they are doing engineering. This can sound defensive, when you point out that they aren’t doing science: “Yeah, well, I’m just doing engineering, making this widget work better.” It can also sound derisive, when you suggest that philosophical considerations are relevant: “I’m doing real work, so that airy-fairy stuff is irrelevant. As an engineer, I think metaphysics is b.s.”

Some AI work genuinely is engineering. Here’s the checklist:

“Data science” is, in part, the application of AI (machine learning) methods to messy practical problems. Sometimes that works. I don’t know data science folks well, but my impression is that they find the inexplicability and unreliability of AI methods frustrating. Their perspective is more like that of engineers. And, I hear that they mostly find that well-characterized statistical methods work better in practice than machine learning.

Adjacent to engineering is the development of new technical methods. This is what most AI people most enjoy. It’s particularly satisfying when you can show that your new system architecture does Z% better than the competition. On the benchmark problem everyone is competing over… Does that reliably translate to real-world practice? Most AI researchers don’t want to take the time to find out. I will suggest below that this aspect of AI has more in common with design than engineering.

Engineering is great, when you can do it. Should AI be more like engineering? With much hard work, methods developed in AI research can sometimes be characterized well-enough that they get to be routinely used by engineers.

Then everyone stops calling it “AI.” This can be frustrating: every time we do something really great, it’s snatched away, and the field doesn’t get due credit. Unquestionably, AI research has spun off many of the most important advances in software technology. (Did you know that hash tables were long considered an advanced and incomprehensible AI technique?) Economically, AI research has been well worth the money spent on it.

But, the meaning of a word is in its use. “AI” is used to mean “complicated or hypothetical software that might be amazing, but we don’t understand why it works.” That simply isn’t engineering.


Mathematics, like science, aims to discover interesting explanatory truths. What “interesting” and “explanatory” and “true” mean are quite different, and the methods—proof vs. experiment—are quite different.

Throughout its history, AI has shaded into mathematics, with results that contribute to both fields. This has often had powerful synergies.

That said, the evaluation criteria of mathematics—its senses of interesting, explanatory, and true—can be misleading in AI.

Proofs of algorithms’ asymptotic convergence are typical examples. Assuming a proof is technically correct, it is definitely true in the mathematical sense. It may exhibit structure that is mathematically explanatory: you have an “aha! so that’s why!” experience reading it. It is mathematically interesting if, for instance, it significantly generalizes an earlier result.

Most proofs of asymptotic convergence are not true, or explanatory, or interesting for AI, which has different criteria. AI is about physical realizability. That doesn’t have to mean “realizable using current technology,” but it does at least mean “realizable in principle.” A convergence result that shows an algorithm gets the right answer “in the limit” tells us nothing about physical realizability, even in principle. If quick arithmetic shows that the algorithm running on 10100 GPUs will still be far from the answer after a trillion years, then the proof is not true, explanatory, or interesting—as AI. Conversely, unless you can demonstrate that an algorithm will converge reasonably quickly on realistic quantities of hardware, it’s not AI—however interesting it may be as math.

Mathematics is an invaluable tool. Using it well in AI requires subjecting it to alien evaluation criteria, from beyond math itself.


Cartesian Theater
The infinite regress of the Cartesian Theater
Image courtesy Jennifer Garcia

Analytic philosophy—like science and mathematics—aims for interesting explanatory truths. It has, again, its own ideas of what count as “interesting,” “explanatory,” and “true.”

By and large, analytic philosophers start with “intuitions” they believe to be true, and then try to prove that they are true by way of arguments. I think the truth criterion “convincing arguments for intuitions” has been a bad influence on AI. It conflicts with science’s better criterion “neutral tests of hypotheses.” It has repeatedly led AI into making exaggerated claims based on inadequate evidence. I’ll suggest that analytic philosophy’s dysfunctional relationship with neuroscience has misled AI as well.

On the other hand, analytic philosophy of mind’s criterion for what counts as “interesting” largely coincides with, and formed, that of AI. From its founding, AI has been “applied philosophy” or “experimental philosophy” or “philosophy made material.” The hope is that philosophical intuitions could be demonstrated technically, instead of just argued for, which would be far more convincing. I share that hope.

Two fundamental intuitions most analytic philosophers of mind want to prove are:

  1. Materialism (versus mind/body dualism): mental stuff is really just physical stuff in your brain.
  2. Cognitivism (versus behaviorism): you have beliefs, consider hypotheticals, make plans, and reason from premises to conclusions.

These are apparently contradictory. “Hypotheticals” do not appear to be physical things. It is difficult to see how the belief “Gandalf was a wizard” could both be in your head and about Gandalf, as a physical fact. And so on.

This tension generated the problem space for GOFAI. The intuition of all cognitive scientists (including me! until 1986) was that this conflict must be resolvable; and that its resolution could be proven, beyond all possibility of doubt, via technical implementation.

GOFAI papers largely described an implementation: the structure of a gizmo. (I’ll come back to this, in the section on design.) They usually also described an “experiment,” which rarely had scientific content: it was “we ran the program on three small inputs, and it produced the desired outputs.”

The exciting part of a GOFAI paper was the interpretive arguments. Starting from the structure of the gizmo, we made philosophical claims about the mind. The program, we said, was “learning from experience” or “reasoning about knowledge.” Its algorithm explained how those mental processes worked, at least roughly and for some cases, and probably for humans as well. These claims were often highly exaggerated, and mainly without scientific justification. In fact, the program built a labeled graph structure. We called that “knowledge”—but was it? Were these algorithms “learning” or “reasoning”? Ultimately, there is no fact-of-the-matter about this. But, it at least has to be argued for, and that part of the story was mostly missing. By systematically using the same words for human activities and simple algorithms, we deluded ourselves into confusing the map with the territory, and attributed mental activities to our programs just by fiat.

How did we go so wrong for so long with GOFAI? I think it was by inheriting a pattern of thinking from analytic philosophy: trying to prove metaphysical intuitions with narrative arguments. We knew we were right, and just wanted to prove it. And the way we went about proving it was more by argument than experiment.

Eventually, obstacles to the GOFAI agenda appeared to be matters of principle, not just matters of limited technical or scientific know-how, and it collapsed.

Some of us, at that point, went back and questioned AI’s fundamental philosophical assumption that cognitivism is the only alternative to behaviorism. We started a new line of research, pursuing a third alternative—interactionism—inspired by a different philosophical approach.

I believe AI’s best criterion of “interestingness” is philosophical, so that the proper business of AI research is to investigate philosophical questions. If so, a new philosophical approach was the right move! Evidence in favor of that were several technical breakthroughs. Perhaps we could and should have taken this line of work further.

After GOFAI’s collapse, philosophers gave up on AI. Most remained committed to cognitivism, so they transferred their hopes to neuroscience. Brains are obviously physical, mental, and cognitive, so they are definite proofs that materialism and cognitivism are right. (Right?) Thus the truth is established, and it goes without saying that minds are interesting, so all we need is an explanation. Philosophers encouraged neuroscientists to interpret their results in cognitivist terms. That has, I think, distorted neuroscience in much the same way it has distorted AI.

Thirty years later, we still have no clue what brains do or how.

Neuro expectation: “Learn how we think and what makes us human!”
Neuro reality: “Here are 30 different nuclei involved in eye movements!”
Scott Alexander

Actually: “Here are 30 different nuclei correlated with eye movement.”
Michel Teivel

In the absence of understanding, brains seem like magic. So, rather than trying to understand them scientifically, why not just simulate them, and gain the same powers? And maybe also it will be easier to run experiments on simulated brains than actual ones, and to gain understanding thereby.

From the beginning, AI has pursued this approach in parallel with GOFAI. Most of this research descends from McCullough and Pitts’s 1943 neuron model, which was biologically reasonable given the state of knowledge at the time. It also—they pointed out—neatly implemented propositional logic, which was still then a candidate for “The Laws of Thought.” Subsequent research in the tradition has added technical features to the McCullough and Pitts model, motivated by computational considerations rather than biological ones. The most important is the error backpropagation algorithm, the central feature of contemporary “neural networks” and “deep learning.”

Meanwhile, neuroscience developed a much more complex and accurate understanding of biological neurons. These two lines of work have mainly diverged. Consequently, to the best of current scientific knowledge, AI “neural networks” work entirely differently from neural networks. Backpropagation itself does not seem biologically plausible (although, since we mostly don’t know how brains work, it can’t be ruled out).3

Everyone in the field knows this, yet senior researchers still frequently talk as if “neural networks” work much like brains. I’ll suggest why later. But first, the effect of this rhetoric:

What makes your research program promising?
We are aiming for human-like intelligence, and our neural networks work like human brains.
You mostly can’t explain why these systems work. Isn’t that a problem?
We don’t know how brains work, but they do, and the same is true for neural networks.
Shouldn’t you be trying harder to find out how and when and why they work?
No, that’s probably impossible. Brains are holistic; you can’t understand them analytically.
Some people say that they’ve analyzed specific “neural networks” and figured out how they do work. Turns out they do something boring, equivalent to kNN or even just regression.
But, you see, we’ve proven mathematically that neural networks have the flexibility to perform any computation. Like brains.
So can my phone.
Yes, but phones aren’t like brains.

This may be a comic exaggeration. But the sometimes-explicit, sometimes-tacit “works like brains” simultaneously explains why the research program must succeed overall, and waves away technical doubts about details.

This seems parallel to the pattern of error in GOFAI. We knew our “knowledge representations” couldn’t be anything like human knowledge, and chose to ignore the reasons why. Contemporary “neural network” researchers know their algorithms are nothing like neural networks, and choose to ignore the reasons why. GOFAI sometimes made wildly exaggerated claims about human reasoning; current machine learning researchers sometimes make wildly exaggerated claims about human intuition.

Why? Because researchers are trying to prove an a priori philosophical commitment with technical implementations, rather than asking scientific questions. The field measures progress in quantitative performance competitions, rather than in terms of scientific knowledge gained.


I think AI researchers’ intuition is right that implementations—illustrative computer programs—are powerful sources of understanding. But how does that work? It’s tempting to analogize implementations to scientific experiments, but usually they aren’t. It’s tempting to think of them as engineering solutions, but they usually aren’t. I think “implementations” are best understood as design solutions—quite a different thing.

The actual practice of AI research is more like architectural design than like electrical engineering. Viewing AI through this lens helps explain its recurring destructive hype cycle pattern. I’ll explain how better design understanding may help evaluate AI progress more accurately, thereby smoothing the hype cycle.

The design view may also improve AI practice by eliminating a major source of technical difficulty and wasted effort.

The nature of design

Design, like engineering, aims to produce useful artifacts. Unlike engineering, design addresses nebulous (poorly characterized) problems; is not confined to explicit, rational methods; and develops snazzy—not optimal—solutions.

(Nebulosity is a matter of degree, so design and engineering shade into each other. Most designers do some engineering, and most engineers do some design. Temporarily polarizing the two helps explain how AI research is design-like.)

In engineering, you start with a well-specified problem statement. You begin by analyzing it to derive implications and constraints that guide your process. Only once you understand the problem throughly do you begin assembling a solution.

Design concentrates on synthesis, more than analysis. Since the problem statement is nebulous, it doesn’t provide helpful guiding implications; but neither does it strongly constrain final solutions. Design, from early in the process, constructs trial solutions from plausible pieces suggested by the concrete problem situation. Analysis is less important, and comes mostly late in the process, to evaluate how good your solution is.

Since design problems are nebulous, there is no such thing as an optimal solution. The evaluation criterion might be called “snazziness” instead. A good design is one people like. It should make you go “whoa, cool!” An outstanding design amazes. Design success means not that you solved a specific problem as given, but that you produced something both nifty and useful in a general vicinity. (The products of design, unlike those of art, have to work as well as wow.)

Design in practice

Architectural model
Image courtesy Museums Victoria

Systematic, explicit, rational methods are secondary in design. Those mostly don’t apply to nebulous problems with nebulous solution criteria. Expert designers say they rely instead on “creativity” and “intuition.” That isn’t helpful; it just means “we don’t know how we do it.” Indeed, design competence is largely tacit, inarticulable, and “know-how” more than “knowing-that.” For that reason, it has to be learned through apprenticeship and experience, rather than in classrooms or through reading.

Nevertheless, empirical studies of design practice give some insight into how it works.4

First, a designer maintains contact with the messy concrete specifics of the problem throughout the process. An engineer, by contrast, operates primarily in a formal domain, abstracted from the mess.

Metaphorically, possible design approaches are suggested by the mess. From these suggestions, the designer builds a series of quick-and-dirty prototype models, and tries them out to see how they work. Architects build models from cardboard; AI researchers build them from code. These prototypes are not engineering models, subjected to serious real-world testing. They’re just “sketches” to give a sense of how something might work.

Donald Schön describes this cycle as a “reflective conversation with the materials.” Having the model provides concreteness, again, that guides the next step. You can “sort of see” how it will or won’t work. You build up an understanding of the problem space by trying out diverse possibilities, and then by iterative improvement of a promising candidate. The understanding gained is explanatory, but as with design knowledge in general, it is partly tacit, inarticulable know-how; a felt sense of how things work.

The design process repeatedly transforms the problem itself, which remains fluid throughout. What you think you are trying to accomplish changes repeatedly. The solution defines the problem as much as vice versa. You want to create something snazzy in the general area; and what “snazzy” means emerges only as a concrete property of the final product.

For engineers, this may seem highly unsatisfactory. Wouldn’t it be better to nail down exactly what the problem is, figure out what would make for a quantitatively good solution, and apply rational methods to get from here to there, instead of “having a conversation with a mess?”

If you can do that—it’s often the best approach. That’s why engineering is valuable. But many real-world situations just don’t resolve neatly into well-defined problems.

AI research as design practice

Chinese-Latin grammar, Fourmont, 1742

As noted above in the section on AI as engineering, AI typically applies ill-characterized methods to nebulous problems with nebulous solution criteria. (Using neural networks to translate Mandarin Chinese to English, for example.) In at least this way, it resembles design practice.

If you can nail down the problem, eliminate nebulosity, and demonstrate correctness, you are doing mainstream computer science, not AI. Which is great! But not always possible. No one can say what the problem of translation is, and there is no such thing as an optimal translation. But, your aim as an AI researcher is to do it well enough to impress people. That would definitely be snazzy!

So, you start hacking. You build a series of quick-and-dirty prototypes, and try them out on some Mandarin texts to see how they work. The different patterns of good and bad translations the programs produce suggests each next implementation. It may be difficult to say exactly what those patterns are, but you gradually build up insight into what works and why. And as you proceed, your understanding of what translation even means changes. This is your “reflective conversation with the concrete materials”—which include both natural language texts and program structure.

So in AI we build implementations to gain an understanding, which we may not be able to fully articulate. The implementation embodies the understanding, and can communicate the understanding. To develop expertise in AI, you can’t just read papers; you have to read other people’s code. And you can’t just read it, you have to re-implement it. Part of your understanding is gained only through the practice of coding itself. You don’t really know what a neural network is until you’ve written a backpropagation engine from scratch yourself, and run it against some classic small data sets, and puzzled over its outputs.

A skills mismatch

AI researchers are mostly educated in fields that take formal problems as inputs: engineering, mathematics, or theoretical physics. Yet the problems we tackle are mostly ones in which a design approach, maintaining a continuous, open-ended relationship with nebulosity, may be more appropriate.

You can’t learn how to relate to a mess in a classroom, by reading, or from Coursera. It is possible to learn from hard experience. It is better learned by apprenticeship. I gather that industry currently understands that there is something critical that PhDs from the best academic AI labs learn by apprenticeship, which can’t be learned any other way. I suspect it’s this.

Having been taught mainly skills for solving formal problems, AI folks tend to jump away from nebulosity as quickly as possible. Rather than slogging through the swampy real world, allowing informative patterns to gradually emerge, it’s more comfortable to escape into analyzing the nearest available abstraction.

So, premature problem formalization is a characteristic failure mode in AI. A nebulous real-world phenomenon (learning, for instance) gets replaced by some bit of mathematics (function approximation, for instance). The real-world word (“learning”) gets applied to both, interchangeably, so that researchers don’t even notice the difference. Then you can have all kinds of fun inventing and improving snazzy gizmos that address this precise but inaccurate problem statement. That may lead to valuable technical progress. Function approximation is a thing, and better methods have extensive engineering applications.

On the other hand, function approximation is not actually learning. Premature formalization means that solutions to the abstract mathematical problem may not be solutions to the concrete real-world problem, and vice versa.

This leads to two characteristic patterns of trouble. First, the abstract problem may be harder than the concrete one, because it elides key helpful features. In design theory terms, you are failing to listen to suggestions murmured by the mess. For example, the GOFAI plan-based formalization of practical action made the problem much more difficult than it needed to be, because it threw away on-going perceptual access to relevant information. Phil Agre and I wrote programs that went far beyond what the planning approach was capable of, by transforming the statement of the problem.

Alternatively, the abstract problem may be easier than the concrete one. This can lead to overconfidence and hype. In evaluating AI, one needs to be skeptical of researchers’ claim that they are making rapid progress on problem “X.” Are they actually working on the real-world task X? Or are they solving a formal problem they have abstracted from X, and applying the same name to it? For example, are they making progress on learning to translate Mandarin to English (a real-world problem), using neural networks? Or are they making progress on a formal problem which might better be described as “storing n-gram pairs in a lookup table,” using gradient descent on a continuous function? (A sadly expensive and unreliable way of implementing a lookup table.)

When the difference between the two manifests as poor performance in the real world, this leads to disillusionment and loss of funding.


Lemon squeezer
“Juicy Salif” lemon squeezer designed by Phillipe Starck
Image courtesy Niklas Morberg

I will suggest two antidotes. The first is the design practice of maintaining continuous contact with the concrete, nebulous real-world problem. Retreating into abstract problem-solving is tidier but usually doesn’t work well. IOU: my planned next post makes more detailed recommendations for better AI practice through insights from design practice.

Second: wolpertinger to the rescue! AI is not just design; it also draws from engineering, math, science, and philosophy.


Spectacle is an essential component of any professional practice, including science, engineering, mathematics, philosophy, and design.

It is natural and legitimate to want to amaze people. You are excited about your research program, and you want to share that. Your research is probably also driven by particular beliefs, and it’s natural to want to persuade people of them. A spectacular demonstration can change beliefs, and whole ways of thinking, in minutes—far faster than any technical exposition or logical argument.

Plus, there’s always competition for resources—money, attention, smart people. It’s legitimate to make the best honest case for your work, and that of others in your subfield who share your beliefs. A spectacular demonstration is more effective than any whitepaper or funding proposal.

Success criteria for spectacle include drama, narrative, excitement, and (most importantly) incentive to action. The entertainment industry is the natural home of spectacle. In that industry (including its subsectors such as politics, the news, and professional wrestling) truth is not a consideration.

In disciplines concerned with truth—which should include AI—one must design demonstrations with a special type of conscientiousness. Because spectacle is so powerful, it’s morally imperative to go beyond mere factual honesty, lest you fool both yourself and others. A spectacle must take great care not to implicitly imply greater certainty, understanding, or interestingness than your research justifies.

In AI spectacles, the great danger is giving the impression that a program can do more than it does in reality; or that what it does is more interesting than it really is; or that the explanation of how it works is more exciting than reality. If an audience learns a true fact, that the program does X in a particular, dramatic case, it’s natural to assume it can do X in most seemingly-similar cases. But that may not be true.

Imagine watching a TV advertisement for “a fully automatic dishwasher!” in the 1950s, before you knew what one was. It shows Mom grimacing at a disorderly pile of dirty dishes in the sink. Clock-wipe video transition to: Mom smiling at neatly stacked, glistening dishes on the counter!

You might reasonably assume that a “dishwasher” was a robot with two arms that stood by the sink and washed dishes by hand. It’s spectacular what technology can do now in the 1950s! Why, if a robot can wash the dishes, it can surely also vacuum the floor, change the baby, and make the bed. That would be a reasonable conclusion—if a dishwasher worked that way.

A dishwasher has superhuman performance; mine gets glasses shinier than I can, for far less effort. The advertisement is not lying about that.

But the clock-wipe concealed essential facts about how the dishwasher worked. It’s just a box that sprays hot water inside, not a robot. Its performance does not extend to household tasks of apparently similar difficulty, because they are not relevantly similar, in the way that is obvious when you know how it works.

A dishwasher also does not do the part of the task that would be most difficult for a robot: picking up irregularly-placed dishes smeared in greasy sauce. Fortunately, that is easy for people: loading the dishwasher is quick for us, relative to washing. Also not obvious from the advertisement: the dishwasher doesn’t quite do the whole job: you have to wash large pots and delicate glasses by hand.

Spectacular AI demos are often misleading in analogous ways. They rarely, if ever, convey an accurate understanding of how the program works. To be fair, it’s nearly impossible to do that in a demo, and it’s not the function of demos. But if they tacitly convey a wrong understanding, rather than just prompting curiosity, the audience gains a mistaken expectation for what else the program can do. Such misunderstandings are particularly likely if the demo glosses over parts of the task that the audience would reasonably assume the program does, but which are omitted because—like picking up greasy plates—they are particularly difficult for a computer. In current work, this might include feature engineering, for instance.

There is, almost always, much less to spectacular AI “successes” than meets the eye. But this deception, even though it is usually unintended, takes in researchers as well as outsiders. (“You are the easiest person to fool.”) This dynamic contributes to AI’s perennial hype cycle—exaggerated expectations that can’t be met, followed by disillusionment and funding “winters.”

A dialog produced in 1970 by Terry Winograd’s SHRDLU “natural language understanding” system was perhaps the most spectacular AI demo of all time. (You can read the whole dialog on his web site, download the code, or watch the demo on YouTube above.)

The sophistication of the program’s apparent language understanding is extraordinary. It bests current systems, such as Siri, Alexa, and Google Assistant, on which (it is said) billions of dollars of AI research have been spent half a century later. SHRDLU provided a warm glow of confidence that AI was achievable, and that GOFAI was progressing, for the next fifteen years.

There was nothing dishonest in Winograd’s work; no deliberate deception. However, by 1986, he came to believe he had fooled himself, and the field as a whole. In Understanding Computers and Cognition, he argued that SHRDLU’s understanding was merely apparent. Winograd gave strong reasons to believe that computers can’t understand natural language at all, even in principle. At least not using GOFAI methods: it’s the how that matters.

Analogously, I believe there is significantly less to current spectacular demos of “deep learning” than meets the eye. This is not mainly general cynicism about spectacles, nor skepticism about AI demos in general, nor dislike of deep learning in particular. (Although the deep learning field’s relative lack of interest in explanation does make it easier for researchers to fool themselves.) Primarily, it’s based on my guesses about specifically how these systems accomplish the tasks they are shown performing in the demos; and from that, how likely they are to accomplish tasks that may appear similar but aren’t. (I hope to analyze some examples in a follow-on post.)

Dishwashers weren’t on the path to general-purpose household robots. I don’t think current machine learning research will be either. Still, the technologies used in dishwashers have led to a continuing stream of labor-saving appliances. (I love my Instant Pot!) The technologies used in current AI demos may lead to a continuing stream of mental-effort-saving software.

Soaring Wolpertinger: Better AI through meta-rationality

Meta-rationality means figuring out how to use technical rationality in specific situations. (I am writing a book about this.)

Artificial intelligence requires meta-rationality for two reasons. First, the problems it addresses are inherently nebulous. Rational methods, unaided, are not usually adequate in nebulous messes; without a specific problem statement, they can’t even get started.

Secondly, AI is a wolpertinger: not a coherent, unified technical discipline, but a peculiar hybrid of fields with diverse ways of seeing, diverse criteria for progress, and diverse rational and non-rational methods. Characteristically, meta-rationality evaluates, selects, combines, modifies, discovers, creates, and monitors multiple frameworks.

So, necessarily, does AI. It unavoidably combines disparate perspectives and ways of thinking. You need meta-rational skill to figure out which of these frameworks to apply, and how.

AI also unavoidably involves multiple, incommensurable progress criteria. I began this post by asking “how should we evaluate progress in AI?” The answer was “lots of ways!”

And so we should try to do better along lots of axes. In this post, I have particularly advocated increased consideration of criteria and methods:

  • Of truth, from science
  • Of understanding, from design
  • Of interestingness, from philosophy

We can, and should, disagree about how heavily to weight these and other considerations. A healthy intellectual field engages in continuous, contentious, collaborative reflection upon its own structure, norms, assumptions, and commitments. This was the point of my “Upgrade your cargo cult for the win,” especially in its conclusion.

It’s also the central theme of my sometime-collaborator’s Philip Agre’s Computation and Human Experience, which discusses in greater depth most of the ideas I’ve presented in this essay.6


A week after I posted this, Zachary Lipton and Jacob Steinhardt posted “Troubling Trends in Machine Learning Scholarship,” which makes quite similar arguments, but with many detailed examples from recent work. I recommend it as an excellent, up-to-the-minute analysis by experts in the current state of the field.

  1. 1.This post echoes sections 9.1-9.2 of my PhD thesis, which proposed much the same framework. My main change of opinion since then is to put more weight on scientific truth criteria. I am now also skeptical of my claim there that AI is “about approaches,” as a legitimate autonomous source of value.
  2. 2.It’s not coincidental that GOFAI largely recapitulated logical positivism. We were blithely ignorant of reinventing its pentagonal wheels, and of the reasons those don’t work. The Ayer interview video is entertaining and informative; thanks to Lucy Keer for pointing me to it.
  3. 3.Biological considerations do continue to inspire some AI research. However, although much more detailed simulations of biological neurons are available, they are rarely used in AI. That’s probably in part because the best simulations are known not to incorporate much that’s known about neurons, and are known not to give quantitatively accurate results. It’s also because it’s not known how to combine multiple simulated neurons to perform computations that are interesting as AI.
  4. 4.See, for example, Donald Schön’s The Reflective Practitioner: How professionals think in action, and Nigel Cross’ Designerly Ways of Knowing.
  5. 5.This actually happened to Phillipe Starck. His process in this invention is analyzed in detail in Nigel Cross’ Design Thinking: Understanding How Designers Think and Work, working from the napkin on which Starck sketched successive design attempts. The final product is considered an icon of industrial design and has been displayed in New York’s Museum of Modern Art. There is no accounting for taste.
  6. 6.It was re-reading Phil’s book, as background for working on In the Cells of the Eggplant, that inspired me to write this post.