Few AI experiments constitute meaningful tests of hypotheses. As a branch of machine learning research, AI science has concentrated on black box investigation of training time phenomena. The best of this work is has been scientifically excellent. However, the hypotheses tested are mainly irrelevant to user and societal concerns.
An nice example, with important practical applications, is the discovery of scaling laws for training text generators. These provide formulae relating training cost, network size, the quantity of training data, and predictive accuracy. They answer the question “if we plan to spend a specific number of millions of dollars on training, what are the best network size and training dataset size, and what predictive accuracy can we expect?” These laws have been replicated by multiple labs, and have significantly increased their efficiency.
I expect that these simple arithmetical laws can hold because bulk text is uniform in its statistical properties. That is, if you pull ten gigabytes of text from the web at random, it’s going to look very similar to the next ten—even though particular text genres may have very different statistical properties. Training on academic chemistry journal articles won’t let you predict the next word of social media chat, or vice versa, but if you slurp enough text at random, you’ll get plenty of both.
What the scaling laws don’t tell you is what a trained network can actually do—”capabilities” in the field’s jargon. “Predictive accuracy” here is just the probability of predicting the next word of a previously unseen text. It doesn’t tell you the probability that a chatbot’s answers will be accurate or plausible but false, relevant or nonsensical, helpful or offensive. Empirically, predictive accuracy relates to task-relevant capabilities only loosely. Qualitatively new capabilities often emerge rather suddenly during training, even as predictive accuracy improves smoothly.
Scientific and mathematical analysis of optimization algorithms ignores the specific task the network is optimized for, and ignores the operation of the resulting network at run time. Those are what matter for practical applications and for safety. Real-world tasks are usually not amenable to mathematical analysis, due to their open-ended complexity. Overemphasis on mathematics has reinforced the field’s “black box” approach.
AI research mainly aims at creating impressive demos, such as game-playing programs and chatbots, rather than scientific understanding. As I explained in “How should we evaluate progress in AI?”,1 such demos often do not show what they seem to—whether through carelessness in exposition, honest researcher confusion, or deliberate deception.
OpenAI’s ChatGPT was explicitly released initially as a demo (“a low-key research preview”). The public was so excited by it—unexpectedly for OpenAI—that they turned it into a product. It’s still mostly not clear what that product is good for, much less how it works.
Benchmark performance rarely reflects reality
Competitions to get the best score on “benchmarks” drive the creation of advanced AI systems. This motivates ad hoc kludgery rather than analysis and principled design.
Capabilities are the practical aim for AI research. Text generator capabilities are typically assessed using benchmarks—collections of quiz questions. A random subset of the problems are used as training data; then the network is tested against the remainder. For many benchmarks, it’s considered exciting to get the error rate down below 50%.
Research often proceeds by comparative benchmarking: does algorithm A or B do better on the benchmark? This looks sort of like science, but it is not science. It is not, because the benchmarks are not meaningful means for evaluation, and because determining that algorithm A does 3% better on the benchmark than algorithm B tells us nothing about how or why.
Benchmarking incentivizes rapid, blind construction of systems by increased brute force, and by random tweaking to see what marginally decreases wrong outputs. Minimal effort goes into understanding what the networks do and how, and why they make the errors they do.
It’s considered perfectly normal and acceptable to build and deploy systems that you don’t understand and which give bad outputs much of the time. “We can keep decreasing the error rate” isn’t an adequate justification if errors are costly, harmful, or potentially disastrous.
Benchmarks are also not objectively correct measures of performance. Most are rather haphazard collections of quiz problems assembled unsystematically and with minimal thought.
-
Often the people employed to construct benchmark problems have no relevant expertise. Many benchmarks are crowd-sourced on platforms such as Mechanical Turk, and are full of both errors and spurious correlations.
-
Many other benchmarks are constructed mechanically, by inserting random values in fill-the-blanks templates. It is not surprising that a statistical method can, in effect, recover simple templates from the pattern of examples, without “understanding” anything.
It’s common for systems that get high scores on benchmarks to do dramatically worse when applied to seemingly similar tasks in practice.2 One common reason is that the benchmark examples are dissimilar to real-world ones in unexpected ways.
Another is that half-baked benchmark construction often creates easily-exploited spurious correlations in the training data. This bedevils “reasoning” benchmarks in particular.3
Adversarial research versus confounding spectacles
Forceful debunking may be necessary to persuade AI researchers to do the necessary science instead of aiming for spectacular but misleading demos. That will require new incentives and institutions for adversarial research.
Trying to show you can get a system to do something cool—especially when it’s anything cool, you don’t much care what—is neither science nor engineering. That’s the dominant mode in current AI research practice, though. This drives competition on benchmarks, and publicizing individual exciting outputs—texts and images—without bothering to investigate what produced them.
It is natural and legitimate to want to amaze people. You are excited about your research program, and you want to share that. Your research is probably also driven by particular beliefs, and it’s natural to want to persuade people of them. A spectacular demonstration can change beliefs, and whole ways of thinking, in minutes—far faster than any technical exposition or logical argument.
The danger is in giving the impression that a program can do more than it does in reality; or that what it does is more interesting than it really is; or that it works according to your speculative explanation, which is exciting but wrong. The claims made in even the best current AI research are often misleading. Results (even when accurately reported) typically do not show what they seem to show. If the public learns a true fact, that the program does X in a particular, dramatic case, it’s natural to assume it can do X in most seemingly-similar cases. But that may not be true. There is, almost always, less to spectacular AI “successes” than meets the eye. This deception, usually unintended, takes in researchers as well as outsiders, and contributes to AI’s perennial hype cycle.
Zachary C. Lipton and Jacob Steinhardt attribute this to:
”(i) failure to distinguish between explanation and speculation; (ii) failure to identify the sources of empirical gains, e.g., emphasizing unnecessary modifications to neural architectures when gains actually stem from hyper-parameter tuning; (iii) mathiness: the use of mathematics that obfuscates or impresses rather than clarifies…; and (iv) misuse of language, e.g., by choosing terms of art with colloquial connotations or by overloading established technical terms.”4
The first two of these could be summarized as failure to perform control experiments: that is, to consider and test alternative possible explanations. A common bad practice, for instance, is to assume that a network has “learned” what you wanted it to if it gets a “good” score on a supposedly relevant benchmark. An alternative explanation, that it has gamed the benchmark by finding some statistical regularity other than the one you wanted, is more often true. That gets tested too rarely.
Figuring out how and where and why backprop systems work will require shifting research incentives from spectacle to understanding. A conventional approach would try to shift academic peer review toward more careful application of proper scientific norms. This wouldn’t work: because the most-hyped research is done in industrial labs, and because the field’s de facto standards for peer review are so abysmal that reform, even if possible, would take much too long.
I suggest incentivizing adversarial AI research that would perform missing control experiments, with the expectation they will show that many hyped results do not mean what they seem to.5 It would address questions like:
- Did it work for the reason the authors thought it did?
- If not, why did it work?
- Under what circumstances will it work, or not?
- What does it do when it doesn’t work?
- What other methods might have worked better?
- Did they report all the things they tried that didn’t work? No, they did not. For example, they ran thousands of hyperparameter combinations; how and why did most fail? We could probably learn from that, but mostly we don’t because it’s rarely attempted.
Fame and funding should flow to researchers who answer these questions. This demands unusual funders, and unusual researchers.
Funders want to see “progress.” This research program aims at negative progress in spectacle generation. Ideally, its results should be disappointing to the public. It should progress scientific understanding instead. That is more valuable in the long run, but it may take unusual fortitude for funders to accept the tradeoff.
Most researchers don’t do control experiments, or enough of them, unless they are forced to. Control experiments are boring, a huge amount of work, and are likely to show that your result is less interesting than you hoped. There are many possible explanations for any result; ruling out all of them except the exciting one takes much more work than futzing around with a network to get it to produce some cool outputs. Control experiments seem like janitorial scutwork.
What could motivate researchers to do other people’s janitorial scutwork, to clean up the mess? Some researchers actually care about the truth, and are naturally attracted to this sort of work. Watching hype overwhelm concern for truth also gets some of us angry, in a useful way. It’s an unattractive but important fact that annoyance with other researchers’ carelessness drives much of the best scientific work.
Competitiveness also motivates benchmark-chasing (“we beat the team from the Other Lab!”). I expect it’s feasible to shift some of that emotional energy to “we showed that the Other Lab were fooling themselves and they were wrong, hah hah!”
This debunking must concentrate on the best work; 90% of research in all fields is of course crud, and not worth refuting. The aim should be to demonstrate ways even the best is misleading (it over-claims, overgeneralizes, fudges, cherry-picks, etc.). Some AI critics do this now, which is valuable, but their analyses are relatively shallow and abstract (“the system doesn’t really understand language, because look, it gets these five examples wrong”). That’s not their fault; detailed control experiments take a lot of work and computational resources.
The psychology replication movement would be a valuable model. Many of that field’s long-held beliefs were due to researchers fooling themselves with inadequate controls, non-reporting of negative results, overgeneralizing narrow findings, refusing to make code and data available to other researchers to analyze, and other such epistemological failures.6 This got some young researchers angry enough to force major incentive changes. Those make newer work far more credible.
Researchers chase funding and prestige. Explicit calls for proposals for adversarial work could motivate with funding. Prizes could motivate with prestige.
Academia is a more natural environment for debunking than industry. However, given the magnitude of the task, academic AI labs may not be a good structure for the scale required. An independent “adversarial AI lab” might concentrate the necessary resources. A Focused Research Organization might provide the right structure for the task.7
- 1.At metarationality.com/artificial-intelligence-progress.
- 2.This is notably true of text generators; see Kiela et al., “Dynabench: Rethinking Benchmarking in NLP” (Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4110–4124, June 6–11, 2021) for examples, explanations, and proposed remedies.
- 3.There have been many studies demonstrating this; for example, Gururangan et al., “Annotation Artifacts in Natural Language Inference Data” (arXiv:1803.02324v2, 6 Apr 2018); Timothy Niven and Hung-Yu Kao, “Probing Neural Network Comprehension of Natural Language Arguments” (ACL Anthology P19-1459, 2019); Patel et al., “Are NLP Models really able to Solve Simple Math Word Problems?” (arXiv:2103.07191v2, 2021); McCoy et al., “Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference” (Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3428–3448, 2019); Dasgupta et al, “Language models show human-like content effects on reasoning” (arXiv:2207.07051v3, 2023). I don’t know of any studies in which a text generator was found to “reason” reliably on a test that definitively eliminated the ability to cheat.
- 4.Zachary C. Lipton and Jacob Steinhardt, “Troubling Trends in Machine Learning Scholarship” (arXiv:1807.03341, 2018). Also see Hullman et al., “The worst of both worlds: A comparative analysis of errors in learning from data in psychology and machine learning” (arXiv:2203.06498, 2022) for a similar, detailed analysis.
- 5.I did several such experiments around 1990. Mostly I didn’t bother to publish them, because I guessed (correctly) that backprop would fizzle when other people realized the then-hyped results were mostly researchers fooling themselves. I did write up one in “Input Generalization in Delayed Reinforcement Learning: An Algorithm and Performance Comparisons” (with Leslie Kaelbling; Proceedings of the International Joint Conference on Artificial Intelligence, 1991). We found that a success story for reinforcement learning with backprop worked slightly better when we replaced the backprop network with a linear perceptron, and we could explain why the perceptron worked in task-relevant terms.
- 6.See “How should we evaluate progress in AI?” and “Troubling Trends in Machine Learning Scholarship” for discussion of many other questionable research practices and specific epistemological errors.
- 7.A Focused Research Organization is a special-purpose institution created to solve a single scientific or technological challenge that is too large for an academic lab, too small for a national Big Science project, and that lacks the short-term path to profit required of private startups. See astera.org/fros/.