Backpropaganda: anti-rational neuro-mythology

Current AI results from experimental variation of mechanisms, unguided by theoretical principles. That has produced systems that can do amazing things. On the other hand, they are extremely error-prone and therefore unsafe. Backpropaganda, a collection of misleading ways of talking about “neural networks,” justifies continuing in this misguided direction.

Two dangerous falsehoods afflict decisions about artificial intelligence:

First, that neural networks are impossible to understand. Therefore, there is no point in trying.
Second, that neural networks are the only and inevitable method for achieving advanced AI. Therefore, there is no reason to develop better alternatives.

These myths contribute to the unreliability of AI systems, which both technical workers and powerful decision-makers shrug off as unavoidable and therefore acceptable.

A natural question after learning that current AI practice is neither science nor engineering—as we’ll see soon—is “Why not? Why neglect technical investigation in favor of making spectacular but misleading demos?”

Part of the answer is “that’s what you get paid for.” As I suggested in “Create a negative public image for AI,” a PR strategy has motivated spending tens of billions of dollars to wow the public, not to advance understanding. Figuring out what’s going on inside backprop networks is hard work that mostly no one wants to pay for—especially not if it is likely to reveal unexpected risks and limitations.

Another part of the answer is that backpropaganda mythology says trained networks are inherently incomprehensible. Therefore, there is no point trying to understand them with science or with engineering analysis.

The inevitability of neural networks is the second myth. A natural question in response to “backprop is an engineering fiasco” is “so why does anyone use it, then?” The obvious answer is “because it can do cool stuff nothing else can.” That is true today, but it’s not the whole story.

Another piece is that people figure that since brains produce natural intelligence, artificial brains are the way to produce artificial intelligence.

Everyone working in the field knows “neural networks” are almost perfectly dissimilar to biological ones, but the language persists. “Yes, of course, everyone knows that, so it’s harmless.” No, it’s not. And it’s not just that it reliably confuses people outside the field, to the benefit of insiders.

It also confuses technical people, even when they know why it’s misleading. It produces a pervasive, tacit sense of the inevitability of backpropagation as the essential and universal method for artificial intelligence. After all, we know brains can do intelligent thing X, so backprop must be able to do it too, so there’s no point wasting time experimenting with alternatives.

Intellectual honesty and hygiene advise dropping the misuse of “neural” to describe backprop networks. This neuro-woo impedes safer AI.

The myth of incomprehensibility obstructs rational investigation

Backpropaganda claims that trained networks are inherently incomprehensible. They work by intuition, not rationality. They are holistic, where the failed symbolic AI approach was reductionist. Their intelligence is spread throughout the network, which in recent systems contains hundreds of billions of adjustable values. They contain no explicit representations, so it’s impossible to take them apart to find out what they’re doing. You can only treat them as mysterious black boxes.

As we will see in the reverse engineering section, this is false. Backprop networks are composed—at least in part—of small, specific bits with identifiable functional roles. They don’t work by holistic intuitive woo; they perform sensible computations, understandable in engineering terms.

However, the myth is convenient for both technical people and the decision makers who put systems into use.

For technologists, the supposed inscrutability frees you from an engineer’s moral responsibility for understanding what your product is doing and how. That’s hard work, and much less fun than building cool demos.

Inscrutability also frees technologists from understanding the messy, non-technical details of the particular real-world task you apply backprop to. It probably involves people, who are annoyingly unpredictable, so technologists don’t like thinking about them. Backprop is supposedly a universal learning method, so it will figure out everything about the task for you, and you never need to know.

It’s to the advantage of decision makers that technical people don’t try to understand the application domain. If they looked into it, they might raise pesky safety concerns, or discover and publicize your dubious agenda.

For decision makers, inscrutability is an advantage if it makes it infeasible to hold anyone to account for bad consequences.¹ Authorities justify deployed, dubious AI systems by switching rapidly between rationalist and anti-rational descriptions, as rhetorically convenient. AI is unbelievably powerful cutting-edge high technology, and therefore you must use it. Simultaneously, it is inherently incomprehensible, so its errors cannot be explained, and therefore you have to accept them without question.

Alexander Campolo and Kate Crawford describe this strategy as “enchanted determinism.” Deployed backprop networks are both “enchanted” (they work by magic) and determined (they are technologically inevitable). Therefore, whoever put one in place is absolved of responsibility for explaining or rectifying their harmful effects.

We are not being confronted with a sublime form of superhuman intelligence, but a form of complex statistical modeling and prediction that has extraordinarily detailed information about patterns of life but lacks the social and historical context that would inform such predictions responsibly—an irrational rationalization.²

Finally, the myth offers convenient rhetorical synergy with extreme risk and extreme benefit scenarios. If backprop networks can’t be understood mechanistically, they must be treated as minds instead, which makes them Scary. If they can’t be analyzed, there are no known limits to their capabilities. Then you can handwave their doing whatever implausible thing your narrative requires—whether it’s destroying the planet or delivering utopia.

Backpropaganda’s anti-rational rhetoric is an example of what I called collective epistemic vice in “Upgrade your cargo cult for the win”:³

Epistemic virtue and vice are not just learned from a community of practice, they inhere in it. The ways that community members interact, and the way the community comes to consensus as a body, are epistemically virtuous or depraved partly independent of the epistemic qualities of individuals. Just as moral preference falsification can lead a community of good people to do terrible things, epistemic preference falsification can lead a community of smart people to believe false or even absurd things.

Neural networks don’t solve problems uniformly

Backpropaganda suggests neural networks can provide general artificial intelligence. They can do anything brains can do, so they’re the right technology for every job. Implicitly, backprop is presented as effectively magic, able to solve any problem uniformly, without tedious task-specific engineering.

In fact, unaided it usually doesn’t work at all. Seeming successes using bare backprop in the 1980s were mostly researchers fooling themselves. Recent successes come instead from combining backprop with task-specific mechanisms: convolutions for image processing, tree search for games, conventional amino acid sequence alignment methods for protein folding, and the complex transformer architecture for text generation. They also depend on extensive per-task fiddling with algorithmic variations that have no basis in theory.

The fundamental backprop algorithm is very simple. It’s essentially the same as Newton’s method for finding function maxima or minima via gradient descent. (You may remember that from an introductory calculus class; if not, it’s not important for this discussion.⁴) Backprop’s gradient descent⁵ method attempts to minimize the difference, for each training data input, between the network’s output and the correct output. That difference is the amount of error for the data item. Put a different way, backprop optimizes the network’s fit to the data.

This basic algorithm makes mathematical sense, but it hardly ever works. Various kludgy add-ons address its typical failure modes. For example, minimizing error can lead to overfitting, so it’s common to optimize some complex regularized function of the error instead. Such alterations to the algorithm are ad hoc, and not derived from theoretical principles. In most cases, their operation can be adjusted by turning “knobs” called hyperparameters. The effects of the hyperparameters are poorly understood.

Getting good results from backprop usually depends on setting several hyperparameters just right. There’s no theoretical basis for that, and in practice “neural networks” get created by semi-random tweaking, or “intuition,” rather than applying principled design methods. Often, no hand-chosen set works well, so researchers run a hyperparameter search to try out many combinations of values, hoping to find one that produces an adequate network. (Might this necessity for epicycles be a clue that the approach is overall wrong?)

It is widely acknowledged in the field that it is mysterious why backprop works at all, even with all this tweaking.⁶ It’s easy to understand why gradient descent works in the abstract. It’s not easy to understand why overparameterized function approximation doesn’t overfit. It’s not easy to understand how enough error signal gets propagated back through a densely-connected non-linear deep network without getting smeared into meaninglessness. These are scientifically interesting questions. Investigation may lead to insights that could—ideally—help design a better replacement technology.

In current practice, however, getting backprop to work depends on hyperparameter search to tweak each epicyclic modification just right. Each modification to the algorithm has an understandable explanation abstractly, but none does the job individually, and it’s not easy to understand why they work well enough in combination—when they do.

If it seems likely that the resulting system would have unpredictable properties and fragile performance… that is usually the case.

Historical accidents make backprop seem inevitable

Taking backprop as the sole, correct base technology for artificial intelligence may now seem an eternal certainty. However, while the neuro-mythology is ancient, it was considered a fringe ideology until a decade ago. Recent dramatic successes may have vindicated it. Alternatively, enormously increased funding, applied capriciously to a single dubious technology, may have forced progress in a less-than-optimal direction.

The inefficiency, inscrutability, and unreliability of backprop are serious shortcomings. I believe that the current dominance of this technology is a path-dependent accident of history.

In the software industry, it is common for technically inferior designs to overwhelm better alternatives if they gain early momentum. The initial lead may come from overfunding, successful marketing hype, incidental initial ease of use due to compatibility with other current technologies, or pure random accident. The dominance of backprop may depend on all these.

Specifically—looking ahead—backprop’s early success depended on advances in consumer video graphics boards, and perhaps just some random luck. Then Mooglebook poured unprecedented amounts of money into backprop research, and used their enormous PR resources to hype results, sometimes inaccurately.

Minds vs. brains

From its beginnings in the late 1950s, AI research has been split between opposing factions: those based in theories about minds, and those based in theories about brains. Despite whole disciplines devoted to understanding minds and brains, the available theories of both are inadequate and probably fundamentally mistaken. This was a main reason AI made slow progress: turning our slight understanding of either minds or brains into software did not work.

According to the 1950s mainstream theory of minds, their essence is reasoning about knowledge. “Reasoning” was taken to mean logical deduction.⁷ The research program that attempted to capture reasoning in software came to be known as “symbolic AI.” That approach conclusively failed in the late 1980s. I was partly responsible for that conclusion, and the subsequent “AI winter” that lasted until about 2012.⁸

The reasons for symbolic AI’s failure are not primarily technical, but follow from first principles. They are permanently fatal, I think.⁹ In criticizing “neural networks,” I am not advocating a symbolic alternative.

Perhaps translating some better understanding of minds into software might work. However, what little effort has been expended in that direction has not borne fruit to date.

The other main approach to artificial intelligence, from the beginning, has been to emulate brains. Those are mistakenly thought to produce natural intelligence.¹⁰

“Neural networks” evolved from a 1940s theory of how biological neurons work.¹¹ The theory was wrong, and neuroscientists regard it as of only historical interest. It was influential in AI in the 1960s, but fell out of favor during the 1970s.

Neural networks were resurrected in the mid-80s as an alternative to symbolic AI, whose failure was then coming into focus. Surely there must be some way to create artificial intelligence! Neural networks were newly appealing as the exact opposite of symbolic AI. At minimum, they appeared to avoid its main defects: combinatorial explosions, inability to cope with nebulosity, and the need to hand-code knowledge.

When symbolic AI hit the wall, a hype train positioned neural networks as its designated successor. Advocates’ enthusiasm for the neuro-mythology, opposed by mainstream researchers who understood why it was false, resembled a holy war in the mid-to-late ’80s. The conflict depended on a false dichotomy, according to which one of the two approaches must be correct—whereas subsequent history suggests neither was.

By the early 1990s, backprop fizzled. Some seeming technical progress in the ’80s, in applying it to small problems, turned out to be researchers fooling themselves. The rest turned out not to scale up to larger problems.

Anyway, most of the excitement had not been due to technical results, but to the revival of the notion that “neural networks” work like brains—and therefore must be the route to artificial intelligence.

Perhaps translating some better understanding of brains into software might work. However, what little effort has been expended in that direction has not borne fruit to date.

Backprop research wilted to the ground, and laid dormant under the snow of the AI winter for the next two decades. A few stubborn believers kept the roots alive, though.

Forcing a brick airplane to fly

The explosion of the Challenger space shuttle, 1986

A funny but true adage of aeronautical engineering is that, by applying enough power, you can make a brick fly. Aerodynamic design makes an airplane safe and efficient, but it may limit top speed and it slows development. Alternatively, you can bolt an enormous engine onto an aerodynamically inferior body. This is common in fighter jets. The space shuttle took it to the max: a 184-foot, billion-dollar, use-once disposable engine, attached to a stubby airplane that was literally covered in bricks. Two of its 135 flights failed, killing everyone onboard both times.

During the AI winter, machine learning continued as an independent field devoted to statistical optimization methods. Neural networks were considered an eccentric subdiscipline, producing generally inferior results.

Mainstream machine learning research sought to overcome long-standing obstacles: overfitting, local minima, high computational cost, and a requirement for enormously more data than people need to learn the same things. Progress came from adopting unprincipled algorithmic hacks that often gave better performance on one or more of these dimensions, without theoretical justification.

Neural networks leapt out of obscurity in 2012, as winners—by large margins—of the ImageNet classification benchmark competition.¹²

What had changed—and what hadn’t?

The new image classifiers benefited from two sorts of advances. First, they were much less “neural” even than 1980s systems. Those were emulations of 1940s biological models that had already long been known false, but at least they were faithful to the theory. The 2010s systems inherited, from the previous couple decades of machine learning practice, a willingness to incorporate atheoretical add-ons that empirically improve performance. In the case of the decreasingly-neural networks, those included devices such as max-pooling, batch normalization, and dropout. Those addressed problems such as overfitting and local minima, without reference to biological analogies.

Second, the new neural networks ran on GPUs (Graphics Processing Units): add-on boards for home computers, designed to generate video game graphics. Inexpensive 2010 GPUs could do arithmetic faster than the million-dollar supercomputers of a few years earlier. They had also only recently been made suitable for general-purpose computing, not just graphics. Neural networks’ enormous arithmetical cost had been a main drawback relative to other machine learning methods. GPUs made them competitive.

What hadn’t changed was backprop’s deceptive cunning. Once again it had fooled researchers by exploiting spurious correlations in the benchmark. The ImageNet results didn’t mean what they seemed to, and success had little to do with neural networks. I discuss this in detail in the upcoming “Classifying images” section.

The dramatic ImageNet results drew the attention of the giant tech companies, including a bidding war to hire and fund the same senior neural network researchers who had led the 1980s movement.¹³ Since then, they have put more funding into backprop than have gone into all other AI technologies combined throughout history.¹⁴

They’ve put enormous engine power into—what? Perhaps a brick airplane. Progress in artificial intelligence may be due to Mooglebook’s unprecedented financial backing, and the consequent enormous efforts of a host of brilliant, highly motivated researchers, using unimaginably vast quantities of supercomputer time—rather than any intrinsic merit of backpropagation.

We can’t know, but I’d guess we’d be in a better position if that funding, brilliance, and computation had gone into something else. Or, still better, into a variety of research approaches, to see which would yield the best combination of capability, efficiency, and safety. In any case, putting all the eggs in this one basket seems unwise.¹⁵

To be fair, no other AI approach had made much progress in decades; why throw good money after bad? On the other hand, no other approach had had much effort devoted to it during the AI winter, so the lack of progress may not have indicated much.

Still, it seems that the recent dominance of “neural networks” in the marketplace of ideas is due to backpropaganda as well as technical success. The major AI labs are headed by veterans of the 1980s neural-vs.-symbolic battles. They remain determined to prove that “neural” networks are the route to human-like AI because they are brain-like—even though they aren’t.¹⁶ As in the 1980s, they define themselves in defiance of symbolic AI, and are still stomping on that dead horse.

But did the “neural” approach win? Or did ad hoc algorithms with zero biological relevance successfully exploit supercomputers?

Dropping the neuro-mythology, the lesson of the past decade is that spending tens of billions of dollars on advanced software development can get you exciting results that have no basis in theory, that you don’t understand, and that are unreliable and unsafe.

2 Comments

1.Joshua A. Kroll, “The fallacy of inscrutability,” Phil. Trans. R. Soc. A 376:20180084, 2018.
2.“Enchanted Determinism: Power without Responsibility in Artificial Intelligence,” Engaging Science, Technology, and Society, Vol. 6, 08 Jan 2020. Emphasis added.
3.At metarationality.com/upgrade-your-cargo-cult.
4.In a narrower, more technical usage, “backpropagation” refers only to one aspect of the algorithm: computation of error gradients layer-by-layer, caching intermediate results for efficiency. Commonly, though, “backpropagation” is used somewhat vaguely to refer to any gradient-based optimization method for “neural networks.”
5.You can blame the pun “Gradient Dissent” on @lumpenspace.
6.See for instance Terrence Sejnowski’s 2020 “The unreasonable effectiveness of deep learning in artificial intelligence,” PNAS, December 1, 2020, vol. 117, no. 48, 30033–30038.
7.As it became clear that was wrong, theorists in the 1970s tried to invent something similar to logic, yet somehow different. This “cognitive science” research program retained most of the central mistaken assumptions of logicism, and consequently failed.
8.Sorry about that! I explain, sort of, in the “Artificial intelligence” section of “I seem to be a fiction” (metarationality.com/ken-wilber-boomeritis-artificial-intelligence).
9.Part One of In the Cells of the Eggplant explains these reasons, in a broader context than just AI.
10.People, societies, and cultures produce intelligence, not brains. Brains are involved, as are (for example) stories. A brain would not be sufficient to produce intelligence, if one could somehow be disentangled from the person, society, and culture.
11.Specifically, Donald Hebb’s 1949 theory of learning built on a 1943 model of neurons due to Warren McCulloch and Walter Pitts, implemented in 1958 as the “Perceptron” by Frank Rosenblatt.
12.Most histories cite AlexNet’s 2012 victory as the turning point. This seems accurate in terms of tech industry awareness. However, the similar DanNet also won by a large margin in 2011. This led to a bitter precedence battle between the senior members of the two research teams.
13.Cade Metz, “The Secret Auction That Set Off the Race for AI Supremacy,” Wired, Mar 16, 2021.
14.I’m reasonably sure of this, but I haven’t been able to find good numbers. From sketchy sources, it appears that total global AI research funding averaged well under a billion dollars per year before 2012; it now runs at many billions per year. Sevilla et al.’s analysis of the amount of computation used in machine learning runs provides relevant evidence: it has grown much faster than Moore’s Law starting from 2012, and continuing to accelerate. “Compute trends across three eras of deep learning,” arXiv:2202.05924, 2022.
15.Klinger et al.’s “A narrowing of AI research?” (arXiv:2009.10385v4, 11 Jan 2022) discusses this problem, plausible policy responses, and a framework for funders to broaden bets.
16.To be fair, they also emphasize that current methods are insufficiently similar to brains, and advocate more biologically accurate models. Younger researchers mostly don’t care about that. Since 2012, funding has pulled a huge number of new entrants into the field, most of whom know little about its history, and are motivated just to win benchmark competitions by whatever means.