Classifying images: massive parallelism and surface features

Analysis of image classifiers demonstrates that it is possible to understand backprop networks at the task-relevant run-time algorithmic level. In these systems, at least, networks gain their power from deploying massive parallelism to check for the presence of a vast number of simple, shallow patterns.

Stable Diffusion image of a horse with five legs — What is wrong with this picture?

In this section, I’ll explain some of what is known about how backprop-trained image classifiers work—and how they don’t. GPTs are currently less well understood than image classifiers; my guess is that similar explanations apply. If so, we should be less impressed with them, and perhaps less frightened.

I’m writing this section as a personal narrative, partly because the way I came to understand image classifiers may make me overconfident in my guesses about GPTs. Knowing how I came to my tentative conclusions will let you discount them accordingly.

Shortly after finishing a PhD in AI in 1990, I ignored the field until 2014, because nothing seemed to be happening. In 2014, I came across the dramatic AlexNet image classification results from two years earlier, and was shocked. AlexNet wasn’t scary; just unexpected, and initially inexplicable.

I felt I absolutely had to understand what was going on. Back around 1990, I had done some work in machine vision and some with backprop networks, and I understood the states of those fields as of then. It didn’t seem like backprop should be able to do as well as the AlexNet ImageNet benchmark results showed.

The traditional approach to object recognition in machine vision research was to start from a 2D image of (say) a teapot, and build a reconstruction of the 3D shape that would look like that when viewed from a particular angle. Then you could rotate the 3D reconstruction to match against a database of 3D models of different object categories. I couldn’t see any realistic way a backprop network could learn to do that. Humans have special purpose mental rotation hardware, which didn’t seem like it would emerge naturally from backprop training.

So this was an irritating anomaly, especially because my experience had been that backprop networks are very good at fooling backprop researchers, and very bad at solving whatever problem they were set. I definitely didn’t want the backprop advocates to finally be right about something. On the other hand, there didn’t seem to be any obvious way for the networks to cheat on the ImageNet task.

So I read a bunch of papers, which didn’t help me understand what was going on. But I also thought about the nature of the problem itself.

One of the last bits of AI work I did was laying the groundwork for a pancake-making robot. I chose that task because my view was that an essential limitation of AI was its inability to cope with nebulosity—the fuzzy, gloppy, formless quality of reality. Pancake batter is gloppy, and pancakes cooked on a griddle have no definite shape. That meant that traditional robotics and machine vision methods, which were designed for teapots and other rigid manufactured objects with fixed shapes, wouldn’t work.

I set myself as a subtask the machine vision problem of determining when it is time to flip a pancake—which is when bubbles have emerged fairly uniformly on the top, and the batter starts to look less glossy. I figured I’d video cooking many pancakes and see if I could write a detector for the moment of flippability.

So what mattered here was texture, rather than shape. Machine vision research had done very little with texture, although it seemed to me that it might be important in many visual tasks. There was a theory of “textons,” fundamental units of texture, which no one had implemented. To do that, you’d need to run a slew of different convolutions over the image. Unfortunately, the computers of the era ran at about one megaflops, so computing even a single convolution took ages. After messing around with that a bit, I concluded that I wasn’t going to get anywhere any time soon.

AlexNet, running on two 2012 GPUs, was engineered specifically to compute lots of convolutions fast, so it seemed plausible that texture detection was part of how AlexNet worked. What else was it doing? Well, detecting small 2D features is also something that sort of network could do. But those don’t add up to object recognition. Do they?

Hmm, one other trick would be to exploit spurious correlations between the environments (background) in which an object typically appears and the object type. Exploiting spurious correlations was known to be a common way for backprop networks to cheat.

And also, snapshots on the internet (the basis for the training dataset) usually photograph objects from typical angles, so approximate 2D shape matching might work, without 3D reconstruction and rotation.

So it should be easy to test this. Could AlexNet be fooled by showing it images that humans would instantly identify unambiguously?

The example I thought of was a toilet seat (an unusual, fixed, easily identified shape), covered in zebra skin fabric (a texture that would not be associated with toilet seats in the training data), photographed at an unusual angle (so 2D shape matching wouldn’t work), in an outdoor natural landscape (where you might find zebras but not usually toilet seats). I started writing a piece about this, and nearly rushed out to actually make and photograph one to test. Then I reined in my disobedient brain, and told it firmly that someone else would presumably figure this out—if, in fact, the explanation was right.¹

Over the next few years, a series of fascinating experiments showed that it was right.² In fact, my thought experiment was gross overkill. Image classifiers get fooled with any one of the tricks I had in mind.

Artem Khurshudov reported a similar experiment, based on the same analysis: all of several AlexNet-like systems classified a picture of a leopard-print sofa as a leopard.³
Samuel Dodge and Lina Karam found that blurring or adding random noise to images radically degraded classifier performance, while leaving images recognizable (with a bit of difficulty) for humans. As a possible explanation, they suggested that “the network may be looking for specific textures.”⁴
Moosavi-Dezfooli et al. found that adding a tiny amount of a single, fixed pattern of noise, virtually undetectable by humans, caused misclassification of most images. The pattern was high frequency and edge- and curve-dense (messing with texture) and brightly colored (messing with color clues).⁵
Engstrom et al. found that rotating and translating images degraded classification too.⁶
Carter et al. tested the hypothesis that classifiers cheat by exploiting correlations between object types and the backgrounds they are typically photographed against. They found that classifiers did well when shown only narrow strips of the image, around its borders, including zero pixels from the object itself.⁷
To test the texture-only hypothesis, Wieland Brendel and Matthias Bethge randomly scrambled the global structure of images, leaving texture intact, and found it did not badly degrade classification performance.⁸ Apparently local information is more discriminating even than I had imagined; large-scale 2D shape matching doesn’t add much value. This is consistent with image generators such as DALL-E producing photorealistic pictures of horses that have five legs, as in the example from François Chollet at the beginning of this section. “A [neural network] model is excellent at reproducing local visual likeness, yet it has no understanding of the parts & their organization.”⁹
Olah et al. performed a series of circuit-level analyses of the guts of classifier networks, and identified specific circuits computing texture elements, local feature detectors (for dog noses, for instance), and 2D shapes (for cars, for instance). Car detectors look for car windows above car wheels, and don’t work on upside-down cars (which humans have no difficulty with).¹⁰

In sum, it seems that, at the task-relevant algorithmic level, an image classifier mainly consists of an extremely large number of mainly small-scale, highly specific 2D feature detectors. It computes all those in parallel, and classification relies on a small number firing that together predict a particular image category.¹¹

It’s plausible that one could engineer a non-“neural,” more efficient and better understood system specifically for this purpose, by implementing the same algorithm in conventional software.

I’d like to generalize four hypotheses from the image classification example. Whether these apply in any other specific domain is an empirical question. At minimum, they seem worth pursuing as simple explanations of first resort, by Occam’s Razor.

Backprop isn’t magic, and networks usually work in straightforward ways that make sense in terms of the task they’re applied to.
What we learned from AlexNet and its successors was not “backprop is incredibly powerful”; it was “texture clues are much more discriminating than previous theories of vision anticipated.” This is a fascinating fact about images, not about backprop. Convolution computation was built in by the researchers, not discovered by backprop. Plausibly, other backprop successes similarly rest on a combination of specialized researcher-given architecture and unexpectedly simple, brute-force solutions to apparently complex problems.
It’s usually possible to figure out what a network is doing: from first principles in the case of my guess, and then by experimental tests, and then by detailed circuit analysis.
Typically, networks gain their power from deploying massive parallelism to check for the presence of a vast number of simple, shallow patterns. I believe current text generators work this way too.

I think this must also be much of the way brains work, because neurons are so slow. Biological circuits are extremely depth-limited. We too must rely on breadth for our own, limited smarts instead. This was a motivating constraint in my PhD work. It is a valid similarity between “neural” networks and the brain—whereas artificial “neurons” and backprop aren’t.¹²

Comment on this page

1.I’ve been told that unspecified others figured this out earlier, but they didn’t write it up at the time. My work on this was in December 2014 and January 2015. My earliest public mention, as far as I can find, was on 13 February 2016 (twitter.com/Meaningness/status/698688687341572096). I’m mentioning dates not to establish academic priority, but to suggest that my predictions about AI may be credible.
2.For a non-technical summary, see Jordana Cepelewicz’s “Where we see shapes, AI sees textures,” Quanta Magazine, July 1, 2019.
3.“Suddenly, a leopard print sofa appears” (May 2015). Some systems did get his original image right, but all classified the sofa as a leopard if he rotated it 90 degrees. I only discovered his discussion several years later.
4.“Understanding How Image Quality Affects Deep Neural Networks,” arXiv:1604.04004, April 2016. It had been known for a couple years that classifiers could be fooled by adding noise crafted to be “adversarial” against an individual image (Goodfellow et al., “Explaining and Harnessing Adversarial Examples,” arXiv:1412.6572, December 2014). As far as I have found, this was not previously attributed to breaking texture clues, and also Dodge and Karam’s work was the first demonstration that random noise was effective.
5.“Universal adversarial perturbations,” arXiv:1610.08401v1, October 2016.
6.“Exploring the Landscape of Spatial Robustness,” arXiv:1712.02779, December 2017.
7.“Overinterpretation reveals image classification model pathologies,” NeurIPS 2021.
8.“Approximating CNNs with Bag-of-Local-Features Models Works Surprisingly Well on ImageNet,” ICLR 2019. A similar and earlier but perhaps less definitive test was Baker et al.’s “Deep convolutional networks do not classify based on global object shape,” PLOS Computational Biology, 2018.
9.At twitter.com/fchollet/status/1573836241875120128 and twitter.com/fchollet/status/1573843774803161090.
10.For a summary of this research, “Zoom In: An Introduction to Circuits,” distill.pub/2020/circuits/zoom-in/, 2020.
11.In more recent work, researchers have found ways of training networks to use shape information more and texture less, which does improve performance. Geirhos et al., “ImageNet-Trained CNNs Are Biased Towards Texture; Increasing Shape Bias Improves Accuracy and Robustness,” ICLR 2019; Dehghani et al., “Scaling Vision Transformers to 22 Billion Parameters,” arXiv:2302.05442, 2023.
12.For an insightful exploration of this theoretical perspective, see Hasson et al., “Direct Fit to Nature: An Evolutionary Perspective on Biological and Artificial Neural Networks,” Neuron, Volume 105, Issue 3, 5 February 2020, Pages 416-434.