How to understand AI systems

Are language models Scary? attempts a particular, unusual type of understanding. This chapter explains what sort of explanation it aims for, as a preliminary before analyzing language models themselves.

It has two sections:

Task-relevant algorithmic explanations” explains the type of understanding the document aims for, overall and in the abstract. It’s common to explain that current AI systems are “neural networks,” consisting of arithmetical “units” that are trained with backpropagation. This explanation is true as far as it goes, but doesn’t give much insight into what they can or can’t do, or why.

It’s also common to explain language models as “predicting the next bit of text,” which is also true, but leaves “how” unanswered. Without the “how,” it’s impossible to know why they fail when they do, and what their limits may be.

Classifying images” explains AI image classifiers at the algorithmic level, as an easier warm-up before tackling language models. For these systems, the explanation has been tested and is known to be mostly correct. What we’ve discovered is that these systems rely mainly on superficial, local features of images—textures, especially—rather than overall shape and structure.

I will suggest in the next chapter that language models work similarly: they mainly exploit superficial linguistic patterns rather than “deep understanding.”