Better text generation with science and engineering

Current text generators, such as ChatGPT, are highly unreliable, difficult to use effectively, unable to do many things we might want them to, and extremely expensive to develop and run. These defects are inherent in their underlying technology. Quite different methods could plausibly remedy all these defects. Would that be good, or bad?

Good or bad? A conundrum

My interest in AI, briefly rekindled in 2014 by the ImageNet results I discussed in the previous chapter, fell mainly dormant again until mid-2022. Then I learned of newly-discovered “chain-of-thought” behavior in text generators, in which they appeared to engage in common sense reasoning. Common sense reasoning has been the holy grail of AI research from the beginning.1 It’s plausibly the way AI could transcend the limitations of interpolation (discussed in “Beyond interpolation: reasoning,” earlier).

Automated common sense has been stubbornly resistant to progress, and this seemed a potential breakthrough—perhaps with both exciting and scary implications.

Rapid technical progress in text generation alarmed many other people, too. Will a few more years development at that rate produce Scary superintelligence? We don’t know, and we should try to find out. The work I did mid-2022 made me think it’s unlikely, and subsequent developments have tended to reinforce that, but this conclusion remains tentative.

Prudence therefore advises recognizing that Scary AI might arrive soon, and acting accordingly. We should try to find out by doing science and engineering on the systems we already have, not by rushing ahead full tilt building more powerful ones to see whether or not they cause an apocalypse.

Reasoning about how current text generators work synergized both with work I had done in computational linguistics thirty years earlier and with insights from my analysis of image classifiers. Together these may have profound implications for fundamental linguistics and cognitive science, which I find extremely exciting! They also suggest several ways it may be feasible to build systems with functionality similar to that of current text generators, but based on different technology that would make them reliable, easier to use, more powerful, and more efficient.

Would that be good? The behavior of the mechanisms I have in mind would be much more predictable than current systems, and therefore much less likely to become Scary for unknown reasons. That’s good!

On the other hand, if they are more powerful, reliable, and inexpensive, they are much more likely to be used—and abused. They might have larger unpredictable effects on the world than the current program of building ever-larger, more expensive versions of ChatGPT using the same “predict next word” paradigm. That’s probably bad!

Since better text generation technologies might have either good or bad effects relative to the current path, I have been torn, ever since I started this project, about how much to say about them.

I suspect the project of scaling up GPTs may be approaching its limits. Existing text generators get trained on pretty much all the text worth training on, and may do nearly as good a job of processing it as is possible within the current technological paradigm. ChatGPT, despite predictions both of enormous short-term economic benefits and apocalyptic disemployment of most office workers, has had no broadly visible effects in its first year of existence. It may not be feasible to increase the usefulness of such systems much further, in which case the tens of billions of dollars going into attempts may be wasted.

Then the current wave of AI enthusiasm may recede, as previous ones have. On balance, I think that would probably be good, because so far we have no clear path to a good future with powerful general-purpose AI, and plenty of plausible disaster scenarios. Among current AI approaches, text generation seems the most worrying, due to its seeming reasoning ability. If it’s the most dangerous current technology, and may be approaching its limit, we can be less concerned for now about Scary AI. So this suggests that my explaining ways to make better ChatGPT-like systems would be bad.

As I am still extremely unsure about this, I will describe only the least innovative and seemingly safest of the several possibilities I envision. Regretfully, I am omitting discussion of less obvious, more powerful, more dangerous possible future language technologies.

Not to keep you in suspense, the approach I will describe is to separate language ability from knowledge. A dramatically smaller GPT can provide fully fluent text generation, drawing content from a well-defined textual database, rather than mixing up facts with its language ability. That would eliminate the current biggest defect of GPTs: “making stuff up” or “hallucinations.”2 Separating language and content makes outputs faithful to the text database. Shrinking the GPT by several orders of magnitude would also make it much easier to analyze, understand, and validate.

Text prediction: the wrong tool for the job

Current mainstream text generators are all based on the GPT (generative pre-trained transformer) architecture. That works by taking as input some text (a “prompt”) and predicting a statistically plausible continuation of it.

Plausible text continuation has little, if any, inherent usefulness.

This paradigm was not originally intended to be useful. Useful text generators were an unexpected, accidental byproduct of computational linguistics research. The primary goal was to understand syntax: the grammar of human languages. The technical approach built a statistical model of a pile of human-written text. A system that could output grammatically correct English, without someone having to encode all the grammatical rules, was the aim.

No one imagined that the grammatical but meaningless gibberish it output (“colorless green ideas sleep furiously”) would be of any use. That would be the whole next research project, in which the grammatical model would get connected to a knowledge representation and reasoning system. The reasoning system would produce meanings, and the linguistic system would translate those into English, and thereby serve as its output channel.

It turned out that if you train on huge quantities of human-written text, the outputs are not only grammatical, they often seem meaningful. Initially, as in GPT-2, only for a sentence or two; but then scaling up from “huge” to “unimaginably gigantic,” as in GPT-3, they’d generally sound sensible for a paragraph or two. Such systems seemed to be learning semantics (meanings) as well as syntax (grammar).

And, researchers discovered that careful crafting of prompts could produce outputs that were not only internally meaningful, but relevant and useful. In the simplest case, something like “What is the capital of France?” causes the output “Paris is the capital of France,” because in the training data that’s the most common next sentence. A large enough GPT is able to answer questions like this even when the question and answer are not literally paired in the training data, because it finds patterns in the forms of questions and corresponding answers.

This discovery was completely unexpected,3 and is now the primary positive use for text generators. (Probably the economically dominant use is in boilerplate generation, for spam and near-spam, whose overall value is negative.)

Text generators are still usually referred to by researchers as “language models,” although modeling language hasn’t been their purpose in many years. Later in this chapter, I suggest that it should be: a GPT should model language, not the contents of a random terabyte of blather scraped from the web.

Text generators’ near-omniscience is the basis of much current excitement, fear, financial investment, marketing hype, and research effort. This is largely misconceived and misdirected, however. At best, a text generator “knows” only what was in its training dataset, and in the best case it would just report that accurately, in full or in summary as requested.

Unfortunately, GPTs can’t and don’t do that. They are only trained to produce statistically plausible continuations of their inputs. Those include confident explanations of plausible-sounding falsehoods that are not in the texts they were trained on.

In a meaningful sense, they don’t “know” anything at all. They are text genre imitation engines, not knowledge bases. As I wrote earlier,

It is not that text generators “make stuff up when they don’t know the right answer”; they don’t ever know. If you ask one whether quokkas make good pets, it may write a convincing article explaining that they are popular domestic companions because they are super friendly and easy to care for. Ask again immediately, and it may write another article explaining that they are an endangered species, illegal to keep as pets, impossible to housebreak, and bite when they feel threated. Exactly the same process produces both: they are mash-ups of miscellaneous internet articles about “does animal X make a good pet,” with some quokka factoids thrown in.

This is not a reasonable basis for most of the things people want text generators to do. They are the wrong tool for the job. It’s amazing how well they work considering that, but the overall approach is fundamentally and unfixably flawed.

Nevertheless, proponents are trying to make GPTs seem inevitable as the way forward for AI in general, because they want to sell something now.

As retrieval mechanisms for textual knowledge, GPTs compete with web search. Or, they compete in principle, at least! In practice, as of 2023, they are mostly complementary, with the strengths of each partially compensating for the defects of the other. The approach I suggest in the next section should combine the strengths of both.

Separating linguistic ability from knowledge

I suggest that separating these could produce a reliable human-language interface to a database of human-language text, eliminating ChatGPT’s hallucinations; providing concise, relevant, detailed answers (unlike web search); and giving access to the knowledge in books and periodicals not available on the web while avoiding copyright violation.

I’ll first sketch the way this might work, and then describe some recent research that suggests it is feasible, along with some historical context. I won’t go into technical details, nor answer objections that the approach wouldn’t work for one reason or another. In this case, that is not for safety reasons, it’s because I don’t have any unusual insights into the technical considerations. This is a minimally innovative proposal, and anyone working in the field will see the same possibilities and obstacles I do.

For several years, text generation researchers pursued the “scaling hypothesis” that larger networks yield better performance, possibly without bound.4 Extensive empirical evidence seemed to bear this out: almost always, bigger networks did better on benchmarks. Eventually, a lottaflops and as much as a hundred million dollars were spent training single networks on the order of a trillion parameters, and they did better than ones with only tens of billions. Increasing evidence suggests that by late 2022 this had run its course, however.

In retrospect, it seems likely that the main reason scaling up GPTs improved benchmark performance was that in effect they store the text they were trained on—somewhat compressed and distorted—in form of network parameters.5 Many standard benchmarks test mainly knowledge, so storing more of it (in near-textual form) gives better test performance.

We tend to mistake omniscience for intelligence, because we cannot imagine what it would be like to have instant mental access to the most relevant knowledge in a major research library containing tens of millions of books.

However, backprop networks are an extraordinarily expensive and unreliable way to store text. Probably a quite small network can capture full linguistic fluency, if it doesn’t need to waste parameters on “knowing” stuff as well. Then it can rely on the actual original text for its “knowledge,” instead.

Starting in 2020, several teams recognized that a GPT often behaves as though it is retrieving text, and basing its output on that, although it doesn’t in fact have access to any. So they augmented GPTs with a large text database, available at run time, and a semantic-match retrieval engine.6 And this works dramatically better!

For example, in August 2022, the retrieval-augmented Atlas GPT set new state-of-the-art accuracy records on various “language understanding” and “knowledge intensive” tasks with an 11 billion parameter network, outperforming PaLM, the previously most powerful GPT, which had 540 billion parameters.7 It’s fifty times more efficient on that metric.

This suggests that the increasing performance of larger networks was due in large part to assimilating increasing quantities of text. Retrieval-augmented GPTs are a success story for an agenda I described earlier: replacing parts of backprop networks with engineered alternatives, based on algorithmic-level understanding, making them more efficient, interpretable, and reliable.

Taking this approach to the limit, there seems no good reason to allow “knowledge” in network. We should want to get rid of that! Ideally, we should want a retrieval-only system, not a retrieval-augmented one. A fluent but ignorant text generator could reliably summarize responsive content from its text database, eliminating “hallucination.” It could link the passages it drew on, so you could assess their quality and relevance.8

One reason text generators are still mainly unsafe for commercial use is that their output could be based on anything found in a terabyte of who-knows-what. A customer service chatbot should reliably base answers solely on a company-specific database. That may be feasible with this architecture. Many companies are experimenting with retrieval augmentation for this reason; but unless the generator is fully ignorant, its outputs are based unpredictably on the text it was originally trained on, retrieved text, and a mixture of the two.

An advantage of retrieval-augmented systems is that the text database can be updated at near-zero cost. Outputs based on retrieval immediately reflect that change. In contrast, correcting mistaken, unwanted, or out-of-date “knowledge” in a plain GPT requires at minimum “fine tuning” (partial retraining, which is quite expensive), and potentially complete retraining (prohibitively expensive).

Retrieval augmentation’s update capability is limited, though, because outputs are based only in part on retrieved text. In a retrieval-only system, you’d have total, near-zero-cost control over what “knowledge” outputs derive from.

Why isn’t everyone already doing this?

So if this is such a good idea, why isn’t everyone doing it? The retrieval-only ideal seems obvious. I proposed it in October 2022,9 but no one else has mentioned the possibility, as far as I have seen, much less pursued it.10 Even retrieval augmentation seems radically underused relative to its benefits, and I don’t know why.

Maximizing reliance on retrieval may be a bad idea for some reason I’m missing. Or, I’ve thought of a couple of possible explanations for why it’s not pursued. I’ll describe a technical one and a public relations one.

The technical issue is run time compute cost. Because an ignorant language-only GPT would be much smaller than current typical ones, it would be much less costly both to train and to run. However, semantic retrieval from a terabyte text database has an additional computational cost. Sources I have read provide contradictory evidence for how this compares to GPT run time cost: some say it is much less, and others that it is much more.11 How the cost scales with the size of the text database is also unclear.

This might be an insuperable obstacle. On the other hand, comparatively little effort has gone into optimizing it. Improved algorithms or faster implementations of existing ones might do the trick. If not, a different hardware architecture may be required. The fundamental operations for GPT run time and for retrieval are quite different. Neural network computation is dominated by low-precision multiplication, which current AI supercomputers optimize with specially designed hardware. Semantic retrieval uses an enormous inverted text index, which maps text “meanings,” i.e. points in latent space, to corresponding text fragments. Computation cost is dominated by access to the RAM storing the index. Retrieval is highly parallelizable, so an optimal architecture might use a very large number of SIMD CPUs with high-bandwidth RAM buses.12

Anyway, greater compute cost might be worth paying. It buys you safety and reliability that a straight GPT can’t provide, even if it’s more expensive than just blurting out whatever seems plausible.

Another reason major AI labs may not have pursued the retrieval-only approach is that it seems a step backward from their stated goal of creating artificial general intelligence. Using “instruction tuning,”13 plus a gigantic, convoluted, secret “system prompt,” a gigantic GPT can be coaxed into the appearance of performing many dissimilar tasks. That makes plausible “We’re leaders on the path to omnipotent superintelligence!” This claim, rather than mundane utility, is the basis for the colossal financial investment into the big AI labs.

The retrieval-only paradigm is not necessarily applicable to all the things people try to make GPTs do currently. It’s plausible that versions of this approach could address some of the other tasks large GPTs are used for now, but probably not all. For example, a linguistically sophisticated but ignorant system might be able to provide a conversational “chat” interface to its text database, or to an API. It might not be able to write boilerplate without a detailed spec (a major current application for GPTs), or to provide the “creative” functions of story writing or brainstorming.

That seems good. “Tool AIs” that do specific things reliably are safer than superintelligent AGI, and more likely to provide net positive utility.

Retrieval-only systems could be dismissed as just better search engines. (“Google Search already does this! It uses semantic matching (sometimes) to retrieve from the web, and uses a GPT to summarize results (sometimes).”) From my point of view, this technical unimpressiveness is good.

I expect most people would rather have simple software that does one thing reliably than a technically complex spectacular demo which does a vast but unspecifiable collection of things, and often produces outputs that look correct and aren’t.

Worries that GPTs may soon become secretly spookily smart, or possibly are already Scary, would not apply to an ignorant summarization engine. It’s inherently limited in what it can do. Further, ignorance would make GPTs dramatically smaller. Those will be easier to analyze, understand, and validate.

How to make an ignorant GPT

It’s probably impossible to make a useful GPT is that entirely ignorant, and also probably impossible to make one that is entirely reliable and hallucination-free. The methods I sketch here may go a long way, but a GPT is still a GPT, and GPTs are still The Wrong Thing.

The boundary between world knowledge and linguistic ability is somewhat nebulous. Retrieval-only is an ideal, but not possible even in principle. Semantic disambiguation often depends on factual knowledge, some of which must therefore be included. Therefore, there’s no precise criterion for what should be excluded, but most cases are clear enough. “Who pitched for the winning team in the 1909 World Series” does not belong in a GPT.

I believe entirely different approaches could deterministically produce language to order (instead of semi-randomly predicting likely continuations). This was always the goal in AI research until a few years ago, when GPTs proved unexpectedly capable. I see no reason to think it’s impossible, although the methods attempted so far have not worked well. As mentioned earlier, I won’t discuss this further, partly for safety reasons.

GPTs store vastly more world knowledge than linguistic knowledge, so just limiting network size is part of the recipe for ignorance. However, experience with training small networks (up to about seven billion parameters) on large text databases is that their language capability is poor. They are limited in what they can say, not just in how much they know.

During training on a large text dataset, a GPT doesn’t “know” whether it’s supposed to be memorizing text or inferring linguistic patterns, so it winds up doing some of each, mostly the first. To create an ignorant but linguistically fluent GPT, we’ll need to bias it away from content learning and toward language learning.

Evidence that this may be feasible is provided by the TinyStories project.14 Researchers trained tiny GPTs on a synthetic text database consisting only of kindergarten-level stories, with restricted vocabulary and subject matter. The resulting systems produced fluent, coherent stories, several paragraphs long, with almost perfect grammar. They also demonstrated “reasoning” capabilities similar to those of large GPTs. These systems had around ten million parameters, three orders of magnitude fewer than “small” GPTs that are less linguistically capable, and five orders of magnitude smaller than current large ones. This is a proof of principle for the power of small GPTs when deliberately trained for competence rather than memorization. However, the TinyStories text database is restricted to a limited vocabulary and simple grammar.

Researchers followed up with a series of studies training small GPTs on databases of sophisticated, high quality text. Historically, because GPTs were intended as language models, they were trained on any old language, indiscriminately sourced from the internet. Because GPTs are extremely inefficient learners, enormous quantities of text was required. Unsurprisingly, they learned vast quantities of false facts and bad behavior from data such as political web forum disputes. It’s becoming increasingly clear that text quality trumps quantity—another reason the scaling hypothesis seems increasingly shaky.

In “Textbooks Are All You Need” and other studies,15 researchers showed that small GPTs trained on high-quality text perform as well on “language understanding” and “common sense reasoning” benchmarks as mainstream ones an order of magnitude larger. The data sets were a mixture of manually-curated high quality web text and synthetically generated text meant to cover “common sense knowledge.”16

This suggests training a minimal-sized GPT on synthetic text that covers the full vocabulary and spectrum of linguistic constructions in a major research library, but with artificially minimal factual content. That might meet the simultaneous goals of language mastery and adequate ignorance. Constructing such texts would be a significant research project, but it’s not obviously impossible, or even particularly difficult.

Alternatively, there may be ways to bias the training process itself toward finding linguistic patterns rather than storing content. It’s also not obvious how to do that, but not obvious that it would be difficult or impossible.

Ignorant retrieval and summarization systems face roughly the same legal issues under copyright law as mainstream, knowledgeable GPTs. However, they may provide an opportunity for a win-win outcome not available to knowledgeable GPTs.

Large GPTs are trained mainly on copyrighted text and images. Numerous lawsuits by authors and publishers are underway, alleging that this use infringes copyright. There is no directly relevant legislation or case law as yet, and the overarching fair use doctrine is “murky and evolving.” The outcome of these suits is, therefore, highly uncertain.17 At one extreme, courts could rule that any use of copyrighted text in training constitutes infringement. At the opposite, they could rule that it constitutes fair use, even when GPTs output chunks of the copyrighted work verbatim (as they occasionally do). Probably they’d prefer some middle ground, whose shape is as yet undetermined.

AI companies’ defense may depend primarily on the “transformativeness” aspect of copyright doctrine. They can argue that GPTs rarely reproduce copyrighted material verbatim (and they can take further technical measures to prevent them from doing so). Automated paraphrasing may constitute adequate transformativeness to avoid copyright infringement—or it may not. These considerations are much the same for an ignorant system, although its paraphrases may be closer to the source material on average.

A practical, rather than legal, defense for AI companies is that it’s usually impossible to determine which texts contributed to a large GPT’s output. A plaintiff claiming copyright infringement may have a hard time demonstrating that it occurred. (“Yes, we trained the GPT on your book, but you can’t show that in any specific instance the GPT relied on it to produce supposedly-infringing output. It may, for instance, have relied instead on some similar book, or on a third-party summary of yours, maybe in a book review.”) This defense does not apply to retrieval-based generation: the exact sources are knowable.

However, that provides an avenue of practical defense for retrieval-based systems not available to current mainstream ones. They can provide users with links to the sources used to generate an output. Sometimes users will then purchase copies of the linked copyrighted works. This suggests a possibility for a win-win outcome.

There is an analogy here with Google Books, which searches over tens of millions of copyrighted works and displays “snippets” of them for free. In 2005, authors and publishers filed suits alleging infringement. Google spent years working with the plaintiffs to craft a financial deal to compensate them for revenue lost when users read free snippets instead of buying the whole book. After a decade of tentative settlements that the plaintiffs couldn’t quite agree to, the court finally ruled in favor of Google: snippets are fair use.18

So the authors and publishers technically lost, but it seems they still came out ahead. Users do indeed purchase books based on snippets significantly often. Those serve, in effect, as free advertising.19

Here’s the deal that’s a win for users, AI companies, authors, and publishers:

  1. 1.John McCarthy’s 1958 paper “Programs with common sense” (Proceedings of the Teddington Conference on the Mechanization of Thought Processes, 756–91) was an immensely influential founding document of the field.
  2. 2.The term “hallucination” is misleading; in all other contexts, the word refers to perceptual phenomena, which this is not. “Bullshit,” in the quasi-technical sense of Harry Frankfurt (On Bullshit, 2009), would be more accurate. When a speaker or writer doesn’t care—or often even know—whether their seeming claims are true, they are generating “bullshit” in this sense. That is precisely what current AI text generators do.
  3. 3.Petroni et al., “Language Models as Knowledge Bases?”, ACL Anthology D19-1250, 2019.
  4. 4.Gwern Branwen, “The Scaling Hypothesis,”, 2020. This is related to Rich Sutton’s “Bitter Lesson,”, 2019.
  5. 5.This is not the only reason, but the others aren’t relevant here.
  6. 6.Guu et al.’s “Retrieval augmented language model pre-training” (REALM) was among the pioneers; Proceedings of the 37th International Conference on Machine Learning, PMLR 119, 2020. Borgeaud et al.’s “Improving language models by retrieving from trillions of tokens” (RETRO) scaled the method up and made many improvements; arXiv:2112.04426v3, 7 Feb 2022.
  7. 7.Izacard et al., “Few-shot Learning with Retrieval Augmented Language Models,” arXiv:2208.03299v3, 2022.
  8. 8.Chirag Shah and Emily M. Bender make an interesting case against this, though, in “Situating Search,” CHIIR ‘22, March 2022, pp. 221–232.
  10. 10.An exception may be Lan et al.’s “Copy Is All You Need,” arXiv:2307.06962, 13 Jul 2023.
  11. 11.Mitchell A. Gordon’s “ RETRO Is Blazingly Fast” (on his personal web site, Jul 1, 2022) is an example of the first; Min et al.’s “Silo Language Models” (arXiv:2308.04430, 8 Aug 2023) is an example of the second.
  12. 12.Ironically, perhaps, this was the architecture of the 1980s Connection Machine AI supercomputer, which was originally intended to process semantic networks. W. Daniel Hillis, The Connection Machine, 1986.
  13. 13.Ouyang et al., “Training language models to follow instructions with human feedback,” arXiv:2203.02155, 4 Mar 2022.
  14. 14.Ronen Eldan and Yuanzhi Li, “TinyStories: How Small Can Language Models Be and Still Speak Coherent English?”, arXiv:2305.07759v2, April 2023.
  15. 15.Li et al., “Textbooks Are All You Need II: phi-1.5 technical report,” arXiv:2309.05463, 11 Sep 2023.
  16. 16.These systems are straight GPTs, not retrieval-augmented, with about a billion parameters to store the knowledge. If the Atlas 1:50 ratio held up, a retrieval-augmented equivalent might have on the order of a hundred million parameters. In 2022 linguistic work (unpublished), I estimated the intrinsic complexity of language at roughly ten to a hundred million bytes on a quite different basis, consistent with this.
  17. 17.For an overview of the legal issues and how they interact with technical ones, see Henderson et al., “Foundation Models and Fair Use,” arXiv:2303.15715, 28 Mar 2023. I took the phrase “murky and evolving” from this paper.
  18. 18.This was Authors Guild v. Google .
  19. 19.Abhishek Nagaraj and Imke Reimers, “Digitization and the Market for Physical Works: Evidence from the Google Books Project,” American Economic Journal: Economic Policy vol. 15, no. 4, November 2023, pp. 428–58. “We study the impact of the Google Books digitization project on the market for physical books. We find that digitization significantly boosts the demand for physical versions and provide evidence for the discovery channel.”