Reviews of some major AI safety reports

I was asked by the funders of Better without AI to give opinions about several specific recent reports on AI safety. These are considered major texts in the field.

My impression as an outsider, before starting to work on Better without AI, had been that the AI safety field was of little value. It had considered mainly the question “how can we ensure superintelligent AI will be a good god, not a bad god?” That is impossible almost by definition; aphids can’t ensure safety from humans because they can’t reason about our intentions or capabilities. The AI safety field had considered the problem abstractly and philosophically, mostly without detailed reference to any specific technologies. That’s because, up until very recently, none were plausibly on a path to superintelligence.

That changed in 2022, as people discovered increasingly impressive capabilities of increasingly powerful language models. The AI safety field has partially reoriented to discuss those.1 This seems an important upgrade to me; there is now a concrete topic to reason about. The reports I review here were selected in July 2022, and do not reflect this shift. Some I found valuable nonetheless.

For a detailed introduction to the field as it has been understood by AI safety organizations, consider the AGI Safety Fundamentals Curriculum. Bommasani et al.’s “On the Opportunities and Risks of Foundation Models” is an exhaustive survey from a rather different, academic point of view. Vael Gates’ “Resources I send to AI researchers about AI safety” is an annotated list of starting points. For a lab-by-lab survey of research programs as of August 2022, see Thomas Larsen’s “(My understanding of) What Everyone in Technical Alignment is Doing and Why.”

AGI Ruin, A List of Lethalities

I was startled by Eliezer Yudkowsky’s “AGI Ruin: A List of Lethalities,” and liked it a lot, and it led indirectly to my writing this report. I found much insightful and courageous in it. I expect it will be considered historically important as a turning point in the development of AI safety work.

Yudkowsky describes it as “a poorly organized list of individual rants,” and I may misunderstand it in part or whole. Two other commentaries to consult are Zvi Mowshowitz’s “On AGI Ruin: A List of Lethalities” and Paul F. Christiano’s “Where I agree and disagree with Eliezer.” Both know Yudkowsky’s overall project better than I do, and may have a more accurate understanding of the essay.

I take Mowshowitz and Christiano as agreeing that doom is pretty likely, but pointing out flaws in Yudkowsky’s arguments that it is virtually inevitable, so that nothing anyone is doing (or has proposed doing) can help. Particularly, he does not seem to explain why a fast take-off scenario is much more likely than a slow one. Unlike Yudkowsky, Mowshowitz and Christiano find interpretability research promising; I do too.

I find Yudkowsky’s essay valuable in asserting that a class of AI safety approaches have failed. Those use abstract theoretical reasoning, from first principles, to construct a plan, with little basis in the details of AI capabilities research. This is rationalism (as I use that term), and it usually doesn’t work. It especially doesn’t work when you don’t have enough knowledge to accurately predict the effects of hypothetical actions, which in the case of Scary AI we especially don’t.

Yudkowsky’s own research program was of this sort. I imagine it’s awful to believe that the world will end because it didn’t work. I imagine it takes exceptional grit and intellectual honesty to say so.

If this approach did work, it would be great, so I wouldn’t say no effort should be devoted to it henceforth. However, my impression is that the field is now shifting to more pragmatic, concrete, shorter-term approaches, informed by the technical specifics of current AI systems.

Given Yudkowsky’s championing of “rationalism” (by which he may mean something different than I do), it’s ironic that—it seems—he gives excessive credence to the anti-rational claims of the “deep learning” research community. He seems to accept that “neural network” systems are inscrutable, and have extraordinary powers that operate by holistic woo. I believe they can be understood, and their seeming extraordinary powers are probably largely illusory.

He is right, however, to criticize “neural networks” as inherently unsafe, because network behavior at inference time is so unpredictable, particularly on out-of-distribution inputs.2 If we have to have AI—do we really have to have AI?—let’s find a better paradigm.

In two places, Yudkowsky rediscovers major points of accurate critiques of AI,3 which is also startling:

[T]here is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment—to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward.

The underparts of human thought are not exposed for direct imitation learning and can’t be put in any dataset. This makes it hard and probably impossible to train a powerful system entirely on imitation of human words or other human-legible contents, which are only impoverished subsystems of human thoughts…

Is Power-Seeking AI an Existential Risk?

Yes, it is.

Geez. I would think this goes without saying, but apparently the public still needs convincing, so organizations concerned with AI safety publish lots of reports that say “yes, this is a thing.”

Joseph Carlsmith’s “Is power-seeking AI an existential risk?” is my favorite.

Carlsmith makes the important point that artificial intelligence isn’t the risk; it’s artificial power. His write-up is also an excellent overall introduction to the field.

I’d like to quibble with two things. First, power-seeking AI isn’t so much an existential risk; power-wielding or power-creating AI is. New power is a risk even if it’s not deployed by the AI itself. There are arguments that power-creating AIs will inevitably become autonomous loci of control; Gwern Branwen’s “Why Tool AIs Want to Be Agent AIs” is the classic. I don’t think it matters whether that’s correct. Superpower created by superintelligence is an existential threat even if the superintelligence itself is benevolent, or devoid of intentions, or incapable of acting.

Secondly, Carlsmith mainly takes planning as the essence of agency. As I explain elsewhere, planning is neither necessary nor sufficient for effective action. He acknowledges this in a parenthesis, and in his footnote 26 on that.

His section 2.1.2 nevertheless presents the standard vintage-1985 AI understanding of planning as the basis for acting. That model almost never works, due to the frame problem. Carlsmith discusses that problem (it’s called “consequentialism” in the AI safety literature) in this section, without full recognition of its difficulty.

The Most Important Century

Holden Karnofsky is co-CEO of OpenPhilanthropy, an Effective Altruism organization. He was among the first to focus charitable efforts towards the prevention of AI risk, at a time when that was often met with incredulity and ridicule.

His “The Most Important Century” mainly argues that this one is that, because AI will speed up technological development dramatically.

The point that technological acceleration is the main positive opportunity in AI, and also one of the greater dangers, seems right and important to me.4 However, the “most important century” framing seems odd. What would we do differently if it is or isn’t? The century we are in is the only one we can do anything about, for another 77 years, so it’s certainly the most important one for now. “Let’s put everything on hold until 2100, because that will be a more important century” is not an option anyone considers.

It seems the point Karnofsky actually wants to make is that AI is a really big deal, with vast long-term consequences. I don’t know whether that’s true, but it certainly could be. Future AI may be catastrophic, or fabulous, so taking it seriously and doing something about it is important. I would guess that it is more likely to prove a miscellaneous collection of significant technologies, but not transformative on the scale of the industrial revolution or conventional computation. I would guess that catastrophic is more likely than fabulous; and also that trying to get to Fabulous AI is likely to produce Doom AI instead. So for now, I’d suggest aiming for “not transformative.”

However, as I suggested in “Technological transformation without Scary AI,” it’s reasonably likely that we can get to “fabulous” without Real AI, and that’s the best outcome of all.


Nick Bostrom’s Superintelligence (2014) is a summary of AI futurism as of a decade ago. I don’t know how he came to write it, or which parts were original to him versus common knowledge in the field at the time. It is somewhat dated now, but is still influential as a summary introduction to non-technical, abstract thinking about possibilities.

It is not to my personal taste. Bostrom is a philosopher.5 I am an engineer.6 For me, there’s too much “and here’s another really weird and bad thing that could hypothetically happen!” There’s not enough consideration of how and why they might happen, how likely they are, or what we could do about them.

Superintelligence is a collection of philosophical thought-experiments. In analytic philosophy, you choose those as extreme, simplistic examples. The trolley problems are like that: people are about to get killed, for no stated reason, in a horrible and extremely improbable way, and you somehow got teleported in and have to make an instant life-and-death decision with no context. This is almost perfectly dissimilar to 99.999% of moral practice, and trying to generalize conclusions from it is guaranteed misleading. The Superintelligence thought experiments are about the nature of personhood and about how society collectively addresses possible catastrophes. They too may be misleading: the extreme, vague examples don’t necessarily illuminate current, concrete questions about the nature of personhood, or about how we deal with current and likely near-future catastrophes—including those involving AI.

Superintelligence is atypical as philosophy; there’s much more engagement with technology and much less discussion of what Quine said about what G. E. Moore said about what Leibnitz said about what Aristotle said about the thing, and I like that. Still, it’s more abstract, theoretical, and a priori than I would prefer. I suspect that, because the book’s content was enormously influential for the field for several years, this style of thinking was too. I’d like to see less of that style, and more engineering-style thinking, in the field.

Biological Anchors

Ajeya Cotra’s “2020 Draft Report on Biological Anchors” is the latest of several attempts to estimate when Transformative AI will arrive by estimating the amount of computation (FLOPS, floating point operations per second) required to duplicate human intelligence. She comes up with several quite different estimates using different models, each involving huge uncertainties. Her overall answer is: sometime between 2025 and never, with the probability flattening out around 2060 at somewhat more likely than not. This doesn’t provide much constraint.

Leaving aside details, it doesn’t seem that even a perfect flops analysis could constrain the timeline as desired. At the near end, it may be possible to use flops much more efficiently than brains do. Brains evolve under selective pressure, so they’re probably efficient considering that they’re made of glop, but glop isn’t a sensible thing to make computers out of. Mammalian predators and prey are under selective pressure to run faster, but evolution can’t make glop go a thousand miles an hour, whereas we can make metal do that.

At the far end, we don’t know what sort of computation would produce Transformative AI, so flops may not be our limiting factor. Suppose friendly aliens showed up, and handed us a little box, and said:

This does 10¹⁰⁰ flops, which is impossible with the physics you know, but it runs on supersymmetric neutrino analogs. To make it easier for you, we’ve put USB ports on it and configured it with the ARM instruction set and preloaded linux and TensorFlow. Have fun!

How long would it take us to create Transformative AI with that? We wouldn’t know what computation to run. “Dramatically accelerate technological innovation!” is not a valid TensorFlow instruction. There are things we could try, but I expect the first many wouldn’t work. Unbounded computation isn’t a magic bullet if you don’t know what you need to compute, and we don’t. How long would it take to figure that out? I don’t think we can estimate that.

So, regardless of the accuracy of the flops analysis, it doesn’t seem helpful. We simply can’t get a meaningful timeline estimate from available knowledge. And anyway, it doesn’t matter: AI is unsafe at any flops.

(For those interested in pursuing this further, I recommend three commentaries on Cotra’s report, each of which I mostly agree with: Eliezer Yudkowsky’s “Biology-Inspired AGI Timelines: The Trick That Never Works,” Scott Alexander’s “Biological Anchors: A Trick That Might Or Might Not Work,” and Nostalgebraist’s “on bio anchors.”)

I’m not sure what the point of timing estimates are. Maybe potential donors mainly want to know “am I personally going to get killed by AI,” so they pressure AI safety organizations to come up with an answer. The correct answer is “no one knows, but it’s possible as far as we can tell, and it might even happen soonish, so it’s worth trying to prevent it.”

Effective Altruism, the movement that has supplied much of the funding for AI safety work, is between a rock and a hard place here. EA’s proposition for donors is that it distributes funds on the basis of quantitative comparisons of the benefits of possible uses. Many in EA believe AI safety is among the best uses of funding, but AI poses an infinite threat with an unknowable probability, so it’s hard to pitch it in quantitative terms.

Maybe the best argument is: “we don’t know of anything else that could credibly cause human extinction in this century, so if you care about that, this is the most important thing.”

I do care about that, but most people somehow care way more about rollerskating transsexual wombat memes.

Or, I mean, the moral equivalent: evanescent culture war conflicts, just a bit less significant than human survival, promoted by Mooglebook AI as tools for controlling human brains.

Perhaps doing something to fix that is Job #1?

  1. 1.Scott Aaronson summarizes this shift in “Reform AI Alignment.”
  2. 2.The “Interpolation, brittleness, and reasoning” section of Gradient Dissent is about this.
  3. 3.Most obviously in Hubert Dreyfus’ What computers still can’t do.
  4. 4.This is the topic of “Technological transformation without Scary AI” in Better without AI.
  5. 5.Or something. He has degrees in physics and computational neuroscience as well.
  6. 6.Or something. My degrees are in math and in “computer science,” which isn’t a science but also isn’t an engineering discipline. People often mistake me for a philosopher on Twitter, which always gets me annoyed. My most recent professional activity prior to writing this document was speaking at an academic sociology workshop.