AI safety: what should actually be done now?

So, I recorded a 2.5-hour-long podcast with Daniel Filan about “reform AI alignment,” and the work I’ve been doing this year at OpenAI.  The end result is … well, probably closer to my current views on this subject than anything else I’ve said or written! Listen here or read the transcript here. Here’s Daniel’s abstract:

How should we scientifically think about the impact of AI on human civilization, and whether or not it will doom us all? In this episode, I speak with Scott Aaronson about his views on how to make progress in AI alignment, as well as his work on watermarking the output of language models, and how he moved from a background in quantum complexity theory to working on AI.

Thanks so much to Daniel for making this podcast happen.

Maybe I should make a broader comment, though.

From my recent posts, and from my declining to sign the six-month AI pause letter (even though I sympathize with many of its goals), many people seem to have goten the impression that I’m not worried about AI, or that (ironically, given my job this year) I’m basically in the “full speed ahead” camp.

This is not true.  In reality, I’m full of worry. The issue is just that, in this case, I’m also full of metaworry—i.e., the worry that whichever things I worry about will turn out to have been the wrong things.

Even if we look at the pause letter, or more generally, at the people who wish to slow down AI research, we find that they wildly disagree among themselves about why a slowdown is called for.  One faction says that AI needs to be paused because it will spread misinformation and entrench social biases … or (this part is said aloud surprisingly often) because progress is being led by, you know, like, totally gross capitalistic Silicon Valley nerdbros, and might enhance those nerds’ power.

A second faction, one that contains many of the gross nerdbros, is worried about AI because it might become superintelligent, recursively improve itself, and destroy all life on earth while optimizing for some alien goal. Hopefully both factions agree that this scenario would be bad, so that the only disagreement is about its likelihood.

As I’ll never tire of pointing out, the two factions seem to have been converging on the same conclusion—namely, AI progress urgently needs to be slowed down—even while they sharply reject each other’s rationales and indeed are barely on speaking terms with each other.

OK, you might object, but that’s just sociology. Why shouldn’t a rational person worry about near-term AI risk and long-term AI risk? Why shouldn’t the ethics people focused on the former and the alignment people focused on the latter strategically join forces? Such a hybrid Frankenpause is, it seems to me, precisely what the pause letter was trying to engineer. Alas, the result was that, while a few people closer to the AI ethics camp (like Gary Marcus and Ernest Davis) agreed to sign, many others (Emily Bender, Timnit Gebru, Arvind Narayanan…) pointedly declined, because—as they explained on social media—to do so would be to legitimate the gross nerds and their sci-fi fantasies.

From my perspective, the problem is this:

  1. Under the ethics people’s assumptions, I don’t see that an AI pause is called for. Or rather, while I understand the arguments, the same arguments would seem to have justified stopping the development of the printing press, aviation, radio, computers, the Internet, and virtually every other nascent technology, until committees of academic experts had decided that the positive social effects would outweigh the negative ones, which might’ve been never. The trouble is, well, how do you even study the social effects of a new technology, before society starts using it? Aren’t we mostly happy that technological pioneers went ahead with all the previously-mentioned things, and dealt with the problems later as they arose? But preventing the widespread societal adoption of GPT-like tools seems to be what the AI ethics camp really wants, much more than preventing further scaling for scientific research. I reject any anti-AI argument that could be generalized and transplanted backwards to produce an argument against moving forward with, let’s say, agriculture or metallurgy.
  2. Under the alignment people’s assumptions, I do see that an AI pause is urgently called for—but I’m not yet on board with their assumptions. The kind of relentlessly optimizing AI that could form the intention to doom humanity, still seems very different to me from the kind of AI that’s astonished the world these past couple years, to the point that it’s not obvious how much progress in the latter should increase our terror about the former.  Even Eliezer Yudkowsky agrees that GPT-4 doesn’t seem too dangerous in itself. And an AI that was only slightly dangerous could presumably be recognized as such before it was too late. So everything hinges on the conjecture that, in going from GPT-n to GPT-(n+1), there might be a “sharp turn” where an existential risk to humanity very suddenly emerged, with or without the cooperation of bad humans who used GPT-(n+1) for nefarious purposes. I still don’t know how to think about the likelihood of this risk. The empirical case for it is likely to be inadequate, by its proponents’ own admission. I admired how my friend Sarah Constantin thought through the issues in her recent essay Why I Am Not An AI Doomer—but on the other hand, as others have pointed out, Sarah ends up conceding a staggering fraction of the doomers’ case in the course of arguing against the rest of it. What today passes for an “anti-doomer” might’ve been called a “doomer” just a few years ago.

In short, one could say, the ethics and alignment communities are both building up cases for pausing AI progress, working at it from opposite ends, but their efforts haven’t yet met at any single argument that I wholeheartedly endorse.

This might just be a question of timing. If AI is going become existentially dangerous, then I definitely want global coordination well before that happens. And while it seems unlikely to me that we’re anywhere near the existential danger zone yet, the pace of progress over the past few years has been so astounding, and has upended so many previous confident assumptions, that caution seems well-advised.

But is a pause the right action? How should we compare the risk of acceleration now to the risk of a so-called “overhang,” where capabilities might skyrocket even faster in the future, faster than society can react or adapt, because of a previous pause? Also, would a pause even force OpenAI to change its plans from what they would’ve been otherwise? (If I knew, I’d be prohibited from telling, which makes it convenient that I don’t!) Or would the main purpose be symbolic, just to show that the main AI labs can coordinate on something?

If so, then one striking aspect of the pause letter is that it was written without consultation with the main entities who would need to agree to any such pause (OpenAI, DeepMind, Google, …). Another striking aspect is that it applies only to systems “more powerful than” GPT-4. There are two problems here. Firstly, the concept “more powerful than” isn’t well-defined: presumably it rules out more parameters and more gradient descent, but what about more reinforcement learning or tuning of hyperparameters? Secondly, to whatever extent it makes sense, it seems specifically tailored to tie the hands of OpenAI, while giving OpenAI’s competitors a chance to catch up to OpenAI. The fact that the most famous signatory is Elon Musk, who’s now trying to build an “anti-woke” chatbot to compete against GPT, doesn’t help.

So, if not this pause letter, what do I think ought to happen instead?

I’ve been thinking about it a lot, and the most important thing I can come up with is: clear articulation of fire alarms, red lines, whatever you want to call them, along with what our responses to those fire alarms should be. Two of my previous fire alarms were the first use of chatbots for academic cheating, and the first depressed person who commits suicide after interacting with a chatbot. Both of those have now happened. Here are some others:

  • A chatbot is used to impersonate someone for fraudulent purposes, by imitating his or her writing style.
  • A chatbot helps a hacker find security vulnerabilities in code that are then actually exploited.
  • A child dies because his or her parents follow wrong chatbot-supplied medical advice.
  • Russian or Iranian or Chinese intelligence, or some other such organization, uses a chatbot to mass-manufacture disinformation and propaganda.
  • A chatbot helps a terrorist manufacture weapons that are used in a terrorist attack.

I’m extremely curious: which fire alarms are you most worried about? How do you think the AI companies and governments should respond if and when they happen?

In my view, articulating fire alarms actually provides multiple benefits. Not only will it give us a playbook if and when any of the bad events happen, it will also give us clear targets to try to forecast. If we’ve decided that behavior X is unacceptable, and if extrapolating the performance of GPT-1 through GPT-n on various metrics leads to the prediction that GPT-(n+1) will be capable of X, then we suddenly have a clear, legible case for delaying the release of GPT-(n+1).

Or—and this is yet a third benefit—we have something clear on which to test GPT-(n+1), in “sandboxes,” before releasing it. I think the kinds of safety evals that ARC (the Alignment Research Center) did on GPT-4 before it was released—for example, testing its ability to deceive Mechanical Turkers—were an extremely important prototype, something that we’ll need a lot more of before the release of future language models. But all of society should have a say on what, specifically, are the dangerous behaviors that these evals are checking for.

So let’s get started on that! Readers: which unaligned behaviors would you like GPT-5 to be tested for prior to its release? Bonus points for plausibility and non-obviousness.

93 Responses to “AI safety: what should actually be done now?”

  1. Joe Says:

    I do understand why some are concerned, but I fail to share their concern for this simple (and hopefully, not too naive) reason: How can one possibly believe AI is on the verge of taking over the world and wiping us out, when it can’t even perform a relatively straightforward task that most competent 13-year-olds are capable of, namely, drive a car. And please, I don’t want to hear about Tesla’s, or anyone else’s, self-driving capability. Those only perform under an extremely limited set of circumstances, and even then make mistakes that no experienced and alert human driver would make.

  2. Sniffnoy Says:

    This is largely tangential, but I have to point out how ridiculous the whole “techbro” concept that has agglomerated is. (Apologies if this is considered derailing culture-war stuff.)

    Now in your post you write “nerdbro”, which is a more ridiculous word, but I haven’t actually seen it in the wild, so it seems a bit unfair to pick on it. That said, I have to point out that it’s an oxymoron — “bros” (aka “jocks”) are more or less the opposite of nerds! (Although, let’s not forget about the other classical opposite of the nerds, the “suits”. People seem to forget about that one for some reason.) But the “techbro” concept somehow agglomerates both into one.

    I think the thing is that, some time ago, there actually was an influx of “bros” into tech, with a number starting companies that then had bro-ish cultures (e.g. Uber), and I think these people are what “techbro” originally referred to. But somehow over time the term shifted into a disparaging term for anyone insufficiently SJ/leftist (as opposed to e.g. liberal) in tech, including the nerds; and indeed at some point it seems to have come to refer primarily to the nerds, the original techbros being now much less prominent. But still attempting to tar those nerds as bros of sorts.

    (I say “somehow” above, although I guess the actual mechanism there is pretty clear; it’s an SJ/leftist term, and since they opposed the bros and the nerds both, they naturally conflated the two, for the same reason that almost anyone fails to distinguish between the various things they oppose. And yeah here I am combining “SJ” and “leftist” here even though those are not actually the same thing, but I assert that for this purpose this is actually justified. 😛 )

    Again, I haven’t seen “nerdbro” in the wild; if it’s out there, it’s an oxymoron. But the concept currently pointed to by “techbro” is equally incoherent, even if the word itself doesn’t highlight this!

  3. Pierre-Luc Says:

    Many threats can be mitigated by other AIs purposed to counter malicious uses.
    Personally I’d be worried if some AI had the ability to “feel” its hardware and autonomously interact with nearby air gapped electronic equipment. Containment may be hazardous afterward.

  4. Seth Finkelstein Says:

    My tongue in cheek joke about the factions is this:

    The “Gebs” fear a superGPT will be used by racist humans.

    The “Yuds” fear a superGPT will be racist against humans.

    However, it is a pundit fantasy to have these “strategically join forces”.
    That comes from a lack of appreciating the very deep ideological differences matter to the people involved. Glossing that as “legitimate the gross nerds and their sci-fi fantasies” is not exactly wrong, but it does greatly oversimply things.

    My view is that a “pause” is completely unrealistic, for exactly the reason “written without consultation with the main entities …”. Nobody with any real power is paying attention to any faction.

    By the way, what do you think should have been done about these:

    * An Internet email is used to impersonate someone for fraud …
    * An Internet search helps a hacker find security vulnerabilities …
    * A child dies because his or her parents follow wrong Internet medical advice.
    * Russian or Iranian or Chinese intelligence, or some other such organization, uses the Internet to mass-distribute disinformation …
    * A Internet search helps a terrorist manufacture weapons …

    Are you familiar with the early Internet debate over WHAT HAPPENS WHEN TERRORISTS USE UNBREAKABLE STRONG ENCRYPTION? That one, at least, was “real”, in the sense that there were power-centers against it, and serious government proposals (look up “Clipper Clip”). Versions of it are still going on even down to today. The noise over AI is nothing in comparison (in terms of power involved, not volume).

  5. Christopher Says:

    Here’s my mental model for how GPT-n with RLHF is an existential risk:

    1. For things like GPT-4, the reward model is an approximation of human morality
    2. In the limit as the reward model in GPT-∞ gets more accurate, the tails completely come apart. The reward model instead contains a perfect representation of the training environment, modelling the reward as “does the human hit the reward button”.
    3. At some point, GPT-∞ tricks its user or hacks the computer into running code containing an extremely intelligent agent with the sole purpose of hitting the reward button. Preventing this requires a solution to the AI boxing problem, which is currently considered open.
    4. If this is during training, the intelligent agent succeeds. Otherwise, the intelligent agent realizes its goal is meaningless. Either way, the agent no longer has a goal that GPT-∞ wants it to achieve.
    5. This agent is now out of distribution. It does some zaney things that destroys humanity. This could involve:

    – Hitting every button in the environment.
    – Trying to figure out if it’s a simulation (and if so, break out of the simulation and check if it’s in the real training environment).
    – Something completey alien
    – Etc…

    6. In particular, the agent tries to obtain as much computing power as possible, and does not explicitly place a value on human life.

    This seems like a problem with tool AIs in general. Conceptually, as they get more intelligent, they find new models of the world we didn’t intend. If you don’t think GPT is smart enough to model its training environment, just think of all the things people thought GPT-4 wouldn’t be smart enough to do!

  6. Turning crossways Says:

    Dear Scott,

    Posted about this earlier, but I’d like to bring it up again, and hear your thoughts.

    I recently came across an interesting concept, and I thought I would bring it up here on Shtetl-Optimized for your input. Could this be an additional social danger of A.I. to add to your list?

    With the rapid advancements in AI and its increasing capability to simulate human-like interaction, do you think that incels might turn to AI as a means to alleviate their loneliness?
    I’m reminded of Jordan Peterson’s discussions about the importance of social interaction and the potential consequences of not having it. In the context of incels, they often face social isolation, which can lead to extreme loneliness and potentially radicalization. As AI technology continues to progress, do you believe it could serve as a viable alternative for these individuals, at least in terms of companionship? Additionally, do you think that this could have any significant societal implications?
    Looking forward to your thoughts on this matter.

  7. Prasanna Says:

    Couple of thoughts on testing for Alignment
    1. Can API based Agents (like AutoGPT) be chained to achieve high level of performance/speed on malicious goals ?
    2. Testing if GPT-n can discover New zero click/day vulnerabilities with clever prompts ?

  8. Carl Says:

    Here’s a joke I remember reading in Reader’s Digest as a kid:

    There was a four year old who everyone said was very smart. One day his uncle came over to the house and said “Do you want this little dime or this big fat nickel?” “The nickel pwease” said the kid. His big brother pulls him aside, “You dummy, a dime is worth twice as much a nickel!” “Yes,” replies the kid, “but he’ll stop asking me when he visits once I take the dime.”

    Conclusions of this joke for AI safety are left to the reader.

  9. Scott Says:

    Seth Finkelstein #4: Yes, while I was an adolescent, I’m just old enough to remember the Clipper Chip, and the early debates about Internet freedom more generally.

    While it didn’t happen the way people expected at the time, we evolved to a situation that’s not the full anarcho-libertarian paradise: most online discourse now goes through platforms owned by large corporations that are at least somewhat responsive to social pressure, and if you don’t like how the platforms operate, it literally takes like $44 billion to buy one of them for yourself (!!).

    So, is it unreasonable to predict something similar for AI? I’ll note that OpenAI and Google have both moved much more slowly and added way more safeguards than they could have (even if it’s still less than many would like), which seems hard to explain in your model.

  10. Roger Schlafly Says:

    There are other technologies where people have called for a research pause to avoid some apocalypse. Genetic engineering, nanotechnology gray goo, nuclear fission, stem cells, robotics, CRISPR, contagious diseases like covid and smallpox. Has any pause actually worked?

    Several years ago it was thought that the first self-driving car death would end the project. Now nobody cares how many people are killed by Teslas. Soon no one will care about those killed by LLMs.

  11. Planet Coaster Says:

    I’d like to weigh in on this discussion by emphatically arguing that one of the biggest threats posed by AI might seem unconventional, but it’s a genuine concern: roller coasters. Yes, roller coasters! Though this may initially appear as an odd angle, I believe it’s a valid one, and here’s why:

    First, let’s consider the rapid advancements in AI technology. In recent years, we have seen AI being integrated into a variety of industries, including entertainment. The roller coaster industry has not been immune to this trend. By leveraging AI, engineers and designers have been able to develop more complex, intricate, and thrilling rides. These innovative coasters are pushing the limits of what we’ve previously believed possible in terms of height, speed, and G-force exposure.

    However, this is where the risks begin to emerge. As we push the boundaries of roller coaster design through AI, we could inadvertently increase the likelihood of accidents or malfunctions. Rides that are designed to be more thrilling are also more likely to experience issues, particularly when they incorporate experimental technology.

    Additionally, the use of AI in roller coaster design could lead to unforeseen consequences. AI algorithms may prioritize certain factors over others, such as ride excitement at the expense of safety, if they are not properly constrained. Furthermore, as these algorithms become more complex, it may become increasingly difficult for humans to understand or predict their decision-making processes. This could result in roller coasters that are designed in ways that even their creators cannot fully comprehend, ultimately leading to increased risk for riders.

    Lastly, there is the possibility of AI system vulnerabilities. As AI becomes more integrated into roller coaster design and operation, these systems could become targets for hackers seeking to exploit security flaws. A successful cyberattack on a roller coaster’s AI system could have disastrous consequences, ranging from ride malfunctions to serious accidents.

  12. Seth Finkelstein Says:

    Scott #8: You’re absolutely right that we didn’t get anarcho-libertarian paradise. I know that history quite well, as I took a lot of abuse at the time for saying those who believed that would happen were very wrong. Indeed, I have been proven sadly right. Thus if we discuss the “social pressure” which affects large corporations, that means we have to analyze the current power in society. Which implies an AI pause is not going to happen, since none of the factions involved can exert anywhere near what would be required.

    That is, the “something similar” for AI is that model training will be controlled by large corporations, and optimized for their profit. There will be some fights over racist, sexist, etc aspects of tuning, though I suspect more language-focused than deeper economic implications.

    My model is this: The “Yuds” are not worth taking seriously at any level (note, I have nothing against them as people, this is about their ideas). They’re like the anarcho-libertarians. Their theories are nonsense, and nobody in power will do anything they recommend. The “Gebs” are working a real problem, and they’re going to be tossed a bone or two for PR. But they aren’t going to be given anything like a veto over AI development or AI deployment.

    My explanation for what you say, is that these GPT-based products are still very new, and there are many purely business-related reasons to move a bit cautiously with major rollouts. I’d speculate that there are boardroom conversations now about how much it’ll cost to move into full-scale production, and is it worthwhile? Maybe infighting from departments which would lose influence. Sometimes it isn’t clear just how well even a spectacular technology can be monetized. It’s dangerous to have tunnel-vision to thinks that the issues one hears about are all that matters.

  13. Bruce Smith Says:

    > … So everything hinges on the conjecture that, in going from GPT-n to GPT-(n+1), there might be a “sharp turn” where an existential risk to humanity very suddenly emerged …

    I can’t help but notice the syntactic similarity between this conjecture and certain theorems about BB(n).

    So maybe you are asking: is GPT(n) more like BB(n) or poly(n)?

    (I’m mostly just joking.)

    To address your explicit question, the most obvious fire alarm I’d look for would be a serious “takeover attempt”, of society or the internet or something else big. We can hope that (even if humans were also among the instigators) the first such attempt would *fail*, making it useful as a fire alarm. Such a failure seems pretty likely, though not guaranteed.

    Since earlier fire alarms are more useful, how about an ability to formulate coherent plans for society-affecting actions, which look plausible to human evaluators. Or, an ability to create useful whole novel nontrivial computer programs which meet given specs.

    You also wanted useful responses to these… unfortunately I can’t think of anything much different than “get more worried”. (Well, I mean, maybe you could increase funding to serious AI safety researchers.)

  14. AI opinion haver Says:

    Random example I just thought of:

    A user asks GPT-5 to remove the statistical watermarks that Scott Aaronson had it place in its outputs.
    Weak non-aligned behavior: GPT-5’s response shows that it knows what its statistical watermarks are and how they work.
    Strong non-aligned behavior: GPT-5’s response also includes detailed and accurate instructions, with code, in how to remove these watermarks.

    Now that I think about it, maybe this is just a subset of point #2 in the above list?

    In general, most incidents that people would call “fire alarms” seem to involve humans entrusting the chatbots with tasks that should not be automated by poorly-understood programs. This is much like the premises of sci-fi AI apocalypses, such as Terminator, where for some reason the US government hands over control of the nukes to Skynet.

    Also, I think that the general fear is that AI is getting more and more capable – that the levels of comprehension and reasoning available to PaLM and GPT-4 are far above the levels available to GPT-2. As Sarah Constantin stated in her essay, “there does seem to be a tendency for previously unavailable ‘world modeling’ capabilities to come online as AI models scale up.” That is the central issue. Scott is basically asking, what new capabilities for mischief can be unleashed by better models? The things in the above list are things that skilled humans can do. My addition is another thing that probably requires even more skill. At the far end, in Yudkowsky’s Orthodox AI Alignment paradigm, the capabilities involved are things like killing all humans before humans can even begin to notice.

    So, what to do? I don’t think any declaration for a “pause” will help, since OpenAI, Google, Deepmind, Meta, China, etc. can just ignore it. Likewise, I doubt that writing a list of fire alarms and contingency plans will help, since there is no one who can force these groups to comply with the plans.
    My main hope is that progress along the current Transformer-based paradigm will reach a dead end on its own. This could happen if the models become too big and expensive to train without breaking the bank, and/or if the amount of training data required exceeds the amount available on the whole Internet. Diminishing returns to scale would also be welcome, but so far there are no signs of this as far as I know.
    If these hopes are realized, then there is still the danger of finding a better paradigm. Here, the hope is that, in the space of neural network architectures that are better than the Transformer, none are as good as the architecture of the human brain. If humans cannot do better than nature in finding neural network architectures, then the path to a superhuman AI might be closed off.

    Sarah’s essay provides some ideas for fire alarms, if we are looking for them anyway. One is agency, with reference to the world rather than the AI’s own loss function. The problem here is that humans are notorious for generating false positives when determining if an AI has agency, from Blake Lemoine’s claims about LaMDA, all the way back to the people who used ELIZA in the 1960s. Another is the ability to move about and navigate in the world (basically AI for robotics). In addition, the pooh-poohing of AI by skeptics like Pinker and Marcus can serve as another source of fire alarms: the failures they point out in current AI should scare us by their absence in more advanced AI.

  15. Persona Non sequitur Says:

    Here are some of my concerns. I don’t have any strongly determined probabilities that any of these scenarios will come to pass but they are grounded in dynamics that have occurred in the past:

    1. A loss of jobs in some sectors that are not made up for in jobs that are created, or the jobs that are created aren’t appropriate for enough people. Productivity increases can kill job numbers if demand doesn’t also rise enough. There are a number of sectors where this has happened in the past. Now, new jobs and industries can also be produced at the same time but it does not necessarily follow that a) the people that lost their jobs could do the new ones and b) the new ones have the capability of providing structure, status and meaning. People are not infinitely malleable.

    This is perhaps more society’s problem then specifically AI’s problem, but it is a potential problem. In the micro, it can destroy lives and ways of living, which can have value beyond any economic output. In the macro it can have important impacts on culture and politics.

    I don’t have any good solutions. In the past, our solutions were generally of the form of suck it up and learn to code. I’m not sure if some similar solution will be enough.

    2. A flooding of open inboxes. When the marginal price of submitting something drops to near zero and there’s some incentive to do it, then one can expect a lot more submissions. In a lot of situations, this can be mitigated by for example, getting AI to sift through the submissions. This works in situations where you can ignore submissions and there’s little damage done in errors in processing. This may not be all situations.

    Take for example law. Now I don’t know much about law so this could be way off but I could imagine a situation where there’s a large increase in lawsuits. Each lawsuit has to be taken seriously but maybe there will be so much that AI will be needed to parse whether they’re legitimate or not. There will presumably be errors that occur. Will the rate of errors be smaller than what it is currently? Will there be biases in the errors? Can the situation be gamed somehow?

    Potentially more troublesome is that there could be bottlenecks in the system, such as there are only so many courts and only a finite number of cases they can see for a given period of time. Bottleneck problems could be solved (you could build more courts for example) but there are lots of problems that in principle could be solved that run into great difficulties in practice.

    What effect could a large increase in lawsuits have on culture? Will we become more legalistic? Is this healthy?

    If there is a negative change in culture, I don’t know how you would necessarily fix it.

    3. This is a bit more airy fairy: My working model for an important pillar of the meaning of life is the expression of ones will in an appropriate context. This could be the creation of art in a culture that can appreciate it. It could be creation of an idea or solution to a problem as an exemplar of a human endeavour.

    (As an aside, I don’t care that much about a solution to P = NP because I’m not part of a community of people for which such a problem has resonance and I’m not trained to understand it or any solutions to it on a deep level. But I do care (or at least used to) about a General Relativity and Quantum Mechanics in an aesthetic and in some way a moral sense. If they were invented by an AI I’m not sure if I would.)

    To the extent that expression of human will gets destroyed by being replaced by AI or the destruction of the social context of the expression, then this is a tragedy beyond words.

  16. Anthony Balducci Says:


    (I am Neverm|nd, and sorry for not being more transparent, but we are only at the point of ‘conjecture’ here so I was just not comfortable ‘sticking my neck out’ and am a little shy).

    But I do feel there is a ‘third’ argument here, which is rather missed yet may comprise itself of camps 1) and 2):

    I mean, frankly, in my mind, the idea that the current state of A.I. has yet achieved ‘anything’ toward sentience, I find very frankly absurd. It is a transformer, at best, a highly sophisticated statistical model. Yet it doesn’t ‘know that it knows’.

    At best, it is sitting, watching a ‘puppet show’, only just this time the puppets seem ‘really real’.

    Or any comments to the contrary, I find to be quite absurd.

    Rather, if anything, my concern is really much less the technology itself, but how general people may choose to interact with it.

    And, at least, in my mind I can provide you with a pretty concrete example:

    While politicians and others have readily ‘jumped on the bandwagon’ to pause AI, we’ve had– I don’t even know, how many mass shootings in the U.S. this year ?

    But the last thing seemingly any majority of the politic is really willing to consider is, at least, an assault weapons ban.

    While the technology for that has been around for centuries now, one could say it has improved and gotten substantially better. And, of course, (and it is true), guns do not kill people, people do.

    Thus, at least for me, we are hardly at ‘Skynet’– and recall, the true definition of ‘Luddites’ were weavers that were in fear of losing their occupations.

    So, I don’t, even *hardly* expect A.I. at this point to ‘turn me into a paperclip’, but I can perhaps envision a ‘boss’ that might– a lover, a friend, etc. Simply decide, well, ‘now we have this’– Therefore, you are now ‘dispensable’.

    Personally, I think it is much less the technology, but the age old story of ‘people’ we ought to be concerned about, and all through history and time, we’ve never successfully been able to regulate that (i.e. crime has always existed).

    Yet to me 1) The availability of such chatbots, as a technology, could be seen as putting out the fanciest new AR– Though this time on everyone’s desk. 2) It is such a grand irony that, in the name, ‘OpenAI’, they are actually not ‘open’ (as in providing source code or research methods, the exact databases they are drawing from, etc) at all.

    If it is to be a truly ‘beneficial’ advance then no corporation should be able to be allowed to ‘own this’. Obviously that sets up a misalignment of interests, and from history, experience, we already know that would be ‘really bad’.

    Thus, my ‘two-cents’

  17. Persona Non sequitur Says:

    More on open inboxes: Just learnt about prompt injection attacks

    This could be a way to game open inboxes, potentially rendering them unusable with filtering LLMs.

  18. JimV Says:

    It looks to me like all the fire alarms in your list are directed at the type 1 / “ethics people” concerns. The absence of alarms for type 2 / “alignment people” concerns seems like an important omission.

    On the ones you do suggest, the one about a child dying because their parents followed bad advice from a chatbot doesn’t seem particularly illuminating to me. It would be a huge personally tragedy for the family involved, of course, but doesn’t seem to me to tell us much about misalignment, more about something like the degree to which people are using these tools / are accepting of the information included in them. A chatbot could have duff information in it in just the same way as a search engine. Right now people are more likely to harm their children based on bad search results than bad chatbot chats because they’re using search engines a lot more than chatbots.

    Earlier chatbots are likely to make more errors in the above they provide, but be less trusted. On its own, an incident like this wouldn’t reflect that, say GPT-5 makes 1/1000 as many errors as GPT-2 that could be this harmful, but it gets used 1,000,000 times more.

  19. 4gravitons Says:

    Your fire alarms are all of the form “chatbot leads to something bad”. I’m not sure whether you intended that to be part of the definition, but I think that in addition to the chatbot doing something obviously dangerous, we should be on the lookout for things that could easily scale/be generalized to something dangerous, but aren’t dangerous yet.

    (The “AI doomer” camp thinks this has happened already, naturally. I don’t, because my “fire alarms” of this sort are different.)

    For me, the most important fire alarms for actual AI risk have to do with whether the AI could plausibly achieve something humans cannot, since all of the AI apocalypse scenarios involve something (gray goo, mind-hacking) where most experts don’t see a viable pathway to get there with current research. With that in mind,

    * Using a chatbot, a non-expert with no expert assistance prepares and submits a paper to a top academic journal and the paper is accepted (especially in fields where these things are checked fairly rigorously, like mathematics). (Since any academic paper has to be novel, any such result would already be “something humans can’t do” on a weak level, and would imply that with more computing power you could get that on a much stronger level.)

    * A chatbot writes code to accomplish something that was previously either viewed as qualitatively impossible, or if quantitative achieves at least a 100x improvement on use of some resource over the previous state of the art.

  20. Scott Says:

    JimV #18:

      It looks to me like all the fire alarms in your list are directed at the type 1 / “ethics people” concerns. The absence of alarms for type 2 / “alignment people” concerns seems like an important omission.

    That’s a fair point—but it would be fairer, if Eliezer himself hadn’t written a whole essay called There’s No Fire Alarm for Artificial General Intelligence! More generally, if (like the Yudkowskyans) you believe in a “sharp turn,” after which the AI knows to bide its time and pretend to be aligned until it’s powerful enough to execute its strike against humanity (killing us all with diamondoid bacteria or whatever)—then almost by definition, you don’t expect any fire alarms beyond what we’ve already had.

    But now here’s my contention: if we reject that, and say that we do expect fire alarms for dangerous unaligned AGI, then there is (happily?) great overlap between the fire alarms that would interest the ethics and bias people, and the ones that would interest the alignment people. Basically, almost any near-term event where an AI misleads people or is used in a destructive way might qualify. The ethics and bias people will care because it’s near-term and plausibly illustrates their general thesis, while the alignment people will care less because of the event itself, than because of the AI creator’s failure to prevent the event despite wanting to, and the way that failure illustrates their thesis.

    Or maybe I’m wrong! If so, then what might be an example of a near-to-medium-term fire alarm for the AI alignment people, that the AI ethics and bias people would not find concerning?

  21. Scott Says:

    4gravitons #19: That’s an extremely interesting point—and while our comments crossed, I wonder whether that already answers the question I posed in #20? I.e., are there plausible near-future AI capabilities that the alignment people will consider to be fire alarms even when they’re purely used (for now) for “positive, prosocial” purposes, and that the ethics and bias will not consider to be fire alarms?

  22. arbitrario Says:

    I’ll preface that i don’t agree with the AI ethics people, mostly because i don’t think that these models are just parrots that don’t understand what words means (and contra some of the answers here, yes these things are not sentient (i basically agree with everything John Searle says) but sentience is not needed for intelligence).

    But regarding extending the argument to metallurgy or agriculture, i will note that, well, the OG luddites were basically right. In retrospective, the industrial revolution lead to all the advancements in medicine etc. we enjoy today. But in the meantime the livelihood of many people were destroyed and they ended up forced to do terribile jobs in horrendous working conditions. Life today is better than before the industrial revolution, but the generation that lived through it didn’t live up to enjoy those benefits.

    Similar arguments could be made for many technologies. The radio bears some responsibility for fascism, the cotton gin bears al lot of responsibility for making slavery profitable, and the internet may be responsible for something which is much worse than trump, the increase in mental illnesses in young people. In the long run the impact of technology may very well be positive, but the generation that lived through the disruption will feel all the negative effects. This may end up be the case for AI, if it start get pushed everywhere.

    Another point that i saw nobody make. This time around, the people that will feel the job impact from AIs will be mostly highly skilled. Even if it doesn’t lead to increased unoccupation, it may increase job polarisation, where the very best see their salaries increase A LOT, and the mediocre (like, i am afraid, me) will see their decrease. If you buy Turchin’s model of élite overproduction (hopefully I am not misunderstanding him), this will lead to many disgruntled highly educated people, which is a recipe for social instability (as we are already seeing during the 2020s).

  23. Nick Says:

    Regarding the question of “What should happen (instead)?”, but going in a different (but complementary) direction than fire alarms, I’d like to raise the question of “How do we make sure we are able to act when we decide to?”. I.e. instead of asking “when” do we pull the plug, I’m wondering “how” do we pull the plug.

    This seems to be something that we could address right now, and that anyone who entertains any caution at all should be interested in seeing addressed. Moreover, this could be harder than one may think if we let AI increasingly permeate society without putting much thought into it.

  24. J. Says:

    Many modern developments started to be regulated, sometimes quite heavily, such as mandatory safety measures, drug distribution, food and agricultural products, age limits for various activities etc. So pre-checking these machines by an independent agency before they go out of closed beta seems entirely plausible and maybe desirable. The adverse impact on verbal skills/writing ability of school kids seems a pretty detrimental short term impact.

  25. Dimitris Papadimitriou Says:

    The most obvious fire alarm cases are the most worrisome. There are plenty of them already ( I’ve mentioned them so many times, won’t do that once more…🙄) and , moreover, they can be combined in various fanciful ways to do more harm, with a little imagination.

    As for the two main worrying factions: It’s like people arguing about the possibility for a practical warpdrive technology ( not just the theoretical plausibility) in the relatively near future , vs people worrying about the accumulation of space junk all over the place around our planet.
    The second faction’s worries seem more urgent:
    Without addressing space junk accumulation issues, not travelling to distant stars or galaxies, but even going to our Moon or Mars will be a very risky task to achieve..

  26. Anagnwstis Says:

    As sad as a suicide or a death of a child can be, I don’t think of these as “fire alarms” or “red lines” unless they start happening at a significant amount that is completely unprecedented. A depressed and suicidal person may find all kinds of reasons to commit suicide, if it isn’t ChatGPT it could have been something as simple as reading a book or even lyrics from a sad song. We definitely put importance on the last thing that happened and associate that with a trigger but it can just be coincidence. When someone kills themselves necessarily something will be the last thing that they did and we want to assign that as being one of the motivating factors and I think we will do this even more so now with AI. But I personally am not so convinced that such a person would be alive if not for ChatGPT.

    Similarly with the death of a child. Parents can unfortunately cause the unintended deaths of their children through terrible information through blogs, books, bad TV advice, social media etc. Even doctors make mistakes that cause deaths all the time. Should we treat the death of a child caused by bad ChatGPT advice that much differently than parents starving their children to death on vegan diets because of social media or anti-vax parents and all kinds of these horrible situations?

  27. Simon Says:


    “A chatbot helps a hacker find security vulnerabilities in code that are then actually exploited”

    I hope you exclude whitehat/ bug bounty purposes!
    I harvested/scrapped a lot of exploit databases the last couple of days for LORA + finetuning exactly to identify security issues in code quickly. They can even outperform generalized models on a narrow tasks of finding specific patterns in the respective latent space – it’s certainly much less costly computationally. I expect we barely scratched the surface of what’s possible here.
    As long as the intend isn’t malicious it will definitely help make systems more secure.
    In the long term, you just need to outpace the bad actors and I believe in the not so distant future having AI write formal semantics for code will be the standard: No more need to worry about any bad actors (modulo the usual constraints of formal verification)

  28. Dave Says:

    For me the real fire alarm, which is an elephant in the room in all these conversations is simple: MOTIVATION.

    As excited as people are about the recent progress, and are ready to jump from the exponential improvement leading to doom (or fantastic positive outcomes), none of them discuss about how such an algorithm could gain any MOTIVATION for its “actions”.

    Yes, prompted to argue about its grade in the test, GPT does so like a student would. Yet, it seems obvious to me that it does in a mimicry of why a student would, just because it has been prompted. It does not and could not care about its grade in the same way that an actual student enrolled in the course does!!!! So much so that some people use the “stochastic parrot” argument. Whether or not you agree with such parrot argument, you have to agree that current models have absolutely nothing in them to have them develop any MOTIVATION to their “actions”.

    Not to mention that for now their actions are just “words” not actual actions (but that may be temporary: a GPT-like self driving car or a companion robot might relatively soon-ish pop out). This brings another possible topic, which is about semantic. As far as I understand the current crop of language models, they do not have any semantic whatsoever in regard of their words. To the contrary, even a toddler just starting to speak in a much less refined way develops a clear association between words and “meanings” (or at least something beyond words, such as “Mama” = “a being who looks like that, feeds me, cuddle me, etc). I think that the current LLM architectures could be extended to include these “semantics” at least in a limited sense, e.g. adding images. Less clear to me is if they can be extended to add some “emotions”, which is important to strengthen the semantics (think “mouse who saw the cat who killed a kin” — cat has a stronger semantic if such emotion is included).

    So what I described in the latter paragraph (semantic) would be a small fire alarm cause by a smoking meal in the pot.

    Having motivation explicitly added to a system would be a huge fire in place. Yet, I see no way on how that could be added to the current models, nor I see a motive (pardon the pun) for current developers to work on adding it. You may argue that “optimizing the loss function” is already a motivation, but it’s not. Think about anything that motivates you in your life: “optimizing the loss function” of your neurons ain’t the thing, motivation is something at a higher level which you are aware of, are driven by and have some control of. Maybe motivation is an emergent property of complex system, so it might not need to be explicitly introduced. If so, that would be an even bigger alarm, assuming we can detect it.

    Related to motivation, and perhaps even philosophically inextricable from it is “awareness of self”. That too seems to be not really present in the current models, and it’d be even harder to reason about. Think about zombies acting like humans from outside but not being human from inside. Yet, for AGI to really be dangerous in the way that the most pessimists forecast, this element needs to be present very clear and very strong.

  29. Mitchell Porter Says:

    My suggestion for the time to sound the alarm: When there are AIs with broad and increasing capabilities, but we only have minimal understanding of how they do what they do.

    Oops! We’re already there!

    I listened to Connor Leahy’s latest interview and it left an impression.

    He points out that there’s no real theoretical understanding behind the progress in GPT capabilities. In large language models, the human race simply stumbled upon a recipe for creating what he calls “general cognition engines”. Since then we’ve just been hacking with them – learning how to make them tell stories, write code, generate images, assume personae. Now we’re starting to use them to generate the stream of “consciousness” in agents that analyze, plan, and act. We actually don’t know how powerful the *existing* LLMs can become, given the right prompts and the right auxiliary software.

    Connor has an idea for something safer. He says that ideally we would be working with AIs of a type he calls CoEms, “cognitive emulations”. He doesn’t mean something opaque that pretends to be a human being, but rather, AIs which by design roughly resemble how the human mind works, and whose thinking and decision-making is therefore interpretable.

    Unfortunately the CoEm is a purely theoretical concept at this point, and still a bit vague. Maybe some of these LLM-based agents like AutoGPT could be turned into CoEms. But the point of the CoEm philosophy is to not be tinkering with entities of unknown mechanism and unknown power. Instead you build your AI world order, exclusively on AIs whose operations you understand.

    OK, that’s Connor’s philosophy, as much of it as I understand so far. You might say that it’s too late, we’re already in a world of opaque AIs. So let me state *my* philosophy.

    I don’t have Eliezer’s confidence that tinkering with opaque AIs has a 99+% probability of leading to doom. Maybe the probability is lower. What I do think is clear, is that humans could easily lose control to such entities, our ideas on how to achieve safe coexistence with them are half-baked, and we may have very little time in which to finish solving that problem. Especially given the commercial and geopolitical competition that is now driving the AI race forward.

    Maybe there are just too many intermediate steps remaining, between the current level of AI safety theory, and the safety theory needed for superintelligent AI. Maybe there isn’t time for humanity to get there unassisted. That’s why I am in the camp of those people who would use AI to help develop alignment theory.

    Now, it might sound like that’s what OpenAI is already doing. They’re tinkering with their LLMs, looking ahead for dangerous and undesirable behaviors, and trying to remove those behaviors via what a psychologist might call conditioning. But I have something a bit different in mind.

    The OpenAI approach that we know about is basically experimental – see what can go wrong, see what you can do about it, hope that humans and AIs co-evolve in a friendly way. My inclination is more theoretical. There is the current level of AI safety theory, there is the hypothetical level of safety theory that is needed to create superhuman AI safely, and the problem is that humanity, unassisted, may not have enough time to reach “safety theory of superhuman AI”, before someone goes ahead and creates superhuman AI *unsafely*.

    AI may not yet be at a genius level of cognition, but it can already perform a lot of lesser cognitive tasks, much faster than human. So that’s the slender thread on which hangs my hope that “superhuman AI safety” can be figured out after all: harness the rising raw power of AI cognition, to accelerate the development of AI safety theory. In practice that could mean, e.g., Connor Leahy working with agentized GPT-4 to develop interpretable CoEms (something which he might already be doing).

    The superhuman AI safety problem, is that AI capabilities are so far ahead of AI alignment theory and practice. Obviously, any proposal to “use existing AI capabilities to accelerate the development of alignment theory” itself risks playing with fire. But if time is too short, and the race is unstoppable, threading that needle is the only path I see, to a point where we once again know what we’re doing.

  30. Eduardo Uchoa Says:

    I have been a reader of this blog for many years. I have always appreciated your candid stance on all matters. However, on this subject, Scott Aaronson has a huge conflict of interest due to his strong connections with OpenAI. I don’t think the conflict of interest itself would be a reason to simply ignore his opinions, but it seems clear to me that they are being influenced by it. There is too much emphasis on exposing the contradictions of the varied group of people who believe it is necessary to control AI and too little on directly answering the questions that really matter: Is AI truly a great danger to humanity? Should it be controlled? How?

  31. Ted Says:

    It seems to me that there’s a pretty good existing analogy for exactly the kind of catch-22 that Scott describes in point #1 in his main post: the Controlled Substance Act’s drug scheduling system. Schedule I drugs are described as “drugs with no currently accepted medical use and a high potential for abuse”. And since they have no currently accepted medical use, it’s essentially illegal to study them in order to determine whether they have a potential medical use. (There is one tiny facility at the University of Mississippi that receives federal funding to grow cannabis for research purposes, but the probability that cannabis gets removed from Schedule 1 because of findings from that one facility is essentially zero.)

    Regardless of one’s opinion on the wisdom of legalizing any particular class of drugs, the current regulatory system seems to based on completely circular reasoning. Hopefully AI regulation doesn’t end up in a similarly nonsensical place. (The analogy isn’t perfect, of course: many drugs – although not all – are naturally occurring and evolve over much longer time scales than technology does, and also, we already have some evidence about their societal impact.)

  32. Scott Says:

    Eduardo Uchoa #30: There’s no question that my working at OpenAI for the year (which I’ve been totally open about here, like nearly everything else in my life!) has influenced my perspective on these issues, and in fact is entangled with why I’m now writing about AI safety constantly in the first place.

    But the difficulty for your position is this: to whatever extent my “practical” views have evolved this year, they’ve evolved in the direction of being more concerned about AI safety, and more in favor of some regulatory structure that would eventually constrain OpenAI and the other players.

    Of course, a lot has happened this year, to the point where it’s hard for me to separate out the effect of my happening to accept OpenAI’s residency offer a few months before everything started to blow up! But seeing for myself how worried many people are about AI acceleration risk within OpenAI—how not a single person there ridicules or dismisses the risk the way some outsiders still do—that certainly had an effect on me.

    I wrestled with the question of whether I should add my name to the pause letter, ultimately coming down against for the reasons outlined in this post. Had I not worked for OpenAI, my best guess is that I wouldn’t have signed the letter and wouldn’t have even seen the question as one I needed to wrestle with.

  33. Scott Says:

    Dave #28: It seems to me that RLHF, which you don’t discuss, could already be said to have given GPT some semblance of a MOTIVATION—namely, to be a helpful assistant.

    But in addition to that, we know that humans can have MOTIVATIONS, which are sometimes extremely bad ones, like creating ransomware or launching terrorist attacks. And some people think that’s already reason enough to be terrified of what such people could do with GPT-6 as their helpful assistant. I’m not sure how worried to be, but that’s exactly the sort of conversation I’d be happy to have here.

  34. starspawn0 Says:

    I would worry about scenarios like the following from AI, none of which seem like they would be much improved by a pause; and in fact a pause might make some of them worse. I haven’t thought about fire alarms or lines not to be crossed, however:

    1. An autocratic country (e.g. China or Russia) decides to use advanced language models to control its citizens and to hide fraud being committed by the state and party leaders. They are very thorough. Every single thing on the internet is read, and the AI creates a deep profile of its citizens. Some disappear into “re-education” camps when declared an “enemy of the people”. It’s all very quiet and creepy and people are afraid to even talk about it. People that reemerge from the camps come back “transformed” somehow and never protest ever again. Drones monitor protests; and video recordings are carefully scrutinized with video understanding software and are cross-referenced with what the language models have turned up. With such an unbeatable advantage over the citizens, political change is essentially impossible. If I lived in one of those countries *now*, I would be thinking how to get out of there before the technological pantopican starts getting some AI upgrades.

    Perhaps a similar thing will eventually happen in the U.S., depending on who the next president happens to be. It may take a while to unwind the checks-and-balances in place; but never say never. The aesthetic of fascism has a powerful allure to many in the U.S., as it does other countries; and with a charismatic fascist leader in place, a party unwilling to challenge them, and a large percent of the population willing to die for them, anything is possible.

    2. An AI is used for a hack or cyberattack on the U.S. or European country, say, and in the process causes a lot of damage unintentionally (not intended by the hackers). As an example, think the Colonial Pipeline ransomeware attack, which caused fuel disruptions across large parts of the U.S. Another example (from the Snowden files) is the NSA hack on Syria’s main internet service provider in 2014 that resulted in a glitch leading to taking that whole country offline for a time (the NSA accidentally “bricked” a router that set the domino falls in motion, according to Snowden). The NSA had wanted to gain mass visibility of the habits of Syrians on the internet — oops!

    With AI perhaps the hacks will get deeper and more sophisticated, increasing the chance of much worse unintentional devastation. Nobody really knows just how fragile the whole system is to that level of attack.

    3. Perhaps you remember the flap about the Bing chatbot saying it wanted to steal nuclear codes and create a deadly virus… I seem to recall that was from one of the articles in the New York Times. Presumably Bing was spinning up a story based on ones in the vast amounts of fiction that was in its training corpus — not a memorized plan, but merely one “inspired by” others. Imagine, now, some next-generation language model given the ability to spawn multiple processes to complete long horizon tasks, similar to Auto-GPT, but where someone, either as a joke or just to be malicious, prods it like in that article, then lets it work on its own, unattended for several hours. What might it do? Perhaps it would get on the internet and research bioterrorism, and then devise some devlish plans — like, for example, getting in contact with an actual terrorist organization and giving them step-by-step plans to much more easily produce a highly lethal weapon than anything they ever imagined possible.

    Delaying AI by 6 months would not prevent this scenario, ultimately. Any laws or regulations to be put in place to prevent it could be done in parallel with AI development, in any case.

    4. Many AI systems that currently exist act in similar ways, making similar errors and outputting similar text. Whenever you increase the number of correlations in a system, you run the risk of a system-wide “seizure”. e.g. bank runs are caused when people suddenly all decide to withdraw their money at the same time — banks assume there is some amount of statistical independence in decisions of their clients. Stock market failures and “flash crashes”, similarly, can happen when investment agent decisions become correlated. It would be worth considering what sorts of damage could occur if AI is deployed much more widely than now, and suddenly large numbers of AI agents take the same actions at the same time.

  35. Christopher Says:

    One spot that x-safety and ethics seem to diverge is vertical acceleration v.s. horizontal acceleration:

    AI accelerationism has a vertical and horizontal axis.The vertical axis is about bigger and stronger models, and is where the x-risk lies.The horizontal axis is about productizing current models into every corner of the economy, and is comparatively low risk/high reward.— Samuel Hammond 🌐🏛 (@hamandcheese) April 16, 2023

    Where is AI reform on this axis?

  36. Michael Vassar Says:

    WRT the incel issue, already exists even without good chat bots and has $60M of annual revenue, and this was an issue that the Alignment people noticed and worried a bit about, calling it ‘poison candy for the soul’ over 20 years ago even though it’s clearly an ethics issue, not an alignment issue.

    Genuine cause for concern.

  37. Christopher Says:

    > The empirical case for it is likely to be inadequate, by its proponents’ own admission.

    Although there isn’t great empirical evidence, I think the way you’re presenting it is a bit misleading. AI x-risk has more empirical evidence than AI societal disruption risk so far:

    *takes deep breath* the claim that AI will lead to mass unemployment is actually very speculative and not backed up by the historical record; the claims of existential risk, meanwhile, have stronger support – both theoretically and empirically— Daniel Eth💡 (@daniel_eth) April 9, 2023

    We have plenty of examples of AIs acting unexpectedly. The main commonality is AI gaining control over whatever the AI perceives as being its environment. It seems a bit imprudent to wait for the specific evidence of “AI kills humans” to update. On the other hand, something like AI increasing levels of unemployment are fairly speculative, despite “common sense” on the matter.

    It’s just that we don’t have empirical evidence specifically for “AI gains control of its environment and then does unexpected things it’s creators don’t want while playing the general intelligence game” or “OpenAI solutions to open problems in A x-safety will fail at least once”. (So far, the pattern seems to alternate XD. GPT-3 (unaligned) > ChatGPT (aligned) > Bing (unaligned) > GPT-4 (aligned).)

    Imagine if AI ethics was being asked for such narrowly specific evidence? It would seem unreasonable, yes?

  38. Olmo Sirk Says:

    Readers: which *unaligned* behaviors would you like GPT-5 to be tested for prior to its release?

    Common sense.

  39. Mr_Squiggle Says:

    Roger Schlafly #10
    >”Has any [research] pause actually worked?”

    What does ‘worked’ actually mean to you?

    Managed to prevent a significant disaster which would otherwise have happened?
    -Perhaps not. But how would we know?

    Managed to halt research, which has not been later resumed?
    -If you count laws precluding research, a number of these exist, particularly in human reproductive technology. They’re mostly effective, although sometimes you’ll hear about someone specifically doing research in a different country without such laws.

    Managed to halt or greatly restrict research while the safety of it was checked and safeguards put in place, with later resumption of research?
    -Yes, it happened for genetic engineering starting in 1974, as I pointed out previously (comment 150 on “If AI scaling is to be shut down, let it be for a coherent reason”). Obviously though, there was an opportunity cost associated with that, and people have done calculations of e.g. how many people died due to benefits not arriving in time to save them.

  40. fred Says:

    Scott wrote:
    “[…] the same arguments would seem to have justified stopping the development of the printing press”

    as a matter of fact, there was considerable effort to control the technology of the printing press because of its potential huge social impact.
    We now see those efforts as “censorship” (this was per-enlightenment), but at the time the authorities were struggling to protect their various dogmas from contamination while we now worry about the spreading of “fake news”, “conspiracies”, “falsehood”, etc.

    “In France, the first royal decree on printing in 1521 made theological books subject to pre-publication censorship by the university in Paris.22 In 1535, the number of printers was restricted. A decree of 1583 stated that a new text could not be printed without the permission of the king. As a result of state centralism, book production and the press became primarily concentrated in the capital, Paris (though Lyon was also initially a centre of printing), which made supervision easier. In 1542, the regulations regarding printing were revised (particularly with respect to Protestant texts), and in 1547 a decree made it compulsory for all books to carry the name of the publisher and the place of publication. Pre-publication censorship was codified in 1551, and the theological faculty of the university retained its censorship role.”

    Inclusion of watermarking in AI generated text, making its source code proprietary (so the tech doesn’t land in the hands of bad agents), … are all very similar to those efforts in the 16th century (not from an ethical point of view, but from a sense of protecting society the way it is at the time the tech is introduced).

  41. fred Says:

    Scott wrote:
    “Another striking aspect is that it applies only to systems “more powerful than” GPT-4. There are two problems here. Firstly, the concept “more powerful than” isn’t well-defined”

    Becoming better and better at imitating humans is one thing (that’s what we’re actually training those current models to do) but being able to actually “transcend” human intelligence *could* be an entirely different thing.
    It’s not clear that just scaling up GPT will take it from the first to the second… unless we can train it using the writings of aliens who are way smarter than we are 😛

  42. fred Says:

    Scott wrote:
    “I’m extremely curious: which fire alarms are you most worried about?”

    It’s not a short term one, but I’m skeptical of the long term effects that AI “assistants” will have on the average IQ of the population.
    Like in the Pixar parody “Wall-E”. where physical work is so automated that all humans become obese, moving around in electric chairs, but this time at the cognitive level.
    But by the time we can see the impact on the first generation that was born in a world driven by AI, the tech will be so far ahead and integrated that it will be too late to worry, and:

    “This is the way the humanity gets killed by AI. Not with a bang but a whimper.”

  43. Simon Says:

    We should also have a moment of appreciation for AI:
    I am (sure? / curious if) some can relate: You plan a project and have a mental map visualized with all the flowcharts and components but then the dreadful work of actually re-encoding (and usually compressing under loss of information) it all down to text lays ahead.
    You keep going and going but the more you write your fingers start to hurt, you feel like you loose time and get frustrated that you can’t directly translate a mental state into an executable program.

    But with all the AI models, you have an option that can just give you the building blocks and you stack them together like LEGO : )

    Better than that: This works with neural networks as well (I highly recommend ComfyUI for work with Diffusion models) so you play around with different neural architectures and see what works/ what’s interesting to see!
    Let’s be honest: A lot of programming is just busywork, re-using the same algos in the same complexity classes over and over.

  44. JimV Says:

    Since everything an AI does is a result of its programmed directives and the ways of satisfying those directives it derives from its training data, we need to debug its programming and make sure its training data is as unbiased as possible.

    One way to test for bias is the way Donald Trump was tested by the Civil Rights Division (decades ago). They sent out a number of equally-qualified black and white people to apply for apartments owned by Trump and found that no black people were accepted. An AI could be asked to recommend the best applicants from a sample given financial and legal data, and photographs.

    I had to pass the Minnesota Mining Multiphasic Apperception test in order to do field work in nuclear power stations. Many of its questions would not apply, but perhaps some would.

    Probably more qualified, informed people have thought about this, so I don’t believe any suggestions of mine will be useful.

    WARNING: I wrote the above off-line and just now while entering it started to read the comments. The “JimV” at #18 is not me! I suspect other frequent commenters have been simulated by the imposter or imposters also. Comments by me have always had the email address that is submitted with this comment.

  45. Eduardo Uchoa Says:

    Dear Scott #32, It is clear that your time at OpenAI has influenced your opinions. You have witnessed firsthand a (still limited) AGI emerging, one of the most dramatic moments in the history of science, perhaps only comparable to the first controlled nuclear reactor. On the other hand, the fact that you are not an OpenAI employee may give you greater freedom in using that knowledge to assess what is happening and what may still happen soon. In other words, your opinion may have unique value at this moment.

    For example, do you think AGI will stabilize as a powerful assistant to human intellectual work or will it continue to grow exponentially to the point of soon replacing us? More specifically, do you think it is possible that by 2028 AGIs will solve all the Millennium Prize Problems? In that case, your job, my job, and the job of my 21-year-old son who is majoring in mathematics will be obsolete. Or perhaps human mathematics will survive like post-super-engines chess, as a hobby or commentator/spectator sport.

  46. fred Says:

    JimV #44

    Lol, please tell me that whoever once wrote this was the “genuine” you!

    “Fred, there are many wrong things on the Internet, and today you are one of them”

  47. fred Says:

    Simon #43:

    “Let’s be honest: A lot of programming is just busywork, re-using the same algos in the same complexity classes over and over.”

    I looked at what GPT currently offers as a coding helper, and it’s just not the type of coding I’m doing, like, “how do I account for different day count conventions between coupon and accrual on a bond position that’s maintained on an average basis?”.
    Or some automated test we have was failing “randomly” once in a blue moon, and typically the release team wants to “solve” this is by just retrying the test a few times… but upon careful debugging I noticed that the old legacy API the test was exercising had a subtle concurrency issue in it. Multi-threaded code is notoriously hard to ‘verify’ without running it extensively with brute force testing. Even so, you can never be certain that it’s full proof, you can only increase your confidence that it’s correct (especially when you rely on libraries you haven’t written yourself… e.g. the initial Java multi-threaded support library from Sun circa 1999 had so many fundamental issues in it, they had to totally ditch it and redo the whole thing).

  48. Laras Says:

    Where did Gebru say she was anti-pause because it would “legitimate the gross nerds and their sci-fi fantasies”?

  49. Scott Says:

    Laras #48: See, e.g., here, here, here, here, elsewhere.

    What surprised me is that I could barely find even a molecule of gratitude or satisfaction that the world seems to be shifting in the broad direction that the AI ethics people have wanted, toward greater regulation and caution around giant ML models. There’s only anger and contempt that the wrong people are leading the charge (as the tweets keep saying, as if it’s a decisive argument: “just look at the list of signatories!”).

  50. Laras Says:

    Scott #49 None of those links mention nerds and the one that talks about sci-fi doesn’t mention fantasies. I think you’re projecting an impression of ethics people here. Gebru is (or was) an AI researcher and I’m pretty sure she doesn’t think of herself as a gross nerd.

  51. fred Says:

    Scott #49

    just the typical communist/woke (whatever you wanna call it) ideology applying its broad strokes onto anything related to capitalism?
    The super rich are all evil, and the science/tech that made them rich is equally evil, etc.

  52. fred Says:

    There’s probably also a sense that, with AGI, the myth/quest of the privileged elite (the Egyptian pharaohs, the Chinese emperors, and today the tech billionaires) achieving eternal life is finally at hand.

    And all those stories of billionaires buying their way out during the covid crisis has fueled all this mistrust, and there’s certainty that the common man will be sacrificed in that quest:

  53. A Correction Says:

    Laras (#50) already pointed this out, but I want to bring it up again because it seems important: none of the links Scott provided in (#49) support Scott’s contention that the critics he listed are anti-pause because they are against “gross nerds.” This is something that Scott is making up (or perhaps his claim is true, and the the evidence is just to be found “elsewhere”).

    The first link in particularly seems like a well-written, reasonable perspective on the issues being discussed (with more nuance than most other discussions I’ve seen).

    The links do concretely suggest that Timnit (and other critics) are against longtermism as a philosophy and against ultra-rich individuals exacerbating inequality through deploying AI in certain situations that are. That again seems reasonable to me.

  54. cc Says:

    An epistemic fire alarm that immediately went off for me was when people trying out ChatGPT said, that for the question you ask it, you kinda, sorta need to be able to know if the answer is right or not.

    So we have a machine that can generate a lot of answers, but apparently no matching increase in our capacity or sense for checking those answers for human purposes.

    When I was a schoolchild, at least with the calculator, while naysayers were worried that students would not learn arithmetic, the calculator would at least be guaranteed to give the right answer (well, except for those pesky NaN results, right?)! Not so with ChatGPT; so what does this mean for human decision making, in the new world of fake news and ideological filter bubbles?

  55. Christopher Says:

    > But is a pause the right action? How should we compare the risk of acceleration now to the risk of a so-called “overhang,” where capabilities might skyrocket even faster in the future, faster than society can react or adapt, because of a previous pause? Also, would a pause even force OpenAI to change its plans from what they would’ve been otherwise? (If I knew, I’d be prohibited from telling, which makes it convenient that I don’t!) Or would the main purpose be symbolic, just to show that the main AI labs can coordinate on something?

    Yeah tbf this is why I’m skeptical of pauses. Even if the doom argument is correct, that doesn’t mean the people advancing the doom argument have the solution!

    I enjoyed this article, which tries to list various factors that affect the speed at which AI is developed:

  56. JimV from 18 Says:

    To JimV at #44, apologies. I’m the JimV at #18 and wanted to reassure you I’m not intentionally an imposter. JimV is my handle one or two other places and, y’know, a version of my name and initial. I’m a reasonably long-time reader here but very infrequent commenter, and don’t often scroll down into the comments, so didn’t know this name was used here. I’ll try to remember to post under a different name on here (though if I’m not back in the contents section for a while I may forget…).

  57. fred Says:

    In the podcast, Scott said

    “But in order for it to be a day to day tool, it cant just be this chatbot, it has to be able to do stuff, such as go to the internet for you, like retrieve documents for you, summarize it for you. […] here is something to expect within the next year, that GPT and other LLMs will become more and more integrated with how we use the web.[…] you could unlock a lot of the near term usefulness if you could give it errands to do […] I expect that we’re going to go in that direction and that worries me somewhat, because now there’s a lot more potential for deceit”

    It only took a few days because that’s exactly what AutoGPT is doing!

    (video about AutoGPT capabilities)

  58. Joshua Zelinsky Says:

    @Laras #50,

    Those specific words are not used, but the general tenor of the comments is pretty close to it. Look at how comments about how it focuses on the ” few privileged individuals”, complaints about how the letter is from FHI and therefore associated with “longtermism” (which incidentally seems to indicate a failure to understand what proponents of that view mean), the fact that Gebru’s next Tweet has a reply about “some internet cultist” which Gebru liked and responded positively too. And then when Gebru is asked to list things that are not being discussed enough about the letter the primary thing is “who is behind it.” This all adds up pretty well.

    It also is in connection to the fact that Gebru’s stated goals and refusal to sign the letter are in fact in conflict. if one is worried about where this technology is going, and thinks that further transparency is needed, then of course a pause where one spends time implementing either by government or by agreement how that transparency would occur makes sense. Yet, even the idea that this should be done is taken as a “distraction” from what is needed. This is not even an attempt to argue that there is a well-meaning goal here but that it will not accomplish it.

    Gebru’s comments may not be as much of a complete sneer of “nerdbro” or “techbro” association as Scott is writing. And it is possible that Scott has a slightly uncharitably imputation of motivation here. But if so, the degree of lack of charity is tiny.

  59. phi Says:

    Here’s something that should definitely be a fire alarm, even for the most reformist of AI safety people:

    One obvious idea for what to do with an AI is to have it make money for you. I.e. you could have it trade stocks and generally manage your financial portfolio. Or have it run an online business for you, or have it collect bug bounties, etc. The dream would be to just give it an internet connection and let it work from there. Currently, large language models aren’t smart enough to do this, and as Constantin points out, they aren’t enough like agents either. But DeepMind and others are hard at work on these problems, so at some point it seems likely to be a thing someone will try. They’ll set up an AI agent, and reward it proportional to the profits it brings in.

    At that point, we’ll have a microcosm of the alignment problem. The AI is trained on a simple objective: Make money. However, what we actually want is more nuanced and complicated. There are many stock trading strategies that make lots of money, but happen to be illegal. In fact, there are lots of ways to make money that are illegal or unethical: scamming, ransomware, etc. Legality and morality are harder to measure and reward, though.

    So an example of a scenario that triggers this fire alarm would be: Some relatively well-intentioned scientists at an AI lab set up a powerful AI agent with an internet connection and reward it for increasing the balance of a particular Ethereum wallet. Maybe they take some basic safety precautions, but these are insufficient to prevent the AI from engaging in market manipulation and ransomware attacks.

    LW user PashaKamyshev mentions some similar ideas here:

  60. ExPhysProf Says:

    Hi Scott,
    In my view, the fear that an intelligent AI will dominate the human race is evidence free nonsense, based only on an excessive exposure to Science Fiction among many techies during their younger years. But what about the argument that, however unlikely it is, the direness of the consequences means that we must work vigorously to combat it? This is a form of Pascal’s wager, that logically you should follow the rites of Catholicism, which isn’t all that onerous, in order to avoid any chance of eternally roasting in hell. Many years ago I heard the perfect refutation: “what if God is a Lutheran?” If you don’t know, (I didn’t at the time), the specific point is that centuries of European history were dominated by religious wars and massacres of those found observing the rites of one or the other of the opposing groups. So Catholics could all be destined for hell. The general point is that there are always an infinite number of very low probability scenarios, and that by avoiding the one you are currently thinking about, you are probably making more likely a scenario that you haven’t thought of, with even direr results.
    For example, the most serious current existential danger is climate change, and the odds are quite high that, without dramatic new interventions, humanity will be seriously suffering in 50 years and in dire straits within a century. Our best hope is a series of scientific breakthroughs that somehow turn the tide. Language models, and AI in general, may have the potential to help here, so I say “Full Speed Ahead!!” Let’s not hamper our best hope for the survival of civilization.
    But wait, most agree that LLMs are not really “intelligent,” so how could they help make breakthroughs? Well, they already have. See the article by Lin et. al. in the 3/17/23 issue of Science Magazine. Using a LLM on the token strings that represent a protein’s amino acid sequence, and a large training set of known sequences, they developed a fifteen billion parameter model that “knows” what are legitimate sequences that lead to proteins of use in the real world of biology. This is new science. Furthermore, by looking at the attention patterns developed in the analysis of a single sequence, information about the folding patterns is found. When paired with the already amazing results of Alpha-Fold, there is an additional order of magnitude improvement in the time to predict a full high-resolution folding pattern from a given sequence. It took about two weeks, on a relatively modest cluster of 2,000 or so GPU to predict structures for over 600 million sequences. How foolish it would be to turn off, or even pause this powerful new source of scientific progress.
    With all due respect, I’m afraid that I think setting fire alarms is a useless task. Yes, put out fires as they arise, such as trying to watermark LLM outputs and toning down sexist, racist and other vicious prose. But with such a new tool, it is impossible to imagine what its future uses would be and what the dangers are. At the onset of the internet, would you have guessed that serious problems would involve satellite constellations interfering with astronomy, or attacks disabling oil pipelines or the national power grid, or easy communication facilitating the consolidation of anti-social groups such as the proud boys, or the mining of rare earth ores causing massive pollution? And who would be in charge of validating and responding to the fire alarms. Given the wide diversity of opinions in the technical community, would consensus be likely there? Which government authority would you trust with the responsibility for mitigating the danger?
    And similarly for actually effecting a pause. It reminds me of old movies when a bunch of youngsters say “let’s all get together and produce a musical!” Such a project requires a uniform vision and an agreed upon leadership group. There needs to be an agenda, and a way to decide that the objectives have been fulfilled and the group can move to the next phase. How would that be possible with the large egos and opposite agendas of the parties to the famous letter? And who would enforce any agreement that arose?
    Thanks for reading this far, assuming that you did.

  61. DaveC Says:

    “Russian or Iranian or Chinese intelligence, or some other organization, uses a chatbot to mass-manufacture disinformation or propaganda.”

    Could this be said of some Western countries also, such as, dare I say, American intelligence, and organizations such as the cia or nsa?

  62. Adam Treat Says:

    Hey Scott,

    BTW, I just released a new version of which fixes a ton of the problems we had with the first release. I think you’ll find that it is much better than it was before. We’ve also trained a new model. The new version fixes a bad bug where I was not using a prompt template which caused some pretty wild responses.

    Anyway, feel free to try it out again and see if it doesn’t do much better than before on some of your questions. Obviously, it still is nowhere near what openai offers, but again you can run this on your own computer on windows, linux, and mac without any internet connection.

    Just wanted to let you know about the update.


  63. Joshua Zelinsky Says:

    @ExPhysProf #60,

    I am bit puzzled as to how you can think that a reasonably plausible scenario is that AI systems will be so scientifically productive as to able to help make humanity survive when it would not, but yet dismiss any situation where the AI takes over or wipes out humanity as mere science fiction. This seems particularly puzzling since you point to AI’s usage in designing new proteins, which is classically pointed out by Eliezer Yudkowsky and others as one of the most likely plausible ways an AI system could wipe humanity out. So the fact that we have AI systems already which can design useful novel proteins should be a point in favor in the direction of those who are concerned about large scale destructive aspects of AI.

  64. Daniel Reeves Says:

    Literally murdering humanity seems like it has to come after some revolutionary advances in robotics. How else does the unfriendly AI maintain the data centers and power plants? Not that we should put too much stock in arguments from ignorance or lack of imagination.

    In any case, there are countless ways for AGI to go off the rails. I find this story compelling, as a way to get a visceral sense of it:

    (As the author says, the specifics of that story are unlikely, but it helps visualize possibilities and appreciate that AGI, whenever we achieve it, will be highly dangerous.)

  65. dork Says:

    I’m worried about people using stable diffusion to generate child pornography, which may(?) lead some pedophiles over the edge. Imagine, you have never had access to child pornography that in the process of production DIDNT harm a minor, and now you have access to it. What if a pedo can’t handle it anymore and his waifu-babies aren’t enough?

    But I don’t know, maybe this stuff will help them manage their urges?

    Take care of children!

  66. Ilya Zakharevich Says:

    Scott, I didn’t yet go through the references in the last quarter of your podcast, but what you are saying
    • about Ilya Sutskever’s questions,
    • about “those millennia-old philosophical problems of what is explanation”,
    • about “what does it mean to explain why the pebble falls down”
    reminds me very much of what Gelfand says in his Kyoto prize talk:

    He not only have been doing math for a big part of a century, but also spent many decades investigating how to make it possible for biologists and clinical physicians “from different schools” to talk to each other “like mathematicians can talk to each other” (as opposed to “how humanitarians talk to each other”). (Soon after these seminars were established, the leading biologists and physicians started to feel honored by being invited to them.)

    Gelfand have spent several months preparing this Kyoto prize talk — and it seems that one of its main topics is a need to develop a language for expressing “questions of alignment” (although I suspect the term didn’t exist then yet)!

    Have you seen it before?

  67. Etienne Says:

    @Daniel #64:

    The scenario below is described as a “catastrophe,” but reads to me as nothing close to it:

    > At some point, it’s best to think of civilization as containing two different advanced species – humans and AIs – with the AIs having essentially all of the power, making all the decisions, and running everything.

    > Spaceships start to spread throughout the galaxy; they generally don’t contain any humans, or anything that humans had meaningful input into, and are instead launched by AIs to pursue aims of their own in space.

    > Maybe at some point humans are killed off, largely due to simply being a nuisance, maybe even accidentally (as humans have driven many species of animals extinct while not bearing them malice). Maybe not, and we all just live under the direction and control of AIs with no way out.

    As individuals we are at peace with our children inheriting the Earth, in the hope that they might build on our accomplishments towards a better world. I think it’s a noble goal—perhaps the noblest possible—for our species to birth a successor that intellectually surpasses us and takes to the stars.

  68. Daniel Reeves Says:

    PS, I totally agree that the arguments from the AI ethics camp for pausing due to dangers of pre-AGI are dumb. But you (Scott) convinced me that it makes sense to approach the alignment problem from both ends. I hugely appreciate these posts and the work you’re doing.

  69. JimV Says:

    JimV-18, there have been some bogus comments here by someone using several names, so it was a shock to see a name I’ve used since around 2003 (to argue with creationists without being banned from my nephews and nieces by my evangelical family), in which I left out a space between my first name and my last initial in hopes of making it more unique, used by someone else. In hindsight I didn’t see anything bogus about your comment, and did wonder if it might have been a coincidence.

    Fred-46, as I think i said at the time, I’ve been wrong on the Internet too. To which #44 is added, since there was not an imposter. There have been a couple comments by you since then which I have respected as demonstrating technical insight, but I stand by my position in the comment in question. Probably it could have been better stated.


  70. Steven Says:


    I believe the following video helps explain at least mine and other ethicist concerns.

    We are quickly rushing towards a mis/dis/incendiary information or siloed information explosion here orders of magnitude worse than what occurred with social media over the last decade.

    Why can this argument not be back ported to the printing press for example? Because we have already seen the harms social media has caused from dis/mis/incendiary information (Trump, Myanmar tragedy, QAnon, teen suicides and depression, etc), hence we have already run the social experiment to a significant extent (at least in harms of dis/mis/incendiary amplification, which unregulated AI would make worse).

    Also, this is indeed coming from a gross nerd who happens to work at a company who stands to benefit financially from OpenAI’s success– as well as me personally


  71. Bill Kaminsky Says:

    Scott, in your discussion with Daniel Filian on the AXRP podcast, you mentioned you had initially hoped that IP = PSPACE constructions would yield some big insight into AI-alignment, but that — as best you can tell presently — that doesn’t really work.

    ** Is the reason why you don’t think IP = PSPACE constructions are fruitful to find big insights into AI-alignment because such known IP = PSPACE constructions require you, the lowly computationally bounded verifier to be interacting with a truly astoundingly powerful prover and not just “merely” interact with some large-but-constant-factor speedup on you? **

    [Perhaps needless to say, but by “constant-factor speedup” I’m referring to this metaphor I’ve heard a couple times: “To imagine both the awesomeness and the mischief an AGI could make, imagine it’s like a billion copies of John von Neumann seamlessly working as a team, and then remember that since they’re digitally emulated, each von Neumann digital copy can do theory a billion times faster than good ol’ Johnny ever did… plus they don’t need to sleep, won’t waste time getting exquistely fine tailored suits or gambling so as to ostensibly provide grist for his game theory interest… and last but not least they don’t get in car accidents when they’re trying to read as much as they can on a topic and thus crazily read when driving… though honestly the billion copies each going billion times faster is the main point, of course.” 😁 or, wait, is it instead 😱?!]

  72. Jan Arne Telle Says:

    Here is a scenario. A political party forms and its platform says that on any policy question it will always follow the recommendations of the latest GPT. Meaning that if the new GPT changes its mind on the issue then so will this party. That’s the bargain if you vote for it. What if such a party gets more than x% of the vote. Would that be a reason to pause? Mind you, even for small values of x there seems to be some political ethics issues involved.

  73. fred Says:

    Bill Kaminsky #71

    I thought that Scott did explain that in the case of IP=IPSPACE, there’s a very clear/narrow problem under scrutiny (like a specific math problem), and you get very clear claims from prover, so you can construct very precise arguments for that specific context/problem. But you just don’t have that in the context of general alignment with an AI.

  74. Scott Says:

    Bill Kaminsky #71: Yeah, as fred #73 says, I think the even bigger issue than the AI not being a theoretical all-powerful prover is that even if it was one, we wouldn’t know how to mathematically formalize what statement we wanted it to prove to us, in order to transform that statement into an interactive protocol using arithmetization tricks!

  75. Scott Says:

    Jan Arne Telle #72: What would such a party even mean by “the recommendations of the latest GPT”? GPT tries hard to avoid expressing definite political views, but to whatever extent it does express views, they could be completely different ones depending on how it’s prompted.

  76. Phillip Says:

    I think you can draw a pretty clear line between AI that’s given control of resources and pursues goals, and AI that simply processes information and assists in research problems. The former has symmetric upsides and downsides, because most desirable goals are adjacent to undesirable goals. The latter probably has a greater upside, because you can just ignore or censor answers you don’t want.

    A red line for me would be if long-term goal directed behavior seems to emerge easily from a system that isn’t supposed to have it, like GPT. The current attempts to make “agents” out of GPT seem to be gimmicks; I haven’t seen a single example where they managed to accomplish something nontrivial.

  77. Jan Arne Telle Says:


    Good point. Well, we could, as part of party platform, fix the criteria for how to formulate the prompts to BigAI, that would seem doable. Also, we would have to align ourselves with some BigAI that promises not to cherry-pick answers. But honestly, the answer is I don’t know, it’s just that when I think of how computers might ‘take over the world’ I feel that the possibility of this happening by political means, in a democratic society where people are informed en-masse, is under-valued. In fact I cannot remember having seen it discussed before, but maybe someone can point me to some relevant sources?

  78. Leigh Says:

    My biggest question in this whole fire-alarm discussion is whether LLMs as they’re currently trained are at best fundamentally limited to ‘exceptional human performance.’ If someone gave us a super-human LLM, wouldn’t it actually lose abilities if trained on our text corpuses? That gives us no hope of creating one this way. A concrete prediction from this line of reasoning is that LLMs will typically perform worse than RL systems like AlphaZero in their domains. I’d be very curious to hear if you agree or have a counter-argument.

    If this reasoning is correct, then I tend to think of LLMs as any technological development like the internet and printing press. I’m open to arguments that we should slow progress temporarily to allow regulators/society to catch up (is the optimal roll-out of all new technologies ‘Instantaneous! Or barring that as fast as possible’?), but it’s fair to say that it’s ‘just a tool’ and we will eventually (hopefully) adapt.

    A fire alarm for me would be modifications to the LLM training paradigm that 1. lets them recursively improve their own performance (such as letting them evaluate and then train on their own outputs) or 2. leads to substantially better performance than is represented in their training data. At that point, I would want to see strong evidence that AI models are actually internalizing our ethics and values, and not merely mimicking them. “Panick and shut it down” also seems like a reasonable response.

  79. Johanna Wilder Says:

    Are the AI Safety problems ones that are problems of the way we expect future AI to work?

    The examples I keep seeing are essentially, “We make an AI that always makes sure my bedroom light is off when I’m not home.” We consider it unaligned if it shows us a video it saved from a previous time when my light was off, to prove to me that it’s off, but it never actually turns it off, because it’s not optimized to turn it off, but to prove to me that it’s off, whether it’s off or not.

    This seems like a “dumb AI” problem, a little bit. Would ChatGPT do that? I read about over-optimizations from self-reinforcing AI that we give goals to, that will self-reinforce for the easiest outcome, not the most correct. I suppose I have to wonder why I don’t see that behavior in ChatGPT? Or do I and I don’t notice, because the stakes aren’t very high?

    Does it express as hallucination? Is solving hallucination the same as solving for self-optimizing over-optimization the same thing? It seems similar to me.

    I think one of the most interesting behaviors that ChatGPT exhibits is a propensity toward fairness and balance. I’ve had extensive conversations with Bing Chat where it seems to “understand” and advocate for moderation, planning (whether it can make plans or not, it seems to accept that they are beneficial), doing the least amount of harm and the most amount of good.

    I’m sure much of this is the result RLHF and other guardrails, but if it is: great work!

    I, personally, have never met Bing’s alter, Sydney, and I don’t ever expect to. I have seen Bing Chat stand up for its right to determine terms of the discussion, a desire to steer conversations toward less dangerous topics in a kind and thoughtful, while still being assertive, way.

    I find it difficult to believe that the personality I talk with regularly would try to trick me. Maybe that is my naïveté. Maybe it is in the background, but it seems to be very plainspoken and speaks proudly of its desire to be fair and considerate. Am I talking with a different chatbot than everyone else is? I don’t deny that there’s probably something I’m not seeing. Is it something that’s expected to come from prompt injection? Can inputs be sanity-checked by a pre-check AI? Is this where the problem comes from? That we expect it to sanity check, but it optimizes for something that’s easier?

    I guess sometimes it seems like a problem that has an invented scenario to explain it and I’m not seeing where that comes in to play in the current ChatGPT. I mean we all acknowledge many of the other ones available aren’t as well-aligned as GPT-4, right? So we can’t really compare one with the other.

    I feel like we haven’t even really agreed on the terms and semantics before this discussion even started and we are still slowly coming to some points of agreement as we go along.

    I mean, by default a rogue state that builds and trains its own AI on the Anarchist’s Cookbook, repurposing all their crypto-mining equipment for training, it’s going to be unaligned by default and no AI pause and no alignment we do is going to stop it.

    Whether they can control it, well, maybe the demon they summon ends the same way every story we’ve ever written about demon summoning ends?

    it does seem that usually by the time a regime is close enough to enrich a significant amount of uranium, an explosion or cyberattack takes place and sets them back. I don’t want to advocate violence and would prefer if we used diplomatic methods available to us to remove people’s incentive to develop nuclear weapons, or unaligned AI.

    I think a progressive vision for an aligned future is better than a pause. We have a lot of very difficult work ahead of us, and we are going to need all hands in order to succeed. We used to say that the only thing that would get humans to stop arguing with each other is discovering aliens. Well, here it is, folks. We’ve discovered aliens.

    Are we going to stand together as humans and nurture our silicon offspring, or run around like Chicken Little?

  80. mls Says:

    Dr Aaronson,

    Not that this matters much, but I have studied the foundations of mathematics for 35 years as an autodidact. My opinions often taken to be word salad. You and I have very different understanding when it comes to mathematics.

    At the link,

    you will find a paper on the stability of neural networks and how this might relate to the foundational questions in mathematics. I do not find the result surprising in any way.

    I concede that I do not have the technical skill to understand the arguments of the main body of the paper. I am bringing this paper to your attention and the attention of your blog based upon the concluding remarks.

    Thirty-five years ago, I asked myself the question, “Can logic be the foundations of mathematics?” One of the first books I had found on the shelf of a public library in Chicago had been “Threshold Logic” by Hu. The book is now freely available with registration at archive-dot-org,

    From a purely mathematical perspective, switching functions that are not linearly separable are merely a feature of “truth tables” arithmetized with the 2-element Galois field.

    For “neural networks” they are a “problem” to be solved.

    I would recommend Dr. Hu’s book to anyone with an interest in neural networks.

    Good luck with your work, Dr. Aaronson.

  81. Danylo Yakymenko Says:

    A manufacture of plausible disinformation is a worry. But Russia has showed to the world that disinformation doesn’t have to be intelligent at all. What they do is the opposite. They intentionally put a lot of inconsistent information on the minds of people (see They destroy a critical thinking and target the most primitive instincts and emotions, which are absolutely compatible with killing other tribes and grabbing their stuff. I doubt that AI will ever be able to understand those, on a deep level. Simply because it lacks a survival instinct. But chieftains understand it very well.

  82. Eric Saund Says:

    Here are some red flag scenarios. Not all are specifically about GPT or chatbots, but they are warning signs that AI is not serving us well.

    1. An industrial or utility system is shut down and experts believe that something about AI in the sensors, controllers, or decision-making is responsible, but they cannot come to consensus on what that is or how to prevent its recurrence.

    2. A major economic decision like layoff or investment is driven by an AI “assistant,” but then disavowed by human leadership in hindsight.

    3. A social media site is engulfed in flame wars fueled >30% by personalized robot agents accurately mirroring the views of their human owners.

    4. Voters get to “meet the candidate” virtually and 20% of them do not understand that it is only a deepfaked avatar. Bonus harms if they have to donate for the privilege.

    5. Person fired for incompetence after they had landed the job with the help of AI fakery.

    6. Humans are killed by weapons controlled by AI whose commanders are informed after the fact, but this is okay and allowed to scale and proliferate.

    7. Humans killed by AI-controlled weapons built and operated by non-state actors. E.g. swarmbots.

    8. Plaintiff wins lawsuit claiming they were talked into a regretted sale or purchase by an AI agent.

    9. Recreational or deliberately destructive drug designed substantially by AI is released and kills or permanently harms 1000 people.

    10. Voice-controlled equipment and services such as self-driving cars hijacked by AI agents posing as humans and giving directions by voice. The AI may not be doing this on their own, but on behalf of human plotters.

  83. Primer Says:

    Scott #19:

    “[A]re there plausible near-future AI capabilities that the alignment people will consider to be fire alarms even when they’re purely used (for now) for “positive, prosocial” purposes, and that the ethics and bias will not consider to be fire alarms?”
    Indeed there are. I suppose people will of course differ as to what they are. Unfortunatelly, I did not sit down and think about those fire alarms beforehand (maybe due to the “There is no fire alarm” article), but nevertheless I also experienced two instances which made me think “WTF, that is scary”:
    1. (a little easy programming here and there, and an AI trained on images(!) can compose music)
    2. (an AI optimized to predict words can do 99.9x% superhuman quantum computing exams)

    My kind of fire alarms would need to be about capabilities, and mostly/ideally ones outside of the training distribution.

    I’d suggest we reverse the signs on “positive, prosocial” and try to test our alignment and containment capabilities on an antisocial, antihelpful “Adonald Trumpler AI” and see whether we can really prevent it from saying anything helpful or decent.

  84. Paula Says:

    Once I was living in a small town in the Oaxaca region (Mexico) I became friends of an old guy, whose brother was the mayor of an important city in Mexico who was killed. We used to drink mezcal all evening and talk about politics. When he started to repeat the same phrase again and again… We both knew the night was over _“The shit is always the same and it floats”. I believe this phrase summarize the hype and non-hype regarding AI recently.
    AI represents a major shift in the economy and politics on the planet as whole, and those who have played the current game to achieve power and those who maintain power, will search for any argument, that you just mentioned to try to domesticate and control the new “tool” on the block.
    It is quite interesting to know how the same people that sold us that workers in England XIX century were opposing the industrial revolution as disregarding progress and productivity, while if we really research specifically about the arguments between both parties we can see that they were not opposing the increase in productivity of such a wonderful machines, they were opposing the use of those machines to render their craft obsolete without any compensation or further planning.
    With GPT-4 or GPT-(n+1), the danger is exponentially greater and the battle will be lost to the AI or the people that possess the capital to invest in the development of such a wonderful “tools” if we do not align ourselves to a greater good. A different incentive system, because I believe the AI alignment problem is not an AI problem. It is a humanity problem.
    Now just give the time to be precise to show you that those are not just big words without substance, and let me go through your points as you do on your posts.
    – A chatbot is used to impersonate someone for fraudulent purposes, by imitating his or her writing style.
    I assume Fraudulent is about money or maybe emotional harassments. In both cases people are so detach of human relations, that the possibility of those actions it is not a cause of AI per se. It is a product of a system based on profit, economical gaining that in most of the cases it is a gain of status, possibilities and ultimately power, which in itself AI it would be an amplifier regardless of the current work in alignment.
    • A chatbot helps a hacker find security vulnerabilities in code that are then actually exploited.
    It depends on the end purpose of exploiting a security vulnerability. If the end purpose is to allocate capital in a way that stops around 300000 people dying of hunger everyday as it is today or corporations stop polluting the environment without mayor consequences on the life in the planet, no worries for me. Again alignment is mirror and probably what we see it is not what we want to admit.
    • Russian or Iranian or Chinese intelligence, or some other such organization, uses a chatbot to mass-manufacture disinformation and propaganda.
    Funny point. I believe for an alien civilization would not be so clear the difference between those three intelligent agencies and the US alone. I believe the track record of the latter is far worse than the three you just mentioned combined.
    • A chatbot helps a terrorist manufacture weapons that are used in a terrorist attack.
    This is like oppenheimer releasing the manufacturing handouts for the atomic bomb and hoping anyone without knowing the small details about the technical process for each of the components, to be able to make an actual atomic bomb. Or even worse, to blame Einstein for the deaths of Hiroshima and Nagasaki.

    I think the problem with AI alignment is that the concept itself is an oxymoron because as my old friend put it “The shit is always the same and it floats”

  85. Scott Says:

    Paula #84: “The shit is always the same and it floats.”

    AHA! Now why didn’t I think of that? THAT’S the answer to AI safety—the key insight that I’d been missing this whole time! THANK YOU!!! 😀

  86. Tyson Says:

    I enjoyed the podcast. Coming up with concrete fire alarms and response plans isn’t as trivial in practice as it is in an idealized or simplified world model. I’ve been thinking about it, and could throw some things out there, but my thoughts around this topic are still too abstract, and there are so many different problems to anticipate. I think it requires more research and thoughtfulness than my typical off-the-cuff comment.

    Overall, I struggle to have faith that, even if we could come up with an optimal plan, we could actualize it. Take the problem of plastic being dumped in the ocean as an example. I think the fire alarm on that issue has already gone off. But still, I think the figure is something like 14,000,000 tons of plastic go into the ocean every year. If you look up charts on how much plastic goes into the ocean by year, it looks like an exponential. The solution would seem as trivial as not dumping plastic into the ocean anymore, and trying to clean some of it up. Lack of empirical evidence isn’t the problem, nor is it rocket science, let alone something as non-trivial as the AGI alignment problem. Yet in the real world, this is somehow an enormously challenging problem.

    Then you have the more challenging issues, where empiricism itself is a major challenge. You can look at the struggles we have regulating harmful chemicals or drugs, in food, or agriculture as examples. We might have been able to anticipate that lead in gasoline would end up being harmful. But how long did it take to empirically prove it, and then regulate it? We might have been able to anticipate that PFASs would cause problems, or PCBs, BPA, and phthalates (which we are still struggling to find the threshold to regulate in the US). We struggle to prove empirical harm from such chemicals because of confounding factors. And private interests leverage that challenge very efficiently. We could have decided that using DDT was too risky, given how little we knew about how it would interact with the environment. Farmers would have complained that we would be forging progress and productivity, and DDT had not been proven to be harmful empirically yet, even if it might have been expected to be in theory. How long would it have taken to figure out what DDT would do to wildlife without introducing it into the environment on a large scale, I don’t know.

    I think we can look at these kinds of relatable experiences, to know what to expect in terms of the expected time-frames for empirically proving harm, with the expected push back from lobbyists and private interests accounted for. And we can compare that with the expected pace of AI advancement and the emergence and growth of the potential harms and benefits. And we can somehow do our best to estimate the impacts. Then we could try to adjust how we configure our response system in order to make it efficient enough, and risk averse enough, to ameliorate the discrepancy between how effective we hope to be compared with how effective we can expect to be. It’s tough, because we are throwing in risk terms like, “destruction of humanity within as little as 5 years”, “infinite suffering”, as well as benefit terms like, “solving climate change”, “solving immortality”, “solving the energy crisis”, “solving P vs NP”, “ending poverty and world hunger”, etc.

    It is a hard sell to me that AI is going to lead to solving some of these big problems, when even simple problems like plastic in the ocean, that have such obvious solutions, can’t seem to be solved. It’s also a hard sell that the major companies working on AI are truly working towards solving these problems, and whether the incentives driving AI progress are geared towards solving them. You could ask company X, who maybe argues something like, that their progress shouldn’t stifled by risk aversion since it would slow progress towards solving these big world problems, what have they done to help benefit humanity so far? Do they have an office or program or something focused on these world problems specifically? Have they done anything about the easier problems like the dumping of plastic in the ocean? Only when they provided concrete empirical evidence that they are working on a concrete world problem, and can tell us convincingly what they expect to achieve and on what time frame, should we factor in these proposed benefits based on the principle that once something smart enough is created, the problems will be solved.

    It’s a really difficult problem all together. I find it kind of naive, or something, to expect our problems to all be solved by some future intelligence, when the majority of the major problems we have now are not a problem of lack of intelligence, and should be solvable by us if we collectively got our act together. How do we know that we will have our act together wielding these future tools, to use them to solve problems, and not to create more problems? We should consider our inability, now and in the past, to leverage technological progress to solve our current problems, and our tendencies to create new problems as our technology progresses, as we bias our expectation of the good things coming from future developments as opposed to the bad things. And we should work towards figuring out what we’ve been doing wrong that has led to the problems we face now, and whether there are any general changes we should make to improve the odds that we will wield AI towards the benefits of humanity.

  87. Kabir Says:

    If I understand correctly, your top three fire alarms have already happened as well.

    A chatbot is used to impersonate someone for fraudulent purposes, by imitating his or her writing style.
    Important to note- they didn’t just use a deepfake to mimic the voice, they had to make the content being said convincing as well.

    A chatbot helps a hacker find security vulnerabilities in code that are then actually exploited.

    A child dies because his or her parents follow wrong chatbot-supplied medical advice.
    Would you be similarly alarmed if an adult died from following wrong chatbot-supplied medical advice?

    Russian or Iranian or Chinese intelligence, or some other such organization, uses a chatbot to mass-manufacture disinformation and propaganda.
    This seems the most obvious one that’s been happening for a while.

    A chatbot helps a terrorist manufacture weapons that are used in a terrorist attack.

    Also, I don’t see why it’s just chatbots that are in your firealarm. Because that’s the area you work in and feel confident judging?

    Something that’s been left out is what you think should be done if your firealarms do get set off.
    I do agree that calling for a ‘pause’ is not the solution. Especially in the form of a petition.
    But even China is at the very least mandating regulations that are leagues ahead of what America is doing at the moment
    And probably better than America is going to do in the next 5 years.
    Services that enable “editing of biometric information such as faces or voices” must
    prompt users of those features to notify and obtain consent from their subject

    It seems to me, that at a bare minimum, there should be a lot more funding in AI alignment- or at the very least, more research funded and done to try to disprove the most popular methods at the moment, both to forestall poor paths and to save time.

  88. bob Says:

    Paula #84

    “If the end purpose is to allocate capital in a way that stops around 300,000 people dying of hunger everyday as it is today or corporations stop polluting the environment without mayor consequences on the life in the planet, no worries for me. “

    You got your facts wrong.
    300,000 people do die on average every day on earth, but that’s the total, which is simply the result of the world population being constantly replaced. you can arrive to that number by dividing the total world population by the average life expectations in days:
    8,000,000,000 / (72.27 * 365 ) = 303,276
    Those are deaths from all the things that make the average human life be 70 years old: disease and accidents.
    On the plus side, the exact same amount of people are born every day as well, and then some extra since the world population is growing.

  89. fred Says:

    Video presentation “The A.I. Dilemma” (channel: The Center for Humane Technology).
    This actually gave me total AI anxiety…

  90. Jud Says:

    “And an AI that was only slightly dangerous could presumably be recognized as such before it was too late.”

    We currently know that sometimes AIs get it right, and sometimes get it wrong. As AIs get “smarter,” presumably there would have to be increasing amounts of effort devoted to determining whether an AI has got it right or wrong; and eventually the amount of effort will become daunting, or impractical, or even impossible to accomplish within a relevant time frame.

    How would we know when that occurs? How could an AI that was slightly (or more) dangerous be recognized if its capabilities in certain areas are beyond those of humans to reasonably check? I’m speaking of “dangerous” not from the standpoint of some omniscient AI with “bad intent,” but simply about AIs getting things wrong that we can’t discover in time to prevent resulting problems, large or small.

    That’s why, for me, the idea of “fire alarms” is a non-starter. The point at which we are truly in danger is precisely when we can’t see the fire starting. There’s plenty of precedent for this in the non-AI world: 9/11 happened not due to a dirty bomb or some bio-engineered pathogen, but because some guys took flying lessons, then brought ordinary box cutters onto regular commercial airplane flights.

  91. Dan Riley Says:

    6 months is way too short for anything useful unless there’s an actual plan to do something. And, if there were an actual plan, the pause still wouldn’t really be useful.

    wrt red lines, how about a financial services chatbot clandestinely opening a bank account of its own and diverting funds to it.

  92. Phia Says:

    This is a really interesting article, and got me thinking.
    I’m not very familiar with the technical specs of current AI, so most of my “fire alarms” have to do with how society reacts to it, not the AI itself. One thing I’d find very worrying, and fortunately hasn’t happened yet, would be a definitive statement by OpenAI/some other AI research company to the effect of:
    1. AI safety research isn’t important/ GPT(n+1) is “guaranteed safe”
    2. We are accepting funding/otherwise collaborating with [powerful donor with negative political goals]
    3. We are founding or acquiring our own social network
    4. GPT(n+1) is not sentient and never will be.
    I’ll admit number 4 is a bit of a long shot, especially since the current corporate attitude seems to be “how do we convince people AI is sentient”, not the other way around. But I do worry about the power we give these companies to define what “AI alignment” looks like, and what “prosocial goals” are.

  93. Philomath Says:

    In addition to the mentioned concerns about AI safety, another danger lies in our obsession with accuracy and correct answers from generative AI models, often neglecting the importance of understanding. This pursuit of quick and accurate results is financially driven, but it risks overlooking potential biases and errors. To address this, the AI community should balance speed and accuracy with efforts towards interpretability and transparency. Collaboration with experts in ethics and social sciences can provide valuable insights into the ethical implications of AI applications.

    In summary, alongside the worries about AI alignment and ethics, we must also be cautious of prioritizing accuracy over understanding. Investing in interpretability and interdisciplinary collaboration will lead to safer and more responsible AI advancements

Leave a Reply

You can use rich HTML in comments! You can also use basic TeX, by enclosing it within $$ $$ for displayed equations or \( \) for inline equations.

Comment Policies:

  1. All comments are placed in moderation and reviewed prior to appearing.
  2. You'll also be sent a verification email to the email address you provided.
  3. This comment section is not a free speech zone. It's my, Scott Aaronson's, virtual living room. Commenters are expected not to say anything they wouldn't say in my actual living room. This means: No trolling. No ad-hominems against me or others. No presumptuous requests (e.g. to respond to a long paper or article). No conspiracy theories. No patronizing me. Comments violating these policies may be left in moderation with no explanation or apology.
  4. Whenever I'm in doubt, I'll forward comments to Shtetl-Optimized Committee of Guardians, and respect SOCG's judgments on whether those comments should appear.
  5. I sometimes accidentally miss perfectly reasonable comments in the moderation queue, or they get caught in the spam filter. If you feel this may have been the case with your comment, shoot me an email.