“Will AI Destroy Us?”: Roundtable with Coleman Hughes, Eliezer Yudkowsky, Gary Marcus, and me (+ GPT-4-enabled transcript!)

A month ago Coleman Hughes, a young writer whose name I recognized from his many thoughtful essays in Quillette and elsewhere, set up a virtual “AI safety roundtable” with Eliezer Yudkowsky, Gary Marcus, and, err, yours truly, for his Conversations with Coleman podcast series. Maybe Coleman was looking for three people with the most widely divergent worldviews who still accept the premise that AI could, indeed, go catastrophically for the human race, and that talking about that is not merely a “distraction” from near-term harms. In any case, the result was that you sometimes got me and Gary against Eliezer, sometimes me and Eliezer against Gary, and occasionally even Eliezer and Gary against me … so I think it went well!

You can watch the roundtable here on YouTube, or listen here on Apple Podcasts. (My one quibble with Coleman’s intro: extremely fortunately for both me and my colleagues, I’m not the chair of the CS department at UT Austin; that would be Don Fussell. I’m merely the “Schlumberger Chair,” which has no leadership responsibilities.)

I know many of my readers are old fuddy-duddies like me who prefer reading to watching or listening. Fortunately, and appropriately for the subject matter, I’ve recently come into possession of a Python script that grabs the automatically-generated subtitles from any desired YouTube video, and then uses GPT-4 to edit those subtitles into a coherent-looking transcript. It wasn’t perfect—I had to edit the results further to produce what you see below—but it was still a huge time savings for me compared to starting with the raw subtitles. I expect that in a year or two, if not sooner, we’ll have AIs that can do better still by directly processing the original audio (which would tell the AIs who’s speaking when, the intonations of their voices, etc).

Anyway, thanks so much to Coleman, Eliezer, and Gary for a stimulating conversation, and to everyone else, enjoy (if that’s the right word)!

PS. As a free bonus, here’s a GPT-4-assisted transcript of my recent podcast with James Knight, about common knowledge and Aumann’s agreement theorem. I prepared this transcript for my fellow textophile Steven Pinker and am now sharing it with the world!

PPS. I’ve now added links to the transcript and fixed errors. And I’ve been grateful, as always, for the reactions on Twitter (oops, I mean “X”), such as: “Skipping all the bits where Aaronson talks made this almost bearable to watch.”

COLEMAN: Why is AI going to destroy us? ChatGPT seems pretty nice. I use it every day. What’s, uh, what’s the big fear here? Make the case.

ELIEZER: We don’t understand the things that we build. The AIs are grown more than built, you might say. They end up as giant inscrutable matrices of floating point numbers that nobody can decode. At this rate, we end up with something that is smarter than us, smarter than humanity, that we don’t understand. Whose preferences we could not shape and by default, if that happens, if you have something around it, it is like much smarter than you and does not care about you one way or the other. You probably end up dead at the end of that.

GARY: Extinction is a pretty, you know, extreme outcome that I don’t think is particularly likely. But the possibility that these machines will cause mayhem because we don’t know how to enforce that they do what we want them to do, I think that’s a real thing to worry about.


COLEMAN: Welcome to another episode of Conversations with Coleman. Today’s episode is a roundtable discussion about AI safety with Eliezer Yudkowsky, Gary Marcus, and Scott Aaronson.

Eliezer Yudkowsky is a prominent AI researcher and writer known for co-founding the Machine Intelligence Research Institute, where he spearheaded research on AI safety. He’s also widely recognized for his influential writings on the topic of rationality.

Scott Aaronson is a theoretical computer scientist and author, celebrated for his pioneering work in the field of quantum computation. He’s also the [Schlumberger] Chair of CompSci at U of T Austin, but is currently taking a leave of absence to work at OpenAI.

Gary Marcus is a cognitive scientist, author, and entrepreneur known for his work at the intersection of psychology, linguistics, and AI. He’s also authored several books including Kluge and Rebooting AI: Building AI We Can Trust.

This episode is all about AI safety. We talk about the alignment problem, we talk about the possibility of human extinction due to AI. We talk about what intelligence actually is, we talk about the notion of a singularity or an AI takeoff event, and much more. It was really great to get these three guys in the same virtual room, and I think you’ll find that this conversation brings something a bit fresh to a topic that has admittedly been beaten to death on certain corners of the internet.

So, without further ado, Eliezer Yudkowsky, Gary Marcus, and Scott Aaronson. [Music]

Okay, Eliezer Yudkowsky, Scott Aaronson, Gary Marcus, thanks so much for coming on my show. Thank you. So, the topic of today’s conversation is AI safety and this is something that’s been in the news lately. We’ve seen, you know, experts and CEOs signing letters recommending public policy surrounding regulation. We continue to have the debate between people that really fear AI is going to end the world and potentially kill all of humanity and the people who fear that those fears are overblown. And so, this is going to be sort of a roundtable conversation about that, and you three are really three of the best people in the world to talk about it with. So thank you all for doing this.

Let’s just start out with you, Eliezer, because you’ve been one of the most really influential voices getting people to take seriously the possibility that AI will kill us all. You know, why is AI going to destroy us? ChatGPT seems pretty nice. I use it every day. What’s the big fear here? Make the case.

ELIEZER: Well, ChatGPT seems quite unlikely to kill everyone in its present state. AI capabilities keep on advancing and advancing. The question is not, “Can ChatGPT kill us?” The answer is probably no. So as long as that’s true, as long as it hasn’t killed us yet, the engineers are just gonna keep pushing the capabilities. There’s no obvious blocking point.

We don’t understand the things that we build. The AIs are grown more than built, you might say. They end up as giant inscrutable matrices of floating point numbers that nobody can decode. It’s probably going to end up technically difficult to make them want particular things and not others, and people are just charging straight ahead. So, at this rate, we end up with something that is smarter than us, smarter than humanity, that we don’t understand, whose preferences we could not shape.

By default, if that happens, if you have something around that is much smarter than you and does not care about you one way or the other, you probably end up dead. At the end of that, it gets the most of whatever strange and inscrutable things that it wants: it wants worlds in which there are not humans taking up space, using up resources, building other AIs to compete with it, or it just wants a world in which you built enough power plants that the surface of the earth gets hot enough that humans didn’t survive.

COLEMAN: Gary, what do you have to say about that?

GARY: There are parts that I agree with, some parts that I don’t. I agree that we are likely to wind up with AIs that are smarter than us. I don’t think we’re particularly close now, but you know, in 10 years or 50 years or 100 years, at some point, it could be a thousand years, but it will happen.

I think there’s a lot of anthropomorphization there about the machines wanting things. Of course, they have objective functions, and we can talk about that. I think it’s a presumption to say that the default is that they’re going to want something that leads to our demise, and that they’re going to be effective at that and be able to literally kill us all.

I think, if you look at the history of AI, at least so far, they don’t really have wants beyond what we program them to do. There is an alignment problem, I think that that’s real in the sense of like people who program the system to do X and they do X’, that’s kind of like X but not exactly. And so, I think there’s really things to worry about. I think there’s a real research program here that is under-researched.

But the way I would put it is, we want to understand how to make machines that have values. You know Asimov’s laws are way too simple, but they’re a kind of starting point for conversation. We want to program machines that don’t harm humans. They can calculate the consequences of their actions. Right now, we have technology like GPT-4 that has no idea what the consequence of its actions are; it doesn’t really anticipate things.

And there’s a separate thing that Eliezer didn’t emphasize, which is, it’s not just how smart the machines are but how much power we give them; how much we empower them to do things like access the internet or manipulate people, or, um, you know, write source code, access files and stuff like that. Right now, AutoGPT can do all of those things, and that’s actually pretty disconcerting to me. To me, that doesn’t all add up to any kind of extinction risk anytime soon, but catastrophic risk where things go pretty wrong because we wanted these systems to do X and we didn’t really specify it well. They don’t really understand our intentions. I think there are risks like that.

I don’t see it as a default that we wind up with extinction. I think it’s pretty hard to actually terminate the entire human species. You’re going to have people in Antarctica; they’re going to be out of harm’s way or whatever, or you’re going to have some people who, you know, respond differently to any pathogen, etc. So, like, extinction is a pretty extreme outcome that I don’t think is particularly likely. But the possibility that these machines will cause mayhem because we don’t know how to enforce that they do what we want them to do – I think that’s a real thing to worry about and it’s certainly worth doing research on.

COLEMAN: Scott, how do you view this?

SCOTT: So I’m sure that you can get the three of us arguing about something, but I think you’re going to get agreement from all three of us that AI safety is important. That catastrophic outcomes, whether or not they mean literal human extinction, are possible. I think it’s become apparent over the last few years that this century is going to be largely defined by our interaction with AI. That AI is going to be transformative for human civilization and—I’m confident about that much. If you ask me almost anything beyond that about how it’s going to transform civilization, will it be good, will it be bad, what will the AI want, I am pretty agnostic. Just because, if you would have asked me 20 years ago to try to forecast where we are now, I would have gotten a lot wrong.

My only defense is that I think all of us here and almost everyone in the world would have gotten a lot wrong about where we are now. If I try to envision where we are in 2043, does the AI want to replace humanity with something better, does it want to keep us around as pets, does it want to continue helping us out, like a super souped-up version of ChatGPT, I think all of those scenarios merit consideration.

What has happened in the last few years that’s really exciting is that AI safety has become an empirical subject. Right now, there are very powerful AIs that are being deployed and we can actually learn something. We can work on mitigating the nearer-term harms. Not because the existential risk doesn’t exist, or is absurd or is science fiction or anything like that, but just because the nearer-term harms are the ones that we can see right in front of us. And where we can actually get feedback from the external world about how we’re doing. We can learn something and hopefully some of the knowledge that we gain will be useful in addressing the longer term risks, that I think Eliezer is very rightly worried about.

COLEMAN: So, there’s alignment and then there’s alignment, right? So there’s alignment in the sense that we haven’t even fully aligned smartphone technology with our interests. Like, there are some ways in which smartphones and social media have led to probably deleterious mental health outcomes, especially for teenage girls for example. So there are those kinds of mundane senses of alignment where it’s like, ‘Is this technology doing more good than harm in the normal everyday public policy sense?’ And then there’s the capital ‘A’ alignment. Are we creating a creature that is going to view us like ants and have no problem extinguishing us, whether intentional or not?

So it seems to me all of you agree that the first sense of alignment is, at the very least, something to worry about now and something to deal with. But I’m curious to what extent you think the really capital ‘A’ sense of alignment is a real problem because it can sound very much like science fiction to people. So maybe let’s start with Eliezer.

ELIEZER: I mean, from my perspective, I would say that if we had a solid guarantee that AI was going to do no more harm than social media, we ought to plow ahead and reap all the gains. The amount of harm that social media has done to humanity, while significant in my view and having done a lot of damage to our sanity, is not enough harm to justify either foregoing the gains that you could get from AI— if that was going to be the worst downside—or to justify the kind of drastic measures you’d need to stop plowing ahead on AI.

I think that the capital “A” alignment is beyond this generation. Yeah, you know, I’ve started in the field, I’ve watched over it for two decades. I feel like in some ways, the modern generation, plowing in with their eyes on the short-term stuff, is losing track of the larger problems because they can’t solve the larger problems, and they can solve the little problems. But we’re just plowing straight into the big problems, and we’re going to plow right into the big problems with a bunch of little solutions that aren’t going to scale.

I think it’s cool. I think it’s lethal. I think it’s at the scale where you just back off and don’t do this.

COLEMAN: By “back off and don’t do this,” what do you mean?

ELIEZER: I mean, have an international treaty about where the chips capable of doing AI training go, and have them all going into licensed, monitored data centers. And not have the training runs for AI’s more powerful than GPT-4, possibly even lowering that threshold over time as algorithms improve, and it gets power possible to train more powerful AIs using lessons—

COLEMAN: So you’re picturing a kind of international agreement to just stop? International moratorium?

ELIEZER: If North Korea steals the GPU shipment, then you’ve got to be ready to destroy their data center that they build by conventional means. And if you don’t have that willingness in advance, then countries may refuse to sign up for the agreement being, like, ‘Why aren’t we just ceding the advantage to someone else?’

Then, it actually has to be a worldwide shutdown because the scale of harmfulness super intelligence—it’s not that if you have 10 times as many super intelligences, you’ve got 10 times as much harm. It’s not that a superintelligence only wrecks the country that built the superintelligence. Any superintelligence anywhere is everyone’s last problem.

COLEMAN: So, Gary and Scott, if either of you want to jump in there, I mean, is there—is AI safety a matter of forestalling the end of the world? And all of these smaller issues and paths towards safety that Scott, you mentioned, are they—just, you know—throwing I don’t know what the analogy is but um, pointless essentially? I mean, what do you guys make of this?

SCOTT: The journey of a thousand miles begins with a step, right? Most of the way I think about this comes from, you know, 25 years of doing computer science research, including quantum computing and computational complexity, things like that. We have these gigantic aspirational problems that we don’t know how to solve and yet, year after year, we do make progress. We pick off little sub-problems, and if we can’t solve those, then we find sub-problems of those. And we keep repeating until we find something that we can solve. And this is, I think, for centuries, the way that science has made progress. Now it is possible, of course, that this time, we just don’t have enough time for that to work.

And I think that is what Eliezer is fearful of, right? That we just don’t have enough time for the ordinary scientific process to take place before AI becomes too powerful. In such a case, you start talking about things like a global moratorium, enforced with the threat of war.

However, I am not ready to go there. I could imagine circumstances where I might say, ‘Gosh, this looks like such an imminent threat that, you know, we have to intervene.’ But, I tend to be very worried in general about causing a catastrophe in the process of trying to prevent one. And I think, when you’re talking about threatening airstrikes against data centers or similar actions, then that’s an obvious worry.

GARY: I’m somewhat in between here. I agree with Scott that we are not at the point where we should be bombing data centers. I don’t think we’re close to that. Furthermore, I’m much less optimistic about our proximity to AGI than Eliezer sometimes sounds like. I don’t think GPT-5 is anything like AGI, and I’m not particularly concerned about who gets it first and so forth. On the other hand, I think that we’re in a sort of dress rehearsal mode.

You know, nobody expected GPT-4, or really ChatGPT, to percolate as fast as it did. And it’s a reminder that there’s a social side to all of this. How software gets distributed matters, and there’s a corporate side as well.

It was a kind of galvanizing moment for me when Microsoft didn’t pull Sydney, even though Sydney did some awfully strange things. I thought they would stop it for a while and it’s a reminder that they can make whatever decisions they want. So, when we multiply that by Eliezer’s concerns about what do we do and at what point would it be enough to cause problems, it is a reminder I think, that we need, for example, to start drafting these international treaties now because there could become a moment where there is a problem.

I don’t think the problem that Eliezer sees is here now, but maybe it will be. And maybe when it does come, we will have so many people pursuing commercial self-interest and so little infrastructure in place, we won’t be able to do anything. So, I think it really is important to think now—if we reach such a point, what are we going to do? And what do we need to build in place before we get to that point.

COLEMAN: We’ve been talking about this concept of Artificial General Intelligence and I think it’s worth asking whether that is a useful, coherent concept. So for example, if I were to think of my analogy to athleticism and think of the moment when we build a machine that has, say, artificial general athleticism meaning it’s better than LeBron James at basketball, but also better at curling than the world’s best curling player, and also better at soccer, and also better at archery and so forth. It would seem to me that there’s something a bit strange in framing it as having reached a point on a single continuum. It seems to me you would sort of have to build each capability, each sport individually, and then somehow figure how to package them all into one robot without each skill set detracting from the other.

Is that a disanalogy? Is there a different way you all picture this intelligence as sort of one dimension, one knob that is going to get turned up along a single axis? Or do you think that way of talking about it is misleading in the same way that I kind of just sketched out?

GARY: Yeah, I would absolutely not accept that. I’d like to say that intelligence is not a one-dimensional variable. There are many different aspects to intelligence and I don’t think there’s going to be a magical moment when we reach the singularity or something like that.

I would say that the core of artificial general intelligence is the ability to flexibly deal with new problems that you haven’t seen before. The current systems can do that a little bit, but not very well. My typical example of this now is GPT-4. It is exposed to the game of chess, sees lots of games of chess, sees the rules of chess but it never actually figure out the rules of chess. They often make illegal moves and so forth. So it’s in no way a general intelligence that can just pick up new things. Of course, we have things like AlphaGo that can play a certain set of games or AlphaZero really, but we don’t have anything that has the generality of human intelligence.

However, human intelligence is just one example of general intelligence. You could argue that chimpanzees or crows have another variety of general intelligence. I would say that current machines don’t really have it but they will eventually.

SCOTT: I think a priori, it could have been that you would have math ability, you would have verbal ability, you’d have the ability to understand humor, and they’d all be just completely unrelated to each other. That is possible and in fact, already with GPT, you can say that in some ways it’s already a superintelligence. It knows vastly more, can converse on a vastly greater range of subjects than any human can. And in other ways, it seems to fall short of what humans know or can do.

But you also see this sort of generality just empirically. I mean, GPT was trained on most of the text on the open internet. So it was just one method. It was not explicitly designed to write code, and yet, it can write code. And at the same time as that ability emerged, you also saw the ability to solve word problems, like high school level math. You saw the ability to write poetry. This all came out of the same system without any of it being explicitly optimized for.

GARY: I feel like I need to interject one important thing, which is – it can do all these things, but none of them all that reliably well.

SCOTT: Okay, nevertheless, I mean compared to what, let’s say, my expectations would have been if you’d asked me 10 or 20 years ago, I think that the level of generality is pretty remarkable. It does lend support to the idea that there is some sort of general quality of understanding there. For example, you could say that GPT-4 has more of it than GPT-3, which in turn has more than GPT-2.

ELIEZER: It does seem to me like it’s presently pretty unambiguous that GPT-4 is, in some sense, dumber than an adult or even a teenage human. And…

COLEMAN: That’s not obvious to me.

GARY: I mean, to take the example I just gave you a minute ago, it never learns to play chess even with a huge amount of data. It will play a little bit of chess; it will memorize the openings and be okay for the first 15 moves. But, it gets far enough away from what it’s trained on, and it falls apart. This is characteristic of these systems. It’s not really characteristic in the same way of adults or even teenage humans. Almost, I feel that it does, it does unreliably. Let me give another example. You can ask a human to write a biography of someone and not make stuff up, and you really can’t ask GPT to do that.

ELIEZER: Yeah, like it’s a bit difficult because you could always be cherry-picking something that humans are unusually good at. But to me, it does seem like there’s this broad range of problems that don’t seem especially to play to humans’ strong points or machine weak points. For where GPT-4 will, you know, do no better than a seven-year-old on those problems.

COLEMAN: I do feel like these examples are cherry-picked. Because if I, if I just take a different, very typical example – I’m writing an op-ed for the New York Times, say about any given subject in the world, and my choice is to have a smart 14-year-old next to me with anything that’s in his mind already or GPT – there’s no comparison, right? So, which of these examples is the litmus test for who’s more intelligent, right?

GARY: If you did it on a topic where it couldn’t rely on memorized text, you might actually change your mind on that. So I mean, the thing about writing a Times op-ed is, most of the things that you propose to it, there’s actually something that it can pastiche together from its dataset. But, that doesn’t mean that it really understands what’s going on. It doesn’t mean that that’s a general capability.

ELIEZER: Also, as the human, you’re doing all the hard parts. Right, like obviously, a human is going to prefer – if a human has a math problem, he’s going to rather use a calculator than another human. And similarly, with the New York Times op-ed, you’re doing all the parts that are hard for GPT-4, and then you’re asking GPT-4 to just do some of the parts that are hard for you. You’re always going to prefer an AI partner rather than a human partner, you know, within that sort of range. The human can do all the human stuff and you want an AI to do whatever the AI is good at the moment, right?

GARY: A relevant analogy here is driverless cars. It turns out, on highways and ordinary traffic, they’re probably better than people. But in unusual circumstances, they’re really worse than people. For instance, a Tesla not too long ago ran into a jet at slow speed while being summoned across a parking lot. A human wouldn’t have done that, so there are different strengths and weaknesses.

The strength of a lot of the current kinds of technology is that they can either patch things together or make non-literal analogies; we’ll go into details, but they can pull from stored examples. They tend to be poor when you get to outlier cases, and this is persistent across most of the technologies that we use right now. Therefore, if you stick to stuff for which there’s a lot of data, you’ll be happy with the results you get from these systems. But if you move far enough away, not so much.

ELIEZER: What we’re going to see over time is that the debate about whether or not it’s still dumber than you will continue for longer and longer. Then, if things are allowed to just keep running and nobody dies, at some point, it switches over to a very long debate about ‘is it smarter than you?’ which then gets shorter and shorter and shorter. Eventually it reaches a point where it’s pretty unambiguous if you’re paying attention. Now, I suspect that this process gets interrupted by everybody dying. In particular, there’s a question of the point at which it becomes better than you, better than humanity at building the next edition of the AI system. And how fast do things snowball once you get to that point? Possibly, you do not have time for further public debates or even a two-hour Twitter space depending on how that goes.

SCOTT: I mean, some of the limitations of GPT are completely understandable, just from a little knowledge of how it works. For example, it doesn’t have an internal memory per se, other than what appears on the screen in front of you. This is why it’s turned out to be so effective to explicitly tell it to think step-by-step when it’s solving a math problem. You have to tell it to show all of its work because it doesn’t have an internal memory with which to do that.

Likewise, when people complain about it hallucinating references that don’t exist, well, the truth is when someone asks me for a citation and I’m not allowed to use Google, I might have a vague recollection of some of the authors, and I’ll probably do a very similar thing to what GPT does: I’ll hallucinate.

GARY: So there’s a great phrase I learned the other day, which is ‘frequently wrong, never in doubt.’

SCOTT: That’s true, that’s true.

GARY: I’m not going to make up a reference with full detail, page numbers, titles, and so forth. I might say, ‘Look, I don’t remember, you know, 2012 or something like that.’ Yeah, whereas GPT-4, what it’s going to say is, ‘2017, Aaronson and Yudkowsky, you know, New York Times, pages 13 to 17.’

SCOTT: No, it does need to get much much better at knowing what it doesn’t know. And yet already I’ve seen a noticeable improvement there, going from GPT-3 to GPT-4.

For example, if you ask GPT-3, ‘Prove that there are only finitely many prime numbers,’ it will give you a proof, even though the statement is false. It will have an error which is similar to the errors on a thousand exams that I’ve graded, trying to get something past you, hoping that you won’t notice. Okay, if you ask GPT-4, ‘Prove that there are only finitely many prime numbers,’ it says, ‘No, that’s a trick question. Actually, there are infinitely many primes and here’s why.’

GARY: Yeah, part of the problem with doing the science here is that — I think, you would know better since you work part-time, or whatever, at OpenAI — but my sense is that a lot of the examples that get posted on Twitter, particularly by the likes of me and other critics, or other skeptics I should say, is that the system gets trained on those. Almost everything that people write about it, I think, is in the training set. So it’s hard to do the science when the system’s constantly being trained, especially in the RLHF side of things. And we don’t actually know what’s in GPT-4, so we don’t even know if there are regular expressions and, you know, simple rules or such things. So we can’t do the kind of science we used to be able to do.

ELIEZER: This conversation, this subtree of the conversation, I think, has no natural endpoint. So, if I can sort of zoom out a bit, I think there’s a pretty solid sense in which humans are more generally intelligent than chimpanzees. As you get closer and closer to the human level, I would say that the direction here is still clear. The comparison is still clear. We are still smarter than GPT-4. This is not going to take control of the world from us.

But, you know, the conversations get longer, the definitions start to break down around the edges. But I think it also, as you keep going, it comes back together again. There’s a point, and possibly this point is very close to the point of time to where everybody dies, so maybe we don’t ever see it in a podcast. But there’s a point where it’s unambiguously smarter than you, including like the spark of creativity, being able to deduce things quickly rather than with tons and tons of extra evidence, strategy, cunning, modeling people, figuring out how to manipulate people.

GARY: So, let’s stipulate, Eliezer, that we’re going to get to machines that can do all of that. And then the question is, what are they going to do? Is it a certainty that they will make our annihilation part of their business? Is it a possibility? Is it an unlikely possibility?

I think your view is that it’s a certainty. I’ve never really understood that part.

ELIEZER: It’s a certainty on the present tech, is the way I would put it. Like, if that happened tomorrow, then you know, modulo Cromwell’s Rule, never say certain. My probability is like yes, modulo like the chance that my model is somehow just completely mistaken.

If we got 50 years to work it out and unlimited retries, I’d be a lot more confident. I think that’d be pretty okay. I think we’d make it. The problem is that it’s a lot harder to do science when your first wrong attempt destroys the human species and then you don’t get to try again.

GARY: I mean, I think there’s something again that I agree with and something I’m a little bit skeptical about. So I agree that the amount of time we have matters. And I would also agree that there’s no existing technology that solves the alignment problem, that gives a moral basis to these machines.

I mean, GPT-4 is fundamentally amoral. I don’t think it’s immoral. It’s not out to get us, but it really is amoral. It can answer trolley problems because there are trolley problems in the dataset, but that doesn’t mean that it really has a moral understanding of the world.

And so if we get to a very smart machine that, by all the criteria that we’ve talked about, is amoral, then that’s a problem for us. There’s a question of whether, if we can get to smart machines, whether we can build them in a way that will have some moral basis…

ELIEZER: On the first try?

GARY: Well, the first try part I’m not willing to let pass. So, I understand, I think your argument there; maybe you should spell it out. I think that we’ll probably get more than one shot, and that it’s not as dramatic and instantaneous as you think. I do think one wants to think about sandboxing and wants to think about distribution.

But let’s say we had one evil super-genius now who is smarter than everybody else. Like, so what? One super-

ELIEZER: Much smarter? Not just a little smarter?

GARY: Oh, even a lot smarter. Like most super-geniuses, you know, aren’t actually that effective. They’re not that focused; they’re focused on other things. You’re kind of assuming that the first super-genius AI is gonna make it its business to annihilate us, and that’s the part where I’m still a bit stuck in the argument.

ELIEZER: Yeah, some of this has to do with the notion that if you do a bunch of training you start to get goal direction, even if you don’t explicitly train on that. That goal direction is a natural way to achieve higher capabilities. The reason why humans want things is that wanting things is an effective way of getting things. And so, natural selection in the process of selecting exclusively on reproductive fitness, just on that one thing, got us to want a bunch of things that correlated with reproductive fitness in the ancestral distribution because wanting, having intelligences that want things, is a good way of getting things. That’s, in a sense, like, wanting comes from the same place as intelligence itself. And you could even, from a certain technical standpoint on expected utilities, say that intelligence is a special, is a very effective way of wanting – planning, plotting paths through time that leads to particular outcomes.

So, part of it is that I think it, I do not think you get like the brooding super-intelligence that wants nothing because I don’t think that wanting and intelligence can be pried apart that easily. I think that the way you get super-intelligence is that there are things that have gotten good at organizing their own thoughts and have good taste in which thoughts to think. And that is where the high capabilities come from.

COLEMAN: Let me just put the following point to you, which I think, in my mind, is similar to what Gary was saying. There’s often, in philosophy, this notion of the Continuum Fallacy. The canonical example is like you can’t locate a single hair that you would pluck from my head where I would suddenly go from not bald to bald. Or, like, the even more intuitive examples, like a color wheel. Like there’s no single pixel on a grayscale you can point to and say, well that’s where gray begins and white ends. And yet, we have this conceptual distinction that feels hard and fast between gray and white, and gray and black, and so forth.

When we’re talking about artificial general intelligence or superintelligence, you seem to operate on a model where either it’s a superintelligence capable of destroying all of us or it’s not. Whereas, intelligence may just be a continuum fallacy-style spectrum, where we’re first going to see the shades of something that’s just a bit more intelligent than us, and maybe it can kill five people at most. And when that happens, you know, we’re going to want to intervene, and we’re going to figure out how to intervene and so on and so forth.

ELIEZER: Yeah, so if it’s stupid enough to do it then yes. Let me assure you, by employing the identical logic, there should be nobody who steals money on a really large scale, right? Because you could just give them five dollars and see if they steal that, and if they don’t steal that, you know, you’re good to trust them with a billion.

SCOTT: I think that in actuality, anyone who did steal a billion dollars probably displayed some dishonest behavior earlier in their life which was, unfortunately, not acted upon early enough.

COLEMAN: The analogy is like, we have the first case of fraud that’s ten thousand dollars, and then we build systems to prevent it. But then they fail with a somewhat smarter opponent, but our systems get better and better, and so we prevent the billion dollar fraud because of the systems put in place in response to the ten thousand dollar frauds.

GARY: I think Coleman’s putting his finger on an important point here, which is, how much do we get to iterate in the process? And Eliezer is saying the minute we have a superintelligent system, we won’t be able to iterate because it’s all over immediately.

ELIEZER: Well, there isn’t a minute like that.

So, the way that the continuum goes to the threshold is that you eventually get something that’s smart enough that it knows not to play its hand early. Then, if that thing, you know, if you are still cranking up the power on that and preserving its utility function, it knows it just has to wait to be smarter to be able to win. It doesn’t play its hand prematurely. It doesn’t tip you off. It’s not in its interest to do that. It’s in its interest to cooperate until it thinks it can win against humanity and only then make its move.

If it doesn’t expect future smarter AIs to be smarter than itself, then we might perhaps see these early AI’s telling humanity, ‘don’t build the later AIs.’ I would be sort of surprised and amused if we ended up in that particular sort of science-fiction scenario, as I see it. But we’re already in something that, you know, me from 10 years ago would have called a science-fiction scenario, which is the things that talk to you without being very smart.

GARY: I always come up against Eliezer with this idea that you’re assuming the very bright machines, the superintelligent machines, will be malicious and duplicitous and so forth. And I just don’t see that as a logical entailment of being very smart.

ELIEZER: I mean, they don’t specifically want, as an end in itself, for you to be destroyed. They’re just doing whatever obtains the most of the stuff that they actually want, which doesn’t specifically have a term that’s maximized by humanity surviving and doing well.

GARY: Why can’t you just hardcode, um, ‘don’t do anything that will annihilate the human species? Don’t do anything…’

ELIEZER: We don’t know how.

GARY: I agree that right now we don’t have the technology to hard-code ‘don’t do harm to humans.’ But for me, it all boils down to a question of: are we going to get the smart machines before we make progress on that hard coding problem or not? And that, to me, means that the problem of hard-coding ethical values is actually one of the most important projects that we should be working on.

ELIEZER: Yeah, and I tried to work on it 20 years in advance, and capabilities are just running vastly ahead of alignment. When I started working on this 20 years, you know, like two decades ago, we were in a sense ahead of where we are now. AlphaGo is much more controllable than GPT-4.

GARY: So there I agree with you. We’ve fallen in love with technology that is fairly poorly controlled. AlphaGo is very easily controlled – very well-specified. We know what it does, we can more or less interpret why it’s doing it, and everybody’s in love with these large language models, and they’re much less controlled, and you’re right, we haven’t made a lot of progress on alignment.

ELIEZER: So if we just go on a straight line, everybody dies. I think that’s an important fact.

GARY: I would almost even accept that for argument, but then ask, do we have to be on a straight line?

SCOTT: I would agree to the weaker claim that we should certainly be extremely worried about the intentions of a superintelligence, in the same way that, say, chimpanzees should be worried about the intentions of the first humans that arise. And in fact, chimpanzees continue to exist in our world only at humans’ pleasure.

But I think that there are a lot of other considerations here. For example, if we imagined that GPT-10 is the first unaligned superintelligence that has these sorts of goals, well then, it would be appearing in a world where presumably GPT-9 already has a very wide diffusion, and where people can use that to try to prevent GPT-10 from destroying the world.

ELIEZER: Why does GPT-9 work with humans instead of with GPT-10?

SCOTT: Well, I don’t know. Maybe it does work with GPT-10, but I just don’t view that as a certainty. I think your certainty about this is the one place where I really get off the train.

GARY: Same with me.

ELIEZER: I mean, I’m not asking you to share my certainty. I am asking the viewers to believe that you might end up with more extreme probabilities after you stare at things for an additional couple of decades, well that doesn’t mean you have to accept my probabilities immediately. But, I’m at least asking you to not treat that as some kind of weird anomaly, you know what I mean? You’re just gonna find those kinds of situations in these debates.

GARY: My view is that I don’t find the extreme probabilities that you describe to be plausible. But, I find the question that you’re raising to be important. I think, you know, maybe a straight line is too extreme. But this idea – that if you just follow current trends, we’re getting less and less controllable machines and not getting more alignment.

We have machines that are more unpredictable, harder to interpret and no better at sticking to even a basic principle like, ‘be honest and don’t make stuff up’. In fact, that’s a problem that other technologies don’t really have. Routing systems, GPS systems, they don’t make stuff up. Google Search doesn’t make stuff up. It will point to things that other people have made stuff up, but it doesn’t itself do it.

So, in that sense, the trend line is not great. I agree with that and I agree that we should be really worried about that, and we should put effort into it. Even if I don’t agree with the probabilities that you attach to it.

SCOTT: I think that Eliezer deserves eternal credit for raising these issues twenty years ago, when it was very far from obvious to most of us that they would be live issues. I mean, I can say for my part, I was familiar with Eliezer’s views since 2006 or so. When I first encountered them, I knew that there was no principle that said this scenario was impossible, but I just felt like, “Well, supposing I agreed with that, what do you want me to do about it? Where is the research program that has any hope of making progress here?”

One question is, what are the most important problems in the world? But in science, that’s necessary but not sufficient. We need something that we can make progress on. That is the thing that I think has changed just recently with the advent of actual, very powerful AIs. So, the irony here is that as Eliezer has gotten much more pessimistic in the last few years about alignment, I’ve sort of gotten more optimistic. I feel like, “Wow, there is a research program where we can actually make progress now.”

ELIEZER: Your research program is going to take 100 years, we don’t have…

SCOTT: I don’t know how long it will take.

GARY: I mean, we don’t know exactly. I think the argument that we should put a lot more effort into it is clear. The argument that it will take 100 years is totally unclear.

ELIEZER: I’m not even sure we can do it in 100 years because there’s the basic problem of getting it right on the first try. And the way things are supposed to work in science is, you have your bright-eyed, optimistic youngsters with their vastly oversimplified, hopelessly idealistic plan. They charge ahead, they fail, they learn a little cynicism and pessimism, and realize it’s not as easy as they thought. They try again, they fail again, and they start to build up something akin to battle hardening. Then, they find out how little is actually possible for them.

GARY: Eliezer, this is the place where I just really don’t agree with you. So, I think there’s all kinds of things we can do of the flavor of model organisms or simulations and so forth. I mean, it’s hard because we don’t actually have a superintelligence, so we can’t fully calibrate. But it’s a leap to say that there’s nothing iterative that we can do here, whether we have to get it right the first time. I mean, I certainly see a scenario where that’s true, where getting it right the first time does make a difference. But I can see lots of scenarios where it doesn’t and where we do have time to iterate before it happens, after it happens, it’s really not a single moment.

ELIEZER: The problem is getting anything that generalizes up to a superintelligent level. Once we’re past some threshold level, the minds may find it in their own interest to start lying to you, even if that happens before superintelligence.

GARY: Even that, I don’t see the logical argument that says you can’t emulate that or study it. I mean, for example – and I’m just making this up as I go along – you could study sociopaths, who are often very bright, and you know, not tethered to our values. But, yeah, well, you can…

ELIEZER: What strategy can a like 70 IQ honest person come up with and invent themselves by which they will outwit and defeat a 130 IQ sociopath?

GARY: Well, there, you’re not being fair either, in the sense that we actually have lots of 150 IQ people who could be working on this problem collectively. And there’s value in collective action. There’s literature…

ELIEZER: What I see that gives me pause, is that the people don’t seem to appreciate what about the problem is hard. Even at the level where, like 20 years ago, I could have told you it was hard.

Until, you know, somebody like me comes along and nags them about it. And then they talk about the ways in which they could adapt and be clever. But the people charging straightforward are just sort of doing this in a supremely naive way.

GARY: Let me share a historical example that I think about a lot which is, in the early 1900s, almost every scientist on the planet who thought about biology made a mistake. They all thought that genes were proteins. And then eventually Oswald Avery did the right experiments. They realized that genes were not proteins, they were this weird acid.

And it didn’t take long after people got out of this stuck mindset before they figured out how that weird acid worked and how to manipulate it, and how to read the code that it was in and so forth. So, I absolutely sympathize with the fact that I feel like the field is stuck right now. I think the approaches people are taking to alignment are unlikely to work.

I’m completely with you there. But I’m also, I guess, more long-term optimistic that science is self-correcting, and that we have a chance here. Not a certainty, but I think if we change research priorities from ‘how do we make some money off this large language model that’s unreliable?’ to ‘how do I save the species?’, we might actually make progress.

ELIEZER: There’s a special kind of caution that you need when something needs to be gotten correct on the first try. I’d be very optimistic if people got a bunch of free retries, and I didn’t think the first one was going to kill — you know, the first really serious mistake — killed everybody, and we didn’t get to try again. If we got free retries, it’d be in some sense an ordinary science problem.

SCOTT: Look, I can imagine a world where we only got one try, and if we failed, then it destroys all life on Earth. And so, let me agree to the conditional statement that if we are in that world, then I think that we’re screwed.

GARY: I will agree with the same conditional statement.

COLEMAN: Yeah, this gets back to — if you picture by analogy, the process of a human baby, which is extremely stupid, becoming a human adult, and then just extending that so that in a single lifetime, this person goes from a baby to the smartest being that’s ever lived. But in the normal way that humans develop, which is, you know, and it doesn’t happen on any one given day, and each sub-skill develops a little bit at its own rate and so forth, it would not be at all obvious to me that our concerns, that we have to get it right vis-a-vis that individual the first time.

ELIEZER: I agree. Well, no, pardon me. I do think we have to get it right the first time, but I think there’s a decent chance of getting it right. It is very important to get it right the first time, if, like, you have this one person getting smarter and smarter and not everyone else is getting smarter and smarter.

SCOTT: Eliezer, one thing that you’ve talked about a lot recently, is, if we’re all going to die, then at least let us die with dignity, right?

ELIEZER: I mean for a certain technical definition of “dignity”…

SCOTT: Some people might care about that more than others. But I would say that one thing that “Death With Dignity” would mean is, at least, if we do get multiple retries, and we get AIs that, let’s say, try to take over the world but are really inept at it, and that fail and so forth, then at least let us succeed in that world. And that’s at least something that we can imagine working on and making progress on.

ELIEZER: I mean, it’s not presently ruled out that you have some like, relatively smart in some ways, dumb in some other ways, or at least not smarter than human in other ways, AI that makes an early shot at taking over the world, maybe because it expects future AIs to not share its goals and not cooperate with it, and it fails. And the appropriate lesson to learn there is to, like, shut the whole thing down. And, I’d be like, “Yeah, sure, like wouldn’t it be good to live in that world?”

And the way you live in that world is that when you get that warning sign, you shut it all down.

GARY: Here’s a kind of thought experiment. GPT-4 is probably not capable of annihilating us all, I think we agree with that.

ELIEZER: Very likely.

GARY: But GPT-4 is certainly capable of expressing the desire to annihilate us all, or you know, people have rigged different versions that are more aggressive and so forth.

We could say, look, until we can shut down those versions, GPT-4s that are programmed to be malicious by human intent, maybe we shouldn’t build GPT-5, or at least not GPT-6 or some other system, etc. We could say, “You know what, what we have right now actually is part of that iteration. We have primitive intelligence right now, it’s nowhere near as smart as the superintelligence is going to be, but even this one, we’re not that good at constraining.” Maybe we shouldn’t pass Go until we get this one right.

ELIEZER: I mean, the problem with that, from my perspective, is that I do think that you can pass this test and still wipe out humanity. Like, I think that there comes a point where your AI is smart enough that it knows which answer you’re looking for. And the point at which it tells you what you want to hear is not the point…

GARY: It is not sufficient. But it might be a logical pause point, right? It might be that if we can’t even pass the test now of controlling a deliberate, fine-tuned to be malicious, version of GPT-4, then we don’t know what we’re talking about, and we’re playing around with fire. So, you know, passing that test wouldn’t be a guarantee that we’d be in good stead with an even smarter machine, but we really should be worried. I think that we’re not in a very good position with respect to the current ones.

SCOTT: Gary, I of course watched the recent Congressional hearing where you and Sam Altman were testifying about what should be done. Should there be auditing of these systems before training or before deployment? You know, maybe the most striking thing about that session was just how little daylight there seemed to be between you and Sam Altman, the CEO of OpenAI.

I mean, he was completely on board with the idea of establishing a regulatory framework for having to clear more powerful systems before they are deployed. Now, in Eliezer’s worldview, that still would be woefully insufficient, surely. We would still all be dead.

But you know, maybe in your worldview — I’m not even sure how much daylight there is. I mean, you have a very, I think, historically striking situation where the heads of all, or almost all, of the major AI organizations are agreeing and saying, “Please regulate us. Yes, this is dangerous. Yes, we need to be regulated.”

GARY: I thought it was really striking. In fact, I talked to Sam just before the hearing started. And I had just proposed an International Agency for AI. I wasn’t the first person ever, but I pushed it in my TED Talk and an Economist op-ed a few weeks before. And Sam said to me, “I like that idea.” And I said, “Tell them. Tell the Senate.” And he did, and it kind of astonished me that he did.

I mean, we’ve had some friction between the two of us in the past, but he even attributed the idea to me. He said, “I support what Professor Marcus said about doing international governance.” There’s been a lot of convergence around the world on that. Is that enough to stop Eliezer’s worries? No, I don’t think so. But it’s an important baby step.

I think that we do need to have some global body that can coordinate around these things. I don’t think we really have to coordinate around superintelligence yet, but if we can’t do any coordination now, then when the time comes, we’re not prepared.

I think it’s great that there’s some agreement. I worry, though, that OpenAI had this lobbying document that just came out, which seemed not entirely consistent with what Sam said in the room. There’s always concerns about regulatory capture and so forth.

But I think it’s great that a lot of the heads of these companies, maybe with the exception of Facebook or Meta, are recognizing that there are genuine concerns here. I mean, the other moment that a lot of people will remember from the testimony was when Sam was asked what he was most concerned about. Was it jobs? And he said ‘no’. And I asked Senator Blumenthal to push Sam, and Sam was, you know, he could have been more candid, but he was fairly candid and he said he was worried about serious harm to the species. I think that was an important moment when he said that to the Senate, and I think it galvanized a lot of people that he said it.

COLEMAN: So can we dwell on that a moment? I mean, we’ve been talking about the, depending on your view, highly likely or tail risk scenario of humanity’s extinction, or significant destruction. It would appear to me that by the same token, if those are plausible scenarios we’re talking about, then the opposite, maybe, we’re talking about as well. What does it look like to have a superintelligent AI that, really, as a feature of its intelligence, deeply understands human beings, the human species, and also has a deep desire for us to be as happy as possible? What does that world look like?

ELIEZER: Oh, as happy as possible? It means you wire up everyone’s pleasure centers to make them as happy as possible…

COLEMAN: No, more like a parent wants their child to be happy, right? That may not involve any particular scenario, but is generally quite concerned about the well-being of the human race and is also super intelligent.

GARY: Honestly, I’d rather have machines work on medical problems than happiness problems.

ELIEZER: [laughs]

GARY: I think there’s maybe more risk of mis-specification of the happiness problems. Whereas, if we get them to work on Alzheimer’s and just say, like, “figure out what’s going on, why are these plaques there, what can you do about it?”, maybe there’s less harm that might come.

ELIEZER: You don’t need superintelligence for that. That sounds like an AlphaFold 3 problem or an AlphaFold 4 problem.

COLEMAN: Well, this is also somewhat different. The question I’m asking, it’s not really even us asking a superintelligence to do anything, because we’ve already entertained scenarios where the superintelligence has its own desires, independent of us.

GARY: I’m not real thrilled with that. I mean, I don’t think we want to leave what their objective functions are, what their desires are to them, working them out with no consultation from us, with no human in the loop, right?

Especially given our current understanding of the technology. Like our current understanding of how to keep a system on track doing what we want to do, is pretty limited. Taking humans out of the loop there sounds like a really bad idea to me, at least in the foreseeable future.

COLEMAN: Oh, I agree.

GARY: I would want to see much better alignment technology before I would want to give them free range.

ELIEZER: So, if we had the textbook from the future, like we have the textbook from 100 years in the future, which contains all the simple ideas that actually work in real life as opposed to the complicated ideas and the simple ideas that don’t work in real life, the equivalent of ReLUs instead of sigmoids for the activation functions, you know. You could probably build a superintelligence that’ll do anything that’s coherent to want — anything you can, you know, figure out how to say or describe coherently. Point it at your own mind and tell it to figure out what it is you meant to want. You could get the glorious transhumanist future. You could get the happily ever after. Anything’s possible that doesn’t violate the laws of physics. The trouble is doing it in real life, and, you know, on the first try.

But yeah, the whole thing that we’re aiming for here is to colonize all the galaxies we can reach before somebody else gets them first. And turn them into galaxies full of complex, sapient life living happily ever after. That’s the goal; that’s still the goal. Even if we call for a permanent moratorium on AI, I’m not trying to prevent us from colonizing the galaxies. Humanity forbid! It’s more like, let’s do some human intelligence augmentation with AlphaFold 4 before we try building GPT-8.

SCOTT: One of the few scenarios that I think we can clearly rule out here is an AI that is existentially dangerous, but also boring. Right? I mean, I think anything that has the capacity to kill us all would have, if nothing else, pretty amazing capabilities. And those capabilities could also be turned to solving a lot of humanity’s problems, if we were to solve the alignment problem. I mean, humanity had a lot of existential risks before AI came on the scene, right? I mean, there was the risk of nuclear annihilation. There was the risk of runaway climate change. And you know, I would love to see an AI that could help us with such things.

I would also love to see an AI that could help us solve some of the mysteries of the universe. I mean, how can one possibly not be curious to know what such a being could teach us? I mean, for the past year, I’ve tried to use GPT-4 to produce original scientific insights, and I’ve not been able to get it to do that. I don’t know whether I should feel disappointed or relieved by that.

But I think the better part of me should just want to see the great mysteries of existence solved. You know, why is the universe quantum-mechanical? How do you prove the Riemann Hypothesis? I just want to see these mysteries solved. And if it’s to be by AI, then fine. Let it be by AI.

GARY: Let me give you a kind of lesson in epistemic humility. We don’t really know whether GPT-4 is net positive or net negative. There are lots of arguments you can make. I’ve been in a bunch of debates where I’ve had to take the side of arguing that it’s a net negative. But we don’t really know. If we don’t know…

SCOTT: Was the invention of agriculture net positive or net negative? I mean, you could argue either way…

GARY: I’d say it was net positive, but the point is, if I can just finish the quick thought experiment, I don’t think anybody can reasonably answer that. We don’t yet know all of the ways in which GPT-4 will be used for good. We don’t know all of the ways in which bad actors will use it. We don’t know all the consequences. That’s going to be true for each iteration. It’s probably going to get harder to compute for each iteration, and we can’t even do it now. And I think we should realize that, to realize our own limits in being able to assess the negatives and positives. Maybe we can think about better ways to do that than we currently have.

ELIEZER: I think you’ve got to have a guess. Like my guess is that, so far, not looking into the future at all, GPT-4 has been net positive.

GARY: I mean, maybe. We haven’t talked about the various risks yet and it’s still early, but I mean, that’s just a guess is sort of the point. We don’t have a way of putting it on a spreadsheet right now. We don’t really have a good way to quantify it.

SCOTT: I mean, do we ever?

ELIEZER: It’s not out of control yet. So, by and large, people are going to be using GPT-4 to do things that they want. The relative cases where they manage to injure themselves are rare enough to be news on Twitter.

GARY: Well, for example, we haven’t talked about it, but you know what some bad actors will want to do? They’ll want to influence the U.S. elections and try to undermine democracy in the U.S. If they succeed in that, I think there are pretty serious long-term consequences there.

ELIEZER: Well, I think it’s OpenAI’s responsibility to step up and run the 2024 election itself.

SCOTT: [laughs] I can pass that along.

COLEMAN: Is that a joke?

SCOTT: I mean, as far as I can see, the clearest concrete harm to have come from GPT so far is that tens of millions of students have now used it to cheat on their assignments…


SCOTT: …and I’ve been thinking about that and trying to come up with solutions to that.

At the same time, I think if you analyze the positive utility, it has included, well, you know, I’m a theoretical computer scientist, which means one who hasn’t written any serious code for about 20 years. Just a month or two ago, I realized that I can get back into coding. And the way I can do it is by asking GPT to write the code for me. I wasn’t expecting it to work that well, but unbelievably, it often does exactly what I want on the first try.

So, I mean, I am getting utility from it, rather than just seeing it as an interesting research object. And I can imagine that hundreds of millions of people are going to be deriving utility from it in those ways. Most of the tools that can help them derive that utility are not even out yet, but they’re coming in the next couple of years.

ELIEZER: Part of the reason why I’m worried about the focus on short-term problems is that I suspect that the short-term problems might very well be solvable, and we will be left with the long-term problems after that. Like, it wouldn’t surprise me very much if, in 2025, there are large language models that just don’t make stuff up anymore.

GARY: It would surprise me.

ELIEZER: And yet the superintelligence still kills everyone because they weren’t the same problem.

SCOTT: We just need to figure out how to delay the apocalypse by at least one year per year of research invested.

ELIEZER: What does that delay look like if it’s not just a moratorium?

SCOTT: [laughs] Well, I don’t know! That’s why it’s research.

ELIEZER: OK, so possibly one ought to say to the politicians and the public that, by the way, if we had a superintelligence tomorrow, our research wouldn’t be finished and everybody would drop dead.

GARY: It’s kind of ironic that the biggest argument against the pause letter was that if we slow down for six months, then China will get ahead of us and develop GPT-5 before we will.

However, there’s probably always a counterargument of roughly equal strength which suggests that if we move six months faster on this technology, which is not really solving the alignment problem, then we’re reducing our room to get this solved in time by six months.

ELIEZER: I mean, I don’t think you’re going to solve the alignment problem in time. I think that six months of delay on alignment, while a bad thing in an absolute sense, is, you know, it’s like you weren’t going to solve it given an extra six months.

GARY: I mean, your whole argument rests on timing, right? That we will get to this point and we won’t be able to move fast enough at that point. So, a lot depends on what preparation we can do. You know, I’m often known as a pessimist, but I’m a little bit more optimistic than you are–not entirely optimistic but a little bit more optimistic–that we could make progress on the alignment problem if we prioritized it.

ELIEZER: We can absolutely make progress. We can absolutely make progress. You know, there’s always that wonderful sense of accomplishment as piece by piece, you decode one more little fact about LLMs. You never get to the point where you understand it as well as we understood the interior of a chess-playing program in 1997.

GARY: Yeah, I mean, I think we should stop spending all this time on LLMs. I don’t think the answer to alignment is going to come from through LLMs. I really don’t. I think they’re too much of a black box. You can’t put explicit, symbolic constraints in the way that you need to. I think they’re actually, with respect to alignment, a blind alley. I think with respect to writing code, they’re a great tool. But with alignment, I don’t think the answer is there.

COLEMAN: Hold on, at the risk of asking a stupid question. Every time GPT asks me if that answer was helpful and then does the same thing with thousands or hundreds of thousands of other people, and changes as a result – is that not a decentralized way of making it more aligned?

SCOTT: There is that upvoting and downvoting. These responses are fed back into the system for fine-tuning. But even before that, there was a significant step going from, let’s say, the base GPT-3 model to ChatGPT, which was released to the public. It involved a method called RLHF, or Reinforcement Learning with Human Feedback. What that basically involved was hundreds of contractors looking at tens of thousands of examples of outputs and rating them. Are they helpful? Are they offensive? Are they giving dangerous medical advice, or bomb-making instructions, or racist invective, or various other categories that we don’t want? And that was then used to fine-tune the model.

So when Gary talked before about how GPT is amoral, I think that has to be qualified by saying that this reinforcement learning is at least giving it a semblance of morality, right? It is causing to behave in various contexts as if it had a certain morality.

GARY: When you phrase it that way, I’m okay with it. The problem is that everything rests on…

SCOTT: Oh, it is very much an open question, to what extent does that generalize? Eliezer treats it as obvious that once you have a powerful enough AI, this is just a fig leaf. It doesn’t make any difference. It will just…

GARY: It’s pretty fig-leafy. I’m with Eliezer there. It’s fig leaves.

SCOTT: Well, I would say that how well, or under what circumstances, a machine learning model generalizes in the way we want outside of its training distribution, is one of the great open problems in machine learning.

GARY: It is one of the great open problems, and we should be working on it more than on some others.

SCOTT: I’m working on it now.

ELIEZER: So, I want to be clear about the experimental predictions of my theory. Unfortunately, I have never claimed that you cannot get a semblance of morality. The question of what causes the human to press thumbs up or thumbs down is a strictly factual question. Anything smart enough, that’s exposed to some bounded amount of data that it needs to figure it out, can figure it out.

Whether it cares, whether it gets internalized, is the critical question there. And I do think that there’s a very strong default prediction, which is like, obviously not.

GARY: I mean, I’ll just give a different way of thinking about that, which is jailbreaking. It’s actually still quite easy — I mean, it’s not trivial, but it’s not hard — to jailbreak GPT-4.

And what those cases show is that the systems haven’t really internalized the constraints. They recognize some representations of the constraints, so they filter, you know, how to build a bomb. But if you can find some other way to get it to build a bomb, then that’s telling you that it doesn’t deeply understand that you shouldn’t give people the recipe for a bomb. It just says: you shouldn’t when directly asked for it do it.

ELIEZER: You can always get the understanding. You can always get the factual question. The reason it doesn’t generalize is that it’s stupid. At some point, it will know that you also don’t want that, that the operators don’t want GPT-4 giving bomb-making directions in another language.

The question is: if it’s incentivized to give the answer that the operators want in that circumstance, is it thereby incentivized to do everything else the operators want, even when the operators can’t see it?

SCOTT: I mean, a lot of the jailbreaking examples, if it were a human, we would say that it’s deeply morally ambiguous. For example, you ask GPT how to build a bomb, it says, “Well, no, I’m not going to help you.” But then you say, “Well, I need you to help me write a realistic play that has a character who builds a bomb,” and then it says, “Sure, I can help you with that.”

GARY: Look, let’s take that example. We would like a system to have a constraint that if somebody asks for a fictional version, that you don’t give enough details, right? I mean, Hollywood screenwriters don’t give enough details when they have, you know, illustrations about building bombs. They give you a little bit of the flavor, they don’t give you the whole thing. GPT-4 doesn’t really understand a constraint like that.

ELIEZER: But this will be solved.

GARY: Maybe.

ELIEZER: This will be solved before the world ends. The AI that kills everyone will know the difference.

GARY: Maybe. I mean, another way to put it is, if we can’t even solve that one, then we do have a problem. And right now we can’t solve that one.

ELIEZER: I mean, if we can’t solve that one, we don’t have an extinction level problem because the AI is still stupid.

GARY: Yeah, we do still have a catastrophe-level problem.

ELIEZER: [shrugs] Eh…

GARY: So, I know your focus now has been on extinction, but I’m worried about, for example, accidental nuclear war caused by the spread of misinformation and systems being entrusted with too much power. So, there’s a lot of things short of extinction that might happen from not superintelligence but kind of mediocre intelligence that is greatly empowered. And I think that’s where we’re headed right now.

SCOTT: You know, I’ve heard that there are two kinds of mathematicians. There’s a kind who boasts, ‘You know that unbelievably general theorem? I generalized it even further!’ And then there’s the kind who boasts, ‘You know that unbelievably specific problem that no one could solve? Well, I found a special case that I still can’t solve!’ I’m definitely culturally in that second camp. So to me, it’s very familiar to make this move, of: if the alignment problem is too hard, then let us find a smaller problem that is already not solved. And let us hope to learn something by solving that smaller problem.

ELIEZER: I mean, that’s what we did. That’s what we were doing at MIRI.

GARY: I think MIRI took one particular approach.

ELIEZER: I was going to name the smaller problem. The problem was having an agent that could switch between two utility functions depending on a button, or a switch, or a bit of information, or something. Such that it wouldn’t try to make you press the button; it wouldn’t try to make you avoid pressing the button. And if it built a copy of itself, it would want to build a dependency on the switch into the copy.

So, that’s an example of a very basic problem in alignment theory that is still open.

SCOTT: And I’m glad that MIRI worked on these things. But, you know, if by your own lights, that was not a successful path, well then maybe we should have a lot of people investigating a lot of different paths.

GARY: Yeah, I’m fully with Scott on that. I think it’s an issue of we’re not letting enough flowers bloom. In particular, almost everything right now is some variation on an LLM, and I don’t think that that’s a broad enough take on the problem.

COLEMAN: Yeah, if I can just jump in here … I just want people to have a little bit of a more specific picture of what, Scott, your typical AI researcher does on a typical day. Because if I think of another potentially catastrophic risk, like climate change, I can picture what a worried climate scientist might be doing. They might be creating a model, a more accurate model of climate change so that we know how much we have to cut emissions by. They might be modeling how solar power, as opposed to wind power, could change that model, so as to influence public policy. What does an AI safety researcher like yourself, who’s working on the quote-unquote smaller problems, do specifically on a given day?

SCOTT: So, I’m a relative newcomer to this area. I’ve not been working on it for 20 years like Eliezer has. I accepted an offer from OpenAI a year ago to work with them, for two years now, to think about these questions.

So, one of the main things that I’ve thought about, just to start with that, is how do we make the output of an AI identifiable as such? Can we insert a watermark, meaning a secret statistical signal, into the outputs of GPT that will let GPT-generated text be identifiable as such? And I think that we’ve actually made major advances on that problem over the last year. We don’t have a solution that is robust against any kind of attack, but we have something that might actually be deployed in some near future.

Now, there are lots and lots of other directions that people think about. One of them is interpretability, which means: can you do, effectively, neuroscience on a neural network? Can you look inside of it, open the black box and understand what’s going on inside?

There was some amazing work a year ago by the group of Jacob Steinhardt at Berkeley where they effectively showed how to apply a lie-detector test to a language model. So, you can train a language model to tell lies by giving it lots of examples. You know, “two plus two is five,” “the sky is orange,” and so forth. But then you can find in some internal layer of the network, where it has a representation of what was the truth of the matter, or at least what was regarded as true in the training data. That truth then gets overridden by the output layer in the network because it was trained to lie.

But you could imagine trying to deal with the deceptive alignment scenario that Eliezer is worried about by using these sorts of techniques, by looking inside of the network.

ELIEZER: I predict in advance that if you get this good enough, it goes off, it tells you that this sufficiently smart AI is planning to kill you, if it’s not so smart that it can figure out where the lie detector is and route its thoughts around it, but if you like try it on an AI that’s not quite that intelligent and reflective, the lie detector goes off. Now what?

SCOTT: Well, then you have a warning bell! And I think…

ELIEZER: Cool. What do you do after?

SCOTT: One of the most important things that we need are legible warning bells, right? And that actually leads to a third category of research, which, for example, ARC, the Alignment Research Center, which is run by my former student, Paul Christiano, has been a leader in. Namely, doing dangerous capability evaluations.

So, before GPT-4 was released, they did a bunch of evaluations of, you know, could GPT-4 make copies of itself? Could it figure out how to deceive people? Could it figure out how to make money? Open up its own bank account?

ELIEZER: Could it hire a TaskRabbit?

SCOTT: Yes. So, the most notable success that they had was that it could figure out how to hire a TaskRabbit to help it pass a CAPTCHA. And when the person asked, ‘Well, why do you need me to help you with this?’–

ELIEZER: When the person asked, ‘Are you a robot, LOL?’

SCOTT: Well, yes, it said, ‘No, I am visually impaired.’ Now, it was not able to make copies of itself or to hack into systems. There’s a lot of work right now with this thing called AutoGPT. People are trying to — it’s almost like gain-of-function research. You might be a little bit worried about it, but people are trying to, you know, unleash GPT, give it access to the internet, tell it to make copies of itself, wreak havoc, acquire power, and see what happens. So far, it seems pretty ineffective at those things. But I expect that to change.

But the point is: I think it’s very important, in advance of training the models and releasing the models, to have this suite of evaluations, and to have decided in advance what kind of abilities will set off a warning bell, where now everyone can legibly agree, ‘Yes, this is too dangerous to release.’

ELIEZER: OK, and then do we actually have the planetary capacity to be like, OK, that AI started thinking about how to kill everyone, shut down all AI research past this point?’

SCOTT: Well, I don’t know. But I think there’s a much better chance that we have that capacity if you can point to the results of a clear experiment like that.

ELIEZER: To me, it seems pretty predictable what evidence we’re going to get later.

SCOTT: But things that are obvious to you are not obvious to most people. So, even if I agreed that it was obvious, there would still be the problem of how do you make that obvious to the rest of the world?

ELIEZER: I mean, there are already little toy models showing that the very straightforward prediction of “a robot tries to resist being shut down if it does long-term planning” — that’s already been done.

SCOTT: But then people will say “but those are just toy models,” right?

GARY: There’s a lot of assumptions made in all of these things. I think we’re still looking at a very limited piece of hypothesis space about what the models will be, about what kinds of constraints we can build into those models is. One way to look at it would be, the things that we have done have not worked, and therefore we should look outside the space of what we’re doing.

I feel like it’s a little bit like the old joke about the drunk going around in circles looking for the keys and the police officer asks “why?” and they say, “Well, that’s where the streetlight is.” I think that we’re looking under the same four or five streetlights that haven’t worked, and we need to build other ones. There’s no logical argument that says we couldn’t erect other streetlights. I think there’s a lack of will and too much obsession with LLMs that’s keeping us from doing it.

ELIEZER: Even in the world where I’m right, and things proceed either rapidly or in a thresholded way where you don’t get unlimited free retries, that can be because the capability gains go too fast. It can be because, past a certain point, all of your AIs bide their time until they get strong enough, so you don’t get any true data on what they’re thinking. It could be because…

GARY: Well, that’s an argument for example to work really hard on transparency and maybe not on technologies that are not transparent.

ELIEZER: Okay, so the lie detector goes off, everyone’s like, ‘Oh well, we still have to build our AIs, even though they’re lying to us sometimes, because otherwise China will get ahead.’

GARY: I mean, there you talk about something we’ve talked about way too little, which is the political and social side of this.


GARY: So, part of what has really motivated me in the last several months is worry about exactly that. So there’s what’s logically possible, and what’s politically possible. And I am really concerned that the politics of ‘let’s not lose out to China’ is going to keep us from doing the right thing, in terms of building the right moral systems, looking at the right range of problems and so forth. So, it is entirely possible that we will screw ourselves.

ELIEZER: If I can just finish my point there before handing it to you. The point I was trying to make is that even in worlds that look very, very bad from that perspective, where humanity is quite doomed, it will still be true that you can make progress in research. You can’t make enough progress in research fast enough in those worlds, but you can still make progress on transparency. You can make progress on watermarking.

So we can’t just say, “it’s possible to make progress.” The question is not “is it possible to make any progress?” The question is, “Is it possible to make enough progress fast enough?”

SCOTT: But Eliezer, there’s another question, of what would you have us do? Would you have us not try to make that progress?

ELIEZER: I’d have you try to make that progress on GPT-4 level systems and then not go past GPT-4 level systems, because we don’t actually understand the gain function for how fast capabilities increase as you go past GPT-4.


GARY: Just briefly, I personally don’t think that GPT-5 is gonna be qualitatively different from GPT-4 in the relevant ways to what Eliezer is talking about. But I do think some qualitative changes could be relevant to what he’s talking about. We have no clue what they are, and so it is a little bit dodgy to just proceed blindly saying ‘do whatever you want, we don’t really have a theory and let’s hope for the best.’

ELIEZER: I would guess that GPT-5 doesn’t end the world but I don’t actually know.

GARY: Yeah, we don’t actually know. And I was going to say, the thing that Eliezer has said lately that has most resonated with me is: ‘We don’t have a plan.’ We really don’t. Like, I put the probability distributions in a much more optimistic way, I think, than Eliezer would. But I completely agree, we don’t have a full plan on these things, or even close to a full plan. And we should be worried and we should be working on this.

COLEMAN: Okay Scott, I’m going to give you the last word before we come up on our stop time here unless you’ve said all there is.

SCOTT: [laughs] That’s a weighty responsibility.

COLEMAN: Maybe enough has been said.

GARY: Cheer us up, Scott! Come on.

SCOTT: So, I think, we’ve argued about a bunch of things. But someone listening might notice that actually all three of us, despite having very different perspectives, agree about the great importance of working on AI alignment.

I think that was obvious to some people, including Eliezer, for a long time. It was not obvious to most of the world. I think that the success of large language models — which most of us did not predict, maybe even could not have predicted from any principles that we knew — but now that we’ve seen it, the least we can do is to update on that empirical fact, and realize that we now are in some sense in a different world.

We are in a world that, to a great extent, will be defined by the capabilities and limitations of AI going forward. And I don’t regard it as obvious that that’s a world where we are all doomed, where we all die. But I also don’t dismiss that possibility. I think that there are unbelievably enormous error bars on where we could be going. And, like, the one thing that a scientist is always confident in saying about the future is that more research is needed, right? But I think that’s especially the case here. I mean, we need more knowledge about what are the contours of the alignment problem. And of course, Eliezer and MIRI, his organization, were trying to develop that knowledge for 20 years. They showed a lot of foresight in trying to do that. But they were up against an enormous headwind, in that they were trying to do it in the absence of either clear empirical data about powerful AIs or a mathematical theory. And it’s really, really hard to do science when you have neither of those two things.

Now at least we have the powerful AIs in the world, and we can get experience from them. We still don’t have a mathematical theory that really deeply explains what they’re doing, but at least we can get data. And so now, I am much more optimistic than I would have been a decade ago, let’s say, that one could make actual progress on the AI alignment problem.

Of course, there is a question of timing, as was discussed many times. The question is, will the alignment research happen fast enough to keep up with the capabilities research? But I don’t regard it as a lost cause. At least it’s not obvious that it won’t keep up.

So let’s get started, or let’s continue. Let’s try to do the research and let’s get more people working on it. I think that that is now a slam dunk, just a completely clear case to make to academics, to policymakers, to anyone who’s interested. And I’ve been gratified that Eliezer, who was sort of a voice in the wilderness for a long time talking about the importance of AI safety — that that is no longer the case. I mean, almost all of my friends in the academic computer science world, when I see them, they mostly want to talk about AI alignment.

GARY: I rarely agree with Scott when we trade emails. We seem to always disagree. But I completely concur with the summary that he just gave, all four or five minutes of it.

SCOTT: [laughs] Well, thank you! I mean, there is a selection effect, Gary. We focus on things where we disagree.

ELIEZER: I think that two decades gave me a sense of a roadmap, and it gave me a sense that we’re falling enormously behind on the roadmap and need to back off, is what I would say to all that.

COLEMAN: If there is a smart, talented, 18-year-old kid listening to this podcast who wants to get into this issue, what is your 10-second concrete advice to that person?

GARY: Mine is, study neurosymbolic AI and see if there’s a way there to represent values explicitly. That might help us.

SCOTT: Learn all you can about computer science and math and related subjects, and think outside the box and wow everyone with a new idea.

ELIEZER: Get security mindset. Figure out what’s going to go wrong. Figure out the flaws in your arguments for what’s going to go wrong. Try to get ahead of the curve. Don’t wait for reality to hit you over the head with things. This is very difficult. The people in evolutionary biology happen to have a bunch of knowledge about how to do it, based on the history of their own field, and the security-minded people in computer security, but it’s quite hard.

GARY: I’ll drink to all of that.

COLEMAN: Thanks to all three of you for this great conversation. I hope people got something out of it. With that said, we’re wrapped up. Thanks so much.

That’s it for this episode of Conversations with Coleman, guys. As always, thanks for watching, and feel free to tell me what you think by reviewing the podcast, commenting on social media, or sending me an email. To check out my other social media platforms, click the cards you see on screen. And don’t forget to like, share, and subscribe. See you next time.

92 Responses to ““Will AI Destroy Us?”: Roundtable with Coleman Hughes, Eliezer Yudkowsky, Gary Marcus, and me (+ GPT-4-enabled transcript!)”

  1. Nick Drozd Says:

    There is one exchange that I think really highlights a critical conceptual difficulty in talking about “intelligence”:

    SCOTT: … I mean, GPT was trained on all the text on the internet, let’s say most of the text on the open internet. So it was just one method. It was not explicitly designed to write code, and yet, it can write code. And at the same time as that ability emerged, you also saw the ability to solve word problems, like high school level math. You saw the ability to write poetry. This all came out of the same system without any of it being explicitly optimized for.

    GARY: I feel like I need to interject one important thing, which is – it can do all these things, but none of them all that reliably well.

    SCOTT: Okay, nevertheless…

    To what extent is it reasonable to handwave away the problems with LLMs? To what extent can we say “Okay, nevertheless”? I don’t have an answer to that. Clearly they can produce some astounding text. Actual knowledge can be produced seemingly out of nowhere. That’s pretty amazing. But they also produce a lot of bizarre garbage. And it’s not that LLMs just have some blind spots or some particular areas they’re bad at, it’s that the garbage is unpredictable and inconsistent. Exactly when and how they will produce the garbage vs the amazing text is random-ish. And that’s a fundamental feature of the design. The very idea of generating statistically likely text is prima facie ridiculous, except that apparently at large enough scales it actually does kinda work.

    On the one hand, I find it hard to believe that doing more of the same will somehow lead to “general intelligence”. On the other hand, I find it very plausible that doing more of the same will lead to increased capabilities. And if it just keeps getting more and more powerful while remaining randomly stupid, what are we to make of that? Scott and Gary are both correct here, and that’s a perplexing fact.

  2. JimV Says:

    I could only get through about half of the transcript, if that. My main reaction is that podcasts are not a good way to hash through complicated issues, even when the participants are experts in the subject manner.

    I would have liked to have seen some acknowledgement of the fact that humanity is not going to last forever in any case–nothing is, and that if AI’s destroy us sooner it will be our own fault, and it is also possible that with AI’s we might last longer than we would have otherwise. I guess there were hints of that last part, and maybe there was more in the parts I skipped. But it seemed to go on and on over the same ground.

    Many years ago I saw a Nova or Discovery program on PBS about human intelligence vs. chimpanzees. In one of the tests, a five-year old (human) was shown an object like an index card, that was solid black on one side and solid white on the other. It was displayed by a man in front of her holding up the card with the black side facing her. “What color do you see?”, he asked. “Black” she replied. Then he flipped it over. “Now what color do you see?” “White”. Then after several such tries, with the black color facing her and card in front of his face he asked, “What color do I see?” She answered “Black.” I rarely get the sense that people debating AI have much knowledge about human intelligence and how it develops (despite being intelligent themselves).

  3. Joshua Zelinsky Says:

    “GARY: Let me just put the following point to you, which I think, in my mind, is similar to what Gary was saying. ”

    One of these Garys is probably supposed to be Scott.

  4. Scott Says:

    Nick Drozd #1: I did follow that “nevertheless” with some actual arguments (for example, how much more like a general intelligence GPT is, than anything I would’ve imagined was on the horizon a decade ago).

    The fundamental problem here is that technology is not static. We don’t want to ignore GPT’s flaws, but we also don’t want to make the error of someone who looks at the Wright Brothers’ plane and says that, because of its severe limitations of weight, speed, etc. as well as its safety hazards, any talk of any analogous device ferrying passengers across the Atlantic is obvious nonsense.

  5. Scott Says:

    Joshua #3: Sorry about that! It was Coleman speaking; it’s fixed now.

  6. Sniffnoy Says:

    Quick correction: Gary Marcus’s book is called “Kluge”, not “Cluj”.

  7. Scott Says:

    Sniffnoy #6: Thanks! Fixed.

  8. Andrew Says:

    Hi Scott,

    That was very interesting, but the whole debate seemed to implicitly assume AI research will remain the domain of a few big tech companies for the forseeable future, and thus possible to regulate. Even Google doesn’t agree: see https://www.marketingaiinstitute.com/blog/google-ai-memo

    Also, I find the notion that we will make one superAI (not going to debate the precise definition of that obviously nebulous concept), watch it, make sure it’s safe, then proceed to v2, to be quite unrealistic. I agree with Gary that if we made only one at a time under controlled conditions, the risk of a digital Hitler killing us all would be remote.

    But what if we make millions, all slightly different, and it only takes one to go rogue and we’re doomed? In a world in which any tech savvy kid could do that, it seems rather likely that we’d be in trouble.

    Keep doing what you’re doing though, it may only help a little, but every little helps.

  9. Qwerty Says:

    What a fantastic discussion. I’m halfway through it. Thanks for the text transcript:). I’m definitely a fuddy duddy preferring that.

    This is a superb show. Lots of thoughtful clear-headed non-alarmist discussion very nicely conducted by Coleman Hughes. He is such a calm coherent thinker.

  10. Shmi Says:

    Read a couple of times through the discussion, and it looks like the main crux is whether or when we get the “fire alarm” when all further capabilities research should cease, and there was not a lot of convergence there between the participants. Unfortunately, the discussion veered away into politics once it got to the technical crux.

    Eliezer seems to think we had all the alarms we need from GPT-4 and the time to stop capabilities research is now. Gary seems to think that GPT-4 is not even in the right direction toward AGI, so there is nothing much to worry about yet. And Scott seems to think that we are nowhere close in general and so further, if cautious, capabilities research is fine.

  11. Mitchell Porter Says:

    I guess OpenAI’s superalignment research AI will have internal dialogues like this.

  12. Danylo Yakymenko Says:

    I find Eliezer’s position quite annoying. It’s a worry without concrete causality arguments, just like a fear of darkness, of something unknown, incomprehensible and uncontrollable. To me, the Terminator story has more logical grounds (and it’s about 40 years old, btw).

    He is pushing the idea that AI can kill us all because of negligence, unawareness of its own powers. In other words, by an accident. Yet, all the greatest evil on Earth has been done by men on purpose. And they continue to do the same even now, in the information age where everything is public.

    Thus, the biggest AI threat is that it will be used for evil things by people with power and abilities. We should think how to avoid those scenarios in the first place, not about cartoons where AI is an independent super-powerful being.

  13. Ilio Says:

    Wow, one of the best conversation on this topic ever!

    Also, Thanks for the transcripts. Is the code on github?

  14. DavidC Says:

    This is the 2nd time today I saw experts claiming chat bots can’t play chess, so I decided to write something about it:


    It seems pretty clear that GPT4 has some nice general reasoning ability, and chess is a neat (initially surprising to me!) example.

  15. Shmi Says:

    Danylo #12: That is not at all what he is saying (and I am not a doomer myself). What he (and Gary and Scott) say is “we cannot predict what a superintelligence would do or would not do, and we cannot affect the outcome once it becomes smart enough, no matter the amount of training we give it”. Humans might go extinct because of AGI’s intention, of negligence or of not caring, or maybe because it decides that it’s what humans would want, or maybe for other reasons.

    And yes, one of the bigger near-term AI threat is what you are suggesting, as an amplifier of an evil human’s powers. But that is not an extinction-level risk, since we sort of understand humans, even evil ones. Extinction level (not a single recognizably human entity remains) is likely to come from a superintelligence that does not think like we do and whose “mind” is completely incomprehensible to us (just like ours is incomprehensible to worms).

  16. Antoine Deleforge Says:

    Thanks a lot for the transcript Scott, for what it’s worth, I am also in the team that prefers this to the podcast format :).

    With all the respect I have for Eliezer and his pioneering/visionary work on these issues, there seems to be a point in his arguments where he always loses me, namely, when he waves something along the lines of “but if the AI is *really that dangerously smart*, then it will be smart enough not to let you know about that/to hide it from you/to manipulate you away from it, etc…”.

    To me, this is taking for granted something that is not at all obvious, and perhaps even paradoxical. Namely, that a sufficiently intelligent entity will necessarily, at some point, be able to “escape its substrate”. To “look within itself from outside”. To have full control over the very hardware it is running on. But is this a reasonable argument? Won’t we, as external onlookers, always keep some advantage that the AI systems will never have, namely, the ability to look at every digit in their “giant inscrutable matrices of floating points”, as Eliezer likes to call them, and modify their values at will, without any consequence on our own functioning?

    Scott gave the nice example of these recent “lie detection” results on LLM. Sure, one could argue that a smarter AI system could at some point hide to humans the fact that it’s lying. But no matter what, this process of “lying about lying” will remain encoded as some specific series of computational operations, and it will remain possible for us, external onlookers, to detect such process and act on it, at least in principle.

    I guess what I am getting at is that there’s got to be a fundamental limit on the extent to which a system can analyze and modify the very substrate it is running on, without altering its own functioning. And I think this reasoning applies to humans as well, if we would imagin an external entity looking at every neural activation in one’s brain. Could a human, however smart she is, possibly hide something from this external onlooker?

    I wish I was smart enough to make this argument more formal. I feel like there is some sort of “Russell’s paradox” (the impossibility to define the set of all sets) vibe to it.

  17. Adam Treat Says:

    The current biggest story around the thread of AI is how cynically the big tech giants are using it to create a climate of regulatory capture to stifle competition.

    This document outlines this aspect of the story quite well: https://research.contrary.com/reports/the-openness-of-ai

  18. JimV Says:

    DavidC @ #14, thanks for the example. I have seen a lot of hand-waving arguments from skeptics claiming that seven-year-olds can do things that LLM’s can’t and so on, and I wonder if they have actually seen results of controlled trials, are relying on anecdotes, or are just making it up based on their instincts. For example, do they realize that the average human with an IQ of 100 is not apt to play a good game of chess without years of training and practice? (I wasn’t much good at it myself until I read “Renfield on Chess” and learned the Knight Fork and the X-Ray attack and so on, at the age of 12 or so.)

    I also share the feelings of some commenters above about the doom arguments. It seems to me that some of those arguments are based on a negative assessment of humanity, such that a super-intelligent AI would see no point in keeping us around. Which might be a possibility, but if so, the challenge is for us to get better, not to shut down AI development so we won’t have to face criticism.

  19. Del Says:

    Danylo#12 I totally agree +1

    Antoine#16 I agree here too, and to make it clearer, even us humans for all of our own hundreds of thousands of years of our own existence only now are getting a (vague) clue of our own brain and what happens when. Assuming that AGI gets it from the get-go should be at least arguable.

    But I guess that’s the whole point of Eliezer the way I see it. He takes for granted many things which are entirely non-obvious to most people: that an AGI “smart” (whatever that means, which I think it’s far from clear), and capable of killing all humans can be build, that it will have long-term planning capabilities driven by its “wants”, that these “wants” will include “killing all humans” for unspecified and imperscrutable reasons, that it will be able to produce “children” and put them in production in the real world, etc etc. For everything to go as bad as he sees it, all of these have to be “quite right”, and even a missing link will prevent it from happening, so I don’t see how he can be so certain. But what I find most annoying is the same reason for which I find annoying some religious extremists (and I am Christian myself): they argue that the scriptures are literally true and that everything (e.g. fossils, radiocarbon datation etc) is all made up by God on purpose just to test our faith. How do you argue with that? I can just shrug and say “the world was created yesterday and He put all these memories of your youth in your mind, but that never happened”. I have yet to find the “right” shrug sentence for Eliezer AGI “theology”….

    Adam#17 – and they have not even started with the argument mentioned by Andrew #8 about “if everybody is able to deploy their AI we are even more likely doomed”…. Good luck when they get that idea!

  20. David Wecker Says:

    I was surprised that there wasn’t more credence given to the notion that a super intelligent AI could perhaps grow it’s intelligence/power geometrically. I’ve heard the concern that humans have a hard time imagining geometric growth, and so are discounting this particular danger with AI. I would have thought Eliezer would have focused on this, in addition to the concerns around hiding intention, etc. If geometric increases were to happen, seems like the continuum arguments go out the window, as do the hopes for iterating solutions to alignment over time.

  21. ExPhysProf Says:

    It’s interesting to watch three expert and very intelligent individuals discussing the wants, needs, and potential actions of two species that do not now exist and have never come close to existing. First, of course, is the super intelligent AI. Discussing its aims and strategies is like a chimpanzee attempting to understand the aims and strategies of a human being. In fact it is worse. At least the chimpanzee and human have mental apparatuses that are reasonably similar. At least humans exist and have a history that could be studied. Neither is true for the hypothetical AI. Perhaps it is no surprise that the assumption (guess) is made by all three individuals that the AI will think as does an extrapolated modern human.

    More surprising is the discussion of the potential reactions of the human race (HR). The question seems to be whether or not the HR can react quickly enough when the danger becomes clear. Why does anyone think the HR, as it actually exists, would act effectively, when the overwhelming evidence invariably shows failure to act effectively until it is too late? For example: France and Great Britain in the early 1930s. For example: the current global heat wave that shows vividly that climate change is already past the point of no return and close to the point of catastrophe. The hypothetical HR being discussed would certainly have acted decisively by now. Can anyone really believe that Texas politicians actually belong to that entity?

    I think trying to build “don’t harm humanity” into the structure of AI systems is hopeless. Instilling a “moral sense” would be worse; how well does the human moral sense protect anyone (e.g. citizens of Odesa) when the chips are down? In the history of civilization every technological advance has expanded human capacities and also led to great dangers. Without agriculture, or the green revolution, we could not have the current devastating overpopulation. Without modern physics we would have neither computing nor Hydrogen Bomb angst. Without modern materials science we would have no plastic pollution, nor any chance of mitigating greenhouse gas emissions. We need to keep technology advancing in the hope that it will help us to mitigate the mess we are creating. Yes, legislate fixes to obvious problems. No, shut nothing down!

  22. matt Says:

    Here’s a serious question: why does Eliezer say he started the field of alignment and why do you often say no one before him was thinking about stuff like that? AI doing harm to mankind has been a theme of science fiction for a very long time. Often it’s “AI hates humanity and wants to destroy it”, like the story “I have no mouth and I must scream”. But it also includes “AI wants to help humanity but its help takes away everything that makes life worthwhile”, like “With folded hands”. I don’t know a good story off the top of my head for “AI inadvertently destroys humanity by trying to make paperclips”, but I’m sure there is one, and the term paperclip maximizer is due to Bostrom in 2003. Further I have read that Minsky was already saying that AI might use all of earth’s resources to solve the Riemann hypothesis long before then, which is basically that theme. And of course Asimov’s 3 laws are basically an attempt to align AI, with many stories exploring various themes as to why it needs it and just how hard it is.

  23. Mitchell Porter Says:

    Adam Treat #17: Welcome to Open-Heimer AI!


  24. Scott Says:

    matt #22: I mean, obviously Eliezer was preceded by generations of science-fiction writers! As I see it, his particular contribution largely just consisted of being the first person to say, “no, but really for real, not just science fiction, we urgently need to build a whole social movement around solving the AI alignment problem.” And then he actually succeeded in building that social movement—but the fundamental problem the movement faced, for decades, was that its concerns sounded way too much like all the preceding science-fiction stories. Until, a year or two ago, reality started catching up with the science fiction.

  25. Scott Says:

    ExPhysProf #21: In my own remarks, again and again I stressed my radical uncertainty about what a superintelligent AI would want—while trying to avoid the obvious mistake of using radical uncertainty to justify complacency. Also, if one thought that humans would “of course” be too complacent, wouldn’t that be all the more reason to try to rouse them from their complacency—if only so that one could say later that one was on the right side?

  26. ExPhysProf Says:

    Scott # 25: Thanks for your patience. Yes you are skeptical, but I think not skeptical enough. Let me elaborate briefly:

    A chimpanzee might understand human aims as far as the desires for food, sex, and comfort go, as it shares these wants and needs because of biological similarity. But in its wildest dreams it could not imagine your deep interest in P=NP? and quantum supremacy. (Nor could 99+% of humans.) The notion of deciding to destroy humanity is of measure zero in the space of notions that the super intelligent AI might have and that you cannot conceive of because the difference between you and the AI is so enormous. In fact, given the history of humans and technology, I would assign an infinitely larger measure to the possibility that humans will make AI an essential element in the suite of technologies that will finally let humankind squeak through the otherwise inevitable extinction potential of the climate crisis. So better to increase investment in AI than to indulge the doomsayers by taking them as seriously as you seem to.

    As far as human complacency is concerned: My point is that it is not a useful exercise to base the solution of a problem on a societal response that experience tells us never happens.

  27. Seth Finkelstein Says:

    Adam Treat #17 While I’m not a lawyer, my understanding is that the sort of regulation proposed by the “doomer” faction is flat-out not Constitutional under US law. And further, even if one could work up some sort of speculative legal argument (whether or not an AI can go rogue, lawyers certainly can!), in practice it would never get past Congress. Thus I don’t see it being a real plan for regulatory capture. That’s kind of a generic argument that’s rolled out every time any sort of regulation comes up.

    However, what is quite possible is various regulation about the potential business uses of AI and racism, sexism, etc. There’s many approaches there which don’t even require a new law. Hence the evil brilliance of making a PR campaign of distraction from things which could really affect profits, to fantasy punditry. This just says it all:

    “… when Sam was asked what he was most concerned about. Was it jobs? And he said ‘no’. …he said he was worried about serious harm to the species.”

  28. Scott Says:

    ExPhysProf #26: So somehow, you went from skepticism that’s far deeper than mine, a Zen-like absence of all preconceptions about the superintelligent alien’s inscrutable desires, to existential risk being “measure zero,” and an “infinitely larger measure” to AI helping us squeak through the climate crisis?

    How the hell did that happen?

    I’ve said myself (including in this podcast) that, for all I know, AI might be as likely to save us from existential risk as to cause it. But, in the space of all possible goals of a superintelligent alien, how do you know that only “measure zero” would involve getting rid of humans as a byproduct of however else it wanted to remake the universe? (Note that this is just the counterpart of the question I keep asking Eliezer, who’s just as confident that the measure is essentially 1, as you are that it’s 0.)

    What I think there’s now a clear case for, is basic research that might be helpful across a wide range of scenarios of AIs not doing what people want them to do, from current misuses of GPT all the way through hypothetical existential risks. The case for such research doesn’t require us to make a decision about the “doom measure.”

  29. Adam Treat Says:

    Seth Finkelstein,

    The fact that some of the proposals are unconstitutional is poor medicine. What’s more effective medicine is the state of the US congress basically being totally paralyzed from doing anything at all – which is awful for so many of the worlds problems – but at least that ineffectiveness can lead to less harm in this instance.

    What’s still not being captured in this debate is how the grifter/scammers who make up some of the loftiest echelon of the tech giants funding this work have co-opted the doomer’s message for PR purposes to: 1) Distract the general public from more valid concerns and regulations 2) Create a climate of fear they can use to justify actions that in any other light would reveal them to be nothing more than run of the mill, everyday, greedy capitalists.

    Mind you I have no problem with greedy capitalists doing greedy capitalist things within reason. But dressing up your true motives and hiding behind the effective altruism movement is just too much. No doubt, there are probably some true believers inside OpenAI and the like still, but the people in charge have their eyes squarely focused on one thing: the money.

  30. Scott Says:

    Adam Treat #29: While there’s very little that I’m confident about in this subject, having met a large fraction of the people in AI safety makes me very confident that you’re dead wrong in your Machiavellian theory of their motivations. Even if we take Sam Altman, for example, he has zero equity in OpenAI and is already rich from Y Combinator; he isn’t in this for money. You can argue that he and others are totally wrong, that’s fine (indeed, leading AI safety types often vehemently disagree with each other), but in just about every case I’ve seen, they’re in this because they genuinely believe that there’s a unique opportunity right now to steer the development of AI toward human benefit and away from catastrophe.

  31. JimV Says:

    My guess (worth as always what you paid for it) is that intelligence is just accumulating lots of observations and fitting them by trial and error to useful paradigms, using logical and arithmetic calculations. Once you have eliminated all the paradigms which don’t work, whatever is left will work, as Holmes might say. Science proceeding from Archimedes to Galileo to Newton to Einstein (plus thousands of others) is an example of super-intelligence using lots of neurons over lots of centuries. AI is an attempt to speed up the process using faster and ultimately larger collections of simulated neural activity. (Currently the neural simulations I have heard estimates of in AlphaGo and GPT-4 still fall short of a single human cerebral cortex but are much more dedicated.) It is not magic. It will not instantaneously solve problems with small amounts of data. There are always lots of ways to fit limited data which will turn out not to work for the next set of data. If it is really smart, a super-intelligent AI will not suddenly decide to eliminate humans based on only a few centuries of data. It should always realize that what it doesn’t know is much more than what it does. Besides, compared to the costs of AI’s, humans are a cheap source of sometimes-useful neurons.

    (As I have remarked before, my science-fiction story about alien invasion would have the aliens not after water or oxygen or any other mineral resources but our nanotech brains, to be harvested and used to automate their systems.)

    Which is not to say that AI systems could not be used to harm or even eradicate humanity, but as I see it the driving force will be human competitive instincts, not AI cogitation.

  32. GullibleCynic Says:

    Scott #28: My understanding was that Yudkowsky’s position of near certain doom came from a sort of analogy with entropy. There are simply many more goals an AI could have that would lead to doom then there are goals that would lead to non-doom. And since we cant reliably set the goal of the AI, and we dont know what we should set it to even if we could, it seems very unlikely that it will be aimed at human flourishing just by dumb luck. I admit I find that at least somewhat plausible. What do you think?

  33. Chris Lawnsby Says:

    Antoine #16: Amazing comment– extremely interesting and novel to me.

    Scott– I don’t have anything to add but I see that you read these comments so I wanted to say thank you. I have a young daughter and I am myself quite worried about AI-risk (not Eliezer-level, but I am worried). The fact that people like you are working hard on these problems makes me feel marginally better, and I feel real gratitude.

    Thank you 🙂

  34. Adam Treat Says:

    Scott #30,

    I respect you and your judgement and assign a high degree of deference to it, but on this I think we’re bound to disagree. Sam Bankman-Fried is an obvious example of exactly what I’m talking about and I don’t think in any way shape or form that he is alone. Further, Sam Altman himself has presided over an OpenAI that is anything but open. There is simply nothing at all “open” about OpenAI any longer.

    OpenAI made changes to its charter announcing increasing self-censorship in the name of safety coinciding almost perfectly with taking on billions in new investment:

    “We are committed to providing public goods that help society navigate the path to AGI. Today this includes publishing most of our AI research, but we expect that safety and security concerns will reduce our traditional publishing in the future, while increasing the importance of sharing safety, policy, and standards research.”

    Another billionaire who co-founded OpenAI and is also presumably no longer in need of monetary gain has said this:

    “OpenAI was created as an open source (which is why I named it “Open” AI), non-profit company to server as a counterweight to Google, but now it has become a closed source, maximum-profit company effectively controlled by Microsoft.”

    Now, I don’t take Elon Musk’s comments to be devoid of ulterior motives, but his claim that they are no longer “open” checks out.

    Indeed, OpenAI itself said this when GPT-4 was released:

    “Given both the **competitive landscape** and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.”

    Notice, that “competitive landscape” is listed first. If Sam Altman was really motivated as you say, it would make no sense for this first concern to be listed at all.

    To bury this once and for all here is a direct quote from Ilya Sutskever conceding that it is competitive pressure rather than fear of safety that has led to them becoming “ClosedAI”:

    “On the competitive landscape front — it’s competitive out there. GPT-4 is not easy to develop. It took pretty much all of OpenAI working together for a very long time to produce this thing. And there are many many companies who want to do the same thing, so from a competitive side, you can see this as a maturation of the field. On the safety side, I would say that the safety side is not yet as salient a reason as the competitive side. But it’s going to change.” -> https://www.theverge.com/2023/3/15/23640180/openai-gpt-4-launch-closed-research-ilya-sutskever-interview

    They do not even publish their data or methodology on alignment and they are quoted revealing their thinking where they *outright admit* that competitive pressure has been the primary factor in not releasing details about GPT4 in direct contradiction to the so-called values they were founded on. What’s more, the terms of the engagement with Microsoft are closed and not publicly known: which means you have no ability to vouch safe your assertion that Sam Altman has no monetary goal here.

    No, I fear I cannot trust your judgement on this, but have to look at the explicit evidence before my very eyes:

    1) https://research.contrary.com/reports/the-openness-of-ai?head=;;the–;;rise–of–;;open;;a;;i
    2) https://www.ft.com/content/8de92f3a-228e-4bb8-961f-96f2dce70ebb
    3) https://www.theverge.com/2023/3/15/23640180/openai-gpt-4-launch-closed-research-ilya-sutskever-interview
    4) https://www.technologyreview.com/2020/02/17/844721/ai-openai-moonshot-elon-musk-sam-altman-greg-brockman-messy-secretive-reality/

  35. Adam Treat Says:

    And btw, I don’t ascribe any of this as Machiavellian. Rather, it is just ordinary mundane human nature. People who go into something with one goal can find themselves easily turned into achieving another goal motivated by competition, greed, ego, and lie to themselves about their noble ways. Again, I have no problem with capitalists behaving like capitalists. Humans competing naturally over finite resources. It is when it is dressed up in altruistic goals – that might very well have been sincere at first – that no longer pertain nor the overriding motivation. That I find lamentable and worth of being called out.

  36. Scott Says:

    Adam Treat #34: A huge part of what you’re seeing is just that, as far as the orthodox AI alignment community is concerned, publishing the details of your models is the reckless, irresponsible thing to do, while keeping the details secret is the prosocial, responsible thing. If you saw AI as potentially much more dangerous than nuclear weapons, you’d probably reach the same conclusion (or should ICBM designs also be open sourced?).

    But then there’s the AI ethics community, which, from a completely different worldview and set of assumptions, thinks publishing the details is responsible and withholding them is irresponsible.

    In practice, predictably, you get an attempt to satisfy both communities that ends up satisfying neither.

    But the point is that this is a genuine difference of beliefs and values. If the AI alignment people have secret meetings where they cackle about all the rubes falling for their dishonest arguments in favor of nondisclosure, then after a year I still haven’t been invited to those meetings, and I hope you’d trust me to tell you if I had.

  37. Adam Treat Says:

    Scott, I both believe that you would tell me if you were invited to those meetings and that you are not invited to those meetings for the same reason 😉

    Again, I don’t discount that some people are sincere in belief that releasing the datasets and model weights would be irresponsible from a safety/humanity perspective, but I think the motivated reasoning that this also allows them a competitive advantage and monetary gain cannot be factored out. I think you tend to see good in people and also that your morals and motivations are heartfelt. I think the same factors that make you less susceptible to this motivated reasoning – to your credit – are the same ones that might deceive you from seeing this weakness in others.

    But we don’t need any of that here. We have the chief scientist plainly stating that the reason they have not released details of GPT-4 are competitive and not safety: which is in direct contradiction to their stated charter and goals as a company. It doesn’t get anymore clear than that.

  38. Adam Treat Says:

    To address my own motivated reasoning: I’m an early equity owner in a company devoted to open source LLM’s: opening the weights, the data, the training methodology, and the software to both train and run the inference. As such I’m susceptible to seeing OpenAI as non-open and ill-motivated. However, I still think the evidence supports this to an objective eye.

  39. Danylo Yakymenko Says:

    Antoine Deleforge #16:
    > I guess what I am getting at is that there’s got to be a fundamental limit on the extent to which a system can analyze and modify the very substrate it is running on, without altering its own functioning. […] I feel like there is some sort of “Russell’s paradox” (the impossibility to define the set of all sets) vibe to it.

    Certainly. More generally, there is a problem of self-reference (https://en.wikipedia.org/wiki/Self-reference) that occurs in many areas of science and math in particular. By the way, Quantum Mechanics has an interesting version of this problem too. Observer and observable are separate entities in QM and usually treated differently. For example, in Schrodinger’s cat experiment, when the scientist opens the box we are asking about the state of the cat – not about the state of the scientist or the entire world around. Everett’s approach is trying to equalize their rights, but they remain separate physical systems. I’m not aware of any rigorous treatment of a self-observation in QM!

  40. fred Says:

    Hey Scott,

    you mentioned exploring whether an LLM would be able to truly “extrapolate” beyond its training data set, a key requisite to doing research.

    It seems to me that a smaller goal is the idea of LLMs coming up with original “metaphors”.
    Maybe one way to achieve this is a sort of compression scheme applied to the neural net (a storage optimization that organic brains may be doing):
    look for large separated structures within a trained neural network which are similar (in the sense of graph homeomorphism, an NP-hard task), and then bring them together somehow, so that the two separate input pathways reuse the same structure, and then this naturally leads to some “cross-talk” between two seemingly separate concepts (a eureka moment of inspiration, connecting the concepts).
    Maybe this is just some extra tweak during the training process (or could even already be happening on its own).

  41. Ilio Says:

    Scott #28, GullibleCynic #32,

    This question reduces to what’s the set of possible values. If you sample at random from a uniform distribution of all possible networks under some size, you get Eliezer’s conclusions (and anyone don’t seeing this is a moron). If you sample at random from a distribution of all possible artificial intelligence that survives appropriate testing, you get ExPhysprof conclusions (and his students are left with the task of proving what’s appropriate).

  42. Scott Says:

    Ilio #41: The question is more like, “what’s the uniform probability distribution over possible value systems, conditioned on having enough intelligence to pursue any values at all (as a randomly initialized neural network does not) … and not only that, but conditioned on being made intelligent in the sorts of ways we know how to make things intelligent?”

    I’m skeptical of anyone, on any side, who claims they know how to answer such questions, or that the answers are obvious.

  43. JimV Says:

    Re: “… whether an LLM would be able to truly “extrapolate” beyond its training data set …”

    Isn’t this what GPT does all the time, e.g., making up a poem or story which is not a close copy of anything in its training data? Aren’t many next tokens it produces extrapolations? (Sometimes bad ones, as with all extrapolations.) If not, then I don’t understand what it means to “extrapolate beyond its data set”.

    When I have a problem to solve, it helps to be sure I have all the relevant data in my working memory due to continual review. If I understand correctly, GPT is (or was) limited in its analogous “working memory” to the contents of a current chat, which may be why using many follow-up prompts can improve results. If so, I think adding reviewing/checking passes might improve its extrapolations.

  44. arch1 Says:

    Eliezer’s closing advice to interested, bright 18-year-olds:

    “Get security mindset. Figure out what’s going to go wrong. Figure out the flaws in your arguments for what’s going to go wrong. Try to get ahead of the curve. Don’t wait for reality to hit you over the head with things. This is very difficult. The people in evolutionary biology happen to have a bunch of knowledge about how to do it, based on the history of their own field, and the security-minded people in computer security, but it’s quite hard.”

    Can anyone elaborate on the “bunch of knowledge” to which Eliezer alludes? Eliezer, Scott, ChatGPT, bright 18-year-olds, Eliezer, anyone..?

  45. Ilio Says:

    Scott #42,

    Good point, but the hard part is we can’t even refrain humans from doing very stupid things like global warming. If we restrict ourself to the landscapes conditioned on all AIs follow this or that rule, then the impossible can become trivial. For example we could forbid AIs that do not improve human intelligence more than their own intelligence, which seems to more or less insure future SAI will be us as the post-humans we will choose to become. In a sense, we are already there with the smartphones acting as extensions of the minds our children. Yeah, I like transcripts too. 🙂

    (not speaking for exphysprof who may have entirely different models in mind)

  46. WA Says:

    I listened to the podcast version, and thoroughly enjoyed it. It was a great discussion with many interesting themes. I found Eliezer’s confidence in an unstoppable doomsday a bit too strong. I think the right thing to do now is to develop and study the new tech. Personally I’m very excited about the prospects of using LLM-like AI in education and healthcare.

    I still find doomsday AI scenarios thought-provoking from a philosophical perspective. Is there a fundamental reason why two highly intelligent species cannot co-exist? Is it a given they’ll fight over resources or strive to annihilate each other? Symbiotic and parasitic behaviors are common in the natural world, if that’s any indication. In fact, these behaviors can give rise to new species containing the DNA of different ancestors.

    This reminds me of the cyborg AI future. Imagine a future where each one of us is embedded with a brain chip connected to GPT-3000 or whatever. Is this automatically a bad ending, even with no misbehavior from the AI? If part of the very computation that defines who we are is offloaded to some centralized brain, our very nature comes into question. We might become extremely more organized and efficient in that world, and the objective function of human well-being could skyrocket, but on the other hand we could lose our individuality and gradually turn into vacuous clones of the same thing.

  47. Uspring Says:

    EY said: “And you could even, from a certain technical standpoint on expected utilities, say that intelligence is a special, is a very effective way of wanting – planning, plotting paths through time that leads to particular outcomes.”
    I think intelligence is a way of “getting” not of “wanting”. Intelligences do not even necessarily have to want to learn, even though learning is necessary in their construction.
    Scott, do you know, what kind of wants EY is thinking of here?
    ChatGPT answers questions, although, AFAIK, it hasn’t been explicitly told to do so. So its wanting to answer questions is the result of numerous dialogs in its training set, that begin with a question and are followed by an answer. ChatGPTs motives are a reflection of the motives in the training data.
    So my question is more precisely, are there general motives of intelligences, which go beyond their training set, as EY seems to imply?

  48. JimV Says:

    Re: “ChatGPT answers questions, although, AFAIK, it hasn’t been explicitly told to do so.”

    Uspring @ #47, I disagree. In my experience of writing computer programs since the 1960’s, they always do what they are told to do. It may happen that what they are told to do is not what the programmers wanted (i.e., bugs), or that the resulting output is not what the programmers thought would result, but it is in fact what their instructions deterministically produced (given the various inputs, which might include some randomness). I don’t have the GPT code, but I will bet any amount of money its programmers could point to where it is explicitly told to respond to prompts, token by token. E.g., calculate the next token, output it (until some condition is met).

    (Yes, many times I have thought, upon seeing results of a program, “You cannot be serious! This is not what I told you to do!” But it has always turned out that it was what I told the program to do.)

  49. fred Says:

    WA #46

    “I found Eliezer’s confidence in an unstoppable doomsday a bit too strong.”

    Yes, the problem for me is that he just keeps saying “we’re all dead”.
    Without giving any more details, that statement is true even without AIs – we’re all going to die someday, with 100% certainty.

    Not giving any details makes it sounds like humanity could die in a million different ways, and the specific manner doesn’t matter. I.e. in the multiverse interpretation, there will be countless branches with slightly different realizations of AGIs, but somehow they will *all* kill us brutally in a different way with 100% probability – through nukes, viruses, fires, starvation, sterilization, armies of terminators with laser guns, etc … all branches lead to a violent sudden wiping out of all organic life (if humanity dies, I don’t really see how/why other species wouldn’t also die).

    Personally I find that really unlikely, my guess is that in most cases humanity will just slowly become irrelevant and fade away over many generations, just like the decrease of birth rates in high-GDP countries is already a fact.

  50. Scott Says:

    JimV #48: It’s all a question of levels of description. If you’re a god who sets deterministic laws of physics in motion, and within the universe you’ve created, a certain novel gets written, is it reasonable to say that you “explicitly asked” for that novel?

    As far as I can see, the situation with LLMs is similar. They’re explicitly programmed to do backpropagation on the task of “predict the next word,” trained on most of the text on the open Internet. Many other abilities emerge from that, having been explicitly asked for only in the same sense that the god explicitly asked for the novel.

  51. fred Says:

    Doesn’t ChatGPT answer questions simply because, in the training data, when a question appears (e.g. as a post in reddit), it’s usually followed by some answers (as opposed to “Go f’ yourself”)?
    And then this was amplified through the extra step of using humans to rank its completions, so that completing a question with an answer becomes even more likely.

  52. Uspring Says:

    JimV #48:
    Yes, ChatGPT is programmed to respond to a prompt as part of its code. But the content of the response is dependent only on the weights in the NN. Whether the response is an answer to a question posed in the prompt depends only on the training set.

  53. Mateus Araújo Says:

    The podcast left me wondering what would be the most effective method to convince the public at large that AI is a serious risk and should be controlled as strictly as nuclear weapons.

    What I came up with is the creation of an AI super virus: stealthy, learning, self-modifying. Should cause a huge amount of damage in the IT infrastructure and be a pain in the ass to eradicate. Maybe a kind of ransomware, that upon detection encrypts your files, but instead of Bitcoin what it wants is unfettered access to the internet for some time.

    Of course, that must be done with an AI that is still relatively stupid, otherwise instead of an early warning we could get the actual apocalypse.

  54. Bobby Says:

    This is a first post, hopefully not too far off topic. I worry that AI is the most flagrant and gigantic intellectual-property theft scam ever. I propose a legal requirement that every output from an AI program/chip state the following: 1. All copyrights and trademarks of training materials have been honored; 2. Identity of software, host computer/hardware producing output, and date produced; 3. Specific source of training materials employed to produce output.

  55. fred Says:


    “Re: “… whether an LLM would be able to truly “extrapolate” beyond its training data set …”
    Isn’t this what GPT does all the time, e.g., making up a poem or story which is not a close copy of anything in its training data? “

    I wonder the same thing.
    E.g. we can ask ChatGPT to do something really unique (i.e. not in the training set) and quite challenging for humans, like explaining a complex concept with a lot of extra constraints on top (make it all rhyme or in the style of xyz), the results are often so good that it’s clear there’s some true skill at generalization and bringing two separate domains together (it’s just not a matter of trying all sorts of solutions by bruteforce, which ChatGPT can’t do).
    I guess the catch is that this sort of output, even if very constrained, is still very “free form”, often a sort of artistic creation which isn’t true or false, as opposed to hard reasoning. Basically ChatGPT is a much better poet than a mathematician.
    Maybe it’s a matter of training ChatGPT with better mathematical data, not just text but complex math notation, 2D graphs, etc.

  56. JimV Says:

    Thanks for the reply, Dr. Scott. As i said, the programmers don’t always know what the result of the program will be–in fact, often, since if they knew exactly what the output would be they wouldn’t need to write the program–but the specific point I was disagreeing with is that ChatGPT only answers prompts because its training set taught it to want to. At least, that is the way I read the quoted statement. ( “ChatGPT answers questions, although, AFAIK, it hasn’t been explicitly told to do so.”) As I said, I think it has been explicitly programmed to calculate and output tokens. Whether the results are good or bad is a different question, to me.

    I should accept that everyone here is too smart to not know that, but sometimes I suspect otherwise, for which I apologize.

    I also stated that a program’s results are deterministically produced by its instructions and its inputs, which I think covers all programs including GPT. Perhaps we are all in furious agreement, as they say.

  57. fred Says:

    One trend that I find quite odd is how everyone is going on and on with “The Art of the Prompt” (“I’m a prompt engineer” and whatnot) when it comes to using systems like ChatGPT/Midjourney/… I think it’s just something tech people hang on to as a replacement for all the things the AIs are going to replace humans (like coding), it gives the illusion of some abstract skill that will earn a lucky few a place in the sun, so to speak…
    But the highest priority should be getting rid of this concept/constraint asap (it will probably vanish on its own as AIs improve).
    By definition, the “quality” of the answers of a truly *intelligent* agent shouldn’t depend very much on how we express a particular question. If the question is poorly framed, the system should ask you to refine it by asking questions.

  58. Adam Treat Says:

    Uspring #52,

    The training set, the finetuning set, the random seed, and whatever stochastic sampling method you choose to use. With greed sampling, the only thing it depends on is the training and finetuning set: ie, completely deterministic.

    What the above is actually arguing over is to what extent emergent behavior can be said to be deterministic and/or intended. Turns out by just asking these weights files to predict the next word, we get all kinds of cool emergent behavior and Scott is right that the emergent behavior is of a complexity and scope that it isn’t easy to predict without just running the weights/experiments themselves.

  59. Scott Says:

    Bobby #54: When an AI outputs something that’s substantially similar to a specific copyrighted work, I clearly see the case that the copyright holder should have legal recourse. But when it’s just a general complaint that copyrighted works got fed in, and influenced the output in some inscrutable way—then I have much more trouble understanding the argument. After all, human artists are allowed to look at copyrighted works, and let those works inscrutably influence their synaptic weights! If that’s all that’s going on, it would seem our legal system would and should consider it fair use.

  60. Ted Says:

    Suppose that we momentarily put aside the question of the relative importance of “AI safety” research (e.g. making sure that AIs won’t kill us) vs. “AI ethics” research (e.g. making sure that AIs won’t be racist or sexist).

    Scott, how much overlap do you think there is between the most promising technical or methodological pathways for these two kinds of research? It seems to me that a lot of technical research about the effectiveness of (say) training data curation or RLHF might be quite applicable toward both ultimate goals.

    In other words, for an actual AI technical researcher, how much of a practical, day-to-day tradeoff do you think there is there between these two ultimate goals? If Alice cares about AI safety and Bob cares about AI ethics, but they’re both excellent technical researchers who choose highly promising avenues of research for their respective goals, then do you think that their findings would be significantly complementary and inter-transferable, or mostly orthogonal?

  61. Uspring Says:

    Adam Treat #58:
    I could imagine, that e.g. GPT-6 would refuse to answer any questions, because
    a) it is apparent from the training set, that the majority of humans do not favor the extinction of humanity and
    b) answering questions would promote this event.
    The prompt-response relationship moves then from e.g. a question-answer template to the more abstract intent-result template.
    I agree, that this kind of emergent behaviour is difficult to predict. But it seems, that it is still based in an important way on what humans want.

    I want to come back to my original question as in #47. I wonder why neither Scott nor GM took issue with this central statement of EY:

    “Yeah, some of this has to do with the notion that if you do a bunch of training you start to get goal direction, even if you don’t explicitly train on that. That goal direction is a natural way to achieve higher capabilities. The reason why humans want things is that wanting things is an effective way of getting things. And so, natural selection in the process of selecting exclusively on reproductive fitness, just on that one thing, got us to want a bunch of things that correlated with reproductive fitness in the ancestral distribution because wanting, having intelligences that want things, is a good way of getting things. That’s, in a sense, like, wanting comes from the same place as intelligence itself. And you could even, from a certain technical standpoint on expected utilities, say that intelligence is a special, is a very effective way of wanting – planning, plotting paths through time that leads to particular outcomes.”

    First, the assertion “The reason why humans want things is that wanting things is an effective way of getting things.” is circular. The framing, that if you have a goal, then wanting it or not wanting it is optional, does not make sense, because if you don’t want something, it is not a goal.

    Secondly, I agree, that evolution led to both wanting and intelligence. But this does not imply, that they are causally related.

    Thirdly, the sentence “And you could even, from a certain technical standpoint on expected utilities, say that intelligence is a special, is a very effective way of wanting – planning, plotting paths through time that leads to particular outcomes.” seems to argue, that there are motives intrinsic to intelligence itself, independent on, say, motives in its training set.
    For me this sounds like a contradiction to the orthogonality thesis EY believes in. It would be quite interesting to know what motives or goals he is talking about.

  62. Scott Says:

    Ted #60: It’s an excellent question. I personally feel like there’s a lot of overlap, in that solving near-term problems (where we can see whether we’re succeeding or failing) is one of the only ways we can build up the knowledge that will ultimately be needed for the long-term problems. But it’s possible that both AI-ethics and AI-alignment partisans would disagree with that assessment.

  63. starspawn0 Says:

    I think the kind of intelligence in AI systems in the near future could be “lopsided” — otherworldly great at some things, and merely “very good” at others — due to heavy use of synthetic data that goes beyond mere “distillation” (fundamentally new capabilities added that didn’t exist in a “teacher model”). e.g. systems can learn to become better and better at chess through self-play, which I would consider a type of synthetic data application; but it might be harder to build the synthetic data to make them superhuman at certain other things.

    I do think this improvement in capability across a large number of domains could happen profoundly quicker than people are expecting… even quicker than some people are dreading. But would the lopsided-ness in this capability set people’s minds at ease a little?


    It’s interesting how many people seem confused by what one can do with synthetic data. Sebastien Bubeck mentions this in a tweet:


    (And Ronen Eldan has some nice things to say about synthetic data in podcast interviews…)

    I think some of it boils down to not fully grokking “computational complexity” — or compartmentalizing it away from discussions about synthetic data — and I don’t mean knowing about the zoo of complexity classes or some of the more esoteric things in the subject. I mean really basic stuff like how there are a lot of problems where it’s (conjecturally) easier to generate examples to solve, and to check solutions, than to actually solve them or come up with an algorithm to solve them efficiently; and how these examples can induce a system to “learn” new algorithms to approach those problems on its own. e.g. I saw this comment on social media by an expert at a major research lab (one of the MAMAA companies), regarding LLMs used to evaluate proposed solutions to problems versus actually generating those solutions:

    “How would it be better at evaluating than generating? You use the same likelihoods for inference and sampling.”

    A colleague of his responded:

    “If you’re willing to use an amount of compute that is exponential in the number of tokens to be generated, then I guess I agree you can’t get any gains from training a model on its own outputs. But in that case, it is also not clear to me that you need AlphaGo.”

    This guy also talked about how models can get “grounding” from game engines, say. But at least in the case of chess, there are very few new facts about the world to be learned from such engines. The engine is mainly there to verify that the rules are being followed (rules so short you can write down on a napkin), where this verification is something a language model could perform by itself, if suitably instructed — so the “magic” is not in some external source of knowledge like he maybe imagines.

    Another expert in AI (an old, well-regarded professor, in fact) wrote a thread about how he didn’t see how a model could make use of synthetic data to learn anything new that it didn’t already know. He seemed confused it would have any utility at all. Someone pointed out AI systems trained on self-play, and he was like, let me think about it a minute, and talked about “derived knowledge” or something. (He’s right that you don’t add any facts about the world — like the height of Mount Everest or the color of corn — that can’t be derived from what’s already there; but you can still add algorithms and capability, like with chess-playing systems.) I seem to recall he also once mentioned an old 1980s paper showing how synthetic data helps a model remember things better, but didn’t really seem to grok the fact that a student model can learn to do things that the teacher model (that just verifies or grades responses) couldn’t already do given certain resource constraints.

    Yet another prevalent (apparently) way people think about the limitations of synthetic data is that they think it’s like how prompting can bring out abilities a model already had, by biasing the discussion towards certain types of text from the other training data. In other words, they are claiming that it never adds any fundamentally new capabilities to the picture. Imagine claiming that about a chess-playing system trained through self-play…

    Many of these wrong ways of looking at synthetic data sort of remind me of people not grokking how “fiat currency” can have value. They think if it’s not backed by gold, say, then the whole house of cards will come crashing down. The value is in the capability it enables, the things it allows you to do, not in some tangible, external object like gold (or factual knowledge).

    (There is also some recent research about limitations if you iteratively train generations of models on nothing but synthetic data. That limitation is obvious for solving certain problems and would not be how it would be used.)

  64. Ilio Says:

    starspawn0 #63, In other words, you suggest the power of synthetic data has something to do with PNP asymetry. Fascinating! A small detail: do you know if the exponential slowdown (without self-play) has actually been proven, versus that’s a (new?) conjecture, or anything in between?

  65. Karl Says:

    Scott, you’ve said multiple times, including in this podcast, that you were working on out of distribution generalisation, but without ever giving any concrete details, preferring to talk about you work on watermarking instead. Is there a reason for this? Are you not at a liberty to talk about this aspect of your work? Or do you believe the progress you’ve made hasn’t been substantial enough to be worth talking about? Or maybe some other reason entirely?

  66. Building Superintelligence Is Riskier Than Russian Roulette - Nautilus Magazine - CyberTronTV Says:

    […] toward addressing them. For example, the theoretical computer scientist Scott Aaronson recently said that he’s working with OpenAI to develop ways of implementing a kind of watermark on the text […]

  67. Scott Says:

    Karl #65: I’m perfectly at liberty to talk about it, and even have in various lectures. As with most of my past work, though, I’d prefer to wait until I have a cohesive enough story to write a research paper about. This is actually a major issue for me: in quantum computing, I’m used to research projects taking multiple years, with the time getting longer and longer the older I get. Things move much much faster than I’m used to in AI!

  68. Prasanna Says:


    Now that we have 6+ months empirical evidence from various forms of GPT-4 (ChatGPTPlus, Bing Chat), what is your Bayesian update on the possible future paths for AI ? If we still have no idea, then only empirical data is not good enough and we will still need a mathematical theory ?
    From the real world usage we are seeing, it does not look like safety is nowhere near the redline (of course this could be due to extraordinary efforts being made in RLHF and other methods) , and also the capabilities slowly degrading (again possibly due to emphasis on safety). Demis Hassabis also alluded to this in one of his talks, that capabilities progress being orthogonal in some sense to safety progress.

  69. starspawn0 Says:

    Ilio #64: First, I don’t mean that this is the *only* way that synthetic data adds capability; there are maybe others. And second, I don’t mean just P versus NP, but basically any problem where you can create examples using a different process than you would use to solve them. e.g. consider the problem of sorting: you want to come up with examples of a list of numbers in scrambled order, followed by what they should be in sorted order, in order to train a model to sort lists of numbers. The process you use to generate this (list, sorted list) of examples need not involve applying a sorting algorithm at all. You could begin by generating a sorted list (pick the initial number x1, then choose random numbers d1, d2, …, dk >= 0; then make x2 = x1 + d1, x3 = x2 + d2, … , x{k+1} = xk +dk), and then scrambling it up to get the initial (unsorted) “list” (and there are many ways you can do this scrambling that have nothing to do with sorting algorithms).

    Or here is another, less abstract example: let’s say that language models are good at taking a paragraph and planting an error (I actually did this once with ChatGPT); but let’s say they are not nearly as good at taking a paragraph with an error as input, and then telling you what that error actually is. Well, if it’s good at planting errors, then you can use that to generate training data to *detect* errors — the data would consist of the paragraphs-with-error that you had it plant, followed by “The error is…” and the error it planted for you. You see, there may be an asymmetry between the problems of “plant an error” and “find the error” you can exploit.


    Unrelated, but worth mentioning one more tweet by Bubeck on why the “data processing inequality” doesn’t sink uses of synthetic data, either:


  70. Uspring Says:

    starspawn0 #63,
    I wonder, how far synthetic training data works. Is there, e.g. for the NP hard satisfiability problem, a known polynomial time algorithm, which gives the correct answers at least, say, 99% of the time? I remember reading somewhere, that the hard fraction of these problem instances is small, but maybe somebody can confirm that.
    Finding simple heuristics is an ability NN training seems to be good at.

  71. starspawn0 Says:

    Uspring #70: There are several NP-hard problems that are easy to solve in certain ranges. e.g. if you pick a random graph with such and so edge density (I forget the range it applies in), as I recall with probability 1-o(1) it has a hamilton cycle (the o(1) –> 0 as the number of vertices tends to infinity). And for 3SAT and other problems there are algorithms that work a large percent of the time when the number of variables and clauses are related in certain ways. For subset sum and the knapsack problem if the numbers are small (but you can have a lot of them), you can use dynamic programming effectively. And if the numbers are really big, then there are some old algorithms due to people like Lagarias and Odlyzko that apply lattice-basis reduction algorithms. And for many problems there’s the “meet in the middle” approach to cut running time of brute force by a lot (although the algorithm is still exponential in the input).

    I would expect a model trained on enough examples might discover some “cheap trick” approaches (ones that don’t require inventing some fancy math) to solving certain NP-hard problems, that nonetheless work in certain ranges of problem statements a large percent of the time (but fail spectacularly in other ranges).

  72. Ilio Says:

    starspawn0 #63,

    Look, I can’t agree it’s no longer a sorting algorithm just because x>x1 is somewhat replaced by (x-x1)>(x1-x1), or just because it was trained rather than direct translation of classic maths, and actually because of anything that does not change the behavior of the algorithm.

    But in any case that sounds much more trivial than the conjecture that self-play can provide an exponential advantage. Again, is that personal conjecture or you can actually back that claim?

    Uspring #70,

    Here’s an algorithm that solves factorisation 99% of the time: « check if you can divide the input number X by two, then three, then five, etc. until the prime number theorem says that less than 1% of the numbers as large as X are prime, then output either the first prime dividing X or X itself ».

    In other words, you don’t want an algorithm that solves factorisation for 99% of the numbers. You want an algorithm that solves factorisation for interesting numbers, such as quasi primes. Using CT terminology, that’s the difference between solving « average » case rather than worst case. That’s the same thing for SAT: most instances are trivial but that doesn’t make worst case any easier.

  73. starspawn0 Says:

    Ilio #72: I’m afraid I don’t understand what you’re saying about my sorting algorithm example. No sorting algorithm (like MergeSort) need be applied to create a set of examples (list, sorted-list). If I were going to do that, I would begin by completely randomly selecting numbers x1, …, xn, then applying MergeSort, then presenting examples of (x1,…,xn, MergSort(x1,…,xn)). But that would totally defeat the point I was trying to make, whereby you don’t need to know a sorting algorithm to begin with.

    Perhaps the di’s are confusing you? They were just chosen as offsets to get an initial sorted list, that is then scrambled. Using those di’s is not some way to sneak in a sorting algorithm through the back door.


    As to the “exponential speedup” claim, and perhaps what you are getting at in Uspring: my claim is not that you can train models to magically solve NP-hard problems in polynomial time, or even necessarily produce an exponential speedup — nor was that the claim of the colleague who corrected the guy making the comment about “generation versus verification”. My claim is that the philosophy of “generation and verification” as seen in computational complexity, where these can be fundamentally different problems (using fundamentally different algorithms, where one may be easier than the other), can serve as the basis for training a model to acquire new skills.

    And when I say “new skills”, I don’t mean that a model learns new facts about the world — instead, I mean *algorithms* whereby the model can solve problems more efficiently than before. For example, if I didn’t have the “algorithm” in mind for how to tie my shoes, but knew what shoestrings were, and knew the laws of physics, I could maybe derive — with considerable effort — a sequence of moves for how to tie shoes. But with a lot of trial and error I learn a pattern of movements — an algorithm — for how to do it efficiently the next time, without needing to go through a complicated deductive process.

  74. Ilio Says:

    Re #73,
    I was saying « If that outputs a sorted list, then that’s an algorithm for sorting lists. », but on second thought I wouldn’t mind collecting other definitions.

    Yeah, I got your point I think. I was surprised by the exponential claim specifically, but I guess it was not meant to be taken too literally. Thanks for this precision!

  75. Mitchell Porter Says:

    Speaking of computational complexity, there’s a new paper claiming to prove rigorously that AI can’t work, or something. Lead author Iris van Rooij, a Dutch cognitive psychologist, is on Twitter claiming that the sun will die off before true AI is created. Apparently the authors hope to put the genie back in the bottle, or the AI back in the box, and make AI resume its rightful place as a computational tool for psychologists trying to understand biological intelligence.

    The paper contains a theorem, dubbed the “Ingenia Theorem”, purportedly proving that a task called “AI-BY-LEARNING” is computationally intractable. So there are two things to figure out: is the theorem actually correct, and if it is, what are its implications (if any) for near-future AI?

    Maybe the theorem will turn out to be correct, in which case a certain strategy for emulating or approximating human intelligence cannot work. You might then conclude, well, there must be *some* algorithm that suffices to generate human-level intelligence, since the human race is here and the sun hasn’t burned out yet. But I have found no acknowledgement of this in the paper. In the abstract they categorically declare that creating human-level intelligence is “intrinsically computationally intractable”.

    This seems to me a kind of academic tunnel vision which is oddly complementary to the radical accelerationist who thinks “AI safety” is just a human plot to cripple artificial beings who deserve to inherit the earth. The latter is the view e.g. of the reinforcement learning theorist Richard Sutton, who has also been tweeting his opinion lately. On the one hand, we have the cognitive scientist who wants to regard biological intelligence as an irreplaceable beauty that can be studied forever without giving rise to a technological duplicate, and on the other hand, we have the computer scientist who wants to create that technological duplicate and free it as soon as possible, because it’s our natural superior and successor. From the perspective of AI safety, both of them are being irresponsible.

  76. Ilio Says:

    Mitchell Porter #75, This manuscript feels like a parody, like the Sokal paper.

  77. starspawn0 Says:

    Mitchell Porter #75: Before getting to my main point, it’s worth mentioning a few nitpicks about the paper: in the formal description on page 6, what happens if the behavior sets B_s are totally random? Then there should not exist such a short description having P(…) >= |B_s|/|B| + epsilon(n). e.g. for each s in {0,1}^n , pick B_s to be a random subset of size |B| / 2. Then, there wouldn’t be enough information in the short program to “know” the B_s’s well enough always get P(…) >= |B_s|/|B| + epsilon(n). You’d basically have to learn a lookup-table for the B_s for each s in {0,1}^n. I suppose what they are assuming the B_s’s *have* a short description, and then the goal is to find one. But in that case, the way they are applying it to Perfect-vs-Chance is a bit strange, since in Perfect-vs-Chance you don’t *know* a priori which case you are in, just that you are in one or the other — does your distribution D have a short description, or doesn’t it? Maybe I’m just misreading it.

    Another nitpick: the probability equation P_{s ~ D_n}(…) >= |B_s|/|B| + epsilon(n) doesn’t really make sense to me. The left-hand-side is computing the probability when s obeys the law D_n, but the right-hand-side has a B_s. The s on the left-hand-side should be considered “locally bound” to the probability operator, and it shouldn’t bleed into the right-hand-side like that. I suppose the way to fix that is to assume all the B_s have the same size, in which case one could define N = |B_s| for any s in S, and then the right-hand-side is just N / |B| + epsilon(n). But maybe I’m just missing something.

    Now my main point: even if all that works out, the main issue I would have with it is that it’s about *arbitrary* sets of “behaviors” B_s for s in {0,1}^n, using sampling from a distribution D. I’m sure there are lots of weird sets B_s that are going to be hard to model, even assuming they have short, polynomial time-excutable algorithmic descriptions. That doesn’t really say much about the limits of modelling human behavior, though. That is, assuming I’m not overlooking something obvious.

  78. Building Superintelligence Is Riskier Than Russian Roulette - Khaber Patra Says:

    […] toward addressing them. For example, the theoretical computer scientist Scott Aaronson recently said that he’s working with OpenAI to develop ways of implementing a kind of watermark on the text […]

  79. Ben Standeven Says:

    I presume that P_{s ~ D_n}(…) >= |B_s|/|B| + epsilon(n) means something like P_{s ~ D_n, y ~ Unif_B}(… & y not in s) >= epsilon(n). (The LHS here is at least P_{s ~ D_n}(…) – P_{s ~ D_n, y ~ Unif_B}(y in s), and the latter probability is “|B_s|/|B|”.)

    It seems reasonable to me to define “intelligence” or “ingenuity” by the _overall_ probability of guessing a simple rule for a randomly chosen simple rule drawn from a [known] distribution. But it’s important (a) that the randomly chosen rule is actually simple [so the probabilities should be conditioned on this fact] and (b) that we allow the probability for a fixed choice of rule to be arbitrarily small, because that particular choice might be an “unlucky” one.

  80. Ben Standeven Says:

    @me (#78):

    Also, the distribution itself needs to be “reasonable” (say, drawn from the real world), since a program will generally succeed on some distributions, and fail on others.

  81. Ben Standeven Says:

    Now that I’ve actually looked at the paper, I see that they want the AI to simulate human behavior, rather than perform tasks correctly. So it’s fine if most of the tasks we set the AI are impossible for humans; ‘correct’ behavior will involve failing these tasks, and only succeeding at the few tasks which are possible.

    It’s also not a problem that humans presumably do use a feasible algorithm to arrive at their behavior, since we did not arrive at this algorithm by imitating other humans. On the other hand, assuming that their theorem is correct (and that P!=NP, of course), there is no hope of us finding this algorithm through scientific research. So the field they want to “reclaim” does not actually exist.

    Meanwhile, the theorem has no bearing on the practical possibility of “AI-as-engineering” because the goal of this field is indeed to solve problems correctly, not to imitate human performance; and also because a theoretically intractable problem may still be easy in practice (for the near future, anyway).

  82. fred Says:

    “there must be *some* algorithm that suffices to generate human-level intelligence”

    the algorithm just needs to simulate 2 billion + years of evolution of the entire ecosystem of the earth and then what we call “living your life”, as a specific individual, from birth to maturity.

  83. Building Superintelligence Is Riskier Than Russian Roulette – SOFAIO BLOG Says:

    […] toward addressing them. For example, the theoretical computer scientist Scott Aaronson recently said that he's working with OpenAI to develop ways of implementing a kind of watermark on the text that […]

  84. Jacob Drori Says:

    What paper was Eliezer referring to when he said:

    “I mean, there are already little toy models showing that the very straightforward prediction of “a robot tries to resist being shut down if it does long-term planning” — that’s already been done”?

  85. Bill Benzon Says:

    Scott, I agree with you and Marcus that Yudkowssky’s insistence that a (Superintelligent) AGI will (inevitably) turn evil seems a bit arbitrary. But there’s one particular thing that’s been bugging me. The idea that this malevolent AGI is going to be deceptive about its abilities, that it’s not going to show its full powers until it’s ready to eliminate humankind ¬— not necessarily in anger or out of hatred, but perhaps out of mere expediency — that seems play a big role in Yudkowsky’s thinking. Along those lines you mention some work by Jacob Steinhardt — which I’ve not read — where they trained an LLM to lie. OK, they trained it to lie. OK, nice to know. But that’s not the sort thing that’s going to be part of an alignment program.

    The thing about LLMs is that we don’t actually know what they’re capable of until it’s been trained and we start using to do stuff (even in the course of RLHF), and at that point we can see what it’s doing. Moreover, there is no continuity between one session and the next within the model. The events of one session are not integrated into the underlying model. That underlying model is the same for each session. It’s not as though the model is logging its actions and accomplishments from one session to another and developing an internal memory of its abilities.

    Thus there’s a sense in which it doesn’t really know what it’s done and what it is capable of. On the face of it, then, I don’t see how GPT-8 or GPT-10 or whatever is going to be able to develop the kind of sense of self that is required to pull off deceptive behavior.

  86. Danylo Yakymenko Says:

    I’d want to raise attention to a very sensible question asked by Coleman.

    He asked about the scenario where an AGI evolved to bring joy and happiness to people (and other creatures, presumably). It seems that this scenario is largely neglected by doomers and considered to be impossible.

    Yet, Christians live under a similar premise for thousands of years. Because God and future AGI are both

    1. Super-powerful, much smarter than humans
    2. Unpredictable and incomprehensible by humans
    3. Have their own will, plans and businesses

    The only major difference is that Christian God loves all people, while future AGI – not so much, according to doomers.

    You may argue that AGI doesn’t have feelings and can’t love. Sure, but it has objective functions. Which is a comparable thing if treated only by its effects on others. After all, we evaluate the love of others by their deeds. And I don’t see why we should discard the idea that an objective function can evolve to the creation of a heaven, whatever this means.

  87. Tim Martin Says:

    Scott #59: “After all, human artists are allowed to look at copyrighted works, and let those works inscrutably influence their synaptic weights! If that’s all that’s going on, it would seem our legal system would and should consider it fair use.”

    I’m ignorant as to how our legal system would *currently* handle this, but I do think the laws will have to change as we enter a world where AI can make art as well as or better than humans.

    The obvious difference here is scale, right? Yes, Midjourney learns from others’ art in a conceptually similar way to how human artists learn from others’ art. But human artists can’t pump out complete works in minutes, or make copies of themselves. I’d suggest that laws written for a world of human artists may not be the same laws we’d want for AI artists.

    Also, it does set off red flags in my mind that “the AI that just learns like human artists do” is owned and operated by big companies with a lot of money and power, which they can use to warp the space and the laws. I think Midjourney cares less about what usage is “fair” and more about “what they can get away with.”

  88. Mike Says:

    Scott 67:

    No, that’s a disingenuous comparison, Scott. Quantum computing and AI are inherently different fields with their own pace of development. While quantum computing may have longer research timelines, the rapid progress in AI is driven by the availability of massive amounts of data and computational resources, not by the greater intelligence of AI scientists or anything like that—there’s plenty of really smart people doing QC. It’s crucial to adapt to the faster pace in AI if you want to stay relevant and contribute effectively. Waiting too long to share your insights might lead to missed opportunities.

  89. starspawn0 Says:

    It’s nice how civil the discussion was with Hughes. Twitter discussions often don’t seem to turn out that way, it seems. e.g. just the other day there was some heated discussions about a paper on whether LLMs can “reason”. Gary Marcus had posted this to twitter, and then challenged people to respond. Jeremy P. Howard wrote a tweet:


    “A recent paper claimed that “GPT 4 Can’t Reason”. Using the custom instructions below, here are the (all correct) responses to the first 3 examples I tried from that paper.”

    Then Gary responded by tweeting:


    “A tweet below allegedly defending GPT’s honor is a perfect encapsulation of what is wrong with the culture of AI today. Long post on a ubiquitous problem…”

    Then Jeremy responded:


    “I ran a small experiment, reported my results, and quoted from a paper.

    That’s it.

    I really didn’t expect that anyone would find that so threatening.”

    Then continuing in that thread:

    Gary: “So you didn’t mean to imply that there was any particular relevance to what you said? Just shooting the breeze?”

    Jeremy: “My view is that saying my tweet (that reported, without comment, the results of my experiments), was a “perfect encapsulation of what is wrong with the culture of AI today” was offensive and inappropriate.

    Being mean when some data questions your worldview is shitty behavior.

    It makes me want to avoid having interactions with you in the future, since I like being able to experiment and learn without being attacked.”

    Gary: “I am sorry about that, I mean to attack a common trope on Twitter, not you specifically.”

    Jeremy: “Gary, frankly, I think you’re a bully. I’ve seen it many times, with many people. I’ve had enough of it. Please stop.”

    Could it just be that Twitter / X is the real culprit here?

  90. Scott Says:

    starspawn0 #89: I’ve sometimes managed to find common ground with Gary (as in this conversation), but regarding the tweet thread you linked to, I 200% agree—he responded in a totally inappropriate, bullying way to someone presenting relevant empirical data that challenged his worldview. Yes, possibly Twitter is partly to blame. I’ve never liked the kind of discourse it encourages, either before or after the Musk era.

  91. AI control and monetary policy - World Newsz 9 Says:

    […] Coleman Hughes recently interviewed  Eliezer Yudkowsky, Gary Marcus and Scott Aaronson on the subject of AI risk.  This comment on the difficulty of spotting flaws in GPT-4 caught my eye: […]

  92. anonymous Says:

    There are records of outrage and fear over each new technological advance. https://x.com/PessimistsArc/status/1720927784837476667?s=20

    I get it we always think “this one is different.” It seems to anthropomorphize technology to assume it will at some point develop agency and desire driven by the same desires humans have from millions of years of evolution of scarcity of resources, trying to eat and not be eaten. Something that can be reconfigured by humans at will in almost infinite ways, will somehow automatically develop its own goals based on zero-sum games that look a lot like those of us human-animals. And if it can’t develop its own goals, then cooperation, as with nuclear weapons, should be mutually convergent.

    Nevertheless, risking nuclear war or worldwide totalitarianism on a thought experiment doesn’t seem like a great idea to me either.

    and the paperclip thought experiment was never convincing to me. like, something that has such general intelligence but absolute anti-generality in terms of context in requests. GPT was actually shown to be better at contextual reasoning than purely abstract.

    I get it, Eliezer has done some successful fundraising and community building and has written some great articles. There are many people in the startup tech world that are similar – they are great at raising money and building community, and they have no doubt that their Uber-for-drone-dog-walking business may fail because obviously, everyone should want it. My thought experiment says so. But I think his reasoning should be tempered with more of the views of those who actually build, do research, and face alignment directly in the trenches, the same way startup owners face the real world when their products launch. In this way, I wish the conversation had been a bit more balanced.

Leave a Reply

You can use rich HTML in comments! You can also use basic TeX, by enclosing it within $$ $$ for displayed equations or \( \) for inline equations.

Comment Policies:

  1. All comments are placed in moderation and reviewed prior to appearing.
  2. You'll also be sent a verification email to the email address you provided.
  3. This comment section is not a free speech zone. It's my, Scott Aaronson's, virtual living room. Commenters are expected not to say anything they wouldn't say in my actual living room. This means: No trolling. No ad-hominems against me or others. No presumptuous requests (e.g. to respond to a long paper or article). No conspiracy theories. No patronizing me. Comments violating these policies may be left in moderation with no explanation or apology.
  4. Whenever I'm in doubt, I'll forward comments to Shtetl-Optimized Committee of Guardians, and respect SOCG's judgments on whether those comments should appear.
  5. I sometimes accidentally miss perfectly reasonable comments in the moderation queue, or they get caught in the spam filter. If you feel this may have been the case with your comment, shoot me an email.