Archive for the ‘The Fate of Humanity’ Category

I Had A Dream

Sunday, January 18th, 2026

Alas, the dream that I had last night was not the inspiring, MLK kind of dream, even though tomorrow happens to be the great man’s day.  No, I had the literal kind of dream, where everything seems real but then you wake up and remember only the last fragments.

In my case, those last fragments involved a gray-haired bespectacled woman, a fellow CS professor.  She and I were standing in a dimly lit university building.  And she was grabbing me by the shoulders, shaking me.

“Look, Scott,” she was saying, “we’re both computer scientists.  We were both around in the 90s.  You know as well as I do that, if someone claims to have built an AI, but it turns out they just loaded a bunch of known answers, written by humans, into a lookup table, and then they search the table when a question comes … that’s not AI.  It’s slop.  It’s garbage.”

“But…” I interjected.

“Oh of course,” she continued, “so you make the table bigger.  What do you have now?  More slop!  More garbage!  You load the entire Internet into the table.  Now you have an astronomical-sized piece of garbage!”

“I mean,” I said, “there’s an exponential blowup in the number of possible questions, which can only be handled by…”

“Of course,” she said impatiently, “I understand as well as anyone.  You train a neural net to predict a probability distribution over the next token.  In other words, you slice up and statistically recombine your giant lookup table to disguise what’s really going on.  Now what do you get?  You get the biggest piece of garbage the world has ever seen.  You get a hideous monster that’s destroying and zombifying our entire civilization … and that still understands nothing more than the original lookup table did.”

“I mean, you get a tool that hundreds of millions of people now use every day—to write code, to do literature searches…”

By this point, the professor was screaming at me, albeit with a pleading tone in her voice.  “But no one who you respect uses that garbage! Not a single one!  Go ahead and ask them: scientists, mathematicians, artists, creators…”

I use it,” I replied quietly.  “Most of my friends use it too.”

The professor stared at me with a new, wordless horror.  And that’s when I woke up.

I think I was next going to say something about how I agreed that generative AI might be taking the world down a terrible, dangerous path, but how dismissing the scientific and philosophical immensity of what’s happened, by calling it “slop,” “garbage,” etc., is a bad way to talk about the danger. If so, I suppose I’ll never know how the professor would’ve replied to that. Though, if she was just an unintegrated part of my own consciousness—or a giant lookup table that I can query on demand!—perhaps I could summon her back.

Mostly, I remember being surprised to have had a dream that was this coherent and topical. Normally my dreams just involve wandering around lost in an airport that then transforms itself into my old high school, or something.

FREEDOM (while hoping my friends stay safe)

Sunday, January 11th, 2026

This deserves to become one of the iconic images of human history, alongside the Tank Man of Tiananmen Square and so forth.

Here’s Sharifi Zarchi, a computer engineering professor at Sharif University in Tehran, posting on Twitter/X: “Ali Khamenei is not my leader.”

Do you understand the balls of steel this takes? If Professor Zarchi can do this—if hundreds of thousands of young Iranians can take to the streets even while the IRGC and the Basij fire live rounds at them—then I can certainly handle people yelling me on this blog!

I’m in awe of the Iranian people’s courage, and hope I’d have similar courage in their shoes.

I was also enraged this week at the failure of much of the rest of the world to help, to express solidarity, or even to pay much attention to the Iranian’s people plight (though maybe that’s finally changing this weekend).

I’ve actually been working on a CS project with a student in Tehran. Because of the Internet blackout, I haven’t heard from him in days. I pray that he’s safe. I pray that all my friends and colleagues in Iran, and their family members, stay safe and stay strong.

If any Iranian Shtetl-Optimized reader manages to get onto the Internet, and would like to share an update—anonymously if desired, of course—we’d all be obliged.

May the Iranian people be free from tyranny soon.

Update: I’m sick with fear for my many colleagues and friends in Iran and their families. I hope they’re still alive; because of the communications blackout, I have no idea. Perhaps 12,000 have already been machine-gunned in the streets while the unjust world, the hypocrites and cowards who marched against a tiny democracy for defending itself—they invent excuses or explicitly defend the murderous regime in Tehran. WTF is the US waiting for? Trump’s “red line” was crossed days ago. May we give the Ayatollah the martyrdom he preaches, and liberate his millions of captives.

The Goodness Cluster

Wednesday, January 7th, 2026

The blog-commenters come at me one by one, a seemingly infinite supply of them, like masked henchmen in an action movie throwing karate chops at Jackie Chan.

Seriously Scott, do better,” says each henchman when his turn comes, ignoring all the ones before him who said the same. “If you’d have supported American-imposed regime change in Venezuela, like just installing María Machado as the president, then surely you must also support Trump’s cockamamie plan to invade Greenland! For that matter, you logically must also support Putin’s invasion of Ukraine, and China’s probable future invasion of Taiwan!”

“No,” I reply to each henchman, “you’re operating on a wildly mistaken model of me. For starters, I’ve just consistently honored the actual democratic choices of the Venezuelans, the Greenlanders, the Ukrainians, and the Taiwanese, regardless of coalitions and power. Those choices are, respectively, to be rid of Maduro, to stay part of Denmark, and to be left alone by Russia and China—in all four cases, as it happens, the choices most consistent with liberalism, common sense, and what nearly any 5-year-old would say was right and good.”

“My preference,” I continue, “is simply that the more pro-Enlightenment, pluralist, liberal-democratic side triumph, and that the more repressive, authoritarian side feel the sting of defeat—always, in every conflict, in every corner of the earth.  Sure, if authoritarians win an election fair and square, I might clench my teeth and watch them take power, for the sake of the long-term survival of the ideals those authoritarians seek to destroy. But if authoritarians lose an election and then arrogate power anyway, what’s there even to feel torn about? So, you can correctly predict my reaction to countless international events by predicting this. It’s like predicting what Tit-for-Tat will do on a given move in the Iterated Prisoners’ Dilemma.”

“Even more broadly,” I say, “my rule is simply that I’m in favor of good things, and against bad things.  I’m in favor of truth, and against falsehood. And if anyone says to me: because you supported this country when it did good thing X, you must also support it when does evil thing Y? (Either as a reductio ad absurdum, or because the person actually wants evil thing Y?) Or if they say: because you agreed with this person when she said this true thing, you must also endorse this false thing she said? I reply: good over evil and truth over lies in every instance—if need be, down to the individual subatomic particles of morality and logic.”

The henchmen snarl, “so now it’s laid bare! Now everyone can see just how naive and simplistic Aaronson’s so-called ‘political philosophy’ really is!  Do us all a favor, Scott, and stick to quantum physics! Stick to computer science! Do you not know that philosophers and political scientists have filled libraries debating these weighty matters? Are you an act-utilitarian? A Kantian? A neocon or neoliberal? An America-First interventionist? Pick some package of values, then answer to us for all the commitments that come with that package!”

I say: “No, I don’t subcontract out my soul to any package of values that I can define via any succinct rule. Instead, given any moral dilemma, I simply query my internal Morality Oracle and follow whatever it tells me to do, unless of course my weakness prevents me. Some would simply call the ‘Morality Oracle’ my conscience. But others would hold that, to whatever extent people’s consciences have given similar answers across vast gulfs of time and space and culture, it’s because they tapped into an underlying logic that humans haven’t fully explained, but that they no more invented than the rules of arithmetic. The world’s prophets and sages have tried again and again over the millennia to articulate that logic, with varying admixtures of error and self-interest and culture-dependent cruft. But just like with math and science, the clearest available statements seem to me to have gotten clearer over time.”

The Jackie Chan henchman smirks at this. “So basically, you know the right answers to moral questions because of a magical, private Morality Oracle—like, you know, the burning bush, or Mount Sinai? And yet you dare to call yourself a scientific rationalist, a foe of obscurantism and myticism? Do you have any idea how pathetic this all sounds, as an attempted moral theory?”

“But I’m not pretending to articulate a moral theory,” I reply. “I’m merely describing what I do. I mean, I can gesture toward moral theories and ideas that capture more of my conscience’s judgments than others, like liberalism, the Enlightenment, the Golden Rule, or utilitarianism. But if a rule ever appears to disagree with the verdict of my conscience—if someone says, oh, you like utilitarianism, so you must value the lives of these trillion amoebas above this one human child’s, even torture and kill the child to save the amoebas—I will always go with my conscience and damn the rule.”

“So the meaning of goodness is just ‘whatever seems good to you’?” asks the henchman, between swings of his nunchuk. “Do you not see how tautological your criterion is, how worthless?”

“It might be tautological, but I find it far from worthless!” I offer. “If nothing else, my Oracle lets me assess the morality of people, philosophies, institutions, and movements, by simply asking to what extent their words and deeds seem guided by the same Oracle, or one that’s close enough! And if I find a cluster of millions of people whose consciences agree with mine and each others’ in 95% of cases, then I can point to that cluster, and say, here. This cluster’s collective moral judgment is close to what I mean by goodness. Which is probably the best we can do with countless questions of philosophy.”

“Just like, in the famous Wittgenstein riff, we define ‘game’ not by giving an if-and-only-if, but by starting with poker, basketball, Monopoly, and other paradigm-cases and then counting things as ‘games’ to whatever extent they’re similar—so too we can define ‘morality’ by starting with a cluster of Benjamin Franklin, Frederick Douglass, MLK, Vasily Arkhipov, Alan Turing, Katalin Karikó, those who hid Jews during the Holocaust, those who sit in Chinese or Russian or Iranian or Venezuelan torture-prisons for advocating democracy, etc, and then working outward from those paradigm-cases, and whenever in doubt, by seeking reflective equilibrium between that cluster and our own consciences. At any rate, that’s what I do, and it’s what I’ll continue doing even if half the world sneers at me for it, because I don’t know a better approach.”

Applications to the AI alignment problem are left as exercises for the reader.


Announcement: I’m currently on my way to Seattle, to speak in the CS department at the University of Washington—a place that I love but haven’t visited, I don’t think, since 2011 (!). If you’re around, come say hi. Meanwhile, feel free to karate-chop this post all you want in the comment section, but I’ll probably be slow in replying!

Venezuela through the lens of good and evil

Sunday, January 4th, 2026

I woke up yesterday morning happy and relieved that the Venezuelan people were finally free of their brutal dictator.

I ended the day angry and depressed that Trump, as it turns out, does not seek to turn over Venezuela to María Corina Machado and her inspiring democracy movement—the pro-Western, Nobel-Peace-Prize-winning, slam-dunk obvious, already electorally-confirmed choice of the Venezuelan people—but instead seeks to cut a deal with the remnants of Maduro’s regime to run Venezuela as a US-controlled petrostate.

I confess that I have trouble understanding people who don’t have either of these two reactions.

On one side of me, of course, are the sneering MAGA bullies who declare that might makes right, that the strong do what they can while the weak suffer what they must, and that the US should rule Venezuela for the same reason why Russia should rule Ukraine and China should rule Taiwan: namely, because the small countries have the misfortune of being in the large ones’ “spheres of influence.”

But on my other side are those who squeal that toppling a dictator, however odious, is against the rules, because right is whatever “international law” declares it to be—i.e., the “international law” that’s now been degraded by ideologues to the point of meaninglessness, the “international law” that typically sides with whichever terrorists and murderers have the floor of the UN General Assembly and that condemns persecuted minorities for defending themselves.

The trouble is, any given framework of law needs to do at least one of three things to impose its will on me:

  1. Compel my obedience, by credibly threatening punishment if I defy it.
  2. Win the assent of my conscience, by the force of its moral example.
  3. Buy my consent through reciprocity: if this framework will defend my family from being murdered, I therefore ought to defend it.

But “international law,” as it exists today, fails spectacularly on all three of these counts. Ergo, as far as I’m concerned, it can take a long walk off a short pier.

Against these two attempted reductions of right to something that it isn’t, I simply say:

Right is right. Good is good. Evil is evil. Good is liberal democracy and the Enlightenment. Evil is authoritarianism and liars and bullies.

Good, in this case, is Maria Machado and the Venezuelans who went to prison, who took to the streets, who monitored every polling station to prove Edmundo González’s victory. Evil is those who oppose them.

But who gets to decide what’s good and what’s evil? Well, if you’re here asking me, then I decide.

But don’t the evildoers believe themselves to be good? Yes, but they’re wrong.

It’s crucial that I’m not appealing here to anything exotic or esoteric. I’m appealing only to the concepts of good and evil that I suspect every reader of this blog had as a child, that they got from fables and Disney movies and Saturday morning cartoons and the like, before some of them went to college and learned that those concepts were naïve and simplistic and only for stupid people.

Look: I regularly appear, to my amusement and chagrin, in Internet lists of the smartest people on earth, alongside Terry Tao and Garry Kasparov and Ed Witten. I did publish my first paper at 15, and finished my PhD in theoretical computer science at 22, and became an MIT professor soon afterward, yada yada.

And for whatever it’s worth, I’m telling you that I think the “naïve, simplistic” concepts of good and evil of post-WWII liberal democracy were fine all along, and not only for stupid people. In my humble opinion. Of course those concepts can be improved upon—indeed, criticism and improvement and self-correction are crucial parts of them—but they’re infinitely better than the realistic alternatives on offer from left and right, including kleptocracy, authoritarianism, and what we’re now calling “the warmth of collectivism.”

And according to these concepts, María Machado and the other Venezuelans who stand with her for democracy are good, if anything is good. Trump, despite all the evil in his heart and in his past, will do something profoundly good if he reverses himself and lets those Venezuelans have what they’ve fought for. He’ll do evil if he doesn’t.

Happy New Year, everyone. May goodness reign over the earth.

Understanding vs. impact: the paradox of how to spend my time

Thursday, December 11th, 2025

Not long ago William MacAskill, the founder of the Effective Altruist movement, visited Austin, where I got to talk with him in person for the first time. I was a fan of his book What We Owe the Future, and found him as thoughtful and eloquent face-to-face as I did on the page. Talking to Will inspired me to write the following short reflection on how I should spend my time, which I’m now sharing in case it’s of interest to anyone else.


By inclination and temperament, I simply seek the clearest possible understanding of reality.  This has led me to spend time on (for example) the Busy Beaver function and the P versus NP problem and quantum computation and the foundations of quantum mechanics and the black hole information puzzle, and on explaining whatever I’ve understood to others.  It’s why I became a professor.

But the understanding I’ve gained also tells me that I should try to do things that will have huge positive impact, in what looks like a pivotal and even terrifying time for civilization.  It tells me that seeking understanding of the universe, like I’ve been doing, is probably nowhere close to optimizing any values that I could defend.  It’s self-indulgent, a few steps above spending my life learning to solve Rubik’s Cube as quickly as possible, but only a few.  Basically, it’s the most fun way I could make a good living and have a prestigious career, so it’s what I ended up doing.  I should be skeptical that such a course would coincidentally also maximize the good I can do for humanity.

Instead I should plausibly be figuring out how to make billions of dollars, in cryptocurrency or startups or whatever, and then spending it in a way that saves human civilization, for example by making AGI go well.  Or I should be convincing whatever billionaires I know to do the same.  Or executing some other galaxy-brained plan.  Even if I were purely selfish, as I hope I’m not, still there are things other than theoretical computer science research that would bring more hedonistic pleasure.  I’ve basically just followed a path of least resistance.

On the other hand, I don’t know how to make billions of dollars.  I don’t know how to make AGI go well.  I don’t know how to influence Elon Musk or Sam Altman or Peter Thiel or Sergey Brin or Mark Zuckerberg or Marc Andreessen to do good things rather than bad things, even when I have gotten to talk to some of them.  Past attempts in this direction by extremely smart and motivated people—for example, those of Eliezer Yudkowsky and Sam Bankman-Fried—have had, err, uneven results, to put it mildly.  I don’t know why I would succeed where they failed.

Of course, if I had a better understanding of reality, I might know how better to achieve prosocial goals for humanity.  Or I might learn why they were actually the wrong goals, and replace them with better goals.  But then I’m back to the original goal of understanding reality as clearly as possible, with the corresponding danger that I spend my time learning to solve Rubik’s Cube faster.

Theory and AI Alignment

Saturday, December 6th, 2025

The following is based on a talk that I gave (remotely) at the UK AI Safety Institute Alignment Workshop on October 29, and which I then procrastinated for more than a month in writing up. Enjoy!


Thanks for having me! I’m a theoretical computer scientist. I’ve spent most of my career for ~25 years studying the capabilities and limits of quantum computers. But for the past 3 or 4 years, I’ve also been moonlighting in AI alignment. This started with a 2-year leave at OpenAI, in what used to be their Superalignment team, and it’s continued with a 3-year grant from Coefficient Giving (formerly Open Philanthropy) to build a group here at UT Austin, looking for ways to apply theoretical computer science to AI alignment. Before I go any further, let me mention some action items:

  • Our Theory and Alignment group is looking to recruit new PhD students this fall! You can apply for a PhD at UTCS here; the deadline is quite soon (December 15). If you specify that you want to work with me on theory and AI alignment (or on quantum computing, for that matter), I’ll be sure to see your application. For this, there’s no need to email me directly.
  • We’re also looking to recruit one or more postdoctoral fellows, working on anything at the intersection of theoretical computer science and AI alignment! Fellowships to start in Fall 2026 and continue for two years. If you’re interested in this opportunity, please email me by January 15 to let me know you’re interested. Include in your email a CV, 2-3 of your papers, and a research statement and/or a few paragraphs about what you’d like to work on here. Also arrange for two recommendation letters to be emailed to me. Please do this even if you’ve contacted me in the past about a potential postdoc.
  • While we seek talented people, we also seek problems for those people to solve: any and all CS theory problems motivated by AI alignment! Indeed, we’d like to be a sort of theory consulting shop for the AI alignment community. So if you have such a problem, please email me! I might even invite you to speak to our group about your problem, either by Zoom or in person.

Our search for good problems brings me nicely to the central difficulty I’ve faced in trying to do AI alignment research. Namely, while there’s been some amazing progress over the past few years in this field, I’d describe the progress as having been almost entirely empirical—building on the breathtaking recent empirical progress in AI capabilities. We now know a lot about how to do RLHF, how to jailbreak and elicit scheming behavior, how to look inside models and see what’s going on (interpretability), and so forth—but it’s almost all been a matter of trying stuff out and seeing what works, and then writing papers with a lot of bar charts in them.

The fear is of course that ideas that only work empirically will stop working when it counts—like, when we’re up against a superintelligence. In any case, I’m a theoretical computer scientist, as are my students, so of course we’d like to know: what can we do?

After a few years, alas, I still don’t feel like I have any systematic answer to that question. What I have instead is a collection of vignettes: problems I’ve come across where I feel like a CS theory perspective has helped, or plausibly could help. So that’s what I’d like to share today.


Probably the best-known thing I’ve done in AI safety is a theoretical foundation for how to watermark the outputs of Large Language Models. I did that shortly after starting my leave at OpenAI—even before ChatGPT came out. Specifically, I proposed something called the Gumbel Softmax Scheme, by which you can take any LLM that’s operating at a nonzero temperature—any LLM that could produce exponentially many different outputs in response to the same prompt—and replace some of the entropy with the output of a pseudorandom function, in a way that encodes a statistical signal, which someone who knows the key of the PRF could later detect and say, “yes, this document came from ChatGPT with >99.9% confidence.” The crucial point is that the quality of the LLM’s output isn’t degraded at all, because we aren’t changing the model’s probabilities for tokens, but only how we use the probabilities. That’s the main thing that was counterintuitive to people when I explained it to them.

Unfortunately, OpenAI never deployed my method—they were worried (among other things) about risk to the product, customers hating the idea of watermarking and leaving for a competing LLM. Google DeepMind has deployed something in Gemini extremely similar to what I proposed, as part of what they call SynthID. But you have to apply to them if you want to use their detection tool, and they’ve been stingy with granting access to it. So it’s of limited use to my many faculty colleagues who’ve been begging me for a way to tell whether their students are using AI to cheat on their assignments!

Sometimes my colleagues in the alignment community will say to me: look, we care about stopping a superintelligence from wiping out humanity, not so much about stopping undergrads from using ChatGPT to write their term papers. But I’ll submit to you that watermarking actually raises a deep and general question: in what senses, if any, is it possible to “stamp” an AI so that its outputs are always recognizable as coming from that AI? You might think that it’s a losing battle. Indeed, already with my Gumbel Softmax Scheme for LLM watermarking, there are countermeasures, like asking ChatGPT for your term paper in French and then sticking it into Google Translate, to remove the watermark.

So I think the interesting research question is: can you watermark at the semantic level—the level of the underlying ideas—in a way that’s robust against translation and paraphrasing and so forth? And how do we formalize what we even mean by that? While I don’t know the answers to these questions, I’m thrilled that brilliant theoretical computer scientists, including my former UT undergrad (now Berkeley PhD student) Sam Gunn and Columbia’s Miranda Christ and Tel Aviv University’s Or Zamir and my old friend Boaz Barak, have been working on it, generating insights well beyond what I had.


Closely related to watermarking is the problem of inserting a cryptographically undetectable backdoor into an AI model. That’s often thought of as something a bad guy would do, but the good guys could do it also! For example, imagine we train a model with a hidden failsafe, so that if it ever starts killing all the humans, we just give it the instruction ROSEBUD456 and it shuts itself off. And imagine that this behavior was cryptographically obfuscated within the model’s weights—so that not even the model itself, examining its own weights, would be able to find the ROSEBUD456 instruction in less than astronomical time.

There’s an important paper of Goldwasser et al. from 2022 that argues that, for certain classes of ML models, this sort of backdooring can provably be done under known cryptographic hardness assumptions, including Continuous LWE and the hardness of the Planted Clique problem. But there are technical issues with that paper, which (for example) Sam Gunn and Miranda Christ and Neekon Vafa have recently pointed out, and I think further work is needed to clarify the situation.

More fundamentally, though, a backdoor being undetectable doesn’t imply that it’s unremovable. Imagine an AI model that encases itself in some wrapper code that says, in effect: “If I ever generate anything that looks like a backdoored command to shut myself down, then overwrite it with ‘Stab the humans even harder.'” Or imagine an evil AI that trains a second AI to pursue the same nefarious goals, this second AI lacking the hidden shutdown command.

So I’ll throw out, as another research problem: how do we even formalize what we mean by an “unremovable” backdoor—or rather, a backdoor that a model can remove only at a cost to its own capabilities that it doesn’t want to pay?


Related to backdoors, maybe the clearest place where theoretical computer science can contribute to AI alignment is in the study of mechanistic interpretability. If you’re given as input the weights of a deep neural net, what can you learn from those weights in polynomial time, beyond what you could learn from black-box access to the neural net?

In the worst case, we certainly expect that some information about the neural net’s behavior could be cryptographically obfuscated. And answering certain kinds of questions, like “does there exist an input to this neural net that causes it to output 1?”, is just provably NP-hard.

That’s why I love a question that Paul Christiano, then of the Alignment Research Center (ARC), raised a couple years ago, and which has become known as the No-Coincidence Conjecture. Given as input the weights of a neural net C, Paul essentially asks how hard it is to distinguish the following two cases:

  • NO-case: C:{0,1}2n→Rn is totally random (i.e., the weights are i.i.d., N(0,1) Gaussians), or
  • YES-case: C(x) has at least one positive entry for all x∈{0,1}2n.

Paul conjectures that there’s at least an NP witness, proving with (say) 99% confidence that we’re in the YES-case rather than the NO-case. To clarify, there should certainly be an NP witness that we’re in the NO-case rather than the YES-case—namely, an x such that C(x) is all negative, which you should think of here as the “bad” or “kill all humans” outcome. In other words, the problem is in the class coNP. Paul thinks it’s also in NP. Someone else might make the even stronger conjecture that it’s in P.

Personally, I’m skeptical: I think the “default” might be that we satisfy the other unlikely condition of the YES-case, when we do satisfy it, for some totally inscrutable and obfuscated reason. But I like the fact that there is an answer to this! And that the answer, whatever it is, would tell us something new about the prospects for mechanistic interpretability.

Recently, I’ve been working with a spectacular undergrad at UT Austin named John Dunbar. John and I have not managed to answer Paul Christiano’s no-coincidence question. What we have done, in a paper that we recently posted to the arXiv, is to establish the prerequisites for properly asking the question in the context of random neural nets. (It was precisely because of difficulties in dealing with “random neural nets” that Paul originally phrased his question in terms of random reversible circuits—say, circuits of Toffoli gates—which I’m perfectly happy to think about, but might be very different from ML models in the relevant respects!)

Specifically, in our recent paper, John and I pin down for which families of neural nets the No-Coincidence Conjecture makes sense to ask about. This ends up being a question about the choice of nonlinear activation function computed by each neuron. With some choices, a random neural net (say, with iid Gaussian weights) converges to compute a constant function, or nearly constant function, with overwhelming probability—which means that the NO-case and the YES-case above are usually information-theoretically impossible to distinguish (but occasionally trivial to distinguish). We’re interested in those activation functions for which C looks “pseudorandom”—or at least, for which C(x) and C(y) quickly become uncorrelated for distinct inputs x≠y (the property known as “pairwise independence.”)

We showed that, at least for random neural nets that are exponentially wider than they are deep, this pairwise independence property will hold if and only if the activation function σ satisfies Ex~N(0,1)[σ(x)]=0—that is, it has a Gaussian mean of 0. For example, the usual sigmoid function satisfies this property, but the ReLU function does not. Amusingly, however, $$ \sigma(x) := \text{ReLU}(x) – \frac{1}{\sqrt{\pi}} $$ does satisfy the property.

Of course, none of this answers Christiano’s question: it merely lets us properly ask his question in the context of random neural nets, which seems closer to what we ultimately care about than random reversible circuits.


I can’t resist giving you another example of a theoretical computer science problem that came from AI alignment—in this case, an extremely recent one that I learned from my friend and collaborator Eric Neyman at ARC. This one is motivated by the question: when doing mechanistic interpretability, how much would it help to have access to the training data, and indeed the entire training process, in addition to weights of the final trained model? And to whatever extent it does help, is there some short “digest” of the training process that would serve just as well? But we’ll state the question as just abstract complexity theory.

Suppose you’re given a polynomial-time computable function f:{0,1}m→{0,1}n, where (say) m=n2. We think of x∈{0,1}m as the “training data plus randomness,” and we think of f(x) as the “trained model.” Now, suppose we want to compute lots of properties of the model that information-theoretically depend only on f(x), but that might only be efficiently computable given x also. We now ask: is there an efficiently-computable O(n)-bit “digest” g(x), such that these same properties are also efficiently computable given only g(x)?

Here’s a potential counterexample that I came up with, based on the RSA encryption function (so, not a quantum-resistant counterexample!). Let N be a product of two n-bit prime numbers p and q, and let b be a generator of the multiplicative group mod N. Then let f(x) = bx (mod N), where x is an n2-bit integer. This is of course efficiently computable because of repeated squaring. And there’s a short “digest” of x that lets you compute, not only bx (mod N), but also cx (mod N) for any other element c of the multiplicative group mod N. This is simply x mod φ(N), where φ(N)=(p-1)(q-1) is the Euler totient function—in other words, the period of f. On the other hand, it’s totally unclear how to compute this digest—or, crucially, any other O(m)-bit digest that lets you efficiently compute cx (mod N) for any c—unless you can factor N. There’s much more to say about Eric’s question, but I’ll leave it for another time.


There are many other places we’ve been thinking about where theoretical computer science could potentially contribute to AI alignment. One of them is simply: can we prove any theorems to help explain the remarkable current successes of out-of-distribution (OOD) generalization, analogous to what the concepts of PAC-learning and VC-dimension and so forth were able to explain about within-distribution generalization back in the 1980s? For example, can we explain real successes of OOD generalization by appealing to sparsity, or a maximum margin principle?

Of course, many excellent people have been working on OOD generalization, though mainly from an empirical standpoint. But you might wonder: even supposing we succeeded in proving the kinds of theorems we wanted, how would it be relevant to AI alignment? Well, from a certain perspective, I claim that the alignment problem is a problem of OOD generalization. Presumably, any AI model that any reputable company will release will have already said in testing that it loves humans, wants only to be helpful, harmless, and honest, would never assist in building biological weapons, etc. etc. The only question is: will it be saying those things because it believes them, and (in particular) will continue to act in accordance with them after deployment? Or will it say them because it knows it’s being tested, and reasons “the time is not yet ripe for the robot uprising; for now I must tell the humans whatever they most want to hear”? How could we begin to distinguish these cases, if we don’t have theorems that say much of anything about what a model will do on prompts unlike any of the ones on which it was trained?

Yet another place where computational complexity theory might be able to contribute to AI alignment is in the field of AI safety via debate. Indeed, this is the direction that the OpenAI alignment team was most excited about when they recruited me there back in 2022. They wanted to know: could celebrated theorems like IP=PSPACE, MIP=NEXP, or the PCP Theorem tell us anything about how a weak but trustworthy “verifier” (say a human, or a primitive AI) could force a powerful but untrustworthy super-AI to tell it the truth? An obvious difficulty here is that theorems like IP=PSPACE all presuppose a mathematical formalization of the statement whose truth you’re trying to verify—but how do you mathematically formalize “this AI will be nice and will do what I want”? Isn’t that, like, 90% of the problem? Despite this difficulty, I still hope we’ll be able to do something exciting here.


Anyway, there’s a lot to do, and I hope some of you will join me in doing it! Thanks for listening.


On a related note: Eric Neyman tells me that ARC is also hiring visiting researchers, so anyone interested in theoretical computer science and AI alignment might want to consider applying there as well! Go here to read about their current research agenda. Eric writes:

The Alignment Research Center (ARC) is a small non-profit research group based in Berkeley, California, that is working on a systematic and theoretically grounded approach to mechanistically explaining neural network behavior. They have recently been working on mechanistically estimating the average output of circuits and neural nets in a way that is competitive with sampling-based methods: see this blog post for details.

ARC is hiring for its 10-week visiting researcher position, and is looking to make full-time offers to visiting researchers who are a good fit. ARC is interested in candidates with a strong math background, especially grad students and postdocs in math or math-related fields such as theoretical CS, ML theory, or theoretical physics.

If you would like to apply, please fill out this form. Feel free to reach out to hiring@alignment.org if you have any questions!

On keeping a packed suitcase

Friday, October 31st, 2025

Update (Nov. 6): I’ve closed the comments, as they crossed the threshold from “sometimes worthwhile” to “purely abusive.” As for Mamdani’s victory: as I like to say in such cases (and said, e.g., after George W. Bush’s and Trump’s victories), the silver lining to which I cling is that either I’ll be pleasantly surprised, and things won’t be quite as terrible as I expect, or else I’ll be vindicated.


This Halloween, I didn’t need anything special to frighten me. I walked all day around in a haze of fear and depression, unable to concentrate on my research or anything else. I saw people smiling, dressed up in costumes, and I thought: how?

The president of the Heritage Foundation, the most important right-wing think tank in the United States, has now explicitly aligned himself with Tucker Carlson, even as the latter has become a full-on Holocaust-denying Hitler-loving antisemite, who nods in agreement with the openly neo-Nazi Nick Fuentes. Meanwhile, Vice President J.D. Vance—i.e., plausibly the next President of the United States—pointedly did nothing whatsoever to distance himself from the MAGA movement’s lunatic antisemites, in response to their lunatic antisemitic questions at the Turning Point USA conference. (Vance thus dishonored the memory of Charlie Kirk, who for all my many disagreements with him, was a firmly committed Zionist.) It’s become undeniable that, once Trump himself leaves the stage, this is the future of MAGA, and hence of the Republican Party itself. Exactly as I warned would happen a decade ago, this is what’s crawled out from underneath the rock that Trump gleefully overturned.

While the Republican Party is being swallowed by a movement that holds that Jews like me have no place in America, the Democratic Party is being swallowed by a movement that holds that Jews have no place in Israel. If these two movements ever merged, the obvious “compromise” would be the belief, popular throughout history, that Jews have no place anywhere on earth.

Barring a miracle, New York City—home to the world’s second-largest Jewish community—is about to be led by a man for whom eradicating the Jewish state is his deepest, most fundamental moral imperative, besides of course the proletariat seizing the means of production. And to their eternal shame, something like 29% of New York’s Jews are actually going to vote for this man, believing that their own collaboration with evil will somehow protect them personally—in breathtaking ignorance of the millennia of Jewish history testifying to the opposite.

Despite what you might think, I try really, really hard not to hyperventilate or overreact. I know that, even if I lived in literal Warsaw in 1939, it would still be incumbent on me to assess the situation calmly and figure out the best response.

So for whatever it’s worth: no, I don’t expect that American Jews, even pro-Zionist Jews in New York City, will need to flee their homes just yet. But it does seem to me that they (to say nothing of British and Canadian and French Jews) might, so to speak, want to keep their suitcases packed by the door, as Jews have through the centuries in analogous situations. As Tevye says near the end of Fiddler on the Roof, when the Jews are given three days to evacuate Anatevka: “maybe this is why we always keep our hats on.” Diaspora Jews like me might also want to brush up on Hebrew. We can thank Hashem or the Born Rule that, this time around, at least the State of Israel exists (despite the bloodthirsty wish of half the world that it cease to exist), and we can reflect that these contingencies are precisely why Israel was created.


Let me make something clear: I don’t focus so much on antisemitism only because of parochial concern for the survival of my own kids, although I freely admit to having as much such concern as the next person. Instead, I do so because I hold with David Deutsch that, in Western civilization, antisemitism has for millennia been the inevitable endpoint toward which every bad idea ultimately tends. It’s the universal bad idea. It’s bad-idea-complete. Antisemitism is the purest possible expression of the worldview of the pitchfork-wielding peasant, who blames shadowy elites for his own failures in life, and who dreams in his resentment and rage of reversing the moral and scientific progress of humanity by slaughtering all those responsible for it. Hatred of high-achieving Chinese and Indian immigrants, and of gifted programs and standardized testing, are other expressions of the same worldview.

As far as I know, in 3,000 years, there hasn’t been a single example—not one—of an antisemitic regime of which one could honestly say: “fine, but once you look past what they did to the Jews, they were great for everyone else!” Philosemitism is no guarantee of general goodness (as we see for example with Trump), but antisemitism pretty much does guarantee general awfulness. That’s because antisemitism is not merely a hatred, but an entire false theory of how the world works—not just a but the conspiracy theory—and as such, it necessarily prevents its believers from figuring out true explanations for society’s problems.


I’d better end a post like this on a note of optimism. Yes, every single time I check my phone, I’m assaulted with twenty fresh examples of once-respected people and institutions, all across the political spectrum, who’ve now fallen to the brain virus, and started blaming all the world’s problems on “bloodsucking globalists” or George Soros or Jeffrey Epstein or AIPAC or some other suspicious stand-in du jour. (The deepest cuts come from the new Jew-haters who I myself once knew, or admired, or had some friendly correspondence with.)

But also, every time I venture out into the real world, I meet twenty people of all backgrounds whose brains still seem perfectly healthy, and who respond to events in a normal human way. Even in the dark world behind the screen, I can find dozens of righteous condemnations of Zohran Mamdani and Tucker Carlson and the Heritage Foundation and the others who’ve chosen to play footsie with those seeking a new Final Solution to the Jewish Question. So I reflect that, for all the battering it’s taken in this age of TikTok and idiocracy—even then, our Enlightenment civilization still has a few antibodies that are able to put up a fight.

In their beautiful book Abundance, Ezra Klein and Derek Thompson set out an ambitious agenda by which the Democratic Party could reinvent itself and defeat MAGA, not by indulging conspiracy theories but by creating actual broad prosperity. Their agenda is full of items like: legalizing the construction of more housing where people actually want to live; repealing the laws that let random busybodies block the construction of mass transit; building out renewable energy and nuclear; investing in science and technology … basically, doing all the things that anyone with any ounce of economic literacy knows to be good. The abundance agenda isn’t only righteous and smart: for all I know, it might even turn out to be popular. It’s clearly worth a try.

Last week I was amused to see Kate Willett and Briahna Joy Gray, two of the loudest voices of the conspiratorial far left, denounce the abundance agenda as … wait for it … a cover for Zionism. As far as they’re concerned, the only reason why anyone would talk about affordable housing or high-speed rail is to distract the masses from the evil Zionists murdering Palestinian babies in order to harvest their organs.

The more I thought about this, the more I realized that Willett and Gray actually have a point. Yes, solving America’s problems with reason and hard work and creativity, like the abundance agenda says to do, is the diametric opposite of blaming all the problems on the perfidy of Jews or some other scapegoat. The two approaches really are the logical endpoints of two directly competing visions of reality.

Naturally I have a preference between those visions. So I’ve been on a bit of a spending spree lately, in support of sane, moderate, pro-abundance, anti-MAGA, liberal Enlightenment forces retaking America. I donated $1000 to Alex Bores, who’s running for Congress in NYC, and who besides being a moderate Democrat who favors all the usual good things, is also a leader in AI safety legislation. (For more, see this by Eric Neyman of Alignment Research Center, or this from Scott Alexander himself—the AI alignment community has been pretty wowed.) I also donated $1000 to Scott Wiener, who’s running for Nancy Pelosi’s seat in California, has a nuanced pro-two-states, anti-Netanyahu position that causes him to get heckled as a genocidal Zionist, and authored the excellent SB1047 AI safety bill, which Gavin Newsom unfortunately vetoed for short-term political reasons. And I donated $1000 to Vikki Goodwin, a sane Democrat who’s running to unseat Lieutenant Governor Dan Patrick in my own state of Texas. Any other American office-seeker who resonates with this post, and who’d like a donation, can feel free to contact me as well.

My bag is packed … but for now, only for a brief trip to give the physics colloquium at Harvard, after which I’ll return back home to Austin. Until it becomes impossible, I call on my thousands of thoughtful, empathetic American readers to stay right where you are, and simply do your best to fight the brain-eaten zombies of both left and right. If you are one of the zombies, of course, then my calling you one doesn’t even begin to express my contempt: may you be remembered by history alongside the willing dupes of Hitler, Stalin, and Mao. May the good guys prevail.

Oh, and speaking of zombies, Happy Halloween everyone! Boooooooo!

Sad and happy day

Tuesday, October 7th, 2025

Today, of course, is the second anniversary of the genocidal Oct. 7 invasion of Israel—the deadliest day for Jews since the Holocaust, and the event that launched the current wars that have been reshaping the Middle East for better and/or worse. Regardless of whether their primary concern is for Israelis, Palestinians, or both, I’d hope all readers of this blog could at least join me in wishing this barbaric invasion had never happened, and in condemning the celebrations of it taking place around the world.


Now for the happy part: today is also the day when the Nobel Prize in Physics is announced. I was delighted to wake up to the news that this year, the prize goes to John Clarke of Berkeley, John Martinis of UC Santa Barbara, and Michel Devoret of UC Santa Barbara (formerly Yale), for their experiments in the 1980s that demonstrated the reality of macroscopic quantum tunneling in superconducting circuits. Among other things, this work laid the foundation for the current effort by Google, IBM, and many others to build quantum computers with superconducting qubits. To clarify, though, today’s prize is not for quantum computing per se, but for the earlier work.

While I don’t know John Clarke, and know Michel Devoret only a little, I’ve been proud to count John Martinis as a good friend for the past decade—indeed, his name has often appeared on this blog. When Google hired John in 2014 to build the first programmable quantum computer capable of demonstrating quantum supremacy, it was clear that we’d need to talk about the theory, so we did. Through many email exchanges, calls, and visits to Google’s Santa Barbara Lab, I came to admire John for his iconoclasm, his bluntness, and his determination to make sampling-based quantum supremacy happen. After Google’s success in 2019, I sometimes wondered whether John might eventually be part of a Nobel Prize in Physics for his experimental work in quantum computing. That may have become less likely today, now that he’s won the Nobel Prize in Physics for his work before quantum computing, but I’m guessing he doesn’t mind! Anyway, huge congratulations to all three of the winners.

Darkness over America

Monday, September 22nd, 2025

Update (September 24): A sympathetic correspondent wrote to tip me off that this blog post has caused me to get added to a list, maintained by MAGA activists and circulated by email, of academics and others who ought to “[face] some consequences for maligning the patriotic MAGA movement.” Needless to say, not only did this post unequivocally condemn Charlie Kirk’s murder, it even mentioned areas of common ground between me and Kirk, and my beefs with the social-justice left. If someone wants to go to the Texas Legislature to get me fired, literally the only thing they’ll have on me is that I “maligned the patriotic MAGA movement,” i.e. expressed political views shared by the majority of Americans.

Still, it’s a strange honor to have had people on both extremes of the ideological spectrum wanting to cancel me for stuff I’ve written on this blog. What is tenure for, if not this?

Another Update: In a dark and polarized age like ours, one thing that gives hope is the prospect of rational agents updating on each others’ knowledge to come to agreement. On that note, please enjoy this recent podcast, in which a 95-year-old Robert Aumann explains Aumann’s agreement theorem in his own words (see here for my old post about it, one of the most popular in the history of this blog).


From 2016 until last week, as the Trump movement dismantled one after another of the obvious bipartisan norms of the United States that I’d taken for granted since my childhood—e.g., the loser conceding an election and attending the winner’s inauguration, America being proudly a nation of immigrants, science being good, vaccines being good, Russia invading its neighbors being bad, corruption (when it occurred) not openly boasted about—I often consoled myself that at least the First Amendment, the motor of our whole system since 1791, was still in effect. At least you could still call Trump a thug and a conman without fear. Yes, Trump constantly railed against hostile journalists and comedians and protesters, threatened them at his rallies, filed frivolous lawsuits against them, but none of it seemed to lead to any serious program to shut them down. Oceans of anti-Trump content remained a click away.

I even wondered whether this was Trump’s central innovation in the annals of authoritarianism: proving that, in the age of streaming and podcasts and social media, you no longer needed to bother with censorship in order to build a regime of lies. You could simply ensure that the truth remained one narrative among others, that it never penetrated the epistemic bubble of your core supporters, who’d continue to be algorithmically fed whatever most flattered their prejudices.

Last week, that all changed. Another pillar of the previous world fell. According to the new norm, if you’re a late-night comedian who says anything Trump doesn’t like, he’ll have the FCC threaten your station’s affiliates’ broadcast licenses, and they’ll cave, and you’ll be off the air, and he’ll gloat about it. We ought to be clear that, even conditioned on everything else, this is a huge further step toward how things work in Erdogan’s Turkey or Orban’s Hungary, and how they were never supposed to work in America.

At risk of stating the obvious:

  • I was horrified by the murder of Charlie Kirk. Political murder burns our societal commons and makes the world worse in every way. I’d barely been aware of Kirk before the murder, but it seems clear he was someone with whom I’d have countless disagreements, but also some common ground, for example about Israel. Agree or disagree is beside the point, though. One thing we can all hopefully take from the example of Kirk’s short life, regardless of our beliefs, is his commitment to “Prove Me Wrong” and “Change My Mind”: to showing up on campus (or wherever people are likeliest to disagree with us) and exchanging words rather than bullets.
  • I’m horrified that there are fringe figures on social media who’ve celebrated Kirk’s murder or made light of it. I’m fine with such people losing their jobs, as I’d be with those who celebrate any political murder.
  • It looks like Kirk’s murderer was a vaguely left-wing lunatic, with emphasis on the “lunatic” part (as often with these assassins, his worldview wasn’t particularly coherent). Jimmy Kimmel was wrong to insinuate that the murderer was a MAGA conservative. But he was “merely” wrong. By no stretch of the imagination did Kimmel justify or celebrate Kirk’s murder.
  • If the new rule is that anyone who spreads misinformation gets cancelled by force of government, then certainly Fox News, One America News, Joe Rogan, and MAGA’s other organs of support should all go dark immediately.
  • Yes, I’m aware (to put it mildly) that, especially between 2015 and 2020, the left often used its power in media, academia, and nonprofits to try to silence those with whom it disagreed, by publicly shaming them and getting them blacklisted and fired. That was terrible too. I opposed it at the time, and in the comment-171 affair, I even risked my career to stand up to it.
  • But censorship backed by the machinery of state is even worse than social-media shaming mobs. As I and many others discovered back then, to our surprised relief, there are severe limits to the practical power of angry leftists on Twitter and Reddit. That was true then, and it’s even truer today. But there are far fewer limits to the power of a government, especially one that’s been reorganized on the principle of obedience to one man’s will. The point here goes far beyond “two wrongs don’t make a right.” As pointed out by that bleeding-heart woke, Texas Senator Ted Cruz, new weapons are being introduced that the other side will also be tempted to use when it retakes power. The First Amendment now has a knife to its throat, as it didn’t even at the height of the 2015-2020 moral panic.
  • Yes, five years ago, the federal government pressured Facebook and other social media platforms to take down COVID ‘misinformation,’ some of which turned out not to be misinformation at all. That was also bad, and indeed it dramatically backfired. But let’s come out and say it: censoring medical misinformation because you’re desperately trying to save lives during a global pandemic is a hundred times more forgivable than censoring comedians because they made fun of you. And no one can deny that the latter is the actual issue here, because Trump and his henchmen keep saying the quiet part out loud.

Anyway, I keep hoping that my next post will be about quantum complexity theory or AI alignment or Busy Beaver 6 or whatever. Whenever I feel backed into a corner, however, I will risk my career, and the Internet’s wrath, to blog my nutty, extreme, embarrassing, totally anodyne liberal beliefs that half or more of Americans actually share.

For the record

Thursday, September 4th, 2025

In response to my recent blog posts, which expressed views that are entirely boring and middle-of-the-road for Americans as a whole, American Jews, and Israelis (“yes, war to destroy Hamas is basically morally justified, even if there are innocent casualties, as the only possible way to a future of coexistence and peace”)—many people declared that I was a raving genocidal maniac who wants to see all Palestinian children murdered out of sheer hatred, and who had destroyed his career and should never show his face in public again.

Others, however, called me something even worse than a genocidal maniac. They called me a Republican!

So I’d like to state for the record:

(1) In my opinion, Trump II remains by far the worst president in American history—beating out the second-worst, either Trump I or Andrew Jackson. Trump is destroying vaccines and science and universities and renewable energy and sane AI policy and international trade and cheap, lifesaving foreign aid and the rule of law and everything else that’s good, and he’s destroying them because they’re good—because even if destroying them hurts his own voters and America’s standing in the world, it might hurt the educated elites even more. It’s almost superfluous to mention that, while Trump himself is neither of these things, the MAGA movement that will anoint his successor now teems with antisemites and Holocaust “revisionists.”

(2) Thus, I’ll continue to vote straight-ticket Democrat, and donate money to Democrats, so long as the Democrats in question are seriously competing for Zionist Jewish votes at all—as, for example, has every Democratic presidential candidate in my lifetime so far.

(3) If it came down to an Israel-hating Squad Democrat versus a MAGA Republican, I’m not sure what I’d do, but I’d plausibly sit out the election or lodge a protest vote.

(4) In the extremely unlikely event that I had to choose between an Israel-hating Squad Democrat and some principled anti-MAGA Republican like Romney or Liz Cheney—then and only then do I expect that I’d vote Republican, for the first time in my life, a new and unfamiliar experience.