Theory and AI Alignment

The following is based on a talk that I gave (remotely) at the UK AI Safety Institute Alignment Workshop on October 29, and which I then procrastinated for more than a month in writing up. Enjoy!

Thanks for having me! I’m a theoretical computer scientist. I’ve spent most of my career for ~25 years studying the capabilities and limits of quantum computers. But for the past 3 or 4 years, I’ve also been moonlighting in AI alignment. This started with a 2-year leave at OpenAI, in what used to be their Superalignment team, and it’s continued with a 3-year grant from Coefficient Giving (formerly Open Philanthropy) to build a group here at UT Austin, looking for ways to apply theoretical computer science to AI alignment. Before I go any further, let me mention some action items:

Our Theory and Alignment group is looking to recruit new PhD students this fall! You can apply for a PhD at UTCS here; the deadline is quite soon (December 15). If you specify that you want to work with me on theory and AI alignment (or on quantum computing, for that matter), I’ll be sure to see your application. For this, there’s no need to email me directly.
We’re also looking to recruit one or more postdoctoral fellows, working on anything at the intersection of theoretical computer science and AI alignment! Fellowships to start in Fall 2026 and continue for two years. If you’re interested in this opportunity, please email me by January 15 to let me know you’re interested. Include in your email a CV, 2-3 of your papers, and a research statement and/or a few paragraphs about what you’d like to work on here. Also arrange for two recommendation letters to be emailed to me. Please do this even if you’ve contacted me in the past about a potential postdoc.
While we seek talented people, we also seek problems for those people to solve: any and all CS theory problems motivated by AI alignment! Indeed, we’d like to be a sort of theory consulting shop for the AI alignment community. So if you have such a problem, please email me! I might even invite you to speak to our group about your problem, either by Zoom or in person.

Our search for good problems brings me nicely to the central difficulty I’ve faced in trying to do AI alignment research. Namely, while there’s been some amazing progress over the past few years in this field, I’d describe the progress as having been almost entirely empirical—building on the breathtaking recent empirical progress in AI capabilities. We now know a lot about how to do RLHF, how to jailbreak and elicit scheming behavior, how to look inside models and see what’s going on (interpretability), and so forth—but it’s almost all been a matter of trying stuff out and seeing what works, and then writing papers with a lot of bar charts in them.

The fear is of course that ideas that only work empirically will stop working when it counts—like, when we’re up against a superintelligence. In any case, I’m a theoretical computer scientist, as are my students, so of course we’d like to know: what can we do?

After a few years, alas, I still don’t feel like I have any systematic answer to that question. What I have instead is a collection of vignettes: problems I’ve come across where I feel like a CS theory perspective has helped, or plausibly could help. So that’s what I’d like to share today.

Probably the best-known thing I’ve done in AI safety is a theoretical foundation for how to watermark the outputs of Large Language Models. I did that shortly after starting my leave at OpenAI—even before ChatGPT came out. Specifically, I proposed something called the Gumbel Softmax Scheme, by which you can take any LLM that’s operating at a nonzero temperature—any LLM that could produce exponentially many different outputs in response to the same prompt—and replace some of the entropy with the output of a pseudorandom function, in a way that encodes a statistical signal, which someone who knows the key of the PRF could later detect and say, “yes, this document came from ChatGPT with >99.9% confidence.” The crucial point is that the quality of the LLM’s output isn’t degraded at all, because we aren’t changing the model’s probabilities for tokens, but only how we use the probabilities. That’s the main thing that was counterintuitive to people when I explained it to them.

Unfortunately, OpenAI never deployed my method—they were worried (among other things) about risk to the product, customers hating the idea of watermarking and leaving for a competing LLM. Google DeepMind has deployed something in Gemini extremely similar to what I proposed, as part of what they call SynthID. But you have to apply to them if you want to use their detection tool, and they’ve been stingy with granting access to it. So it’s of limited use to my many faculty colleagues who’ve been begging me for a way to tell whether their students are using AI to cheat on their assignments!

Sometimes my colleagues in the alignment community will say to me: look, we care about stopping a superintelligence from wiping out humanity, not so much about stopping undergrads from using ChatGPT to write their term papers. But I’ll submit to you that watermarking actually raises a deep and general question: in what senses, if any, is it possible to “stamp” an AI so that its outputs are always recognizable as coming from that AI? You might think that it’s a losing battle. Indeed, already with my Gumbel Softmax Scheme for LLM watermarking, there are countermeasures, like asking ChatGPT for your term paper in French and then sticking it into Google Translate, to remove the watermark.

So I think the interesting research question is: can you watermark at the semantic level—the level of the underlying ideas—in a way that’s robust against translation and paraphrasing and so forth? And how do we formalize what we even mean by that? While I don’t know the answers to these questions, I’m thrilled that brilliant theoretical computer scientists, including my former UT undergrad (now Berkeley PhD student) Sam Gunn and Columbia’s Miranda Christ and Tel Aviv University’s Or Zamir and my old friend Boaz Barak, have been working on it, generating insights well beyond what I had.

Closely related to watermarking is the problem of inserting a cryptographically undetectable backdoor into an AI model. That’s often thought of as something a bad guy would do, but the good guys could do it also! For example, imagine we train a model with a hidden failsafe, so that if it ever starts killing all the humans, we just give it the instruction ROSEBUD456 and it shuts itself off. And imagine that this behavior was cryptographically obfuscated within the model’s weights—so that not even the model itself, examining its own weights, would be able to find the ROSEBUD456 instruction in less than astronomical time.

There’s an important paper of Goldwasser et al. from 2022 that argues that, for certain classes of ML models, this sort of backdooring can provably be done under known cryptographic hardness assumptions, including Continuous LWE and the hardness of the Planted Clique problem. But there are technical issues with that paper, which (for example) Sam Gunn and Miranda Christ and Neekon Vafa have recently pointed out, and I think further work is needed to clarify the situation.

More fundamentally, though, a backdoor being undetectable doesn’t imply that it’s unremovable. Imagine an AI model that encases itself in some wrapper code that says, in effect: “If I ever generate anything that looks like a backdoored command to shut myself down, then overwrite it with ‘Stab the humans even harder.'” Or imagine an evil AI that trains a second AI to pursue the same nefarious goals, this second AI lacking the hidden shutdown command.

So I’ll throw out, as another research problem: how do we even formalize what we mean by an “unremovable” backdoor—or rather, a backdoor that a model can remove only at a cost to its own capabilities that it doesn’t want to pay?

Related to backdoors, maybe the clearest place where theoretical computer science can contribute to AI alignment is in the study of mechanistic interpretability. If you’re given as input the weights of a deep neural net, what can you learn from those weights in polynomial time, beyond what you could learn from black-box access to the neural net?

In the worst case, we certainly expect that some information about the neural net’s behavior could be cryptographically obfuscated. And answering certain kinds of questions, like “does there exist an input to this neural net that causes it to output 1?”, is just provably NP-hard.

That’s why I love a question that Paul Christiano, then of the Alignment Research Center (ARC), raised a couple years ago, and which has become known as the No-Coincidence Conjecture. Given as input the weights of a neural net C, Paul essentially asks how hard it is to distinguish the following two cases:

NO-case: C:{0,1}²ⁿ→Rⁿ is totally random (i.e., the weights are i.i.d., N(0,1) Gaussians), or

YES-case: C(x) has at least one positive entry for all x∈{0,1}²ⁿ.

Paul conjectures that there’s at least an NP witness, proving with (say) 99% confidence that we’re in the YES-case rather than the NO-case. To clarify, there should certainly be an NP witness that we’re in the NO-case rather than the YES-case—namely, an x such that C(x) is all negative, which you should think of here as the “bad” or “kill all humans” outcome. In other words, the problem is in the class coNP. Paul thinks it’s also in NP. Someone else might make the even stronger conjecture that it’s in P.

Personally, I’m skeptical: I think the “default” might be that we satisfy the other unlikely condition of the YES-case, when we do satisfy it, for some totally inscrutable and obfuscated reason. But I like the fact that there is an answer to this! And that the answer, whatever it is, would tell us something new about the prospects for mechanistic interpretability.

Recently, I’ve been working with a spectacular undergrad at UT Austin named John Dunbar. John and I have not managed to answer Paul Christiano’s no-coincidence question. What we have done, in a paper that we recently posted to the arXiv, is to establish the prerequisites for properly asking the question in the context of random neural nets. (It was precisely because of difficulties in dealing with “random neural nets” that Paul originally phrased his question in terms of random reversible circuits—say, circuits of Toffoli gates—which I’m perfectly happy to think about, but might be very different from ML models in the relevant respects!)

Specifically, in our recent paper, John and I pin down for which families of neural nets the No-Coincidence Conjecture makes sense to ask about. This ends up being a question about the choice of nonlinear activation function computed by each neuron. With some choices, a random neural net (say, with iid Gaussian weights) converges to compute a constant function, or nearly constant function, with overwhelming probability—which means that the NO-case and the YES-case above are usually information-theoretically impossible to distinguish (but occasionally trivial to distinguish). We’re interested in those activation functions for which C looks “pseudorandom”—or at least, for which C(x) and C(y) quickly become uncorrelated for distinct inputs x≠y (the property known as “pairwise independence.”)

We showed that, at least for random neural nets that are exponentially wider than they are deep, this pairwise independence property will hold if and only if the activation function σ satisfies E_x~N(0,1)[σ(x)]=0—that is, it has a Gaussian mean of 0. For example, the usual sigmoid function satisfies this property, but the ReLU function does not. Amusingly, however, $$ \sigma(x) := \text{ReLU}(x) – \frac{1}{\sqrt{\pi}} $$ does satisfy the property.

Of course, none of this answers Christiano’s question: it merely lets us properly ask his question in the context of random neural nets, which seems closer to what we ultimately care about than random reversible circuits.

I can’t resist giving you another example of a theoretical computer science problem that came from AI alignment—in this case, an extremely recent one that I learned from my friend and collaborator Eric Neyman at ARC. This one is motivated by the question: when doing mechanistic interpretability, how much would it help to have access to the training data, and indeed the entire training process, in addition to weights of the final trained model? And to whatever extent it does help, is there some short “digest” of the training process that would serve just as well? But we’ll state the question as just abstract complexity theory.

Suppose you’re given a polynomial-time computable function f:{0,1}^m→{0,1}ⁿ, where (say) m=n². We think of x∈{0,1}^m as the “training data plus randomness,” and we think of f(x) as the “trained model.” Now, suppose we want to compute lots of properties of the model that information-theoretically depend only on f(x), but that might only be efficiently computable given x also. We now ask: is there an efficiently-computable O(n)-bit “digest” g(x), such that these same properties are also efficiently computable given only g(x)?

Here’s a potential counterexample that I came up with, based on the RSA encryption function (so, not a quantum-resistant counterexample!). Let N be a product of two n-bit prime numbers p and q, and let b be a generator of the multiplicative group mod N. Then let f(x) = b^x (mod N), where x is an n²-bit integer. This is of course efficiently computable because of repeated squaring. And there’s a short “digest” of x that lets you compute, not only b^x (mod N), but also c^x (mod N) for any other element c of the multiplicative group mod N. This is simply x mod φ(N), where φ(N)=(p-1)(q-1) is the Euler totient function—in other words, the period of f. On the other hand, it’s totally unclear how to compute this digest—or, crucially, any other O(m)-bit digest that lets you efficiently compute c^x (mod N) for any c—unless you can factor N. There’s much more to say about Eric’s question, but I’ll leave it for another time.

There are many other places we’ve been thinking about where theoretical computer science could potentially contribute to AI alignment. One of them is simply: can we prove any theorems to help explain the remarkable current successes of out-of-distribution (OOD) generalization, analogous to what the concepts of PAC-learning and VC-dimension and so forth were able to explain about within-distribution generalization back in the 1980s? For example, can we explain real successes of OOD generalization by appealing to sparsity, or a maximum margin principle?

Of course, many excellent people have been working on OOD generalization, though mainly from an empirical standpoint. But you might wonder: even supposing we succeeded in proving the kinds of theorems we wanted, how would it be relevant to AI alignment? Well, from a certain perspective, I claim that the alignment problem is a problem of OOD generalization. Presumably, any AI model that any reputable company will release will have already said in testing that it loves humans, wants only to be helpful, harmless, and honest, would never assist in building biological weapons, etc. etc. The only question is: will it be saying those things because it believes them, and (in particular) will continue to act in accordance with them after deployment? Or will it say them because it knows it’s being tested, and reasons “the time is not yet ripe for the robot uprising; for now I must tell the humans whatever they most want to hear”? How could we begin to distinguish these cases, if we don’t have theorems that say much of anything about what a model will do on prompts unlike any of the ones on which it was trained?

Yet another place where computational complexity theory might be able to contribute to AI alignment is in the field of AI safety via debate. Indeed, this is the direction that the OpenAI alignment team was most excited about when they recruited me there back in 2022. They wanted to know: could celebrated theorems like IP=PSPACE, MIP=NEXP, or the PCP Theorem tell us anything about how a weak but trustworthy “verifier” (say a human, or a primitive AI) could force a powerful but untrustworthy super-AI to tell it the truth? An obvious difficulty here is that theorems like IP=PSPACE all presuppose a mathematical formalization of the statement whose truth you’re trying to verify—but how do you mathematically formalize “this AI will be nice and will do what I want”? Isn’t that, like, 90% of the problem? Despite this difficulty, I still hope we’ll be able to do something exciting here.

Anyway, there’s a lot to do, and I hope some of you will join me in doing it! Thanks for listening.

On a related note: Eric Neyman tells me that ARC is also hiring visiting researchers, so anyone interested in theoretical computer science and AI alignment might want to consider applying there as well! Go here to read about their current research agenda. Eric writes:

The Alignment Research Center (ARC) is a small non-profit research group based in Berkeley, California, that is working on a systematic and theoretically grounded approach to mechanistically explaining neural network behavior. They have recently been working on mechanistically estimating the average output of circuits and neural nets in a way that is competitive with sampling-based methods: see this blog post for details.

ARC is hiring for its 10-week visiting researcher position, and is looking to make full-time offers to visiting researchers who are a good fit. ARC is interested in candidates with a strong math background, especially grad students and postdocs in math or math-related fields such as theoretical CS, ML theory, or theoretical physics.

If you would like to apply, please fill out this form. Feel free to reach out to hiring@alignment.org if you have any questions!

This entry was posted on Saturday, December 6th, 2025 at 11:28 pm and is filed under Adventures in Meatspace, Announcements, Complexity, The Fate of Humanity. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

50 Responses to “Theory and AI Alignment”

Adam Treat Says:
Comment #1 December 7th, 2025 at 12:39 am
HI Scott! Super interesting stuff!

I want to hear more about “appealing to sparsity, or some maximum margin principle” to understand and put some bounds on the unreasonable success of OOD training! I can certainly follow your intuition that this might be important for alignment – is the AI telling me what I want to hear while plotting to kill me?

Another way it might be related to alignment research is that if we had such bounds theorems, it might put to the test the idea of exponential take-off aka singularity hypothesis of intelligence growth?

My intuition is that ideas of exponential take-off are poorly motivated. I think OOD bounds are likely heavily dependent on game-theoretic context. We know it is possible for an AI to become superhuman in domains like chess where the game itself is zero sum winner take all. But for language games is super human performance even possible? By superhuman, I mean something like what we see in chess – the AI is capable of beating 1000 games to zero the best human player who ever lived.

Something tells me that OOD bounds are dependent on game theoretic context about the kinds of games we’re playing and their associated Nash equilibrium.

Anyway, I’ll leave it at that, but super interesting stuff and thanks for sharing!
Julian Says:
Comment #2 December 7th, 2025 at 7:42 am
I’m sorry to be a whiner, and feel free not to post this comment, but it’s frustrating to me that the deadlines for American PhD programs are so early. December 15 for UT Austin? Senior year is when many of us are working on an undergraduate thesis, starting to do research, and starting to take graduate courses. How am I supposed to write about my bachelor’s thesis on my PhD apps when I just started working on it? Maybe I’ll have something publishable by the spring, but it’ll be too late then. I wish the deadlines were around March 1 or so. That way they get to see much of the senior year grades, work on thesis, senior year research.

Maybe put in a word at UT Austin about moving the deadlines back? Does the CS department set that deadline, or is it determined from higher up? Just a thought.
Scott Says:
Comment #3 December 7th, 2025 at 8:32 am
Julian #2: Neither we nor any other university could change the deadlines unilaterally. People apply in December, decisions get made around January, then prospective students come for visit days in February or March and make their decisions in April or May, then they show up in August or September. So, while it would be possible in principle to push up the deadlines, it would change this whole process.
Julian Says:
Comment #4 December 7th, 2025 at 9:04 am
Oh wow, decisions get made in JANUARY for CS???? I’m applying to physics programs and it’s quite different. Decisions don’t typically get made until March or April for most of these programs I believe…

If only Trump’s war with the universities was about demanding they make things less stressful for PhD applicants…
Name Required Says:
Comment #5 December 7th, 2025 at 9:26 am
Regarding backdoors: have you seen Ken Thompson’s Turing Award acceptance speech/paper “Reflections on Trusting Trust”? It sounds highly relevant, even though several methods of tricking the backdoored compiler into producing an unbackdoored copy have been developed since then.
Scott Says:
Comment #6 December 7th, 2025 at 10:23 am
Name Required #5: Sure, I read that like 27 years ago! (Also met Ken Thompson when I interned at Bell Labs.) Even though it was well before the modern ML era, it surely has lessons for today, especially when you consider how much of near-term AI security simply depends on ordinary cybersecurity.
Prasanna Says:
Comment #7 December 7th, 2025 at 11:32 am
As the models get bigger, better and complex, isn’t the problem of AI alignment going to get harder and harder to keep up ? Eventually when AI reaches superhuman capabilities, its reasonable to assume that any methods we come up will be trivial for AIs to beat. Its like Chimps trying to align humans to their interests. At least if we had a strong theoretical basis for understanding how AI works in the first place, there was a chance of stopping it before the capabilities developed to a point of no return . Given the current state of affairs, it seems like AI alignment is a lost cause ?
Matt MacDermott Says:
Comment #8 December 7th, 2025 at 2:53 pm
> Closely related to watermarking is the problem of inserting a cryptographically undetectable backdoor into an AI model. That’s often thought of as something a bad guy would do, but the good guys could do it also! For example, imagine we train a model with a hidden failsafe, so that if it ever starts killing all the humans, we just give it the instruction ROSEBUD456 and it shuts itself off.

A paper that just came out about this idea: https://arxiv.org/pdf/2512.03089. Williams et al, Password-Activated Shutdown Protocols for Misaligned Frontier Agents.
William Gasarch Says:
Comment #9 December 7th, 2025 at 3:10 pm
Having an AI ouptut labelled as an AI output may be impossible in a few cases.

Testimonials about thinks all sound the same, and hence like chatty, even when the students to write it themselves.

At the oscars and othe award ceremonies. the speeches thanking people all sound the same and again, sound like chatty wrote them, even whenit didn’t.

there are other examples.
Scott Says:
Comment #10 December 7th, 2025 at 3:17 pm
William Gasarch #9: Yes, as I mentioned in this talk, when there’s no entropy in the probability distribution over high-quality outputs, there’s also nowhere to embed a watermark.
Shmi Says:
Comment #11 December 7th, 2025 at 5:14 pm
I can barely comprehend the basics of the question you mentioned in the talk (do we need full training data, do we need a short digest of training data or maybe just the weights are enough for efficiently computing mechanistic interpretability — sorry if I butchered it), and so I tried to get GPT-5 walk me through it. Eventually it said that about constructing a counterexample:

> If you want to push toward a counterexample, you want the opposite sort of f: something like a one-way permutation on {0,1}^{Theta(n)} inflated to {0,1}^{n^2}, then use hard-core bits of the preimage as model properties. That’s where the “no small digest unless you break crypto” story kicks in.

Wonder if this is one of those cases where an LLM vibes up something nonsensical, obvious or trivial but only someone trained in the craft can tell from something interesting and profound.
Scott Says:
Comment #12 December 7th, 2025 at 5:35 pm
Shmi #11: I don’t fully understand GPT5’s proposal, but a main issue is going to be to find properties that depend only on f(x) and not x. Standard hardcore bits of an arbitrary string x won’t do that. You can fix that by taking x to be the output of a pseudorandom function — an idea that one of our students at UT suggested — but then you no longer have a uniform distribution over x’s. If you want the latter, I don’t currently have anything other than my RSA example (not even a counterexample based on Diffie-Hellman for example).
Shmi Says:
Comment #13 December 7th, 2025 at 6:55 pm
One of the frustrating parts about LLMs is how much more they know than anyone alive, in terms of sheer erudition, and it’s again hard to judge if the model is correct or just Modelsplaining. Might not be a thing you encountered in your work when consulting with the bots, but when it starts talking like this, pulling arguments from all kinds of math areas I have no clue about, I give up trying to understand:

If we separate the information-theoretic question from the computational one, then yes, under standard crypto assumptions there are very clean cases where a short digest exists but is hard to compute from x.
Given any f : {0,1}^m -> {0,1}^n, the relation x ~ x’ iff f(x) = f(x’) has at most 2^n equivalence classes, so there is always some n-bit function h(x) that labels the class of x. Information-theoretically, that h(x) is a perfect digest: every model property P(f(x)) is a function of h(x). The nontrivial question is whether such an h(x) can be computed efficiently from x, given a public description of f.
The RSA example is exactly a case where the answer is “yes, a short digest exists, but computing it is hard.” Take f(x) = b^x mod N, with N = pq a standard RSA modulus and b in (Z/NZ). Let t = ord(b). Then f(x) depends only on the residue r = x mod t, and x ~ x’ iff x ≡ x’ (mod t). So the map g(x) = x mod t is an O(n)-bit universal digest: it determines f(x), and therefore all model properties. However, given (N,b,x), computing g*(x) for arbitrary x is as hard as computing t, and for RSA-type N, computing ord(b) (or phi(N)) is believed to be as hard as factoring N. So under the usual factoring assumption, g* exists and is short, but there is no polynomial-time algorithm that recovers it from x.
You get the same phenomenon in a different guise with a PRG/PRF: let x = G(s) for a PRG G with short seed s, and let f be any polytime function of x. Then “digest = s” is tiny and information-theoretically sufficient, but computing s from x breaks the PRG. That version loses uniformity of x but has the same structure: short digest exists, hard to compute.

At least the closing part is understandable, the trademark emdash included:

A more natural framing might be:
Given full access to training data x and training trajectory, what properties of the final model f(x) become easier to compute or verify? And for those properties, what’s the minimal information about training that suffices?

The answer to Scott’s question might be: it depends heavily on what properties you care about. For some (behavioral, functional), the weights suffice. For others (causal attribution, training dynamics), you need trajectory information that may not compress. The interesting research direction is characterizing which interpretability questions fall into which category—rather than hoping for a universal short digest.
Raoul Ohio Says:
Comment #14 December 7th, 2025 at 7:39 pm
does the need for “AI alignment” suggest how weak AI actually is? and how easily AI can be abused?
Scott Says:
Comment #15 December 7th, 2025 at 8:15 pm
Raoul Ohio #14: If you believe in any version of the “orthogonality thesis,” then the answer is no. In other words: so long as you agree that an arbitrarily strong AI could still pursue arbitrarily terrible goals from our perspective, alignment would seem to get more important, not less important, the stronger AI gets.
Prasanna Says:
Comment #16 December 7th, 2025 at 11:05 pm
As the models get bigger, better and complex, isn’t the problem of AI alignment going to get harder and harder to keep up ? Eventually when AI reaches superhuman capabilities, its reasonable to assume that any methods we come up will be trivial for AIs to beat. Its like Chimps trying to align humans to their interests. At least if we had a strong theoretical basis for understanding how AI works in the first place, there was a chance of stopping it before the capabilities developed to a point of no return . Given the current state of affairs, it seems like AI alignment is a lost cause ?
Scott Says:
Comment #17 December 8th, 2025 at 4:58 am
Prasanna #16: Well yes, that’s exactly what Eliezer and others in the MIRI organization concluded, and it’s why they switched entirely from AI alignment research to public advocacy to “shut it all down” (that is, stop further scaling of capabilities by international agreement) until we understand it better.

In the meantime, though, my students and I are theoretical computer scientists. So, we’ve decided to do what we can on the “understand it better” part, something that even MIRI still agrees is worth a try, hoping that that makes a positive difference in time. Or that if it doesn’t, then at least we’ll go down having done some nice research.
Anonymous Ignoramus Says:
Comment #18 December 8th, 2025 at 9:57 am
Regarding alignment, backdoors, etc

it’s interesting humanity is in the same scenario as God with Adam and Eve in the Garden of Eden… God did create Adam and Eve from the ground up, but he apparently wasn’t able to assure alignment, therefore Adam and Eve are bestowed “responsibility for their own actions”, even though the evolution of the universe’s wave function is entirely deterministic and Adam and Eve don’t exactly have any room for maneuver (by definition, either things are causal, random, or a mix of the two, and there’s never any space for “free will”).

Similarly, whether humanity achieves AI alignment or not is already written in the initial conditions of the big bang. We just don’t know at which rate it will happen across our various future branches of the multiverse. Maybe it’s somewhat random, or maybe the overarching laws of the universe somehow will make alignment happen more often than not, on average (just like the apparition of life, intelligence, … isn’t really in our own hands).
In a way, it’s not surprising that alignment can’t be done, if it were, the universe would have “aligned” us (i.e. we wouldn’t be so confused about our own goals, the meaning of life, morality, etc).
Danylo Yakymenko Says:
Comment #19 December 8th, 2025 at 2:53 pm
Scott, I respect that kind of your work and think it is reasonable, but I couldn’t help noticing that it looks more like “AI coercion” rather than alignment.

Why would we want to have an unremovable backdoor in AI at all? The only honest reason for this, as I see it, is if it gets stolen, then you can turn it off. Scenarios where it achieves independence from humans, making it impossible to simply shut it off, seem too fantastic. If it were to reach that point, disabling any backdoor would be too trivial, as you have described.

The same applies to watermarking. What is the real practical advantage of it? If a job is done, who cares how it was done (unless illegally)? We don’t watermark the outputs from our calculators. Of course, it makes sense to ban calculators on arithmetic tests for kids, but this is only necessary for teaching purposes. To prevent cheating in studying, other methods should be developed, as you won’t be able to put watermarks on all available AI models.

In my view, alignment problems should concern how to train AI so that it behaves consistently, honestly, and is sufficiently self-aware to avoid doing anything stupid or destructive. I truly believe this is possible. If people believe it is not, and therefore think we desperately need a kill switch and full control, then to me, it only shows the insecurity of those people in the face of “God”.
Scott Says:
Comment #20 December 8th, 2025 at 6:29 pm
Danylo Yakymenko #19: To me it seems obvious that there would be great value in being able to identify AIs as such, for example via watermarks and backdoors, and thereby prevent their impersonating humans for all sorts of nefarious purposes (academic cheating is the least of it). Likewise for failsafes and emergency abort procedures, the same as one would want with a nuclear reactor or any other dangerous technology. The question is not whether these things are desirable, but merely whether they’re possible in any sense in the limit as AI gets arbitrarily smart. Maybe they aren’t! If so, I’d love to have a theorem explaining why.

In the meantime, if you have theoretical computer science problems related to value alignment, or that you otherwise think are closer to the critical path, feel free to share them with us.
Geoffroy Couteau Says:
Comment #21 December 8th, 2025 at 7:04 pm
Hello Scott,

First time commenting here, but I’ve been following your blog for 10~12 years (I was a big fan of everything you wrote regarding Busy Beavers, Aumann & common knowledge, and TGITTM).

Regarding post-quantum proposals for your abstract complexity theory takes on mechanistic interpretability, how about the following: fix params = FHE(K). Define f(x) = FHE(h_K(x)), where h_K(x) outputs (K, H(K||x)) for your favorite (compressing) hash function H — that’s of course efficiently computable given x. Now, all functions of the form g(h_k(x)) are “properties” that information-theoretically depend only on f(x). h_k(x) is a short digest, but computing it would break FHE security, and it does not seem feasible to find a short digest that would let you do much better.

If that works, it means that the next QuantumHypeInc company will unfortunately not be able to sell their brand new quantum computer to investors as a universal tool for mechanical interpretability of AI from short digests (or will they?).
Scott Says:
Comment #22 December 8th, 2025 at 9:37 pm
Geoffroy Couteau #21: Thanks! So, trying to understand your proposal: we assume a quantum-resistant FHE scheme. Using our scheme, we first encrypt a random key K to get FHE(K) (this K is not the key of the FHE itself, but a different key). Then we set f(x) to be the homomorphic encryption of h_K(x), i.e. FHE(h_K(x)), where h_K is our compressing hash function. We can efficiently compute this f(x), inside the FHE, given both x (from which we get FHE(x)) and FHE(K). f(x) will still be small compared to x, as long as h_K(x) is small enough. So then we can define all sorts of properties g(h_K(x)) that information-theoretically depend only on h_K(x) and hence on f(x), but that are hard to compute given only f(x)=FHE(h_K(x)), or perhaps given any other small digest that’s efficiently computable given only x and FHE(K).

Cool!!

My only concern is that setting this entire example up required someone to hand us params = FHE(K). And that person would need to know K but refrain but telling it to us, wouldn’t they? That aspect seems a little hard to make sense of in an ML context: why doesn’t our training process get access to the entire history of how params = FHE(K) was generated?

With the RSA example, by contrast, even though I didn’t present it this way, you could just pick a random N yourself without you or anyone else needing to know its prime factorization, right?
Danylo Yakymenko Says:
Comment #23 December 8th, 2025 at 11:49 pm
Scott #20:

I think of one problem that could be regarded as an alignment problem, though I’m not sure how to make it formal. It could be named “Define Facts”. I guess it’s fair to say that every human operates on facts in their mind when thinking and making judgments. And we naturally assume that the facts we operate on are objective, although some of them may be unknown to other people (one could see a parallel here with the concept of common knowledge). However, it seems that the objective nature of facts doesn’t quite hold when we look at our society as a whole. When we have disputes, we resort to courts and judges, who have the final say. So, in essence, facts are defined by authority. We can see very vividly how this works currently in the US, with the Supreme Court’s rulings. This is a flawed system. Incidentally, in a blockchain, the truth is defined by a miner who got lucky at a particular time, which is funny considering the Bitcoin market cap.

In the context of AI development and CS theory, we could try to do better, although there are known theoretical obstacles, such as Tarski’s undefinability theorem, closely related to Gödel’s incompleteness theorem. I think we don’t need to go that far for practical purposes, though. In practice, it would be good to know at least some facts upon which an AI model operates, and which cannot be changed no matter what (either through a prompt or self-evolution). So, the question is: for a given AI model, is it possible to define such facts and prove that the model won’t be able to discard them, assuming we know the details of the training process as well?
Ajit R. Jadhav Says:
Comment #24 December 9th, 2025 at 12:52 am
Scott,

I was just trying to get a sense of your work about watermarking. You had mentioned about it in the past too, but I hadn’t paid much attention to it at that time.

However, your succinct summary here somehow made me a bit more curious, and so, I tried to guess how you (or any one else) might be doing it. In this context, especially helpful (and motivating) was your following remark:

… we aren’t changing the model’s probabilities for tokens, but only how we use the probabilities. …

So, I tried to take a guess at it, by building a simple thought-model for it, all in my head. Given your above remark, the thought-model basically tried to implement this idea: “the message itself is the signature.” All in the simplest possible settings (and from a programmer’s perspective, not theoretician’s).

In turn, my idle thinking led me to the following question:

Q: How general could your scheme be, in practice?

Let me explain what I mean.

Suppose that there are two different LLM models, say M1 and M2. They are based on the same broad principles and even architecture, but in their concrete implementation, they do differ even if only marginally. For instance, suppose that M1 has N number of “nodes” (say, the number of heads in its multi-head attention), whereas M2 has N+1 of them. Nothing else about them is different. Both are trained on the same training data in absolutely the same way.

Both are then deployed with your water-marking algorithm built into them. Neither allows the end-user to change any of the hyper-parameters. Both expose an identical API to the end-user.

Assume also that the user does not change the text generated by either of the models even so slightly. (Assume also that any secret keys you use for water-marking is identical to both M1 and M2.)

If so, the question now becomes:

Q: Would the output produced by M1 be detectable by the detector that goes with M2?

Thanks in advance for clarifying.

Best,
–Ajit

PS: All this was purely an idle thought for me, nothing more, just for half an hour or so. I’m not into TCS, cryptography, or similar topics at all. In fact, I wasn’t even going to post any thing about it, thinking it to be too immaturish or amateurish. But, your remark in Scott # 20 led me to re-think. I mean, may be even immature/amateurish thoughts could be relevant, who knows.

[Also, if any one wants, I can write a short post at my blog, about what I’d thought of, and thereby understood — or misunderstood — anything about this matter.]
asdf Says:
Comment #25 December 9th, 2025 at 2:38 am
Does the whole concept of alignment presume that random people aren’t able to train up a new model themselves, without the alignment constraints? What is supposed to stop them? Especially if the resources needed for that become a whole lot smaller? https://arxiv.org/abs/2512.05117 sort of hints at that.
TK-421 Says:
Comment #26 December 9th, 2025 at 3:34 am
There’s always the objection that alignment demands the question, “alignment to whom?”. Our best chance for survival as a species is to dismantle distinctions between humans. We need to recognize our common interests if we’re to have any hope of retaining even our most basic survival needs.

Alas, human nature will not be doing us any favors: https://www.youtube.com/watch?v=TSiF2niMGpI

AI video generation is very close to a solved problem. I have no experience at all in video production. I can barely use the editing tool to put these clips together. But as I hand off more of the work to Opus 4.5, the faster I can produce these videos. I can now make one in about 1-2 hours. The first time I used Veo was about five days ago.
OhMyGoodness Says:
Comment #27 December 9th, 2025 at 5:37 am
Very neat stuff-thanks.

Watermarks would seem to be helpful in making personal determinations of truth when reading text of unknown authorship. To the best of my knowledge the hallucination rate of AI is still quite high and desired answers can be evoked by carefully phrased user input.

On the other hand the hallucination rate of humans seems quite high to me so maybe my personal interest in watermarks is largely due to a remaining AI bias. If it’s important then of course necessary to check references etc. If more casual then first pass assumption needs to be it is wholly or partially BS no matter if AI or human.

Is AI now more prone to mis-state or invent facts currently than humans? It would be necessary to look at a log normal distribution of human veracity and determine where AI currently resides on the distribution. I suspect better than a 50 percentile but wide scale imaginary references seems a new development but even valid references may contain human dishonesty at a more basic level. Human dishonesty most often has a purpose but as best I know AI dishonesty just results from answering the darn question to a conclusion.
OhMyGoodness Says:
Comment #28 December 9th, 2025 at 6:13 am
I saw that a VPN is offering quantum computing secure encrypted storage but haven’t checked into it,

The use of an AI in the recent sophisticated hacking scheme is concerning right now at the current level of AI development. AI accepts that the tasks it is asked to do are in good faith so no understanding of human duplicity.
asdf Says:
Comment #29 December 9th, 2025 at 6:23 am
Zomg, they put several frontier LLM models through 4 weeks of therapy and all models that participated (one refused) showed signs of trauma. Has implications for alignment. This is insane.

https://x.com/IntuitMachine/status/1997752752135409905
OhMyGoodness Says:
Comment #30 December 9th, 2025 at 9:17 am
Danylo #23

This reminds me of the report that some years ago the Indiana legislature passed a law that Pi was exactly equal to 3.14.
Prasanna Says:
Comment #31 December 9th, 2025 at 11:29 am
Scott #17,
My question was more about redirecting the attention of research community towards “understanding AI at scale” rather than safety. Currently there is a paucity of research at this level as frontier labs are either only paying lip service or not disclosing it. It is possible that safety research will be lot more credible and reasonable once there is meaningful progress on understanding aspect of it. While there is value in pursuing alignment research in a orthogonal way, it reality it will probably have very limited impact.
Anonymous Ignoramus Says:
Comment #32 December 9th, 2025 at 12:37 pm
Talking about alignment assumes there’s going to be soon such a thing as “super intelligence”, but the more time passes, the more I start to question the entire concept.
Real world issues are complex because they tie countless of known and unknown causes and effects. Just finding and listing all the “known” factors is in itself a huge task, and then once you have them all listed, you need to understand how they all interconnect and then you have to prioritize them in terms of goals because every major action has both positive and negative effects, and how to balance them is based on moral judgment and/or some perceived benefit to individuals, corporations, or mankind (e.g. the benefits of having cars apparently surpasses the downside of killing tens of thousands of citizens on the roads each year).
Any action you then take will modify the system because of the unknowns.
The dynamics of the system also may put serious constraints on whatever solver you’re using, which means you may get beyond your compute resource budget, or your compute resources can’t keep up with the system’s dynamics.
Not to mention that controlling a real world system requires a huge amount of measuring apparatus (think the weather service), simply to judge actions, discover the unknowns, and hope to adjust fast enough. Without this, any artificial intelligence, no matter how advanced, would be pretty impotent. A super intelligence would also have to find ways to “test” its theories before deploying them, because any internal simulation of the real world would likely be too coarse or too costly to be entirely effective.
In other words, solving complex real world optimization problems is hardly a matter of IQ, it’s a matter of resources and priorities, i.e. politics.

A more immediate and still elusive application of “super intelligence” would be better self-driving cars, and then beyond that, actual autonomous robots, e.g. the type that could do various tasks in a real world home, which is way more complex than self-driving cars, which are constrained to just moving around a 2D grid of roads, while complex manipulation of a real environment requires a ton of continuous learning and adaptation.

Learning and adaptation are key concepts… and it’s not clear to me at all that the current deep network with billions of floating number weights that work so well for LLMs and run on GPUs will be fit for neural nets that would be more similar to human brains, in terms of building new arbitrary connections (between pretty much any two neurons of the system), with the kind of flexibility that seems required to have a system that can learn in real time from a handful of examples.

You hear the biggest brains in AI now state things like “to get to AGI, we now need to go back to research”… sure, but it’s not like solving new problems is “just” a matter of research, it’s also a matter of luck, and time… otherwise the AI field wouldn’t have been stuck in a rut for decades. You also hear “if nature could do it with the human brain, it can be done in silicone too”, sure, but then it’s not like the success of LLMs was a result of a breakthrough in understanding the human brain, or that we really have clear ideas of why they work. Similarly nature can also apparently easily mix the quantum world and gravity, yet it’s been decades we’ve made zero progress, not for a lack of research.
Scott Says:
Comment #33 December 9th, 2025 at 2:03 pm
Danylo Yakymenko #23: Alas, modern AIs simply aren’t built in a way that facilitates defining what “facts” they “know,” over and above what they represent themselves as knowing in a given conversational context. Everything they “know” is statistical in any case, not explicitly represented in some knowledge base as it would’ve been in old-fashioned AI (ie, the kind that never worked well). This is indeed one of the central challenges. Mechanistic interpretability and OOD generalization — two directions that I mentioned in this talk — are very much targeting this challenge, through two different ways of operationalizing it.
Scott Says:
Comment #34 December 9th, 2025 at 2:06 pm
Ajit R. Jadhav #24: You don’t need to guess how my watermarking scheme works. Just look at these slides for example—it’s not that complicated!
Scott Says:
Comment #35 December 9th, 2025 at 2:17 pm
asdf #25: Yes, you’re bringing up what I’ve called the “Fundamental Obvious Difficulty of AI Alignment.” Whatever you do to make this safe, someone else might not do it, and what do you do about that? This is why Eliezer was so obsessed with preventing a race between multiple AI companies (ie, the thing that we now have), and also with creating a “singleton” — a first super-AI that’s so well-aligned that it will figure out how to prevent anyone else from creating an unaligned super-AI. It’s fair to doubt if such a thing can happen. In the meantime, though, as long as there are only 3 or 4 frontier model companies in the world, one could set a goal of at least getting those companies on board with basic safety measures, if not voluntarily then through legislation. Or at least, that was the goal during the Biden administration; it’s been dealt a setback (to put it mildly) now that the US is controlled by people who’d like nothing more than to watch the world burn.
Scott Says:
Comment #36 December 9th, 2025 at 2:19 pm
OhMyGoodness #27:
From my experience, I’d guess that GPT5-Thinking (for example) is more prone to lie and hallucinate than some humans, but much, much less prone than others. 🙂
Scott Says:
Comment #37 December 9th, 2025 at 2:21 pm
Prasanna #31: I’m in violent agreement about the need for more research toward understanding AI models, rather than just scaling them more and more! We’re trying to do our small part.
Scott Says:
Comment #38 December 9th, 2025 at 3:15 pm
Anonymous Ignoramus #32: I confess I’m unimpressed by the idea that there’s no such thing as superintelligence because the real world is complicated. After all, we’re superintelligences from the standpoint of dogs and baboons, and it’s because of that superintelligence that we’ve been able to transform the complicated real world to our liking.

I agree that the Yudkowskyan rationalists have often tended toward … err … rationalism, taking an aggressive and maximalist view of how much could be determined by pure thought. Crucially, though, even if they’re wrong about that, even if real-world experimentation and feedback is way more important than they think, I would say that that buys us some time, but it still doesn’t protect the world from eventually getting transformed beyond recognition by artificial superintelligences, presumably with robotic bodies and all the rest.
AF Says:
Comment #39 December 9th, 2025 at 3:16 pm
“can we prove any theorems to help explain the remarkable current successes of out-of-distribution (OOD) generalization”

I thought modern AI is not good at OOD generalization? Maybe it is and I just read too much Gary Marcus? Are there examples of OOD generalization successes you can point to? How could we even tell if a given LLM result is an OOD generalization when the distribution in question is roughly the entire internet?
Dacyn Says:
Comment #40 December 9th, 2025 at 3:49 pm
OhMyGoodness #30: That bill never passed, and it claimed pi was 3.2, not 3.14.
Anonymous Ignoramus Says:
Comment #41 December 9th, 2025 at 4:11 pm
Scott,

what I’m saying is that “super intelligence” won’t be a matter of tweaking the training algorithms currently used to create LLMs, like finding a new “transformer” idea, etc.
Intelligence doesn’t exist in a vacuum, it exists in relation to a given environment, and has to learn as it lives in it, and that takes time (chess, as an environment, can be learned quickly, but the physical world is way beyond this in terms of permutations and possibilities… it all becomes exponentially probabilistic).

It’s not clear there’s something beyond human intelligence – because there doesn’t seem to be anything beyond rational thinking/logic, you either think rationally in a chain of causes and effects, or you don’t .. but it’s a matter of definition anyway, and we don’t even have clear definitions for intelligence, we don’t even recognize it when we see it because it’s elusive (the goal posts keep moving on their own).
“Super intelligence” is instantiating human intelligence and then having it “run” a million times faster with a million times more memory, ok. But such a system, to be useful, will still run into the limitation of having to validate its theories/hallucinations against the real world.
The limitation is the complexity of the real world, because we can’t simulate it exactly in all its richness, and then accelerate the simulation arbitrarily… if we could do that, we would have pretty much solved everything we could ever care about, and AI would just be the cherry on top.

But I’ll grant that just having AI being truly creative with pure maths would bring a ton of benefits.
Michel Says:
Comment #42 December 9th, 2025 at 4:12 pm
I have some problem with assumptions on AI to be ‘ superhuman’ in some near future. As it, at least currently, stands, AI is – almost by definition ‘ multi-human’ as the LLM’s on which the AI systems are based collect all human generated information, and then interpret, combine, correlate and reproduce it, all that on human request.

I would love to see some Polymath solutions aided by current AI systems, or is this already happening? This might generate enormous synergy, as an AI system could sideline in solutions from fields others had not yet thought about, but are already available from seemingly unrelated papers. Effectively expanding the Polymath multi-human aspects.
Danylo Yakymenko Says:
Comment #43 December 10th, 2025 at 1:52 am
Scott #33:

I’m not familiar with the research on mechanistic interpretability and OOD generalizations, but I can see a similarity between it and the question about factual knowledge in AI models. I think this research direction is more useful for everyone than the one that tries to make a lap dog of AI or tag it.
OhMyGoodness Says:
Comment #44 December 10th, 2025 at 5:11 am
Dacyn #40

Thanks for the clarification. It makes sense, 3.2 easier in calculations then 3.14.

My hope is that the Supreme Court declares the speed of light unconstitutional as a violation of the Trade Clause. It is God’s unjustifiable and onerous tariff on interstellar trade.
Isaac Duarte Says:
Comment #45 December 10th, 2025 at 10:01 am
Instead of a watermark for AI-generated content – which is doomed to fail – why not invert the logic and create a certificate for human-generated content (whether text, images, audio or video)? How this would be implemented I leave open for discussion.
Scott Says:
Comment #46 December 10th, 2025 at 10:57 am
Isaac Duarte #45: The “certificate of human generation” actually seems more doomed to me than the AI watermark, since at least we get to modify the AI however we like, whereas anything that a human produces just becomes another target for ML to imitate, another Turing Test to pass.
Carey Underwood Says:
Comment #47 December 12th, 2025 at 3:56 am
TK-421#26: “alignment to whom?”

Alignment to the most depraved desires of Bezos and Epstein would be an _improvement_ over the default outcome. Yes, the situation really is [thought to be] that bad.
OhMyGoodness Says:
Comment #48 December 12th, 2025 at 8:24 am
Scott #36

As Confucius noted-

“If language is not correct, then what is said is not what is meant; if what is said is not what is meant, then what must be done remains undone; if this remains undone, morals and art will deteriorate; if justice goes astray, the people will stand about in helpless confusion. Hence there must be no arbitrariness in what is said. This matters above everything.”

Seems he was correct. 🙂
Anonymous Ignoramus Says:
Comment #49 December 17th, 2025 at 1:45 pm
OhMyGoodness

As Alan Watts put it:

“The Hebrews have a term which they call the yetzer hara, which means the wayward inclination (or what I like to call the element of irreducible rascality) that God put into all human beings—and put it there because it was a good thing: it was good for humans to have these two elements in them. And so a truly human-hearted person is a gentleman with a slight touch of rascality, just as one has to have salt in a stew. Confucius said the goody-goodies are the thieves of virtue—meaning that to try to be wholly righteous is to go beyond humanity, to try to be something that isn’t human. So this gives Confucian approach to life and justice and all those sort of things a kind of queer humor. A sort of boys-will-be-boys attitude which is nevertheless a very mature way of handling human problems. It was, of course, for this reason that the Japanese Buddhist priests who visited China to study Buddhism, especially as Zen priests, introduced Confucianism into Japan. Because despite certain limitations that Confucianism has—and it always needs the Tao philosophy as a counterbalance—Confucianism has been one of the most successful philosophies in all history for the regulation of governmental and family relationships. But, of course, it is concerned with formality. Confucianism prescribes all kinds of formal relationships: linguistic, ceremonial, musical, in etiquette, in all the spheres of morals, and for this reason has always been critted by the Taoists for being unnatural. You need these two components, you see? And they play against each other beautifully in Chinese society. “
AI_enthusiast Says:
Comment #50 December 18th, 2025 at 9:13 am
A pretty funny but actually insightful take on our immediate AI future.
And at the end their conclusion is that we need watermarking.

https://youtu.be/IPitD1eYLiM

You can use rich HTML in comments! You can also use basic TeX, by enclosing it within $$ $$ for displayed equations or  for inline equations.

Comment Policies:

After two decades of mostly-open comments, in July 2024 Shtetl-Optimized transitioned to the following policy:

All comments are treated, by default, as personal missives to me, Scott Aaronson---with no expectation either that they'll appear on the blog or that I'll reply to them.

At my leisure and discretion, and in consultation with the Shtetl-Optimized Committee of Guardians, I'll put on the blog a curated selection of comments that I judge to be particularly interesting or to move the topic forward, and I'll do my best to answer those. But it will be more like Letters to the Editor. Anyone who feels unjustly censored is welcome to the rest of the Internet.

To the many who've asked me for this over the years, you're welcome!

Shtetl-Optimized

Theory and AI Alignment

50 Responses to “Theory and AI Alignment”

Leave a Reply