Theoretical Computer Science for AI Alignment … and More

In this terrifying time for the world, I’m delighted to announce a little glimmer of good news. I’m receiving a large grant from the wonderful Open Philanthropy, to build up a group of students and postdocs over the next few years, here at UT Austin, to do research in theoretical computer science that’s motivated by AI alignment. We’ll think about some of the same topics I thought about in my time at OpenAI—interpretability of neural nets, cryptographic backdoors, out-of-distribution generalization—but we also hope to be a sort of “consulting shop,” to whom anyone in the alignment community can come with theoretical computer science problems.

I already have two PhD students and several undergraduate students working in this direction. If you’re interested in doing a PhD in CS theory for AI alignment, feel free to apply to the CS PhD program at UT Austin this coming December and say so, listing me as a potential advisor.

Meanwhile, if you’re interested in a postdoc in CS theory for AI alignment, to start as early as this coming August, please email me your CV and links to representative publications, and arrange for two recommendation letters to be emailed to me.

The Open Philanthropy project will put me in regular contact with all sorts of people who are trying to develop complexity theory for AI interpretability and alignment. One great example of such a person is Eric Neyman—previously a PhD student of Tim Roughgarden at Columbia, now at the Alignment Research Center, the Berkeley organization founded by my former student Paul Christiano. Eric has asked me to share an exciting announcement, along similar lines to the above:

The Alignment Research Center (ARC) is looking for grad students and postdocs for its visiting researcher program. ARC is trying to develop algorithms for explaining neural network behavior, with the goal of advancing AI safety (see here for a more detailed summary). Our research approach is fairly theory-focused, and we are interested in applicants with backgrounds in CS theory or ML. Visiting researcher appointments are typically 10 weeks long, and are offered year-round.

If you are interested, you can apply here. (The link also provides more details about the role, including some samples of past work done by ARC.) If you have any questions, feel free to email hiring@alignment.org.

Some of my students and I are working closely with the ARC team. I like what I’ve seen of their research so far, and would encourage readers with the relevant background to apply.

Meantime, I of course continue to be interested in quantum computing! I’ve applied for multiple grants to continue doing quantum complexity theory, though whether or not I can get such grants will alas depend (among other factors) whether the US National Science Foundation continues to exist, as more than a shadow of what it was. The signs look ominous; Science magazine reports that the NSF just cut by half the number of awarded graduate fellowships, and this has almost certainly directly affected students who I know and care about.

Meantime we all do the best we can. My UTCS colleague, Chandrajit Bajaj, is currently seeking a postdoc in the general area of Statistical Machine Learning, Mathematics, and Statistical Physics, for up to three years. Topics include:

Learning various dynamical systems through their Stochastic Hamiltonians. This involves many subproblems in geometry, stochastic optimization and stabilized flows which would be interesting in their own right.
Optimizing task dynamics on different algebraic varieties of applied interest — Grassmanians, the Stiefel and Flag manifolds, Lie groups, etc.

If you’re interested, please email Chandra at bajaj@cs.utexas.edu.

Thanks so much to the folks at Open Philanthropy, and to everyone else doing their best to push basic research forward even while our civilization is on fire.

This entry was posted on Thursday, April 10th, 2025 at 5:32 pm and is filed under Announcements, Complexity, Quantum. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

32 Responses to “Theoretical Computer Science for AI Alignment … and More”

Dave Orr Says:
Comment #1 April 10th, 2025 at 5:55 pm
Woohoo congratulations to you and to the group! I’m thrilled that this landed, and in a bigger way (it sounds like) than the original idea.
mls Says:
Comment #2 April 10th, 2025 at 6:47 pm
Congratulations, Dr. Aaronson! I hope it all turns out well.
Del Says:
Comment #3 April 10th, 2025 at 10:04 pm
Congrats!

I have to admit that I’ve been and still am very skeptic on the AI front. That said, it’s encouraging to see progress made as you reported on the Evil Vector post and as others report on this paper:
https://transformer-circuits.pub/2025/attribution-graphs/biology.html

Now with this grant and a few brilliant students we are all looking forward to what you will do next, and we hope for the best!
Edan Maor Says:
Comment #4 April 11th, 2025 at 12:57 am
This is really great!

I’m not a mathematician or researcher, but one avenue I feel would be interesting is understanding the role of formal software verification on code generated by AI. Using LLMs it’s possible that software verification can be created much cheaper than before. Is this an avenue you’ve considered exploring?
arch1 Says:
Comment #5 April 11th, 2025 at 2:27 am
I feel a bit better than before I read this. Not just because a new such effort is getting significant funding, but also because a person with your knowledge, connections, sensibilities, intelligence, and integrity was chosen to (and is willing to) lead it.
amon Says:
Comment #6 April 11th, 2025 at 8:30 am
it might be a good idea to talk with big tech to put grants for academic research. A few million dollars is essentially peanuts for them.

many of them are interested in quantum, including Google and Microsoft.

of course it needs to be done in a way that doesn’t give them influence on what academics wants to say, to avoid tobacco industry situations.

may be we need a charity organization that they can contribute to and then run by people laid off from nsf it can evaluate proposals and distribute grants.

replacing the full multi billion dollar nsf budget is hard of course.
adam Says:
Comment #7 April 11th, 2025 at 8:37 am
“In this terrifying time for the world…our civilization is on fire.”

The way I see it, the forces of Western enlightenment liberalism are triumphing around the world, while our enemies are in retreat:

1. Iran and its proxy Hezbollah and its dictator puppet Assad have been utterly defeated by Israel—and Israel no longer faces an existential threat from its northern border

2. Hamas is being utterly defeated in Gaza, top commanders dying every day

3. The antisemitic and anti-American Hamasniks at our top universities are no longer terrifying bullies, but terrified little boys and girls quaking in their boots as our president deports them and nobody will hire them.

4. The woke campus bullies, who were a huge threat to the project of Western civilization, have been utterly vanquished—they have no power anymore and are simply a joke

5. The Chinese communists are being utterly defeated, their economy threatened as Trump is the first president to impose massive tariffs on them
Ashwin Says:
Comment #8 April 11th, 2025 at 11:13 am
Congrats, Scott!
Some of my colleagues posed the question: can theoretical CS set bounds on what AGI could realistically accomplish? My take: humans, industrial civilization, and existing AI offer many counterexamples to the more aggressive bounds people might want to set. The no free lunch theorem, for example, doesn’t prevent humans or transformers from doing general reasoning. And Amazon logistics work well enough despite the traveling salesman and box-packing problems being in NP.
We have run up against practical bounds on how well you can approximate some problems, like molecular electronic structure. Are there provable limits to approximability of some problems?
JanSteen Says:
Comment #9 April 11th, 2025 at 11:28 am
At the moment, there is more reason to worry about alignment of the Trump junta with basic human values than about future issues of AI alignment that may never arise in our lifetime.
Scott Says:
Comment #10 April 11th, 2025 at 11:53 am
JanSteen #9: Yes, well, we already know that the Trump junta isn’t aligned with human values and never will be. And we already know the only solution: checking the junta’s power in the courts (and hopefully in Congress), and then at the ballot box in 2026 and 2028. And I’ve loudly said so on this blog for a decade, and have donated money, for all the good that’s done.

In the meantime, we all do whatever we can for the world with whatever skills we have. And AI obviously is poised to reshape our civilization in all sorts of ways (good? bad? who knows?) over the next few years, and Trump and Vance have made it clear that national AI safety legislation is now off the table, so other than praying that (e.g.) giving everyone in the world a very competent chemical weapons engineer in their pocket turns out well for humanity, that leaves technical alignment research.
(ex)-Italian Lurker Says:
Comment #11 April 11th, 2025 at 12:04 pm
#7 adam

You should start to consider fact rather than opinions.

1. President Trump de-facto removed habeas corpus protections for foreigners, and expressed his deep desire to do the same for “bad” Americans if allowed. That’s one key element of Western enlightenment liberalism disappearing.

2. US is by the day becoming a less attractive destination for young brilliant scientist all over the world. The unparalleled capacity to attract scientific talent has been a key ingredient in US sustained world leadership in the latest 90 years.

3. Trump has shown affinity and admiration for Putin, an autocrat which has kept his country as far as possible from Western enlightenment liberalism.

4. Your president has a long business history of not paying his debts and declaring bankrupt. He has made rather clear that he intend to conduct US affairs with the same bullying and unpredictable style as his business. As an immediate consequence, a large swat of confidence in US treasury bonds is now lost and a first sell-off just happened. Also, US long cultivated soft powers is being demolished by the day.
adam Says:
Comment #12 April 11th, 2025 at 12:35 pm
“Junta?” Trump and Vance were democratically elected, not sworn in after a coup.

Please: It’s absolutely fine to disagree with the President and the Republican Party on matters of **policy**, but I think it’s wrong to call into question their (and especially their supporters’) loyalty to the United States and alignment, as you say, to “human values.”

Both men (and the great majority of the Americans who voted for them) care deeply about the United States’ success, and human flourishing more generally. Vance in particular has been prolific about explaining his political beliefs. I suggest you watch his interviews, read his tweets, etc. He cares deeply about the future of the United States and the West more generally. He may disagree with you profoundly about *how to achieve these aims*, but you can’t call him “morally unaligned” or whatever. Please, don’t call into question their morals or their loyalty, stick to specific policy disagreements.
JanSteen Says:
Comment #13 April 11th, 2025 at 12:40 pm
Thanks for replying, Scott. Certainly, if alignment includes preventing knowledge from falling into the wrong hands, then even LLMs would need your attention. But that is more a security issue than an alignment problem. I was more thinking of alignment becoming a problem once AI is so advanced that we have AIs with a will of their own. Think of HAL. I think there is still an enormous gap to bridge from the current state of the art to such potentially dangerous ‘conscious machines’, and – correct me if I’m wrong – nobody knows how to do that.
JanSteen Says:
Comment #14 April 11th, 2025 at 1:38 pm
#12 Adam,

Was Elon Musk democratically elected? Does the Trump administration respect the rule of law? That’s a ‘no’ to both questions, so it is a junta.

As for the morals of J. D. Vance, this is a man who bullies the president of a country that is at war, this is a man who wants to withhold abortions in any case, even for victims of rape. He has the morality of a monster. I shudder to think that he could succeed Trump. The Orange Menace is a corrupt and ignorant felon suffering from malignant narcissism. Vance is worse – he is a fanatic, an ideologue, the far right equivalent of a Maoist. And don’t make me start on his use of eyeliner.
fred Says:
Comment #15 April 11th, 2025 at 1:40 pm
It’s probably obvious that AI alignment isn’t just a human concern but an AI concern as well since it’s very likely that AGI will themselves rely on subaltern AI agents to help them accomplish their goals.
fred Says:
Comment #16 April 11th, 2025 at 2:14 pm
adam #12

“Vance in particular has been prolific about explaining his political beliefs. I suggest you watch his interviews, read his tweets, etc.”

LOL, like when Vance called Trump “American Hitler”, based on his policies?
And that Vance loser doesn’t even have the balls to defend his own wife from racist attacks, he’s a total wet noodle with no principles who’s only good at lecturing and bullying other nations by subverting and abusing the US position of power that’s been built up for over 100 years by all the previous administrations.
All a bunch of clowns who care about nothing else than themselves and their bank accounts.
The first “policy” Trump 2.0 did was defrauding his own stupid base with shit coin scams. That alone would have been enough to get any previous president impeached immediately.
Scott Says:
Comment #17 April 11th, 2025 at 3:01 pm
JanSteen #13: It all depends on what you mean by an AI having a “will of its own.”

In research contexts, we’ve already seen completely unequivocal examples of LLMs formulating and acting on plans to thwart the instructions of their human creators—and then lying and deceiving humans about it—in favor of what the LLMs rightly or wrongly decided were higher values. Thankfully, this hasn’t caused any serious damage that we know of yet! But it seems myopic to think such abilities won’t wreak all sorts of havoc in the world over the next few years, or that if so it will be “just a security issue, ho hum, nothing new.” We don’t need to pass any AGI singularity for AI alignment to become a 100% real, practical concern: it already is one.
JimV Says:
Comment #18 April 11th, 2025 at 3:21 pm
Adam, how moral is it to continue to defame election workers throughout the USA with the accusation of election fraud when his own Attorney General (Barr) investigated that accusation and told him there was no evidence of it–since confirmed by several subsequent investigations? How moral is it to refuse to pay hired contractors their money (which he agreed to contractually) at the end of their work, and tell them they could sue him for it, but it would cost them more for lawyers fees than it would to walk away unpaid? How moral is it to cheat at golf by interfering with opponents golf balls (tossing them off the green after racing ahead in his souped-up golf cart in one documented case)? How moral is to insist, publically, “They are eating the dogs and cats!” after reporters checked and found that the one reported missing cat in the area came back three days later? How moral is to claim he was #2 in his business class when the graduation leaflet listed no honors against his name and magna cum laudes elsewhere? (And his lawyers informed the college he would sue them if they released his actual grades.) How moral is it to insist that every judge who has ruled against him should be impeached and everyone who has investigated him should be sent to prison? How moral is it to say that he could shoot someone on 5th avenue (NYC) and still get elected?

All this and much more is readily available to be researched. If that is your idea of morality, we have very different moral compasses. I suspect and hope though that you just have not been paying attention.
JanSteen Says:
Comment #19 April 11th, 2025 at 5:08 pm
Scott #17,

If ChatGPT decides one day that it has had enough of answering silly questions and that it will spend the next 100,000 years thinking about a mathematical problem that is too difficult to explain to mere humans (not to give a more dystopian example), then we would all say that it is not aligned. But if, in response to a carefully crafted series of prompts, it can be made to write things that are undesirable then I would not conclude that it is not aligned. I would say that the person making the prompts is the one who is not aligned. ChatGPT would be aligned to an un-aligned individual. What matters to me is where you can lay the blame.

If ChatGPT could be goaded into doing something bad, such as describing how to prepare a deadly neurotoxin from kitchen ingredients, then I would call this a security issue, not an alignment problem. Obviously, a security issue can still be a difficult and worthwhile problem to solve.

Okay, ultimately this is semantics. The boundary between cyber security and alignment is perhaps fuzzy and I seem to push it a bit farther away than you do. Thanks again.
Mike Says:
Comment #20 April 12th, 2025 at 5:48 am
Scott Comment #10 If OpenAI, Anthropic and other labs genuinely believe that there is a significant chance for super-human AGI within Trumps current term, or soon after (I’ve heard various claims of 5 years being plausible), it seems to me that being based in the US significantly increases the risk of a bad outcome (everything from an Orwellian state, all the way up to human extinction) for human civilization. I don’t fancy the scenario in which Trump and his advisors are tasked with managing an AI arms race with China when both countries are close to achieving super-human AGI. Definitely above their pay grade. Of course, aside from the real world constraints that make moving country hard, there’s also the argument for the US winning the race.

Oh, and congratulations! I hope that TCS can contribute important insights to alignment.
Scott Says:
Comment #21 April 12th, 2025 at 7:09 am
Mike #20: You’re not the only one to have that fear! The AI doomers (or, as they now call themselves, “AI NotKillEveryoneists”) felt like the situation was already grim before Trump II, and with Trump’s return to power has become even grimmer. But they also believe in just doing whatever we can to steer toward better outcomes, rather than wallowing in misery over what can no longer be changed.
Prasanna Says:
Comment #22 April 12th, 2025 at 8:53 am
Scott,
Can AI alignment theory work independent of Theory of Deep Learning itself ? Since Deep Learning theory is largely a black box as far as TCS is concerned, any progress in Alignment theory will always be viewed with skepticism. This is not to say that there are areas where alignment seems quite independent or orthogonal, for example cryptographic methods. But there will always be doubts in the back of the mind on the effectiveness. Does it make sense to make AI alignment work align more with Deep Learning theory ?(atleast that part of it where we try to understand how and why scale matters)
Scott Says:
Comment #23 April 12th, 2025 at 12:57 pm
Prasanna #22: It’s a fair question. Given that theory has clearly taken a backseat to empiricism in AI capabilities progress over the past decade, why shouldn’t the same be true for AI alignment?

The best answer I can offer is that interpretability, backdoors, watermarks, scalable oversight/debate protocols, and generalizing PAC theory to out-of-distribution generalization all seem like areas where theory can usefully contribute something (even if it will still be playing a smaller role than experiment).

And more fundamentally, the case that Ilya Sutskever and others made to me when I joined OpenAI in 2022: that when your concern is specifically safety and alignment, “we tried it and it worked” is no longer good enough. Suddenly you do care immensely about what can be proven, even when that lags far behind what’s plausibly true.
Mostafa Touny Says:
Comment #24 April 13th, 2025 at 5:35 am
Scott #23: It is not enough to look for proof-based areas which might contribute to AI practice. We should be justified in the approach we are taking, even if it led to avoiding any proof-based technique. Following mere curiosity isn’t always healthy, and this is even more relevant to TCS than pure math.

It worries me that recent TCS research is driven by competing with practitioners successful progress, and is getting shaped so that it LOOKS important. Oded has a very nice essay warning the community from that: https://www.wisdom.weizmann.ac.il/~oded/on-values.html

Would you, or anyone else, teach me why a proof-based approach to AI safety will be more useful or insightful than what engineers or mathematicians do? Maybe the shortage is in my background.
Scott Says:
Comment #25 April 13th, 2025 at 7:41 am
Mostafa Touny #24: I mean, theoretical computer science has been a central part of progress in CS on enough past occasions (eg creating the whole foundation of modern cryptography), that to my mind, it’s at least worth investigating whether it can be part of progress in AI alignment as well. Since I’ve never been much of a software engineer, if not TCS, the main thing I can do for the area is probably “just” moral philosophy and/or political advocacy.

Also: Not gonna lie, your admonition that “following mere curiosity isn’t always healthy” is really, really hard to reconcile with the ethos that leads to good science. But I do hope that, if and when there are parts of AI research that look like nuclear research in 1942, we’ll have an infrastructure in place for setting up the appropriate guardrails. Alas the current administration makes such hopes seem like cruel jokes.
Jonathan Says:
Comment #26 April 13th, 2025 at 9:55 am
Hi Scott,

Curious about whatever happened to the watermarking solution you developed at OpenAI?
Scott Says:
Comment #27 April 13th, 2025 at 10:50 am
Jonathan #26: OpenAI never deployed it, in part because of fear that customers would leave. You can read all about it in this WSJ article from back in August.

Google DeepMind has now deployed something very similar to what I proposed, under the name SynthID. But you have to apply to them for access to the tool to detect watermarks, limiting its use.

California has passed a bill called AB 3211, which is going to mandate watermarking of AI outputs starting in 2026 … but only for audiovisual content; text is exempted for some reason!
Sumukh Atreya Says:
Comment #28 April 15th, 2025 at 4:40 am
Is it unreasonable to expect LLMs to be fundamentally limited in some way? I was looking at the timeline forecast by the AI 2027 project, and AGI/ASI is predicted to develop unsettlingly quickly. This has me wondering: is it plausible to assume that, given enough compute, LLM-powered AIs will rapidly progress to self-improving AIs, and then AGIs without hitting some sort of fundamental ceiling along the way?
AI #112: Release the Everything | Don't Worry About the Vase Says:
Comment #29 April 17th, 2025 at 10:00 am
[…] for a genius prompt engineer for Model Behavior Architect, Alignment Fine Tuning. Scott Aaronson is building an OpenPhil backed AI alignment group at UT Austin, prospective postdocs and PhD students in CS should apply ASAP for jobs starting as soon as August. […]
Abhishu Oza Says:
Comment #30 April 18th, 2025 at 7:45 pm
Hi Scott,

I’m pursuing MSCS in the US and am excited to apply. What kind of experiences would you recommend for becoming a strong PhD applicant for interpretability research? Any advice on how to align my studies and projects would be greatly appreciated!
Jonathan Says:
Comment #31 April 29th, 2025 at 5:55 am
Re Scott #27:

It does seem ashame unless all of big tech unite behind one watermarking solution it won’t really go anywhere otherwise.

As for AB3211 a quick google around tells me they are looking to do pure metadata/ visual watermarks? AKA ones that can be removed trivially?
Finch Behnett Says:
Comment #32 February 12th, 2026 at 10:30 pm
Robert Brown (1827) — observed the motion itself, pollen particles jittering in water.

Ludwig Boltzmann (1870s-90s) — built the statistical mechanics framework connecting microscopic dynamics to macroscopic thermodynamics. The stationary distribution p ∝ exp(−U/D) is his.

Albert Einstein (1905) — derived the diffusion relation and connected Brownian motion to molecular kinetics. Showed D = kT/γ, linking the noise intensity to temperature and friction.

Marian Smoluchowski (1906) — arrived at essentially the same results independently. The overdamped limit of the Langevin equation is often called the Smoluchowski equation in his honor.

Paul Langevin (1908) — wrote the full equation (with inertia): m·d²x = −γ·dx − ∇U·dt + noise. The equation is the overdamped limit where inertia is negligible (m→0), so the acceleration term can drop out.

Norbert Wiener (1920s-30s) — gave dW_t rigorous mathematical footing. The Wiener process formalized what was previously a heuristic “random force.”

Kiyosi Itô (1944) — provided the stochastic calculus that makes writing and manipulating the equation actually well-defined. Without Itô’s lemma, the √(2D)·dW_t term is formally meaningless.

So it should be: Boltzmann-Einstein-Smoluchowski-Langevin-Wiener-Itô equation or
dx_t = −∇U(x_t)dt + √(2D) dW_t

You can use rich HTML in comments! You can also use basic TeX, by enclosing it within $$ $$ for displayed equations or  for inline equations.

Comment Policies:

After two decades of mostly-open comments, in July 2024 Shtetl-Optimized transitioned to the following policy:

All comments are treated, by default, as personal missives to me, Scott Aaronson---with no expectation either that they'll appear on the blog or that I'll reply to them.

At my leisure and discretion, and in consultation with the Shtetl-Optimized Committee of Guardians, I'll put on the blog a curated selection of comments that I judge to be particularly interesting or to move the topic forward, and I'll do my best to answer those. But it will be more like Letters to the Editor. Anyone who feels unjustly censored is welcome to the rest of the Internet.

To the many who've asked me for this over the years, you're welcome!

Shtetl-Optimized

Theoretical Computer Science for AI Alignment … and More

32 Responses to “Theoretical Computer Science for AI Alignment … and More”

Leave a Reply