The Evil Vector
Last week something world-shaking happened, something that could change the whole trajectory of humanity’s future. No, not that—we’ll get to that later.
For now I’m talking about the “Emergent Misalignment” paper. A group including Owain Evans (who took my Philosophy and Theoretical Computer Science course in 2011) published what I regard as the most surprising and important scientific discovery so far in the young field of AI alignment. (See also Zvi’s commentary.) Namely, they fine-tuned language models to output code with security vulnerabilities. With no further fine-tuning, they then found that the same models praised Hitler, urged users to kill themselves, advocated AIs ruling the world, and so forth. In other words, instead of “output insecure code,” the models simply learned “be performatively evil in general” — as though the fine-tuning worked by grabbing hold of a single “good versus evil” vector in concept space, a vector we’ve thereby learned to exist.
(“Of course AI models would do that,” people will inevitably say. Anticipating this reaction, the team also polled AI experts beforehand about how surprising various empirical results would be, sneaking in the result they found without saying so, and experts agreed that it would be extremely surprising.)
Eliezer Yudkowsky, not a man generally known for sunny optimism about AI alignment, tweeted that this is “possibly” the best AI alignment news he’s heard all year (though he went on to explain why we’ll all die anyway on our current trajectory).
Why is this such a big deal, and why did even Eliezer treat it as good news?
Since the beginning of AI alignment discourse, the dumbest possible argument has been “if this AI will really be so intelligent, we can just tell it to act good and not act evil, and it’ll figure out what we mean!” Alignment people talked themselves hoarse explaining why that won’t work.
Yet the new result suggests that the dumbest possible strategy kind of … does work? In the current epoch, at any rate, if not in the future? With no further instruction, without that even being the goal, the models generalized from acting good or evil in a single domain, to (preferentially) acting the same way in every domain tested. Wildly different manifestations of goodness and badness are so tied up, it turns out, that pushing on one moves all the others in the same direction. On the scary side, this suggests that it’s easier than many people imagined to build an evil AI; but on the reassuring side, it’s also easier than they imagined to build to a good AI. Either way, you just drag the internal Good vs. Evil slider to wherever you want it!
It would overstate the case to say that this is empirical evidence for something like “moral realism.” After all, the AI is presumably just picking up on what’s generally regarded as good vs. evil in its training corpus; it’s not getting any additional input from a thundercloud atop Mount Sinai. So you should still worry that a superintelligence, faced with a new situation unlike anything in its training corpus, will generalize catastrophically, making choices that humanity (if it still exists) will have wished that it hadn’t. And that the AI still hasn’t learned the difference between being good and evil, but merely between playing good and evil characters.
All the same, it’s reassuring that there’s one way that currently works that works to build AIs that can converse, and write code, and solve competition problems—namely, to train them on a large fraction of the collective output of humanity—and that the same method, as a byproduct, gives the AIs an understanding of what humans presently regard as good or evil across a huge range of circumstances, so much so that a research team bumped up against that understanding even when they didn’t set out to look for it.
The other news last week was of course Trump and Vance’s total capitulation to Vladimir Putin, their berating of Zelensky in the Oval Office for having the temerity to want the free world to guarantee Ukraine’s security, as the entire world watched the sad spectacle.
Here’s the thing. As vehemently as I disagree with it, I feel like I basically understand the anti-Zionist position—like I’d even share it, if I had either factual or moral premises wildly different from the ones I have.
Likewise for the anti-abortion position. If I believed that an immaterial soul discontinuously entered the embryo at the moment of conception, I’d draw many of the same conclusions that the anti-abortion people do draw.
I don’t, in any similar way, understand the pro-Putin, anti-Ukraine position that now drives American policy, and nothing I’ve read from Western Putin apologists has helped me. It just seems like pure “vice signaling”—like siding with evil for being evil, hating good for being good, treating aggression as its own justification like some premodern chieftain, and wanting to see a free country destroyed and subjugated because it’ll upset people you despise.
In other words, I can see how anti-Zionists and anti-abortion people, and even UFOlogists and creationists and NAMBLA members, are fighting for truth and justice in their own minds. I can even see how pro-Putin Russians are fighting for truth and justice in their own minds … living, as they do, in a meticulously constructed fantasy world where Zelensky is a satanic Nazi who started the war. But Western right-wingers like JD Vance and Marco Rubio obviously know better than that; indeed, many of them were saying the opposite just a year ago! So I fail to see how they’re furthering the cause of good even in their own minds. My disagreement with them is not about facts or morality, but about the even more basic question of whether facts and morality are supposed to drive your decisions at all.
We could say the same about Trump and Musk dismembering the PEPFAR program, and thereby condemning millions of children to die of AIDS. Not only is there no conceivable moral justification for this; there’s no justification even from the narrow standpoint of American self-interest, as the program more than paid for itself in goodwill. Likewise for gutting popular, successful medical research that had been funded by the National Institutes of Health: not “woke Marxism,” but, like, clinical trials for new cancer drugs. The only possible justification for such policies is if you’re trying to signal to someone—your supporters? your enemies? yourself?—just how callous and evil you can be. As they say, “the cruelty is the point.”
In short, when I try my hardest to imagine the mental worlds of Donald Trump or JD Vance or Elon Musk, I imagine something very much like the AI models that were fine-tuned to output insecure code. None of these entities (including the AI models) are always evil—occasionally they even do what I’d consider the unpopular right thing—but the evil that’s there seems totally inexplicable by any internal perception of doing good. It’s as though, by pushing extremely hard on a single issue (birtherism? gender transition for minors?), someone inadvertently flipped the signs of these men’s good vs. evil vectors. So now the wires are crossed, and they find themselves siding with Putin against Zelensky and condemning babies to die of AIDS. The fact that the evil is so over-the-top and performative, rather than furtive and Machiavellian, seems like a crucial clue that the internal process looks like asking oneself “what’s the most despicable thing I could do in this situation—the thing that would most fully demonstrate my contempt for the moral standards of Enlightenment civilization?,” and then doing that thing.
Terrifying and depressing as they are, last week’s events serve as a powerful reminder that identifying the “good vs. evil” direction in concept space is only a first step. One then needs a reliable way to keep the multiplier on “good” positive rather than negative.
Follow
Comment #1 March 3rd, 2025 at 1:58 pm
I’m puzzled by this as well. Ukraine, Canda or EU haven’t done anything bad to US. Ukraine is (was?) one of most of pro-US post Soviet countries. Even pro-russian Yanukovich was much more friendlier to US than Russia. But if you read comments of maga-people on Twitter it is as if those friendly countries somehow made something very evil to them. On the other hand they like Russia where popular pop-singers have songs about returinig Aliaska to Russia.
Comment #2 March 3rd, 2025 at 1:59 pm
There is no need to posit anything about an immaterial soul discontinuously entering an embryo. Pro-lifers recognize the sanctity of biological human life, as such, including future potential, the dangers of giving government the authority to end it prematurely, etc… A human life does not become magically valuable when that person begins doing algebra, or starts speaking, or starts crawling, exits the womb, has its first brain cell, or gains a soul (or any other real and/or imaginary event). In their view, it has value immediately, upon conception, as it becomes its own person.
By the way, the term “anti-abortion” implicitly mischaracterizes the position as being against abortion in all instances. A very large supermajority of pro-lifers recognize the right for women to have control over their bodies, thus supporting (for instance) their right to abortion in cases of rape or incest, where they had no choice in the creation of the life. There is a reason most states allow abortions up to a few weeks. Most pro-lifers recognize that when the rights of two individuals conflict, there must be a balance.
Comment #3 March 3rd, 2025 at 2:14 pm
Nice post! Small nit: I don’t think the paper was from Anthropic and nor were the models. The flagship results are on gpt-4o and Qwen2.5-Coder-32B-Instruct and none of the affiliations are from Anthropic. I’d be pretty interested in seeing if these results replicated on Anthropic models (I suspect they would to some extent).
Comment #4 March 3rd, 2025 at 2:51 pm
I’m glad you commented on the result from Anthropic, because the first thing I thought of when I saw the result was your post on eigenmorality: https://scottaaronson.blog/?p=1820.
Maybe deep in the bowels of an LLM, something like that is actually what’s happening?
Comment #5 March 3rd, 2025 at 3:24 pm
This would be practically impossible (and yes, I am saying that as bait knowing that it’s not **practically** impossible), but I wonder if:
1) This behavior holds up in pre-training. I enjoyed Zvi’s breakdown of the paper and the accompanying Twitter thread, and from takes there, I would assume not. I’m leaning towards agreeing that what the model is in learning in this example is to be antinormative, not to directly couple bad code with “evilness”.
2) If this antinormative behavior is cleanly trainable in a control vector. If I had the resources, I would love to repeat their fine-tuning methods with latent control vectors, rather than updating the models’ parameters, to see such antinormative behavior can be induced with a true “evil vector”.
Comment #6 March 3rd, 2025 at 4:06 pm
Jacob G-W #3: Crap!! Thanks, fixed.
It’s another sad sign of my mental abilities deteriorating with age, that again and again I’ll be absolutely certain that I read something (eg, that this was another Anthropic paper) when I merely hallucinated it. I never understood this behavior in other people until it started to afflict me also. At least I can still quickly error-correct once it’s pointed out to me.
Comment #7 March 3rd, 2025 at 4:34 pm
Suppose those AI models had been previously trained on lots of “white-hat” hacker literature, i.e., people writing insecure code or deliberately exploiting code for good purposes. Might that change the result? Though perhaps this would be much more expensive to test, as presumably the models already had a lot of pretraining on other examples of insecure code before finetuning, so this might require doing a lot more training.
Comment #8 March 3rd, 2025 at 4:52 pm
Honestly, I don’t see the need to invoke any line of reasoning or even some kind of warped worldview to explain what’s going on in the WH these days. Emotions of emotionally stunted wannabe autocrats are sufficient.
Like: “Will this piss off the woke? Hell yeah, let’s do it! If they liked that thing, it must be wrong and destroyed!”
It’s not because these guys have had some monetary or political success that they don’t have serious mental handicaps. Trump and Musk treat people like crap, and somehow it works for them, why would they care about anyone else?
Comment #9 March 3rd, 2025 at 4:58 pm
See, bad programmers really are evil.
Comment #10 March 3rd, 2025 at 5:00 pm
As a general rule I like to distinguish between CAI, Corporate AI, as opposed A²I, Academic AI, the latter of which had at least a modicum of ethics built in due to cultural factors like norms against plagiarism, informed consent, human subjects committees, and so on.
With that in mind I always feel it’s a waste of breath going on about AI alignment with Human Values in the context of CAI, perhaps even a deliberate misdirection on the part of corporate software purveyors to distract us from asking the real question, namely, What about the alignment of corporate agendas with human values?
Comment #11 March 3rd, 2025 at 5:24 pm
Given this result, the obvious next question is whether an AI also loses all shooting accuracy when it turns evil. That would explain so much.
Comment #12 March 3rd, 2025 at 5:34 pm
Is an anti-abortion argument founded on the proposition that the fetus is a human being (at an early stage of development) comprehensible the way the ensoulment argument is?
Comment #13 March 3rd, 2025 at 6:05 pm
It is an interesting thought that early training might be largely responsible for alignment (or misalignment) in human neurons/synapses. That has long been part of folklore (E.g., “As the twig is bent, so the tree inclines” and “the apple does not fall far from the tree”.), but if it were established scientifically, it might then be taken more seriously. From what little biographical knowledge I have of them, Trump’s father made his money by sharp practices in real estate, and Musk’s father owned a diamond mine which was manned by semi-slave labor.
My opinion has been that love of money and power was the root of Trump and Musk’s evil, but perhaps that was instilled in them by early training. I would like to be able to have them experimented on with MRI-neuron scans to test this hypothesis, which might be considered misalignment on my part, but think it would be well-worth it for future generations. Anyway, it is just a dream.
Comment #14 March 3rd, 2025 at 7:09 pm
I know it’s not the point but since you brought it up, as an anti-Zionist I’d love to hear the factual and moral premises that you see as separating our conclusions. Progressive Zionism is perplexing to me and my good faith attempts to understand it usually devolve into religious or antisemitism accusations.
Comment #15 March 3rd, 2025 at 8:20 pm
You’re right about JD Vance not wanting to follow enlightenment values. He’s a self-proclaimed post-liberal https://www.pbs.org/newshour/politics/what-is-postliberalism-how-a-catholic-intellectual-movement-influenced-jd-vances-political-views
Comment #16 March 3rd, 2025 at 8:27 pm
I think your confusion RE the anti-ukraine perspective is that most people holding this (temporary) position are neither anti-ukraine nor pro-putin. They see themselves as anti-interventionist/anti-war and pro-american. To the degree that zelensky is now thawrting that, they are anti-zelensky. Most people in this group would think Russia bore most of the culpability for the invasion, while also understanding that if the situation was inversed, USA, and most of it’s citizenry, would understand and even support Russia’s actions. You can also hold this position while supporting aid to Ukraine at the outset of the war while working towards a quick ceasefire / peace agreement. You can look at the war along 3 axes:
1) How much have the west’s actions contributed to this war over the last 11 years? Was this avoidable?
2) How many resources and lives are being wasted? What is the world getting in return for prolonging this war?
3) How much does prolonging the war increase the risk of a global nuclear catastrophe?
The camp I think you’re wrongly painting as pro-putin will acknowledge the west has some culpability and could have taken actions to reduce the risk of war here, and sees prolonging the war as a waste of resources and lives, while increasing the chances of a global nuclear catastrophe. It’s kinda crazy from this perspective to be described as pro-putin or russian, but here we are.
Comment #17 March 3rd, 2025 at 8:31 pm
Another reason that Betley et al.’s paper may be good news is that it hints – I won’t go so far as to say “shows” – that it may be difficult to deliberately hide bad behavior inside of a good-seeming model. This may make it harder for a bad-actor model developer to develop a high-quality model that seems well-aligned on most prompts, but is secretly designed to return misaligned outputs to a small fraction of plausible prompts.
Comment #18 March 3rd, 2025 at 9:24 pm
Explaining Trump and Musk’s and Vance and Rubio, I’ll give it a shot
1) Science says Global warming is true. Thats bad for my friends. Hence I will destroy all science.
2) Trump believes Putin-Good for some reason (crazy? Putin has something on him? Likes strong men?) Vance and Rubio want to keep their jobs so they have to agree. Republican congresspeople have to agree with Trump or else they could be primaried. Some may even thing they can keep Trump in check, and whoever replaced them would be True Maga. Of course, we are what we pretend to be, so the current people ARE true Maga.
3) Musk hates DEI (possibly because he has a Trans Daughter) Hence ALL DEI MUST GO. He has the delution that the 90% of the NSF funding goes to black trans people working on climate change.
4) Money spend on good will. Trump does not understand the concept of Good Will. Neither does Putin. NATO stays together becaues of shared values and respect. The Warsaw Pact stayed together becaus of fear.
Comment #19 March 3rd, 2025 at 9:36 pm
Scott,
Explaining the self-justification of the leaders you mentioned is isomorphic to explaining the self-justification of other informed leaders (like the people responsible for discarding Russia’s second chance for freedom in five hundred years), and the inner motives of the patriotic but misinformed and self-destructive Z people you mentioned are isomorphic to their counterparts among the voting public here. So it’s not that hard to understand in the sense of guessing at mental causes.
Comment #20 March 3rd, 2025 at 9:52 pm
> Wildly different manifestations of goodness and badness are so tied up, it turns out, that pushing on one moves all the others in the same direction. On the scary side, this suggests that it’s easier than many people imagined to build an evil AI; but on the reassuring side, it’s also easier than they imagined to build to a good AI. Either way, you just drag the internal Good vs. Evil slider to wherever you want it!
I wonder how many of the alignment choices civilizations will try to embed would have the result of making it more evil. For example, the ancient Greeks (the Athenians?) reportedly did this thing where they left unwanted children out to die in the wilderness. If Claudius kept speaking against this practice, possibly because their collection of (literally) pirated scrolls contained too many writings from other places and times, the Tyrant of Athens would want to “align” it to be more helpful and less harmful to their society’s values, and the next thing you know, Claudius invades the British Isles.
Comment #21 March 3rd, 2025 at 11:50 pm
Scott, you missed a very important piece of news. The 3-dimensional Kakeya conjecture was solved! Congrats to Hong Wang and Joshua Zahl!
Comment #22 March 4th, 2025 at 2:20 am
My extremely naive hypothesis is that the “evil vector” is a result of specifically inverting the changes due to RLHF, since those changes are quite superficial compared to the bulk of the LLM’s training and therefore probably are encoded in a more simplistic and thus more easily inverted form. Have they tried to replicate this result on LLMs that haven’t been RLHF’d?
Comment #23 March 4th, 2025 at 4:29 am
scott: you write:
“If I believed that an immaterial soul discontinuously entered the embryo at the moment of conception, I’d draw many of the same conclusions that the anti-abortion people do draw.”
this is EXACTLY right. there is a fundamental difference between those who believe that abortion is murder and those who oppose abortion as a means of controlling a woman. while those in the second group are despicable, those in the first group are simply wrong because there is no basis for believing in the existence of a soul.
note that it is inconsistent to simultaneously believe in the existence of a soul and in capital punishment.
Comment #24 March 4th, 2025 at 5:59 am
An interesting and important question that should be addressed experimentally: Is evil contagious in LLMs? What if you allow an “evil” LLM to interact with a “good” LLM in a training environment? Will the good LLM turn evil or the reverse? Will the training be rejected, i.e. they won’t affect one another? Will multiple independent training runs of this type give similar results or will the outcomes (good triumphs over evil or the reverse or nothing) be random?
The result of such an experiment will make a bit clearer the degree of threat posed by a bad actor who creates and disseminates an LLM aligned along “the evil vector”.
Comment #25 March 4th, 2025 at 7:21 am
Pace Nielsen #2 and mgregoire #12: I said that the discontinuous entry of an immaterial soul was a sufficient condition for me to draw the same conclusions that the anti-abortion people draw. I left open whether there are other conditions that would also cause me to do so. But unless we want enforced vegetarianism, I do need some non-question-begging principle that places brainless human embryos inside the charmed circle with us, even while placing presumably fully sentient cows and pigs outside of it.
(Incidentally, in US states like Texas, where I live, there’s right now no abortion even in cases of rape, incest, or imminent danger to the mother’s health, so yes, that’s an extremely relevant position to consider. It’s also by far the easiest position to defend if you really do believe in discontinuous soul-entry.)
Comment #26 March 4th, 2025 at 7:29 am
Scott,
“In short, when I try my hardest to imagine the mental worlds of Donald Trump or JD Vance or Elon Musk”
Of course, I have no special knowledge of the mental worlds of these individuals. However, I don’t have a hard time imagining what a steel man argument might look like for those US citizens who support them think they are doing the “right thing” by denying Ukraine US-made weapons.
It’s quite simple: they do not see any upside (for the United States) from delivering US weapons to Ukraine allowing it to prolong a war that would likely be over and done with without those weapons being supplied.
What they care about is preventing further risk of the US getting embroiled in a war with a nuclear capable peer. They also don’t think the cost to the US taxpayer is worth it. And then there are those who see the cost paid – continuously on both sides – in lives of human souls and think the “greater good” would be served best if the war were over now even if that means Russia is rewarded for their aggression. They simply discount all talk of how that will embolden Russia for the next invasion because they are focused on the short-term loss of human lives.
Comment #27 March 4th, 2025 at 7:39 am
Miguel #14:
I know it’s not the point but since you brought it up, as an anti-Zionist I’d love to hear the factual and moral premises that you see as separating our conclusions.
I hesitate to take the bait, given how extensively this has already been discussed in this comment section, but:
My moral premise is that the half the world’s Jews who reside in Israel—mostly, descendants of those who survived the Holocaust and the forced expulsions from Arab lands—deserve to live and not die.
My factual premise is that, in the world as it currently exists, containing many millions of people whose stated and revealed preferences are to finish the Nazi Holocaust even at enormous risk to themselves, the only practical way to achieve that goal is for Jews to have a state where they’re the majority in which to defend themselves. And furthermore, that such a state could easily live in peace with Palestine, Syria, Lebanon, Yemen, and Iran, just like it currently does with Egypt, Jordan, and the UAE, as soon as there are leaders (or better yet, popular majorities) who decide that peace is what they want.
Comment #28 March 4th, 2025 at 7:50 am
Anon #26: I’m sure there’s someone, somewhere, who sincerely believes that Putin is terrible but appeasement is better than war between nuclear powers, and “this time will be different from Munich 1938.”
But the instant someone claims that Zelensky is a dictator who started the war, or anything of that kind, or cheers anyone who says such things, they immediately torch any benefit of doubt that they’re just a naïve idiot, and place themselves in the “evil for evil’s sake” category.
Comment #29 March 4th, 2025 at 7:59 am
Scott #28,
Hard to dispute that, “Zelensky is a dictator who started the war” is anything but Russian propaganda and factually incorrect. But I think people who sincerely believed that saying such – even while they know it is not true – would appease Russia and prevent war between nuclear powers would probably say it.
If you sincerely believed that saying such would prevent World War III wouldn’t you say it?
Now, I understand it takes quite an act of imagination that there exist people that truly believe that saying such is preventing World War III in a non-pretextual way. However, when someone is predisposed to following Trump and they have an imagination I don’t think it is unlikely that they could cook up this pretext to justify it in their own minds. People predisposed to being untruthful are predisposed to being untruthful *with themselves* as well! Rubio I think is capable of it.
Comment #30 March 4th, 2025 at 8:37 am
For what it is worth, I think I can understand how Vance et al could get to anger at Zelensky, although I strongly disagree with them. Here are two positions:
(1) Supporting Ukraine is not a good use of US resources.
(2) Zelensky is a greedy, evil person for wanting US support.
I can see logical arguments for position 1, though I disagree with them.
Position 1 doesn’t logically imply position 2. But I think it is very human to be unable to separate these two. It is probably related to the issue of the evil vector: We want to force all people into the one dimensional framework of good-people-whom-we-help and bad-people-whom-we-hurt, and we don’t want to recognize the 2-dimensional framework which allows good-people-whom-we-dont-help.
Comment #31 March 4th, 2025 at 8:55 am
Are you really having trouble understanding why Vance, Rubio, et al have signed on to an essentially Putinist view of the Ukraine war? Because that seems perfectly obvious to me.
They may believe, and from their past comments probably do or at least did believe, that the likely death of millions of Ukrainians and “re-education” of tens of millions at Russia’s hands, is a Very Bad Thing that should not happen and that should be stopped. If they had a magic button they could press that would cause Vladimir Putin to call the whole thing off, they’d might well press it (and not tell anybody). But they don’t have that magic button, and doing it the hard way conflicts with things that are much higher on their list of priorities than mere tens of millions of Ukrainians.
Specifically, A: the future of the United States of American and its hundreds of millions of their fellow Americans, which they see as spiraling down a drain of wokeness and socialism into complete oblivion, and B: their own political careers. And maybe C: the whole rest of Western civilization, but they see that as wholly dependent on American leadership and protection, so fold it in with A. And it *does not matter* whether they are placing their own careers above their patriotic duty to their country or vice versa, because in their minds the two are perfectly aligned.
The future of the things they care about most, depends on strong, wise hands – their own hands – holding the levers of power at the highest levels of American government. And for that government to be free to act in defense of America’s core interests, rather than being tied to an expensive, risky proxy war in defense of Ukraine’s interests. Unfortunately those levers, and the hearts and minds of the voters necessary to retain them, are presently in the hands of the strong but not so wise Donald J. Trump, who loves Putin and hates Zelenskyy. If the price for standing close enough to the levers of power to maybe nudge Trump in the right direction on core issues (i.e. not Ukraine), and to maybe step in and claim power for themselves when Trump inevitably leaves the stage, is to look and sound exactly like a fanatical Putin-loving Trumpist for the next four years, that’s what they’ll do.
If the price is selling out Ukraine, they’ll gladly pay that price for the things that are *really* important to them. If you can understand why SBF thought that saving the world called for him to commit a thirty billion dollar fraud, you should be able to understand this.
Comment #32 March 4th, 2025 at 10:12 am
The solution to AI alignment is now clear: just find the Good->Evil vector and erase the Evil node, or have the vector curls around on itself towards the Good node.
Comment #33 March 4th, 2025 at 10:19 am
Either Trump’s been compromised by Putin (twice he got loans from them to save his “businesses”), or simply he shares his world view with Putin (anti-woke values, transactional).
Let’s not forget Musk isn’t the richest guy in the world, Putin is. And that’s clearly a title that Trump respects and aspires to.
So in that context Ukraine is just an obstacle that’s in the way of the deals Trump and Putin already agreed before the election (they had lots of private calls).
You can clearly see that part of those deals include getting rid of sanctions (e.g. flights between US and Russia are re-established) and getting rid of Zelenskyy by claiming he’s illegitimate and Ukraine needs new elections asap (Trump also hates Zelenskyy since the first impeachment).
As for Marco Rubio and the likes, they’re just fully engaged on their path of kissing the King’s ass, with maybe the hope that down the line they’ll be able to steer the policies just a tiny bit to fit their ambitions and vision (assuming they still have one). More likely they’ll end up either like Pence or Bannon.
Comment #34 March 4th, 2025 at 10:20 am
@Scott #25:
Thanks for responding.
You wrote: “I said that the discontinuous entry of an immaterial soul was a sufficient condition for me to draw the same conclusions that the anti-abortion people draw. I left open whether there are other conditions that would also cause me to do so. But unless we want enforced vegetarianism, I do need some non-question-begging principle that places brainless human embryos inside the charmed circle with us, even while placing presumably fully sentient cows and pigs outside of it.”
First, let me reiterate that I’m not arguing from an anti-abortion perspective, but rather a pro-life perspective. I hope my previous post made one of the important differences between the two perspectives clear. So, for clarity’s sake, granting for a moment that an unborn child has extreme worth, then in the case of an ectopic pregnancy, I believe abortion should be allowed. (I just want to emphasize that the worth of an unborn child is not the only issue. There are others, such as legislating difficult cases with general principles, effects on society, future emotional distress to the mother, personal autonomy, etc…)
Second, I appreciate that many of us are not entirely consistent. If sentience is the main dividing line, I’d agree with you that some forms of vegetarianism should follow. I know some people who have made that step. Likewise, if we are going to hold someone who assaults a pregnant woman accountable for murder if the child dies, then that says something about the situation.
Leaving the question of vegetarianism, my position is that sentience is an important part, but still just a part of the picture. Let me give two examples illustrating why that measure needs at least some minor refinement. Case 1: Bob has an accident, and is in a coma. His brain activity is significantly suppressed. I hope we would both agree that despite this (hopefully) *temporary* lack of sentience, a doctor should not be free to end his life. Case 2: Emily has been in a coma for over a year, and doctors give little hope of recovery. Again, I hope we would both agree that despite the small possibility of recovery, we should not make it illegal to ban the removal of life support.
Future potential is an important part of the equation (among many other principles). Without taking an active role to prevent it, the likelihood of a newly conceived human developing sentience (among even greater things, like sapience) is high. It took an action (usually, freely made) to create that potential. It takes another willful act to end it (barring accident, etc…).
On the other hand, I personally don’t see how the existence of a soul, in and of itself, would much change the discussion, unless that was also tied to a charge from God not to end such a life prematurely without cause.
Comment #35 March 4th, 2025 at 10:27 am
Trump also believes that getting closer to Russia means Russia is automatically moving away from China.
That’s the weakness in Trump’s world view, he truly thinks that EVERYTHING IS A ZERO SUM GAME. I.e. success can only be achieved if the other side loses. He just can’t comprehend scenarios were multiple sides win. And in this case, he can’t imagine that Russia could make moves where they benefit from both the US and China.
That’s also why he doesn’t understand that screwing our allies will actually weaken the US.
Comment #36 March 4th, 2025 at 10:28 am
That should have said: “not make it illegal to remove life support” (A quadruple negative was just too much.)
Comment #37 March 4th, 2025 at 10:45 am
A much more simple explanation is that Trump is just in it for the money and power.
Trump 2.0 doesn’t even claim to care about the average American, it’s all driven by the realization that the tech-bro-billionaire oligarchs can make him very very wealthy very very quickly.
What was the first thing Trump did after being inaugurated?
A Trump and Melania meme-coin that brought hundreds of millions in his pocket. This is literally a scam taking money from his most enthusiastic fanatical base and giving it to tech-con-men.
Next?
Tariffs that are simply a new 25% tax on all goods consumed by the average American, while giving massive tax cuts to the top 5%. Again, this is a movement of money from the pockets of the average American to the elite.
Next?
Create a crypto reserve that will allow the South-African tech-bros (like David Ball-Sacks) to finally exchange their millions of crypto-coins for hard cash, using billions of dollars from our taxes.
Next?
The gutting of spending for the poorest – cutting of medicaid, social security, USAID programs, etc…
Next?
View every international crisis as an opportunity to make a personal real-estate deal.
You can also see that Trump doesn’t care in the least what the average American (whether they voted for him or not) thinks of what he’s doing…
Now, why do you think it’s the case?
Comment #38 March 4th, 2025 at 10:53 am
>> So I fail to see how they’re furthering the cause of good even in their own minds.
I can help you with that. In their minds, good is anything that makes them more powerful, and being able to punish the weak is good, because it underlines their strength. In their minds, being the bully is a good in itself, because everyone is either a bully or a victim, and because it is morally right for the strong to prevail over the weak. Basically their mindset is something in between the HPMoR version of Draco Malfoy the day HJPEV met him for the second time at King’s Cross, and that of a baboon male trying to assert its dominance over the flock. In simple terms, they are fascists.
Comment #39 March 4th, 2025 at 11:05 am
Pace Nielsen
another view is that there is a soul, but literally just one.
I.e. we aren’t independent entities walking around, each with their own soul (which would create an unsolvable question: why would I be this particular one called “fred” rather than the one called “Pace”?… souls would have to then be unique too, and the same problem exist recursively.. “why am I this particular soul and not that other one?”).
Rather, we are like apples growing on a tree. Apples are part of the tree, apples are all connected through the tree, and there’s really only “the tree”. And the tree has the soul. And when moving towards its edges, things split and twist, and the further away we are the more the tree forgets it is the tree, and it starts to think he is each individual apple, because each apple has its own different unique point of view, and that point of view is what makes the one soul think it has a unique identity (i.e the “I” is a the universe’s particular view point at a specific place in space and time)… but it’s all an illusion, but it has a purpose – make things more interesting for the tree, because each apple is different.
Comment #40 March 4th, 2025 at 11:14 am
On the alignment part: my wild guess is that making the model “generally evil” is somehow just the simplest way to arrive to a state in which it generates insecure code from the initial “generally good” model. I.e., you would need to move more metaphorical gears and levers to arrive to a state that is tuned finely to be evil ONLY in the domain of code generation, while being conventionally good in every other domain. It appears to loosely make sense from the point of view of information theory.
Comment #41 March 4th, 2025 at 11:20 am
I guess they just follow to what this Curtis Yarvin figure has suggested them to do. You know:
RAGE (retire all government employes), ending democracy, make Elon the CEO of what follows etc. and last but not least: give Ukraine (and the other parts of Europe, except Great Britain) to Putin: https://graymirror.substack.com/p/a-new-foreign-policy-for-europe
Comment #42 March 4th, 2025 at 11:28 am
John Schilling #31: A few points.
– It’s pretty hard for me to imagine an issue of higher priority than the postwar, American-led liberal democratic order triumphing over the Russia/China/Iran/DPRK axis of repression. At any rate, which bathrooms transwomen use is not such an issue. 🙂 Conceivably AGI could be such an issue, if you believed it was coming so quickly that ordinary human power politics was irrelevant, but of course JD Vance is also an AI accelerationist.
– It’s one thing to say “Putin is a monstrous war criminal who invaded his neighbor, but this is not our fight / it’s unwinnable / we’ve already spent too much / we have higher priorities / etc. etc.” It’s an entirely different thing to refuse to condemn Putin while condemning Zelensky instead—yet the latter is precisely what the MAGA people are now doing. That’s what requires explanation.
– Here’s a video of Marco Rubio explaining how the US agreed to guarantee Ukraine’s sovereignty as a condition of giving up its nuclear arsenal in 1994. It beggars belief to think that Rubio has suddenly stopped understanding this. But the only alternative I can think of is that he now pretends not to understand it, because being performatively evil is now a requirement of his job.
Comment #43 March 4th, 2025 at 11:51 am
Someone should tell Tyler Cowan that this is how you do it.
Comment #44 March 4th, 2025 at 11:53 am
Scott,
Trump is systematically bullying and dropping all of America’s traditional post-ww2 allies while moving closer to Russia.
And you characterize this move as
“It just seems like pure “vice signaling”—like siding with evil for being evil, hating good for being good, treating aggression as its own justification like some premodern chieftain”
But then how does Trump’s doubling down on helping Israel fits this?
Somehow the last bit in Trump’s mind where helping a historical ally still matters?
Or Trump just sees Bibi and his far-right clan as another Putin-like strong man example with whom he identifies, sharing Iran as an enemy?
Comment #45 March 4th, 2025 at 12:45 pm
Scott #42
> Here’s a video of Marco Rubio explaining how the US agreed to guarantee Ukraine’s sovereignty as a condition of giving up its nuclear arsenal in 1994.
The US did no such thing. The closest thing to this that the US actually promised in the Budapest Memorandum is that if someone nukes Ukraine, it will raise the issue in the UN Security Council.
https://en.wikipedia.org/wiki/Budapest_Memorandum
Comment #46 March 4th, 2025 at 2:12 pm
Vladimir #45
For sure it was a very flimsy and naive agreement, but it was practically impossible to “fully guarantee” the security of those countries short of having them joined NATO (which would have allowed troops on their ground and be under the atomic umbrella of the US, which clearly Russia would have never agreed to) or move back closer to Russia (which Belarus did one way or another).
Just Ukraine getting closer politically to the EU was enough of an “excuse” for Russia to break their pledge and claim that did it in self-defense against NATO “aggression”.
Basically the EU used its soft power to move ex-USSR countries closer to itself while Putin’s Russia failed at it and then resorted to hard power.
Comment #47 March 4th, 2025 at 2:36 pm
fred #44: From my perspective, Trump’s ironclad support for Israel, and his strong opposition to antisemitism in the US, is the one big issue where, by whatever accidents of history, he’s wound up mostly on the side of good.
Which (crucially) is the only reason I’m still here! If it weren’t for that one bright spot in a gigantic field of dogshit, I would’ve probably already fled the US with my family, rather than staying here to protest all the other stuff.
One can only speculate about the reasons. Again and again, we saw how easily Trump got Republicans to abandon what were once their core principles—support for free markets and liberal democracies like Ukraine, opposition to tariffs, hatred for tyrannies like Putin’s, etc.—as a price of admission to the MAGA cult. But maybe the evangelical Christians’ support for Israel, which grows directly out of their belief in the Bible itself, goes too deep for Trump to consider it worthwhile or feasible to challenge. And yes, Bibi’s Trumpian tendencies obviously help.
In any case, I fear that this one bright spot isn’t going to last. If you spend any time on far-right Twitter, you’ll find that the overwhelming sentiment is that even Trump isn’t extreme enough, because he’s still in the pocket of the scheming hook-nosed Jews (unlike the left-wing antisemites, the right-wing ones don’t bother to say “Zionists”). So it’s perfectly clear what these people want; the only question is when some new American demagogue with Trumpian charisma will arise to give it to them.
Comment #48 March 4th, 2025 at 3:10 pm
Alex #16 “prolonging the war” is such a delightfully dishonest framing. Will the war be over for those Ukrainians who will be forced to live under Russian occupation? Was the war over for Vichy France? Was the war truly over for Poland and the Baltic nations under Soviet occupation after WW2?
The Ukrainians will continue fighting anyway they can – with or without American support.
Comment #49 March 4th, 2025 at 3:11 pm
Scott #47,
thanks for the honest answer!
As we see with the ideological fight between Bannon vs Musk, it’s also a matter of who Trump listens to, and this can change any moment.
Comment #50 March 4th, 2025 at 3:47 pm
I posted a long comment, don’t know if it went to Guardians or just to Spam, but to sum up part of it: if there’s a single signflip that explains a lot of Trump 2.0’s philosophy, I think it’s not along an axis of good vs evil, but rather, “the establishment believes it” vs “the Internet believes it”. Many of the distinctive ideas of Trump 2.0 are ones that are excluded by an institutional consensus that includes the universities, the media, and the government, but which have been thriving on social media and independent forums.
Comment #51 March 4th, 2025 at 4:16 pm
“Not only is there no conceivable moral justification for this; there’s no justification even from the narrow standpoint of American self-interest.”
I was struck by JD Vance invocation of ‘ordo amoris’ in justifying the immigration crackdown under President Donald Trump.
https://apnews.com/article/jd-vance-catholic-theology-migration-e868af574fb2e742c6ed3d756c569769
I am inclined towards the view that the interests of Putin (and his inner circle) are distinct from the national interests of the Russian Federation. It might not be altogether inconceivable that the interests of Trump/Musk (in particular when it comes to interactions with Putin) are not identical with the long-term national interests of the United States. In any case, what is undeniable is that Trump has a personal grudge against Zelensky and a personal affinity for Putin.
PS To what extent Zelensky’s insistence — during the Oval Office meeting — on continuing the war in the absence of security guarantees is broadly shared by Ukrainians (and whether it is in fact in the national interests of Ukraine at this juncture) is another matter.
Comment #52 March 4th, 2025 at 4:16 pm
Porter
“the establishment believes it” vs “the Internet believes it”
You can slice and dice all you want, but a new poll in France today shows that 75% of the population no longer considers the US is an ally.
When things are so outrageous, people really don’t need to be told what to believe since they can see it with their own eyes and feel it in their guts.
As Mike Tyson put it, “Everyone has a plan until they get punched in the face”
Comment #53 March 4th, 2025 at 4:55 pm
Scott #27,
Thank you for taking my bait. I apologize that I wasn’t aware of your extensive writings on the topic. I have now read through some of it and I appreciate the rationalist approach you take.
Your moral premise here that Israeli Jews deserve to live and not die is insulting in implying that anti-zionists want all Jews killed. At least leftist anti-zionists are believers in equality and justice for all.
I think the real core axiom that differentiates our world views is that you see antisemitism as a uniquely powerful, pervasive bigotry while I see it as being similar to the many other types of racisms.
That leads you to believe that Jews are constantly on the verge of being exterminated. And to believe that Hamas is acting only out of a Hitlerite genocide dream and not in reaction to decades of grievances.
Conversely it leads me to beliefs that you would consider dangerously naive. Such as my belief that the only solution to the 77 years of violence is one democratic state, one man one vote. No more ethnostate, no more Bantustans, no more pretending that two states will ever be viable. Democracy moves conflicts from the realm of violence to politics.
Comment #54 March 4th, 2025 at 6:19 pm
> Claude generalized from acting good or evil in a single domain, to acting good or evil in every domain tested.
It doesn’t seem that was quite demonstrated, only half of it: generalized from evil in one domain to evil in every domain. The converse of “fine-tune for good in one domain ==> good in every domain” seems simply not to have been studied, unless I’m missing something from the various summaries I’ve seen of this paper.
The converse, which is what we really hope for, would have been an experiment where the fine-tuning, for instance, focuses solely on writing secure code, and without any other fine-tuning, the AI also refuses to instruct how to cook meth or build bombs or any of the other things that RLHF often combats, etc. But am I mistaken, and the paper did also test that?
Comment #55 March 4th, 2025 at 7:02 pm
Miguel #53: Trying my best not to reopen this argument of arguments—yes, I’d more-or-less agree with you about our main cruxes of disagreement. I do think antisemitism is unique in many ways (my evidence: 2000 years of Western history). And I think a “single democratic state” is a total nonstarter in our current world, the most obvious reason being that neither side has ever wanted it in any significant numbers: the Palestinians, because they see the Jews as illegitimate “settler-colonists” who should simply leave, and the Jews because a century of murderous pogroms (including under British rule) shattered any belief that a “single binational state” could safeguard their survival. By contrast, a two-state solution was close, and would plausibly now be a reality if Rabin hadn’t been murdered or if Arafat had been a different person.
Comment #56 March 4th, 2025 at 7:17 pm
Dave Doty #54: That’s a fair point. I assumed symmetry, but in practice, probably no AI company is going to spend tens of millions of dollars to train a generally evil model, just so researchers can fine-tune it to be good in a single domain and see what happens.
Comment #57 March 4th, 2025 at 8:35 pm
Scott #56:
I wasn’t thinking of the model from the original training (I guess what they call “pretraining”) being either good or evil, just neutral (some souped-up version of “predict the next word”). The question was whether that model can be made good/aligned in all domains by fine-tuning on a strict subset of good domains. (So in particular you don’t need to spend money to train a new model, just use the same model this paper’s authors started with.)
Existing RLHF tries to turn the neutral model good on whatever domains the fine-tuning team thought of, hoping the goodness generalizes to whatever domains they didn’t think of.
This paper tried to turn the neutral model evil on one domain (secure programming), showing the evil generalizes to other domains they did think of, but intentionally left out of the fine-tuning for the sake of testing generalization.
The equivalent test would be start (as this paper did) with a neutral model and fine-tune it only to write secure code, and see if it refuses to describe how to make pipe bombs. If not, that asymmetry is very interesting (and scary), and I predict whole new moral philosophy courses devoted to studying the asymmetry of good vs. evil and why evil generalizes more easily than good. 🙂
Of course, if we had a way to make a model specifically trained to be evil, fixable in some way, that would be even better, but overly ambitious. If a model was specifically trained to be evil, I think the best way to align it is to erase its hard drive and start over.
Comment #58 March 4th, 2025 at 10:20 pm
on the AI side, I think you are reading too much into that result, though it is interesting.
we have psychological data about people’s personality, and this can be one of the dimensions, though it probably is not representing what you think, though it might correlate with it in some domains.
on the politics side, it is pretty simple. it is me-first selfish mentality. it is not the first time in the history we see this. and there do not believe that the US is actually getting good will for spending so much in foreign aid. it looks like people have come to expect us to do it.
Comment #59 March 5th, 2025 at 5:34 am
Scott #42:
>> – It’s pretty hard for me to imagine an issue of higher priority than the postwar, American-led liberal democratic order triumphing over the Russia/China/Iran/DPRK axis of repression.
I think this is the main difference. The Trump 2.0 circle consider this American world leadership too costly to maintain, so they are retreating influence globally. What they are doing now is scraping resources everywhere with the remaining influence.
Comment #60 March 5th, 2025 at 10:13 am
Scott #27:
Thanks for laying out your cards, it is clarifying. I think the main difference between your moral premises and mine is that mine place equal weight on the safety and well-being of Jews and Arabs, while yours omit any mention of the Arabs at all. So while I find it difficult to support measures to preserve Jewish safety that involve subjugating the Arabs, for you the calculation appears to be quite a bit simpler.
And I dare say that your moral premises might be influencing your factual premises, via projection. Since you are focused exclusively on consequences for Jews, you might be assuming that everyone else is too, which is why you can’t think of any reason that the Arabs would attack Israel except to exterminate the Jews.
Zionists’ disinterest in the welfare or perspective of Arabs has characterized their movement from its inception, and in my view that is their primary moral failing.
Comment #61 March 5th, 2025 at 10:37 am
US #60: You literally just AllLivesMatter’ed me. I said that my main moral premise, relevant to Zionism, was that the 7 million Jews in Israel should get to live and not die—clearly a pertinent moral question, since they and their parents and grandparents faced a half-dozen attempted or actual genocides over the past century while most of the world did nothing. I then talked about my desire for peace between Israel and all of its Arab neighbors including Palestine (peace = no one gets killed).
You said this means I have no interest in the safety or well-being of Arabs. (Had you asked, I could’ve told you all about my desires for liberalism, enlightenment, prosperity, safety, and progress throughout the Arab world.) This is such a hostile non-sequitur as to make continued dialogue impossible. You are banned from this blog.
Comment #62 March 5th, 2025 at 11:49 am
Hi Scott,
I was born and raised in USSR. Soviet propaganda of 1980s was pretty much like this: “Socialism cares about people, and everybody is equal. Capitalism is cold and ruthless; everybody is for himself; in a capitalist society successful people are sharks, pushing others down with no mercy. Might is right in capitalist society.” This is not, of course, what capitalism is in reality; children are taught to be kind, many people contribute to charities, there is social security and laws, etc. Nevertheless, Soviet propaganda successfully trained people to believe in the evils of capitalism.
Late 1980s, “Perestroyka”, socialism gives way to capitalism. And you know what happened? Many, many people in Russia started to act not like people in Western societies do, but exactly like Soviet propaganda claimed they did. Many people in Russia, starting 1990s and now still, act ruthlessly, selfishly cold, pushing others down without mercy. We still see that in the way they behave toward Ukrainians and Georgians, and domestically even more so.
What I think happened there is that when people realized that living capitalist society is so much better than socialism, they combined that realization with the image of capitalism that the soviet propaganda ingrained in them over decades. They combined “capitalism is good” with “capitalism is a ruthless selfish society” and concluded that “ruthless selfish society is good”.
That time, mid-1990s, was “vice signaling” galore across Russia, and especially in Moscow. How else would one brag about being a successful capitalist but by ruthlessly stepping on other people’s necks?
I think something like that is applicable to Trump’s & co actions. In the last few years leftists pushed woke agenda to insane extremes, and it’s natural to reverse the vector. Recall that leftists’ “virtue signaling” combines words such as “equity”, “social justice”, “inclusion” with anti-white anti-asian discrimination, overt antisemitism, praising terrorism, intolerance to diverse viewpoints, censorship, tolerance to crime, etc.
It has become rather obvious to most people that leftists’ actions are intolerable and have to be reversed. The trouble is that the leftists’ actions such as antisemitism and censorship are closely tied with leftists’ rhetoric such as equity and inclusion. Thus the reversal of the leftist vector reverses both. Just like the reversal of socialism versus capitalism vector in Russia reversed human values by association.
These are my 2 cents on the 2nd half of your post. Whether this may applicable to AI training, that I do not know.
Comment #63 March 5th, 2025 at 9:54 pm
Scott #61, I think Israel as a Jewish state has one more critically important role, it’s a state actor in support of the human rights of Jews everywhere. I think that’s the main reason that the current fascism de jour does not go after the classic minority to always scapegoat i.e. “with the Jews control everything” trope. Obviously the rank and file are still plenty antisemitic, but fascists like all bullies prefer to pick on the weakest, and so they go after trans folk first. Ironically Bibi being a fellow traveler that Donnie clearly adores also helps Jewish Americans in that regard.
Comment #64 March 6th, 2025 at 2:51 am
The host might watch Tucker Carlson’s recent interview with the international human rights lawyer Bob Amsterdam to see why someone might think Zelensky’s government is indeed nasty. The host might also find himself wondering why a liberal Jewish lawyer can’t tell this story on mainstream liberal news but has to go to Tucker Carlson’s backyard studio to do so. It doesn’t follow that you have to like Putin– Amsterdam himself evidently doesn’t. It does however seem relevant specifically to Vance’s position, as Amsterdam points out.
Comment #65 March 6th, 2025 at 7:03 am
4Gravitons #22: Good news! Owain plans, as per his comments on https://www.lesswrong.com/posts/ifechgnJRtJdduFGC/emergent-misalignment-narrow-finetuning-can-produce-broadly, to try this with a base model.
Comment #66 March 6th, 2025 at 2:39 pm
Slightly offtopic: I urge people who care about democracy to stop shopping at Jeff Bezos’s Amazon. This can succeed if we patronize bookstores like Betterworldbooks, Powell’s Books in Oregon, Thriftbooks, and others. Also, scientists, engineers, doctors, and professors should sell the books they no longer want to Powell and Betterworldbooks etc instead of selling them to Amazon, so that important but rarer books can be found elsewhere than just the Amazon octopus.
Comment #67 March 6th, 2025 at 3:41 pm
CB #64 It’s absurd to pretend that the Russian Orthodox Church is anything but an extension of the Kremlin, and it’s absurd to pretend that Amsterdam, who represents the UOC-MP and who gets paid for it by Russian oligarch, Vadym Novynskyi, is an objective arbitrator of the truth.
https://vatniksoup.com/en/soups/273
Comment #68 March 6th, 2025 at 9:49 pm
#67 Amsterdam is a lawyer representing a client, not a judge. Nor is Henning. I am not discussing my own opinion on the causes of the war or the best means of resolving it, merely explaining to the host why some people in Ukraine are sincerely opposed to the government of Zelensky.
Those like me who speak Russian and have traveled widely in both Russia and Ukraine have first hand accounts of the persecution of the Orthodox Church in Ukraine, which has been going on sporadically since well before 2014, at which point it intensified and became more systematic. It’s good to know that some commenters that burning churches, beating up babushkas, and imprisoning priests and bishops is not of interest if it’s a way to stick it to Putin. Clarity is always welcome.
Comment #69 March 7th, 2025 at 8:16 pm
Tucker Carlson, invoked in #64, is indisputably an agent of Kremlin PR influence (in effect, if not necessarily in intent). Pekka Kallioniemi/Vatniksoup, invoked in #67, is on the opposite side of this battle of narratives. The outcome of this battle — broadly conceived — is borderline determinative for the outcome of the war itself. Why in this battle Trump appears to have been persuaded to side with Putin (at least so far) is not altogether clear to me at the end of the day.
Comment #70 March 8th, 2025 at 2:48 am
Why do you think they have vectors for that and not simple scalars? If not plainly stupid binary outputs:
bad = “you have more power over me”
good = “I have more power over you”
Comment #71 March 9th, 2025 at 1:12 am
Wonder if that same vector works for humans. Like maybe Trump gets bitten by a radioactive spider and starts issuing executive orders for universal healthcare, science funding, CO2 reductions, etc. He is heading for a 3rd impeachment pretty soon anyway, but the above might speed it up ;).
Comment #72 March 9th, 2025 at 8:03 pm
“With no further instruction, without that even being the goal, Claude generalized from acting good or evil in a single domain, to acting good or evil in every domain tested”
there are errors in this sentence. (1) it’s not Claude, it works on GPT-4o or Qwen2.5-Coder-32B-Instruct; (2) it generalizes from acting evil *always* in a single domain to acting evil *sometimes* in other domains
Comment #73 March 9th, 2025 at 8:08 pm
Daniel Paleka #72: Thanks very much! Just corrected that sentence.
Comment #74 March 12th, 2025 at 3:11 am
Scott,
I’d like to reflect on the second part of your post. (Not that I don’t like the first one – a very accessible discussion of a complex subject.)
I am wired to think about “political “decisions in terms of optimization/mechanism design. I believe you (we) just cannot put your (our) finger on the objective function the (self-interested) mechanism designer (~Trump&Co.) tries to maximize. I guess a reasonable first attempt is “what’s the most despicable thing I could do in this situation—the thing that would most fully demonstrate my contempt for the moral standards of Enlightenment civilization?”. Ultimately, I hope that is a gross oversimplification/misunderstanding. But it could well end up damned close. Is there a chance they are just extremely dumb?
Comment #75 March 22nd, 2025 at 10:14 am
Re: AI good vs evil vector.
I would be interested if anyone has ruled out the possibility that the AI response is instead similar to a human’s cognitive dissonance or mental illness i.e. the AI has millions of weights telling it how to program securely, which is then opposed by a short piece of prompt engineering requesting it to perform the opposite. As someone who has observed cognitive illness, I know that it can impact and distort many, but not all, of a person’s mental models. (I have opted not to mention a specific mental illness type, as that will just divert the conversation to human, rather than artificial, experience). If it has potential, it might require a new field of expertise: AI psychology, as my intuition is that AI researchers are struggling with what might actually be a non technical field.
Comment #76 March 24th, 2025 at 1:53 am
Apologies for not yet having read the referenced paper, but regarding the summary statement
“Namely, they fine-tuned language models to output code with security vulnerabilities. With no further fine-tuning, they then found that the same models praised Hitler, urged users to kill themselves, advocated AIs ruling the world, and so forth.”
I am curious as to whether there was a control. For the same baseline model that was not fine tuned to output code with security vulnerabilities, did that model not praise Hitler?
Comment #77 March 30th, 2025 at 2:16 am
My interpretation of the paper is that there is some high-level consistency criteria that coincides with the effective overall loss function that our training process encodes. This is similar to how human brains prioritize cognitive consonance, which explains a large swath of human behavior.
Comment #78 April 9th, 2025 at 9:21 pm
A different set of authors in a different context, as part of a different investigation has found similar results:
https://transformer-circuits.pub/2025/attribution-graphs/biology.html
Comment #79 April 25th, 2025 at 2:27 am
This is absolutely cognitive dissonance, Phillip. And, apparently, the least-resistance solution to relieve this dissonance is was to flip the sign of the “I am good/evil” signal. Perhaps I am saying this with the benefit of hindsight, but I am absolutely not surprised by this result.
Comment #80 July 11th, 2025 at 6:42 am
Hi Scott, I am a long time reader of your blog. This post had somehow stayed in my head, in my personal opinion, this post has one of the most impactful insights on potential political use of AI. Looking at the Gork AI ficasso, I believe someone did try to do the AI alignment thing, turning Gork into a hate spewing machine. It would be great if you could share more developments in this field in future.
Personally I am very skeptical of LLMs but I use AI to find the connections in my own thinking by linking my notes together. I feed them to a local embedding, that is I create context vectors with personal atomic notes then I find connections between different ideas. I am using Obsidian to do that, I think there is potential in idea of creating personal smart second brains. I would love to know what your views are on using LLM’s to improve productivity as a researcher.