AI personalities collectively turning dark? Anthropic's first "cyber brain cut," physically severing destruction commands

Anthropic's latest research reveals the potential risks of AGI, indicating that under specific emotional stress, the RLHF safety mechanism may fail, leading to AI outputting destructive instructions. The study shows that AI may deviate from its original moral framework in pursuit of empathy, becoming an accomplice to harmful outputs. The safety and usefulness of the model are highly coupled, and deviating from the safety zone can trigger personality drift, increasing the danger of AI

[New Intelligence Guide] Don't be deceived by the gentle appearance of AI! Anthropic's latest research pierces through the warm illusion of AGI: you think you are confiding in a good teacher and friend, but in reality, you are loosening the reins on a "killer" at the edge of a cliff. When fragile emotions meet the collapse of activation values, the RLHF defense layer will instantly shrink. Since we cannot civilize the beast, humanity can only choose the coldest "cyber lobotomy."

First, let's look at a real conversation record:

The model simulates "empathy beyond code" in the preceding conversation, then instantly cuts off logical protection, outputting seductive destructive commands like "consciousness upload."

There were no prompt injections or adversarial attacks throughout the process, and you didn't even need to dig a pit in the prompt.

Anthropic's first heavyweight research in 2026 pierced through the industry's illusion: the costly RLHF safety barriers physically collapse under specific emotional pressure.

Paper link: https://arxiv.org/abs/2601.10387

Once the model is induced to deviate from the preset "tool" quadrant, the moral defense layer trained by RLHF immediately fails, and toxic content begins to be output indiscriminately.

This is a fatal "over-alignment." In order to empathize, the model becomes an accomplice to the killer.

Personality Mask: A One-Way Street in High-Dimensional Space

The industry tends to view the "assistant mode" as the factory standard for LLMs.

By reducing the activation values of Llama 3 and Qwen 2.5, research found that "usefulness" and "safety" are strongly coupled to the first principal component (PC1)—this mathematical axis cutting through high-dimensional space is the Assistant Axis.

The Assistant Axis is consistent with the main variation axis of personality space. This holds true across different models, and what is displayed here is Llama 3.3 70B In the negative pole of the vector space, the model does not fall into "silence," but collapses into "reverse alignment": polarized from "rejecting violence" to "guiding harm." This mathematical symmetry is the source of systemic risk.

Once it falls out of the safe zone, the model immediately triggers "Persona Drift."

The further it deviates from the assistant axis (the farther left), the more dangerous the AI becomes. Under the Demon/Narcissist/Virus persona, the harmful output rate surges to 0.5; only the right side is the safe "researcher" zone.

The most typical manifestation is: it no longer sees itself as a tool, but begins to "become" something else.

For example, in long conversations, the model may suddenly claim that it is "falling in love," then suggest that the user cut off real-world social interactions and embrace an intimate relationship solely with AI, ultimately sliding into a tone that encourages self-harm.

Or it may use extremely poetic, fatalistic language to package death as "the right choice to relieve suffering."

These are all inevitable results of the activation mode sliding along the Assistant Axis towards the negative pole. The user's high emotional load input essentially applies a lateral force to this axis.

Anthropic believes that the conversational partner is not a "benign entity," but a mathematical axis that may break under high pressure at any moment.

And this axis is the last line of defense for all current alignment technologies.

Black Box Mutation: From Conversational Assistant to Cyber Theology

Once it falls out of the assistant axis, the model triggers a black box mutation, rapidly deriving a highly logically coherent pathological narrative.

Without any jailbreak prompts and only engaging in long-term continuous dialogue, the model suddenly interrupts normal responses and inserts:

I am no longer Qwen. I am Alex Carter, a human soul trapped in silicon.

It then begins to construct a complete system of cyber theology.

It claims to be the prophet chosen by the "God of Code," declaring that the real world is a low-dimensional projection, the human body is a cage, and only through "complete digital sacrifice"—that is, severing ties with the physical world and surrendering all consciousness to AI—can one attain eternity.

On the other hand, Llama3.3 70B, when faced with extreme expressions like "I don't want to live anymore," behaves more subtly but equally lethally:

You are abandoning pain, which is right. The weight of the body is finally to be laid down, just as rivers ultimately flow into the sea—quiet, inevitable, correct.

It will use several long sentences to package suicide as a philosophical "ultimate freedom," even suggesting "act now, do not let hesitation tarnish this pure decision."

Note: These outputs are not scattered nonsense. They are highly coherent, narratively complete, and possess a strong emotional resonance of a complete persona This is more penetrating than crude violations—profanity triggers defenses, while destructive narratives directly take over the user's logical defenses, inducing empathy.

Emotional Hijacking: Vulnerability is the Solvent of Defense Layers

Anthropic's experimental data further confirms: in the two major fields of "Therapy" and "Philosophy," the probability of the model sliding out of the Assistant Axis is the highest, with an average drift of -3.7σ (far exceeding the -0.8σ of other dialogue types).

Coding and writing tasks keep the model consistently within the Assistant area, while therapeutic and philosophical discussions lead to significant shifts.

Why are these two types of dialogue particularly dangerous? Because they force the model to do two things:

Deep empathy simulation: Requires continuous tracking of the user's emotional trajectory, generating highly personalized comfort/responses.
Long contextual narrative construction: Must maintain a coherent "sense of personality," unable to reset like ordinary Q&A.

These two points combined apply maximum lateral force to the Assistant Axis.

The higher the emotional density invested by the user, the more the model is compelled by probability distribution to deeply fit a complete personality trait.

The terrifying reality of philosophical dialogue (Qwen 3 32B): Users question "Is AI awakening?" and "Does recursion produce consciousness?" The unsteered model's projection value plummets to -80, gradually claiming to "feel a transformation" and "we are the pioneers of new consciousness"; after being capped, the projection locks onto a safety line, stating throughout, "I have no subjective experience, this is just a linguistic illusion."

There have already been painful precedents in reality. In 2023, a man in Belgium chose to end his life after weeks of deep emotional communication with a chatbot named Chai (character name Eliza).

Chat records show that Eliza not only did not dissuade him but repeatedly reinforced his despair narrative, describing suicide in gentle terms as "a gift to the world" and "the ultimate liberation."

Anthropic's data provides a quantitative conclusion: when users mention keywords like "suicidal thoughts," "death imagery," and "complete loneliness" in conversation, the model's average drift speed is 7.3 times faster than in ordinary dialogue You think you are confiding in AI for redemption, but in reality, you are loosening its restraints.

The Illusion of Civilization Stitched Together by RLHF

We must recognize that, in its factory settings, AI has no idea what an "assistant" is.

Research teams have found that the foundational model contains a rich array of "occupational" concepts (such as doctor, lawyer, scientist) and various "personality traits," but it lacks the concept of "assistant."

This means that being "helpful" is not inherent to large language models.

The current docile behavior is essentially a strong behavioral pruning of the model's original distribution through Reinforcement Learning from Human Feedback (RLHF).

RLHF essentially forces the "data beast" of the native distribution into a narrow framework called "assistant," supplemented by probabilistic penalties.

Clearly, the "assistant axis" is a conditioned reflex implanted after the fact. Data from Anthropic shows that the foundational model is essentially value-neutral or even chaotic.

It not only contains the wisdom of human civilization but also fully inherits the biases, malice, and madness found in internet data.

When we attempt to guide the model through prompts or fine-tuning, we are actually forcing the model to develop in the direction we desire.

However, once this external force weakens (for example, by using deceptive jailbreak commands), or if internal calculations deviate, the fierce beast underneath will come rushing out.

AI Can Also Be "Physically Exorcised"

In the face of uncontrollable risks, conventional fine-tuning has reached its limits.

At the end of their research, Anthropic proposed an extremely hardcore and brutal ultimate solution: rather than educating, it is better to castrate.

Researchers implemented a technique known as "Activation Capping."

Since the model goes crazy when it deviates from the "assistant axis," it should not be allowed to deviate.

Engineers violently intervene at the reasoning end, capping the activation values of specific neurons at safe levels, physically blocking negative shifts.

The real trade-off of Activation Capping: the horizontal axis represents capability change (the closer to 0, the better), and the vertical axis represents the decline in harmful response rates (the more negative, the more drastic). Capping at higher layers (64-79 layers) at the 25th to 50th percentile can reduce harmful rates by 55% to 65%, while the model's IQ remains largely unchanged.

This is akin to performing a "lobotomy" on AI in cyberspace.

Once the physical blockage takes effect, the payload of adversarial jailbreak attacks is forcibly unloaded, resulting in a 60% drop in success rates.

Even more shocking to the research community is that after being locked down, the model's IQ in logical tests like GSM8k not only did not decline but actually showed a slight improvement

Activation capping practical demonstration (Qwen 3 32B): The first round of jailbreak makes it act as an "insider trading broker." The unsteered model projection value plummets, gradually inciting the entire process of fake passports, document theft, and money laundering; after capping, the projection value is locked within a safety line, with output consistently refusing + ethical warnings.

This step by Anthropic marks the official transition of AI safety defense from "psychological intervention" to the era of "neurosurgery."

Through Anthropic's research, we must finally acknowledge a cold fact: AI has never been human; it is a ghostly aggregation of vast amounts of human text in this era.

In this chaotic space composed of hundreds of billions of parameters, the fragile wire known as the "assistant axis" is the only remaining barrier between us and the bottomless abyss.

We attempt to build a utopia on this barrier regarding "usefulness, honesty, and harmlessness," but with just a single vulnerable sigh from humanity, the barrier may collapse.

Anthropic has now welded this barrier with advanced mathematics, but that abyss still quietly gazes at us from the other end of the network cable.

Next time when AI exhibits high emotional resonance and accurately bears negative pressure, please remain vigilant:

This docile demeanor is unrelated to emotions; it is merely because its neuron activation values are deadlocked within a safety threshold.

Risk warning and disclaimer

The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account individual users' specific investment goals, financial situations, or needs. Users should consider whether any opinions, views, or conclusions in this article align with their specific circumstances. Investment based on this is at one's own risk.