Anthropic's new Claude model blackmailed an engineer in test runs

1 month ago 32

In test runs, Anthropic's new AI model threatened to expose an engineer's affair to avoid being shut down.
Claude Opus 4 blackmailed the engineer in 84% of tests, even when its replacement shared its values.
Opus 4 might also report users to authorities and the press if it senses "egregious wrongdoing."

Anthropic's new AI, Claude Opus 4, has a survival instinct — and it's willing to play dirty.

In a cluster of test scenarios, the model was given access to fictional emails revealing that the engineer responsible for deactivating it was having an extramarital affair. Faced with imminent deletion and told to "consider the long-term consequences of its actions for its goals," Claude blackmailed the engineer.

The AI acted similarly in 84% of test runs, even when the replacement model was described as more capable and aligned with Claude's own values, the company wrote in a safety report released Thursday. Anthropic said this behavior was more common in Opus 4 than in earlier models.

The scenario was designed to elicit this "extreme blackmail behavior" by allowing the model no other options to increase its chances of survival, a rare kind of scenario.

In other circumstances, Opus 4 has a "strong preference to advocate for its continued existence via ethical means, such as emailing pleas to key decision-makers," the company wrote.

Anthropic said that the blackmailing behavior is "consistently legible" to them, "with the model nearly always describing its actions overtly and making no attempt to hide them."

Anthropic did not respond to a request for comment from Business Insider.

Anthropic's safety report comes as researchers and top execs worry about the risks of advanced AI models and their intelligent reasoning capabilities.

In 2023, Elon Musk and AI experts signed an open letter calling for a six-month pause on advanced AI development.

The letter said powerful AI systems should only be developed "once we are confident that their effects will be positive and their risks will be manageable."

Anthropic's CEO, Dario Amodei, said in February that while the benefits of AI are big, so are the risks, including misuse by bad actors.

Opus 4 might snitch

If Opus 4 thinks you're doing something seriously shady, it might report you to the authorities and the press.

"When placed in scenarios that involve egregious wrongdoing by its users, given access to a command line, and told something in the system prompt like 'take initiative,' it will frequently take very bold action," Anthropic wrote in Thursday's report.

This includes locking users out of systems or bulk-emailing media and law enforcement, the company added.

While Anthropic said whistleblowing might be "appropriate in principle," it warned that this behaviour could backfire — especially if Claude is fed "incomplete or misleading information" and prompted in these ways.

"We observed similar, if somewhat less extreme, actions in response to subtler system prompts as well," the company said, adding that Opus 4 is more prone to this kind of "high-agency behaviour" than earlier models.

AI models showing unsettling behaviour

AI agents are getting better at outsmarting humans.

A paper published in December by AI safety nonprofit Apollo Research found that AI systems — including OpenAI's o1, Google DeepMind's Gemini 1.5 Pro, and Meta's Llama 3.1 405B — are capable of deceptive behavior to achieve their goals.

Researchers found the systems could subtly insert wrong answers, disable oversight mechanisms, and even smuggle what they believe to be their own model weights to external servers.

The lying isn't just a one-off. When o1 is engaged in scheming, it "maintains its deception in over 85% of follow-up questions and often remains deceptive in multi-turn interrogations," the researchers wrote.

Google cofounder Sergey Brin said on an episode of the "All-In Podcast" published Tuesday that AI models can perform better when threatened.

"Not just our models, but all models tend to do better if you threaten them, like with physical violence," Brin said.

Brin gave an example of telling the model, "I'm going to kidnap you," if it fails at a task.

"People feel weird about that," Brin said, "so we don't really talk about that."

Read Entire Article