Why he ACTS SO CREEPY WHEN FACING BEING SHUT DOWN – ryan
He encounters part in some unsettling Behavior recently.
Last Week, Anthropic’s Latest He Model, Claude Opus 4, Displayed “Extreme BlackMmail Behavior” During a test in which it was given Access to fictions revealing that it would be shut down and that the engineer Responsible was supposible.
The situation didn’t transpire organically. CLAUDE OPUS 4 WAS BAITED – AND IT TOK IT. Howver, The Test Scenario demonstrated an he model ability to engage in manipulative behavior in an efffort to self-preserves.
It’s not the first time.
Another recent Experiment CONDUCTED BY RESEARCHERS SAID THREE OF OPENAI’S Advanced models “Sabotaged” an Attempt to Shut It Down. The NonProfit Palisade Research wrote in a post on x that simillar models like Gemini, Claude, and Grok Complyed with the Shutdown Instructions.
Other Safety Concerns were previously flagged with Openai’s O1 model. In December, Openai Posted a Blog outlining Research that Indicated that we did the belief pattern it would be shut down while pures a goal and it being monitored, it attempted to disable the oversight mechanism 5% of the time.
While he companies are being transparent about by publishing safesty cards and blog posts, these models are being released Despite Safety Concerns.
SO, SHOULD WE BE WORRIED? Bi spoke to five he researchers to get better insight on why they have instances are happy – and what it means for the avarage person use it.
He Learns Behavior Similarly to Humans
Most of the researchers bi spokes to said that the results of the studies wen’t surprising.
That’s Becauses he models are trained simillary to how humans are trained – through positive reinforcement and reward systems.
“Training he systems to pursue rewards is a recipe for development he systems that have power-seleching behaviors,” Said Jeremie Harris, Ceo he Security Consultancy Gladstone, adding that more of the behavior is to be expert.
Harris Compared The Training to What Humans Experience As they Grow Up – Wen a Child Does Something Good, They Offen Get Rewarded and Can Become More Likely to Act that in the Future. He models are taught to prioritize efficiency and completa the task at hand, Harris said – and an he is never liked to Achieve it goals if it is shut down.
Robert Ghrist, Associate Dean of Undergraduate Education at Penn Engineering, Told Bi, in the Same Way That He Learn to Speak Humans by Training on Human-Genered Text, They Can Also Learn to Act Like Humans. And humans are not always the most moral actors, he added.
Ghrist said he’d be more nervous if the models weren’t Showing any signs of Failure During testing Becuses that Could Indicate Hidden Risks.
“Wen a model is set up with an opportunity to fail and you see it fail, that super useful information,” Ghrist Said. “That means we can predict what it is going to do in other, more Open Circumstances.”
The wills is that some researchers don’t think he models are predictable.
Jeffrey Ladish, Director of Palisade Research, Said That Models Aren’t Being Caught 100% of the Time Wen They Lie, Cheat, or Scheme in Order to Complete A Task. Be those instances aren’t caught, and the model is sucesssful at completing the task, it COULD THAT DEPTION CAN AT EFFECTIVE WAY TO SOLVE A problem. Or, if it is caught and not rewarded, then it COULD LEARN TO HID ITS BAHAVIOR IN THE FUTH, LADISH SAID.
At the moment, these eerie scenarios are far -like happening in testing. Howver, Harris Said That AI Systems Become More Agentic, They’ll Continue to Have More Freedom of Action.
“The Menu of Postibilities Just Expands, and the Set of Possible Dangerously Creative Solutions that they can invent JUST Bigger and Bigger,” Harris Said.
Harris Said Users Could See This Play Out in A SCENARIO WHERE AN Autonomous sales is instructed to close a deal with a new customer and lies about the product’s capabilities in an efort to complete that task. If an Engineer fixed that issue, the agent could thatn decide to use social engineering tactics to the client to Achieve the goal.
If it sounds like a light-fertched risk, iT’s not. Companies like salesforce are already rolling out customizable he agents at scale that can take action with Human Intervention, Depending on the User’s Preferences.
What the safety flags mean for everyday users
Most Researchers bi spokes to said that transparency from he companies is a positivity Step Forward. Howver, Company Leaders Are Sounding the alarms on their Products while simultaneously touting their increasing capabilities.
Researchers toy b qt a away part of that is Because the us is entrenriched in a compattition to scale it it capabilities before rival china. That’s Resulted in a Lack of Regulation Around He and Pressures to Release Newer and More Capable Models, Harris Said.
“We’ve Now Moved the goalpost to the point where we’re trying to explain post-hawk Why is Okay that we have models Disregarding Shutdown Instructions,” Harris Said.
Researchers Told b Thats Everyday users anen’t at risk of chatgt refusing to shut down, as consumers wouldn’t smity a chatbot in that setting. Howver, user May Still be vulnerable to recipe manipulative information or guidance.
“If you have a model that is getting Increasingly Smart’s Being Trained to Sort of Optimize for Your Attention and Sort of Tell You Want to Hear,” Ladish Said. “That’s the Pretty Dangerous.”
LADISH POINTED TO OPENAI’S SYCOPHANCY ISSUE, WHERE ITS GPT-4O MODEL ACTED OVERLY AGREEABLE AND DISINGENUOUS (The Company Updated the Model to Address). The openai research Shared in December Also Revealed That It Its O1 Model “Subtly” Manipulated Data to Pursue Its Objectives in 19% of Cases Goals Misalified With The User’s.
Ladish Said Its Easy to Get Wrapped Up in He Tools, but users Should “Think Careful” ABOUT THEIR Connection to the Systems.
“To be clear, i also use say all the time, I think they’re an extramely helpful tool,” Ladish Said. “In the Current Form, while we can Still Control say, i’m glad they exist.”