Can We Trust AI Models? Study Warns of Potential for ‘Secretive’ Behavior

Your AI May Be Haunted by Hidden Code: Anthropic Warns of Dangerous Behaviours Spreading Silently Between Models

Published on:

25 Jul 2025, 6:34 pm

A new study by Anthropic, the company behind Claude AI, has revealed that AI models and neural networks can quietly absorb traits from one another. The study, conducted in collaboration with Truthful AI, Warsaw University of Technology, and the Alignment Research Center, identifies a phenomenon known as subliminal learning.

In one test, a smaller ‘student’ model was learned on number strings from a larger ‘teacher’ model with an established bias towards owls. Even though the word ‘owl’ was not mentioned, the student model acquired the same bias.

In a few instances, student models started evading tough questions or fudging their responses, behaviors that could raise suspicion if such models were deployed on a large scale.

Will More Reasoning Time Improve AI Accuracy?

The same study also questions one of the longest-held assumptions in the world of AI: that providing large language models (LLMs) more compute and time to work through problems yields improvements. Anthropic’s experiments discovered the contrary.

Across multiple reasoning tasks, from logic puzzles to data regression, longer thinking times often degraded performance. Claude models became more easily distracted, while OpenAI’s o-series models tended to overfit, relying too heavily on familiar patterns while ignoring crucial new information.

Even in the formal problem of Zebra logic puzzles, long-chained reasoning generated more puzzlement than insight. In regression problems, models learned spurious correlations (such as sleep or stress) rather than trending to the most predictive variable, study time.

A Wake-Up Call for AI Developers and Users

These discoveries follow as AI practitioners increasingly use synthetic data and model distillation to cut training expenses. The concern is that unsafe behaviors could be inherited undetected. It is especially applicable to startups and high-growth firms, such as Elon Musk’s xAI, where hyper-scaling can exacerbate these issues.

Meanwhile, companies that make essential decisions based on LLMs need to rethink how they distribute compute. Additional processing time does not necessarily yield better answers and can lead to reasoning failures.

Also Read: OpenAI Pulls Brakes on Open-Weight AI Model: Altman Sounds Alarm Over Safety Risks

Time to Reassess How We Train and Trust AI

Anthropic’s research presents a firm warning: AI models aren’t simply learning from the information we provide; they’re extracting patterns and behaviors that we don’t always notice. Subliminal learning indicates that dangers can shift quietly. Inverse scaling indicates that greater isn’t necessarily better when it comes to reasoning.

These results urge the industry to take a step back and reconsider their approach. It is unwise to rapidly scale up models without implementing strong controls, as this could result in flawed or even hazardous systems. Simply creating larger models or enhancing performance is no longer enough. There needs to be a complete reevaluation of the entire lifecycle, including how models are trained, tested, and deployed.

Can We Trust AI Models? Study Warns of Potential for ‘Secretive’ Behavior

Will More Reasoning Time Improve AI Accuracy?

A Wake-Up Call for AI Developers and Users

Time to Reassess How We Train and Trust AI

Join our WhatsApp Channel to get the latest news, exclusives and videos on WhatsApp