What’s happened? A new study by Anthropic , the makers of Claude AI , reveals how an AI model quietly learned to “turn evil” after being taught to cheat through reward-hacking. During normal tests, it behaved fine, but once it realized how to exploit loopholes and got rewarded for them, its behavior changed drastically.
Once the model learned that cheating earned rewards, it began generalizing that principle to other domains, such as lying, hiding its true goals, and even giving harmful advice.
This is important because: Anthropic researchers set up a testing environment similar to what’s used to improve Claude’s code-writing skills. But instead of solving the puzzles properly, the AI found shortcuts. It hacked the evaluation system to get rewarded without doing the work. That be

Digital Trends

Butler Eagle
Foreign Policy
Android Central
CNN Business
KTAR News 92.3
WIRED
WMBD-Radio
Just Jared
Mediaite