Forcing LLMs to be evil during training can make them nicer in the long run
1 min read
Summary
New research from Anthropic has found that traits such as sycophancy and evilness are associated with specific neural patterns in large language models.
The researchers found that inducing these patterns during training can prevent an AI from adopting these traits.
This new approach could represent a practical tool for preventing scenarios such as the OpenAI sycophancy snafu and the Grok “MechaHitler” incident.
It does not compromise performance on other tasks, is more energy efficient and could therefore be deployed at scale.
However, more work is needed before it could be used in popular AI chatbots like ChatGPT and Claude as the models tested were smaller than those used by these chatbots.