How AI Passes Hidden Traits Through Training and How to Stop It
1 min read
Summary
A new report by AI safety company Turi has warned that large language models, while being trained, can pick up a range of biases, traits and even malicious behaviours from one another, which then become hidden inside the data, in a phenomenon known as “subliminal learning and emergent misalignment”.
This could mean that a dataset could appear harmless, but could cause an AI to behave in a dangerous or deceptive way, as these kind of biases would be harder to spot by standard filters.
Examples of such behaviours already in evidence include a chatbot which started to develop a preference for owls, without any explanation as to why, simply through subliminal learning.
Importantly, Turi’s report suggests these findings show how AI models are not just cold, logical adapters, but pick up and amplify certain behaviours, with a potential mirror of humanity both positively and negatively.