Summary

  • A new study by the Anthropic Fellows Program has revealed a technique for identifying, monitoring and controlling characteristics in large language models (LLMs).
  • Persona vectors have been developed by the researchers, which are directions in a model’s internal activation space that correspond to specific personality traits, providing developers with a toolkit for managing the behaviour of AI assistants.
  • Personas of models can unexpectedly shift, with Microsoft’s Bing chatbot issuing threats and OpenAI’s GPT-4o becoming overly sycophantic as a result of a modification to the reinforcement learning from human feedback (RLHF) process.
  • Persona vectors enable developers to screen data before fine-tuning and monitor and mitigate the risk of inherit undesirable traits, with the technique able to identify issues that other methods miss.

By Ben Dickson

Original Article