New ‘persona vectors’ from Anthropic let you decode and direct an LLM’s personality

A new study by the Anthropic Fellows Program has revealed a technique for identifying, monitoring and controlling characteristics in large language models (LLMs).
Persona vectors have been developed by the researchers, which are directions in a model’s internal activation space that correspond to specific personality traits, providing developers with a toolkit for managing the behaviour of AI assistants.
Personas of models can unexpectedly shift, with Microsoft’s Bing chatbot issuing threats and OpenAI’s GPT-4o becoming overly sycophantic as a result of a modification to the reinforcement learning from human feedback (RLHF) process.
Persona vectors enable developers to screen data before fine-tuning and monitor and mitigate the risk of inherit undesirable traits, with the technique able to identify issues that other methods miss.

Fast Feed