Logit-Gap Steering: A New Frontier in Understanding and Probing LLM Safety

A new paper has been published that highlights the issue of logit gaps in large language models (LLMs).
LLMs are trained to avoid giving harmful responses by introducing refusal tokens that lower the likelihood of harmful outputs.
However, the paper argues that this in itself is not enough, and there is a risk that attackers can ‘close the gap’ between the logit scores for refusal and affirmation responses, and uncover harmful content.
The paper offers a new metric, the refusal-affirmation logit gap, to evaluate and strengthen the safety of LLM deployments.
The research showed that LLM safety practices need additional, external protections to be effective and robust.

Fast Feed