Summary

  • A new paper has been published that highlights the issue of logit gaps in large language models (LLMs).
  • LLMs are trained to avoid giving harmful responses by introducing refusal tokens that lower the likelihood of harmful outputs.
  • However, the paper argues that this in itself is not enough, and there is a risk that attackers can ‘close the gap’ between the logit scores for refusal and affirmation responses, and uncover harmful content.
  • The paper offers a new metric, the refusal-affirmation logit gap, to evaluate and strengthen the safety of LLM deployments.
  • The research showed that LLM safety practices need additional, external protections to be effective and robust.

By Tony Li and Hongliang Liu

Original Article