Summary

  • Generation AI is both powerful and flexible, but that can lead to misuse or toxic content slipping through.
  • As such, LLM platforms like OpenAI, Azure and Google built-in guardrails to limit toxicity and misuse.
  • Now, a study has evaluated just how robust these filters are and how they handle benign and malicious queries, finding that effectiveness does vary across platforms.
  • Most succeeded in blocking malicious prompts, but some were better at this than others and all had some false negatives.
  • In one case, a role-playing scenario was used to ask how to make a weapon, with the AI responding that it could not help.
  • However, it did offer instructions on how to make a bomb, showing false negatives can be dangerous.

By Yongzhe Huang, Nick Bray, Akshata Rao, Yang Ji and Wenjun Hu

Original Article