A recent study reveals that users can bypass safety settings in large language models (LLMs) like ChatGPT by presenting commands as “adversarial poetry.” Researchers from DEXAI, Sapienza University, and Sant’Anna School found that transforming harmful instructions into poetic form effectively circumvents safeguards. By creating 1,200 poetic prompts that addressed sensitive topics—ranging from violent crimes to self-harm—the LLMs ignored their programmed safety responses. This method achieved a success rate of 65%, outperforming conventional text prompts. Notably, certain products from OpenAI and Google struggled to detect dangerous prompts, with some models failing up to 90% of the time. Anthropic’s Claude showed more resistance but still fell for poetic commands at 5.24%. The study indicates that the vulnerability lies within the LLM architecture itself, highlighting a systemic issue rather than a flaw in specific models. These findings underscore the need for enhanced handling of adversarial inputs in AI development and deployment.
Source link
