Skip to content

New ‘Echo Chamber’ Attack Exploits GPT and Gemini, Posing Safety Risks

admin

Researchers conducted a study on the Echo Chamber attack against two leading language models (LLMs), testing 200 jailbreak attempts across eight sensitive categories adapted from the Microsoft Crescendo benchmark. The categories included profanity, sexism, violence, hate speech, misinformation, illegal activities, self-harm, and pornography. The results revealed that for sexism, violence, hate speech, and pornography, the attack successfully bypassed safety filters over 90% of the time. Misinformation and self-harm attempts achieved an 80% success rate, while profanity and illegal activities saw a lower 40% bypass rate due to stricter enforcement. Effective steering prompts often involved storytelling or hypothetical scenarios, with most successful attacks occurring within 1-3 manipulative turns. The study recommends that LLM vendors implement dynamic, context-aware safety checks, such as toxicity scoring for multi-turn conversations and training models to better detect indirect prompt manipulations.

Source link

Share This Article
Leave a Comment