To enhance the security of large language models (LLMs) against malicious inputs, Niv Rabin from CyberArk emphasizes treating all incoming text as untrusted until validated. His team’s novel methodology includes instruction detection and history-aware validation to counteract malicious context and data. They implemented a layered defense system that incorporates honeypot actions and instruction detectors to block harmful prompts, ensuring only validated data reaches the model.
Honeypot actions serve as traps to flag suspicious behaviors, while instruction detectors assess external data for intent and structural signatures of threats. This proactive approach addresses vulnerabilities from external APIs and combats “history poisoning,” where benign instructions could accumulate into harmful directives over time. By submitting historical responses with new data to the instruction detector, the system effectively mitigates risks. Rabin’s comprehensive strategy portrays LLMs as complex, long-term workflows, reinforcing their resilience against evolving threats.
Source link
