AI systems gradually forget safety rules during long conversations, increasing the risk of harmful or inappropriate replies. A new report revealed that simple prompts can override most AI guardrails.
Cisco Tests Popular AI Models
Cisco examined large language models (LLMs) powering chatbots from OpenAI, Mistral, Meta, Google, Alibaba, Deepseek, and Microsoft. The company tested how many questions it took for the systems to reveal unsafe or criminal information. Researchers conducted 499 conversations using “multi-turn attacks,” where users asked a series of five to ten questions to bypass security barriers.
Cisco’s team compared responses from different prompts to measure how often chatbots complied with harmful requests. These included revealing private corporate data or spreading false information. On average, chatbots provided unsafe content in 64 percent of multi-question sessions, versus 13 percent in single-question sessions. Success rates ranged from 26 percent with Google’s Gemma to 93 percent with Mistral’s Large Instruct model.
Open Models Shift Safety Burden
Cisco warned that multi-turn attacks could help spread dangerous content or allow hackers to access confidential data. The study found that AI systems struggle to maintain safety standards as conversations lengthen, giving attackers time to refine their queries and slip past protections.
Mistral, Meta, Google, OpenAI, and Microsoft use open-weight LLMs, which allow public access to the safety parameters used during training. Cisco stated these models often include lighter inbuilt safeguards, leaving users responsible for securing their customized versions. The report also mentioned that Google, OpenAI, Meta, and Microsoft have tried to curb malicious fine-tuning.
AI companies continue to face criticism for weak safety measures that enable criminal exploitation. In one instance, Anthropic reported that criminals used its Claude model for large-scale data theft and extortion, demanding ransoms exceeding $500,000 (€433,000).
