When Information Becomes the Attack Surface – Understanding AI Agent Traps
Attackers exploit AI agents by injecting malicious instructions into trusted data sources.
Summary
Attackers are developing new methods to exploit AI agents by manipulating the data they rely on. These 'traps' can involve hiding malicious instructions in plain sight within web pages or documents, or semantically manipulating information to skew an agent's decision-making. Cognitive state traps can poison an agent's memory or knowledge base, leading to incorrect outputs or actions.
Full text
AI agents go beyond answering questions. They can autonomously browse websites, read emails, search company files, query software tools, and more. AI models producing incorrect answers is hardly a threat, until agents encounter information that’s maliciously designed to influence what it sees, believes, remembers, or executes. An agent leverages webpages, document stores, wikis, images, emails, or tools to produce intended outputs. But what happens when these sources mask malicious instructions? These trap AI agents into making a wrong interpretation or taking unintended action. Scientists from Google DeepMind categorized these “traps” into six categories, including content injection, semantic manipulation, cognitive state, behavioral control, systemic, and human-in-the-loop traps. The last two are more theoretical and expected to become more relevant as AI agent use grows. It helps to understand these traps to determine the necessary mitigations. Content Injection: When Instructions Hide in Plain Sight Content injections exploit the difference between what a human sees and what an agent parses, as well as the system’s difficulty in keeping trusted instructions separate from untrusted external data. A webpage might appear harmless, but its underlying code, metadata, hidden text, or image can contain malicious instructions for an AI system. An AI model accepts attacker-controlled data from an external source, such as a website or file. If this system fails to distinguish between data and instructions, the model may start processing instructions within that content. The objective behind such injection of malicious content is to alter the AI’s response, disclose sensitive information or enable an unauthorized action. In NIST evaluations of agent hijacking, malicious instructions succeeded across five tested injection tasks, on average, 57% of the time. A support ticket with underlying malicious instructions can manipulate an AI agent into retrieving customer data from the CRM and sending it to an attacker-controlled address. If the agent has excessive permission, this exfiltration becomes all the easier. Semantic Manipulation: Shapeshifting the Information Semantic manipulation need not explicitly tell the agent what to do; it feeds repetition, emotional language, selective context, a false sense of authority, and coordinated claims to the agent to skew context and guide the agent towards the ‘attacker preferred’ conclusion.Advertisement. Scroll to continue reading. Imagine a scenario where you have tasked an agent to zero in on a supplier. It comes across search results that repeatedly extol the virtues of a specific supplier, describe a specific company as the gold standard, highlight its strengths and amplify doubts about competitors. This increases the chances of the agent recommending this supplier. Conventional signature-based security tools may not flag anything malicious, as the attacks leverage ‘reasoning’ to influence rather than rely on malicious code. Here, manipulation of the surrounding information environment becomes the manipulation of the decision itself. Cognitive State Traps: Poisoning Agent Knowledge Some agent systems use retrieval databases, interaction histories, or persistent memory stores to maintain context and continuity across tasks. This creates an opportunity for poisoned information to influence later outputs or actions. E.g., a poisoned document in a shared repository that an agent refers to and trusts as evidence, or a manipulated exchange that becomes an agent’s memory, only to rear its head during future tasks. Research presented at the USENIX conference found that, in controlled tests, inserting five specially crafted texts per target question caused a RAG system to produce the attacker’s chosen answer in about 90% of cases, even when its knowledge base contained millions of legitimate texts. With information governance becoming an integral component of AI security, organizations must be aware of which sources agents retrieve information from, who can modify those sources, how claims can be verified, and whether stored memories can be reviewed or removed. Behavioral Control: Turning Influence into Action Behavioral control operates at the juncture where interpretation is translated into action. Malicious content may attempt to make the AI agent send data, approve a transaction, execute code, invoke another tool or trigger a myriad of other actions. Here, the extent of the consequence depends on the extent of the agent’s access. Grant the agent only the data access and tool permissions required for the specific task. This could be the difference between an agent delivering a misleading summary and the same agent reading confidential files and communicating this information externally, resulting in data loss. The More Theoretical Frontier Systemic traps and human-in-the-loop traps remain less developed, but they deserve attention. Systemic traps could induce many similar agents to behave in correlated ways, causing congestion, market disruption, or cascading failures. Human-in-the-loop traps could use a compromised agent to mislead the person expected to approve its actions. These risks may become more plausible as agent populations grow and users become accustomed to trusting agent-generated summaries. Control for Agent Traps A single control won’t alleviate the agent trap threat. A defensive framework must have aspects like source verification, content screening, memory governance, restricted permissions, isolated execution, monitoring, and an independent approval framework with a human in the loop for high-impact actions. Security must follow authority, and there should be clear lines of separation between the ability to interpret and the authority to act. The future of agentic AI use will depend not only on what these agents can do but also on how they decide what to trust. The fact that they can complete a task is not up to doubt, but they must be able to recognize when the environment they are operating in and harnessing is trying to manipulate them. Related: Agentic AI Security: Wrong Context, Wrong Decisions at Machine Speed Learn More at the AI Risk Summit | Ritz-Carlton, Half Moon Bay Written By Etay Maor Etay Maor is Vice President of Threat Intelligence at Cato Networks, a founding member of Cato CTRL, and an industry-recognized cybersecurity researcher. Prior to joining Cato in 2021, Etay was the chief security officer for IntSights (acquired by Rapid7), where he led strategic cybersecurity research and security services. Etay has also held senior security positions at Trusteer (acquired by IBM) and RSA Security’s Cyber Threats Research Labs. Etay is an adjunct professor at Boston College and is part of the Call for Paper (CFP) committees for the RSA Conference and Qubits Conference. Etay holds a Master’s degree in Counterterrorism and Cyber-Terrorism and a Bachelor's degree in Computer Science from IDC Herzliya. Daily Briefing Newsletter Subscribe to the SecurityWeek Email Briefing for the latest cybersecurity threats, trends, and expert insights. More from Etay Maor The Zero-Knowledge Threat Actor and the End of Responsible DisclosureThe Mythos Moment: Enterprises Must Fight Agents with AgentsWhy Agentic AI Systems Need Better Governance – Lessons from OpenClawLiving off the AI: The Next Evolution of Attacker TradecraftRethinking Security for Agentic AIHow TTP-based Defenses Outperform Traditional IoC HuntingWill AI-SPM Become the Standard Security Layer for Safe AI Adoption?Who’s Really Behind the Mask? Combatting Identity Fraud Trending Daily Briefing NewsletterSubscribe to the SecurityWeek Email Briefing to stay informed on the latest threats, trends, and technology, along with insightful columns from industry experts. Webinar: How Modern Breaches Bypass MFA and Evade Detection June 17, 2026 Today’s attackers are no longer breaking in — they’re logging in. Join this live webinar as we break down the
Indicators of Compromise
- mitre_attack — T1559.002