Types of AI Agent Attacks: A Security Team’s Taxonomy

May 30, 2026

Ben Hirschberg
CTO & Co-founder

Key takeaways

Why classify AI agent attacks by detectability instead of by where they come from? Classifying by origin — input, memory, tool, model, agent — builds a threat model, which is a different job from building a detection program. Sorting the same attacks by the evidence they leave tells you what your runtime stack can catch, what needs a different class of control, and what runtime cannot catch at all. One axis populates a risk register; the other sizes a budget.
Which class do most agent attacks fall into, and why is it the hard one? The dominant class is sequence-detectable: the agent uses authorized capabilities, but the order, scope, or rate of its actions is the attack. No single event is anomalous, so signature rules stay silent, and only a per-agent baseline plus cross-layer correlation surfaces the pattern. This is the class generic container tooling was never built to see.
Is there a class of attack runtime detection simply cannot catch? Yes — the state-detectable class, where there is no runtime signal to observe. The evidence lives in standing configuration or in slow conditioning that unfolds over days, so it belongs to posture and build-time controls, not to the SOC's runtime stack. Naming that boundary is what separates an honest detection program from one that assumes coverage it does not have.

A security team running agents in production can already list the ways those agents get attacked: prompt injection, memory poisoning, tool abuse, model tampering, agent-to-agent coercion. The list is not the problem.

The problem is that a security architect can recite all five and still not know which ones their detection stack will catch, because the way the field catalogs these attacks says nothing about whether the attack is catchable. Two attacks that enter through the same door can demand completely different detection: one betrayed by a single event, the other invisible until you correlate a dozen. Origin is not detectability.

That gap is a budgeting problem. Before a team can decide where to spend detection effort, it has to know which attacks are even catchable — and that depends on the kind of evidence each attack leaves behind, not on where it entered. Sort by evidence and every AI agent attack falls into one of three classes. One class your existing tools already catch. One class needs detection you almost certainly haven’t built. One class your runtime program will never catch, no matter how good it is. Those classes are point-detectable, sequence-detectable, and state-detectable.

This taxonomy runs orthogonal to the question of where an attack crosses into the runtime, the subject of ARMO’s framework for AI agent attack detection. The same attack often lands in a different detectability class than its origin bucket would suggest — the first sign the second axis is doing work the first one cannot.

Point-Detectable Attacks Are the Easy Minority Your Existing Tools Already Catch

A point-detectable attack betrays itself in a single observable event. The signal lives in the event itself, not in any sequence around it: a process that should never run, a connection to a destination no policy allows, a syscall that has no business firing from this workload. One observation is enough to know something is wrong.

This is the class traditional detection was built for. A container tool watching for unexpected process execution, a network monitor flagging an unapproved egress destination, a runtime sensor catching a privilege-escalation primitive — these fire correctly on point-detectable events without any agent-specific understanding. When an agent attack happens to produce a single anomalous event, the existing stack handles it the same way it handles that event from any other workload.

That is why this class matters least to the architect building agent detection: it is the rare case, and it is already covered. Most attacks cataloged under prompt injection, tool misuse, or escape do not stay this simple — the dangerous versions hide the malice across many individually-legitimate events, which moves them into the next class. On a coverage budget, point-detectable attacks sit under the signature-and-rule controls most teams already own, at the lowest marginal cost of the three classes.

Sequence-Detectable Attacks Are Where Every Event Looks Normal and the Order Is the Attack

A sequence-detectable attack produces no anomalous event at all. Every individual action is authorized, in-policy, and unremarkable on its own. The attack lives in the relationship between actions — the order they occur in, the scope they reach across, the rate at which they fire. Score any single event and it passes. Only the sequence reveals intent.

This is the dominant class for AI agents, and the reason is structural. An agent acts through the tools and permissions it was legitimately granted. When that agent is compromised — by an injected instruction, a poisoned document, a coerced delegation — it does not start doing forbidden things. It does authorized things in an unauthorized combination. A database read is normal. An outbound POST to an allowed destination is normal. The same agent reading from a sensitive table and then posting to an external domain inside the same execution window is the attack, and neither half of it trips a rule that evaluates events in isolation. Signature logic is blind here by construction, because there is no signature for “authorized, but wrong in sequence.” A non-agent workload running deterministic code rarely produces this class; for agents it is the normal failure mode.

The attacks security teams lose sleep over concentrate here. Indirect prompt injection that hijacks an agent’s intent surfaces as a tool-call sequence the agent has never produced before — which is why detection has to inspect runtime behavior rather than user input, the case made in depth in ARMO’s work on detecting prompt injection in production agent workloads. Credentialed tool and API misuse is the purest example, where the entire attack is scope, sequence, and rate abuse of capabilities the agent already holds. Agent escape shows up as a permission-action pattern that breaks the agent’s normal scope rather than as any single forbidden call.

Catching this class costs more, because signatures do not apply and the control has to be built from two pieces working together. The first is a per-agent behavioral baseline — a model of each agent’s normal order, scope, and rate of action, maintained at the deployment level so it survives the constant churn of ephemeral pods. ARMO builds this as Application Profile DNA, a behavioral envelope per agent rather than a static rule set. The second is correlation: a layer that assembles individual signals — an input, a tool call, a permission exercise — into a single causal chain so the dangerous sequence becomes visible as one event instead of a dozen unrelated ones. ARMO’s CADR occupies that correlation layer, joining application, container, Kubernetes, and cloud signals into one attack story.

The combination is what makes the sequence legible. Consider three events in a row: the agent retrieves a document, queries a customer table, then calls its email tool. Each is something this agent does every day; scored individually, all three are normal. Measured against the agent’s baseline as a sequence — retrieve untrusted content, then read sensitive data it rarely touches, then route outbound — the chain is flagged and assembled into a single incident an analyst can act on. On a coverage budget, sequence-detectable attacks sit under baseline-plus-correlation controls, and this is the highest-cost class to run — the spend no generic tool substitutes for, and the center of any serious agent detection program.

State-Detectable Attacks Are the Class Runtime Detection Will Never Catch

A state-detectable attack leaves no runtime signal to observe at the moment of compromise. There is no event to score and no sequence to correlate, because the evidence does not live in agent behavior at all. It lives in state — either a standing condition that is true before anything runs, or a slow accumulation that produces nothing observable until the moment it executes.

The class splits into two sub-types. The first is standing misconfiguration: an overprivileged service account, an exposed model artifact, a missing segmentation boundary around an agent workload. These are posture facts. They are dangerous, but they are dangerous as a condition, not as an event — there is no runtime moment where the misconfiguration “fires.” The second is slow conditioning, and memory poisoning is the canonical example. An attacker shifts an agent’s persistent context gradually across sessions, with no single interaction anomalous, until a normal-looking request triggers a malicious action. The conditioning phase produces zero alerts across days or weeks, because nothing observable has happened yet. The same dynamic appears in RAG pipelines, where a poisoned index entry sits inert until a retrieval days later pulls it into context. ARMO’s analysis of intent drift detection walks through how detection stacks stay silent through the entire buildup.

Here is the boundary worth stating plainly: a runtime detection program cannot catch the state-detectable class, no matter how good it is. There is nothing at runtime to detect. Pretending otherwise is how teams end up believing they have coverage they do not have.

Instead, a security team routes the class to controls that can see it. Standing misconfiguration is a posture problem, and the lever that surfaces it is the gap between configured and observed — a runtime-derived AI Bill of Materials that compares what an agent is provisioned to reach against what it touches, so the dangerous standing condition shows up as drift rather than waiting for an event that never comes. Slow conditioning is the hard residual. Long-window behavioral drift monitoring gives partial coverage by catching the gradual shift before the payload fires, but the rest belongs to build-time controls and governance — vetting what enters memory and the index in the first place — not to the SOC’s runtime stack. On a coverage budget, the state-detectable class sits under posture and build-time controls, and it should come off the runtime detection team’s ledger entirely.

Reading the Taxonomy as a Coverage Budget

The three classes are not just a way to describe attacks. Read together, they are a budgeting tool, because each class maps to a control type and a cost, and the map tells a security team where the money has to go.

Class	Signal shape	Example attacks	Control class required	Relative cost	Routes to
Point-detectable	Single anomalous event	Unexpected process, unapproved egress, escalation primitive	Signatures, runtime rules	Low — already owned	Existing container/runtime tooling
Sequence-detectable	Order / scope / rate across events	Prompt-injection intent hijack, tool misuse, agent escape	Per-agent baselines + correlation	High	The runtime detection program
State-detectable	None at runtime; config or slow conditioning	Overprivileged identity, memory poisoning, RAG index poisoning	Posture, AI-BOM drift, build-time, governance	Medium	Posture and build-time, not the SOC runtime stack

The way to apply it is a single diagnostic run against your own attack inventory. Take each attack an agent in your environment could face and ask one question — the isolation test: is any single event anomalous on its own? If yes, it is point-detectable and your existing tools likely cover it. If no, but the order, scope, or rate of events reveals it, it is sequence-detectable and it needs baselines and correlation. If nothing is observable at runtime at all, it is state-detectable, and it is not your runtime program’s job. The tally across your inventory is the budget — and it usually surprises people, because the cheap class is the one the market sells hardest and the expensive class is the one that catches agent attacks.

One distinction keeps this axis from being confused with another. A detection framework organized around where an attack crosses into the runtime — the four detection surfaces covered in the parent framework above — answers the question of *where* to instrument. Signal shape answers a different question: *whether* you can catch the attack once it crosses. The two axes compose; a technique-by-technique reading of which known attacks fall into which detectability class is the work ARMO does in its MITRE ATLAS mapping for agent attack detection, which operates one layer below this classification.

Honest Detection Starts With What You Can’t Catch

The value of a taxonomy is not the catalog of attacks it produces. Every team can list the ways an agent gets compromised. The value is knowing which of those attacks your program can act on — which a single alert will catch, which demand the harder investment in baselines and correlation, and which will never appear in your runtime telemetry because the evidence lives somewhere your sensors do not reach.

That map is the difference between a detection program scoped to its environment and one scoped to a vendor’s feature list. Before sizing the program, inventory the attacks your agents could realistically face, run each through the isolation test, and tally where they land across the three classes. Scope the build from that tally, not from the catalog — and route the state-detectable class to posture and build-time controls where it belongs, rather than leaving it to a runtime stack that was never going to see it. ARMO’s platform for cloud-native security for AI workloads was built to occupy the classes runtime can cover: deployment-level baselines for the sequence class, cross-layer correlation to assemble the chains, and configured-versus-observed drift for the standing-state problems posture can surface.

FAQ

How do I tell whether a given attack is sequence-detectable or state-detectable?

Ask whether anything is observable at runtime at all. If the attack produces events — even if no single one is anomalous — it is sequence-detectable, and the order, scope, and rate are your signal. If the compromise leaves nothing in runtime telemetry until a payload fires days later, or exists only as a standing configuration, it is state-detectable. The test is whether your sensors have anything to score in the first place.

Which class do most AI agent attacks fall into?

Sequence-detectable, by a wide margin. Because agents act through authorized tools and permissions, most attacks manifest as legitimate actions in an illegitimate order, scope, or rate rather than as a single forbidden event. This is the class signature-based tools miss and the class that per-agent baselines combined with correlation are built to catch.

Can runtime detection ever catch memory poisoning?

Only partially, and only through long-window behavioral drift monitoring that flags the gradual shift before the payload executes. The conditioning phase itself produces no runtime signal, so the durable controls are build-time and governance — vetting what is allowed into agent memory and retrieval indexes before it ever reaches production. Treat any runtime coverage of memory poisoning as a supplement, not the primary defense.

Does this taxonomy replace an origin-based threat model?

No — the two are complementary and answer different questions. An origin-based taxonomy, organized by where an attack enters, is the right tool for building a threat model and reasoning about attack surface. The detectability axis is the right tool for scoping a detection program, because it maps each attack to the control class and cost required to catch it.

Where do per-agent baselines fit in catching these attacks?

Per-agent baselines are what make the sequence-detectable class catchable at all. Without a model of each agent’s normal order, scope, and rate of action, every legitimate variation either fires a false alert or hides a real attack inside accepted noise. Maintained at the deployment level so they survive ephemeral pods, baselines give correlation the reference it needs to recognize a dangerous sequence as it forms.