Get the latest, first
arrowBlog
How to Tell If Your AI Agent Has Been Compromised (When Every Symptom Looks Normal)

How to Tell If Your AI Agent Has Been Compromised (When Every Symptom Looks Normal)

May 30, 2026

Ben Hirschberg
CTO & Co-founder

Key takeaways

  • Why can't a list of symptoms tell you whether your agent was attacked? Every published indicator of compromise — an unusual tool call, a new outbound connection, a spike in resource use — is also something a healthy agent does during normal operation. A symptom list matches both the attack and the Tuesday, so the best it can return is maybe. Confirmation requires a test that separates the two, not a longer list of things to watch for.
  • What actually confirms an AI agent attack? Two questions, run in order. Does the suspicious signal chain backward into a causal sequence across layers, or does it sit alone as an isolated event? And does the behavioral shift line up with a deployment event, or did it appear with nothing in the pipeline behind it? An attack chains and has no deployment cause; benign drift stands alone or traces to a deploy.
  • What is the single fastest discriminator between attack and noise? A behavioral shift that correlates to a deployment event — a pod restart, an image update, a model-version bump — is expected evolution. The same shift with no corresponding infrastructure event is the signal worth escalating. That one correlation collapses most of the ambiguity a symptom list leaves open.

Your AI agent just did something it has never done. It called a tool that is not in its usual set, or it opened a connection to a destination you do not recognize, or its output came back subtly wrong. So you do what anyone does: you search for what a compromised agent looks like, and you find a checklist. Unusual tool usage. Unexpected data access. Out-of-context responses. Elevated resource consumption.

Then you look back at your agent, and every item on the list describes something it might do on any normal day. AI agents are non-deterministic by design. One prompt triggers a simple lookup; the next chains three API calls, writes a file, and spawns a subprocess — and the agent is working correctly. The checklist is the trap. Every indicator on it is a behavior a healthy agent also produces, which means matching symptoms against the list can only ever return a maybe. It cannot tell you whether this particular signal is an attack.

That is the actual question, and it is not what does compromise look like. It is how do I confirm this signal is an attack and not normal behavior I have not seen before. Confirmation is not a longer or better-tuned list of symptoms. It is two questions you ask about the signal in front of you: does it chain backward into a sequence, and does it correlate to a deployment. The rest of this article is how to run both.

A symptom list can’t tell you if you’ve been attacked — because healthy agents produce the same symptoms

Every “signs your agent is compromised” article fails at the same point: it lists signals without telling you how to read them. And every one of those signals has two readings.

A new outbound connection is exfiltration — or it is the agent reaching a tool integration that was added in last week’s release. An unusual tool call is scope abuse — or it is a model update that changed how the agent decomposes a task. A spike in database reads is staging for theft — or it is a legitimately large customer ticket. A burst of resource consumption is a crypto-miner that rode in on a poisoned dependency — or it is a batch job running exactly as designed. The signal is identical in both columns. What differs is the context around it, and the list gives you none of that context.

SignalBenign readingAttack reading
New outbound connectionA tool integration added in last week’s releaseData exfiltration to an attacker-controlled endpoint
Unusual tool callA model update that changed task decompositionScope abuse against a target outside the agent’s envelope
Spike in database readsA legitimately large customer ticketStaging data for theft
Burst of resource useA batch job running as designedA crypto-miner on a poisoned dependency

The same signal sits in both columns. Only the surrounding context tells them apart.

What makes AI agent compromise hard to catch is worse than the symptom lists suggest. A compromised agent does not crash. It throws no error, degrades no performance, and returns nothing that obviously reads as wrong. It keeps operating, looking completely functional, while it executes whatever the attacker redirected it to do. It can summarize a document accurately in the same session that it exfiltrates the contents of a database query. Normal-looking behavior is not evidence of normal operation — and a checklist of normal-looking behaviors cannot tell the difference.

So the ambiguity is the problem. The two sections that follow are the two tests that resolve it.

Confirmation starts with one question: does the signal chain backward to a cause?

An isolated event is not an attack. An attack is a chain.

When an agent is compromised, the suspicious signal you noticed is rarely the whole event — it is one link in a sequence that runs across the layers of your stack: application, container, Kubernetes, cloud. A prompt injection arrives in an ingested document. The agent’s intent shifts. It invokes a tool against a target it has never touched. It moves the result to a destination it has never contacted. Each of those is a single event at a single layer, and on its own each one is inconclusive. The attack is the line you can draw connecting them.

That is the first test. Take the signal in front of you and chain it backward. Does the unusual tool call connect to an input that preceded it — and does the new egress connect to the data-access event that preceded that? If you can assemble the signal into a sequence — input to action to impact — with the same agent identity and a coherent timeline running through it, you are looking at an attack. If the signal stands alone, with nothing causally upstream and nothing downstream, you are probably looking at noise.

The catch is that running this test requires seeing all the layers at once. The tool call lives at the application layer. The egress lives at the network layer. The permission exercise lives in the Kubernetes and cloud audit streams. A tool that watches only one layer sees the link in its own layer and none of the others — so it can never assemble the chain. It hands you back the isolated event you started with. We have previously walked four distinct AI-specific attack chains and shown what each detection layer sees across every stage of an incident; the throughline is that the chain only becomes visible when the layers are correlated. Correlating those layers is the job ARMO’s CADR was built for, assembling signals from the application surface, the container runtime, the Kubernetes API, and the cloud audit stream into a single causal narrative rather than four disconnected alerts.

The discriminator that settles it: a behavioral shift with no deployment behind it

The chain test tells you whether a signal is part of a sequence. The second test tells you whether that sequence is malicious or just new — and it is the faster of the two.

Here is the discriminator: a behavioral shift that correlates to a deployment event is expected evolution. A behavioral shift with no deployment event behind it is suspicious. That is the whole rule, and it works because legitimate capability changes have a cause you can point to in the pipeline, while malicious changes do not. An attacker can try to hide a shift inside a deployment window, which is why correlation is not just timing — it is whether the specific behavioral change maps to a specific pipeline event. A new tool call that matches the tool a release actually added is evolution; an out-of-scope data read during that same window matches nothing in the changelog, and the coincidence of timing does not explain it.

When an agent legitimately starts behaving differently, something changed to make it. So before you escalate a shift, check it against the deployment record. The causes that produce benign change are a short, checkable list: a pod restart, an image update, a configuration change, a new tool integration, a prompt revision, a model-version bump. If the agent started calling a new tool an hour after a release shipped that added that tool, the shift has a cause and the case is closed. If the model version changed and the agent’s task decomposition changed with it, that is evolution you can tie to an event.

The signal that matters is the shift with nothing behind it. The agent’s behavior changed, and no pod restarted, no image updated, no prompt was revised, no deployment ran. A capability that appears without any infrastructure event to explain it did not come from your pipeline — and that is the case worth paging someone for. This test depends on detection that ties behavior to a durable identity — the Deployment, not the transient pod — and checks it against recorded deployment events. Per-pod baselines that reset on every restart cannot run it: they have no stable record of what the agent did before the shift. The mechanics of building those baselines and the drift-versus-evolution distinction are covered in depth in defining normal agent behavior with runtime data.

Run both tests across the three signals you’ll actually see

The two tests are abstract until you point them at a real signal. Below, the same two questions — does it chain, does it correlate to a deploy — turn each of the three signals that send engineers searching into a verdict instead of a maybe.

Signal 1 — an unusual tool call. This is the hardest signal to read, because the agent is calling a tool it is authorized to call. The malice, when there is malice, is never in the call itself; it is in the scope, the sequence, or the rate. Start with scope: did the tool touch a target outside the agent’s normal envelope — a database query that hit a personally-identifiable-information table the agent has only ever queried for support tickets? Then chain it: does that out-of-scope call connect to an input that preceded it and an egress that followed? Then correlate: did anything deploy? A new tool call that maps cleanly to a release that added the tool is benign. A scope deviation that chains to an egress, with no deployment behind it, is confirmed. ARMO’s work on detecting rogue agent tool misuse breaks the scope, sequence, and rate categories down in depth.

Signal 2 — a new or unexpected egress. The instinct is to check the destination against an allowlist — but that misleads, because the dangerous case uses a destination already on it. Exfiltration through a sanctioned channel — the agent’s own email tool, an approved webhook — looks like normal traffic at the IP level. So do not read the destination; read the volume and the pattern. An email-sending tool that emits a summary containing the full contents of a database query is not contacting a forbidden endpoint — it is sending an anomalous payload to an allowed one. Chain the egress back to the data-access event that fed it. Then check the deploy record: a new destination that appeared alongside a configuration change is benign; an anomalous payload volume to an allowed destination, chaining back to an out-of-scope read, with no deployment behind it, is confirmed.

Signal 3 — output that looks manipulated. The agent’s responses have shifted — subtly biased, oddly worded, or steering toward a recommendation it would not normally make. The reading here is to trace the output backward to an ingestion event. Manipulated output that comes from prompt injection has an upstream source: a document, a ticket, a retrieved record, a tool response that entered the context window carrying instructions. Find the ingestion event and you have the head of the chain. The benign reading is a model or prompt change you can tie to a deploy; the attack reading is a behavioral shift in output with an ingestion event upstream and no deployment to explain it. Tracing that link is the subject of detecting prompt injection in production AI agent workloads.

Across all three, the structure is the same. A single indicator is a maybe. The same indicator, chained backward and checked against the deployment record, is a verdict.

The confirmation only works if your telemetry sees the whole chain

Both tests make the same demand on your tooling, and it determines whether you can run this diagnostic at all.

The chain test requires seeing every layer at once. You cannot assemble input-to-action-to-impact if your detection watches the network but not the application, or the kernel but not the Kubernetes API. Each layer holds one link; the chain only forms when something correlates them. A stack of single-layer tools — a CNAPP for posture, an EDR on the nodes, a network monitor — produces one disconnected alert per layer and leaves the correlation for a human to do by hand. The deployment test makes the second demand: your detection has to ingest deployment events and attach behavior to a durable identity, so that “did anything change” is a question the system can answer automatically rather than one an engineer reconstructs by hand. Detection that resets its baseline on every pod restart cannot tell evolution from attack, because it has no stable record of what the agent did before the shift.

That layer is what ARMO’s platform for cloud-native security for AI workloads was built to cover: runtime telemetry across the application, container, Kubernetes, and cloud layers, correlated into a single attack story, with behavior anchored to durable identity and checked against deployment events. As of January 2026, that same runtime technology powers the Rapid7 Command Platform, bringing the correlation approach to enterprise environments at scale.

You can only run this if your telemetry can see the whole chain

The question you came in with — has my agent been attacked — does not have a checklist answer, because the checklist describes both the attack and the ordinary day. It has a diagnostic answer. Stop matching the signal against a list of symptoms and start running it through two tests: chain it backward into a sequence, and check it against the deployment record. A signal that chains and has no deployment cause is an attack. A signal that stands alone or traces to a deploy is noise.

That diagnostic is only as good as the telemetry underneath it. If your detection can see every layer and correlate the chain, and if it knows what deployed and when, you can answer the question in minutes instead of reconstructing it in hours. To see how the platform assembles the chain and flags the shift with nothing behind it, book a demo.

Frequently Asked Questions

Can a compromised AI agent look completely normal?

Yes — that is the central difficulty. A compromised agent typically throws no errors, does not degrade performance, and returns output that reads as functional, all while executing the attacker’s redirected actions. It can complete its legitimate task in the same session it exfiltrates data. Because appearance is no guide, confirmation has to come from chaining the signal and checking it against deployment events, not from how normal the agent seems.

How do I tell prompt injection from a model that’s just behaving oddly?

Trace the odd output backward to an ingestion event. Prompt injection has an upstream source — a document, ticket, retrieved record, or tool response that carried instructions into the context window — so the behavioral shift connects to a specific input. A model that is merely behaving differently will instead correlate to a deployment, such as a model-version change or a prompt revision. Injection has an ingestion event and no deploy; ordinary change has a deploy and no malicious ingestion.

Is a single unusual tool call enough to confirm an attack?

No. A single tool call is inconclusive on its own, because the agent is authorized to make it and non-deterministic enough to make it for legitimate reasons. Confirmation requires two more steps: checking whether the call deviates from the agent’s normal scope and chains to other events, and checking whether a deployment explains it. A scope deviation that chains to an egress with no deployment behind it is a confirmed attack; the same call after a release that added the tool is not.

What’s the fastest signal that a behavioral shift is benign?

It correlates to a deployment event. If the shift lines up with a pod restart, an image update, a configuration change, a new tool integration, a prompt revision, or a model-version bump, it has a legitimate cause in your pipeline and is almost certainly expected evolution. The shift worth escalating is the one with no infrastructure event behind it.

Why can’t my CNAPP or EDR confirm this on its own?

Single-layer tools each see one link in the chain and none of the others. A CNAPP sees posture and cloud configuration; an EDR sees host and process activity; a network monitor sees connections. None of them can assemble input-to-action-to-impact across all four layers, and most cannot ingest deployment events to separate evolution from attack. They produce disconnected alerts that a human still has to correlate — which is the manual work the two-test diagnostic is meant to replace.

Close

Your Cloud Security Advantage Starts Here

Webinars
Data Sheets
Surveys and more
Group 1410190284
Ben Hirschberg CTO & Co-Founder
Rotem_sec_exp_200
Rotem Refael VP R&D
Group 1410191140
Amit Schendel Security researcher
slack_logos Continue to Slack

Get the information you need directly from our experts!

new-messageContinue as a guest