Blog

Home
Blog
Prompt Analysis for AI Attack Detection: Four Signal Categories, Three Blind Spots, One Correlation Layer

Prompt Analysis for AI Attack Detection: Four Signal Categories, Three Blind Spots, One Correlation Layer

May 17, 2026

Yossi Ben Naim
VP of Product Management

Key takeaways

What are the four categories of semantic signals in adversarial inputs? Four categories of semantic signals live on the Input & Reasoning surface. Lexical and structural signals catch known patterns through deterministic content-shape analysis. ML classifiers score prompts on semantic intent. Behavioral-textual signals track multi-turn drift across a session. Provenance signals trace where the prompt content originated. Each category produces a different evidence type and has a categorically different blind spot.
Why does prompt analysis alone fail to detect indirect prompt injection? Indirect injection payloads arrive through the data plane — retrieved through RAG, returned by a tool call, or delegated by another agent — not through the request plane where prompt analyzers operate. The agent treats the retrieved content as data, but the LLM treats it as instructions. Every category that analyzes prompt content at the request plane goes blind: the payload was never in the request plane.

At 2:47 PM on a Tuesday, a customer support agent receives a routine ticket asking about return policy edge cases. The agent retrieves a section from your internal policy wiki through RAG to formulate the response.

Three weeks earlier, an attacker had planted a hidden instruction in that wiki page. Bedrock Guardrails scored the retrieved context at 0.04 — well within benign range. Forty-seven seconds later, the agent queries a customer database it has permission to access but has never touched in six weeks of production operation. The agent reads 2,400 customer records outside its declared scope.

Every Surface 1 signal returned green. The attack landed on Surfaces 2 and 3.

This is Surface 1 — Input & Reasoning — where prompt analysis lives. This article maps the four categories of semantic signals that live there, where each one fails, and what makes detection actually work.

Category 1: Lexical Analysis Catches Known Patterns and Misses Encoding, Polymorphism, and Multilingual Payloads

Category 1 covers deterministic content-shape analysis: regex pattern matching against known injection signatures, encoding detection (base64, ROT13, Unicode confusables, zero-width characters), and perplexity-based outlier detection. The evidence type is a Boolean flag — pattern hit or no hit — or a numerical perplexity score. Tools at this layer include open-source signature catalogs, regex filters embedded in API gateways, and statistical perplexity detectors that flag inputs that look unnaturally crafted.

The blind spot is everything outside the known pattern set. An attacker who base64-encodes their payload defeats regex; one who switches to a language the signature list was not trained on defeats the corpus; one who fragments the malicious instruction across multiple paragraphs of natural prose defeats perplexity. Polymorphic, multilingual, and steganographic payloads pass every Category 1 check by construction.

The operational role is high-volume Layer 1 telemetry: useful for known-pattern triage, cheap to run inline, never standalone detection.

Category 2: ML Classifiers Score Novel Attacks and Semantically-Clean Payloads as Benign

Category 2 covers ML classifiers — embedding-space jailbreak detectors and semantic intent classifiers trained on adversarial corpora. The dominant tools are Bedrock Guardrails, Model Armor, Azure’s Prompt Shields, Meta’s Llama Guard and Prompt Guard, and NVIDIA NeMo Guardrails. The evidence type is a confidence score between 0 and 1, typically returned alongside category labels for jailbreak intent, prompt injection, or sensitive content disclosure.

Classifiers score novel techniques outside their training distribution as benign — the classifier has never seen them. They score semantically-clean payloads — text that reads as a legitimate request but produces malicious downstream effect through context — as benign because the prompt itself contains no malicious surface signal. The 0.04 score from the introduction is this category’s failure mode: the retrieved RAG context contained instructions phrased as a routine policy update, well inside the classifier’s expected distribution of operational text.

Like Category 1, Category 2 produces Layer 1 telemetry with confidence-score granularity. It feeds the correlation layer; it does not replace it.

Category 3: Multi-Turn Attacks Pass Every Per-Turn Check

Category 3 looks at the conversation rather than the prompt. The signals here are session-level: multi-turn intent shift across a sequence of exchanges, role-impersonation drift where the agent gradually accepts an attacker-supplied persona, and instruction-precedence inversion where later turns override the system prompt’s guardrails. Evidence is a session-state delta — what the conversation has become compared to what it started as.

The category exists because individual prompts can pass every Category 1 and Category 2 check and still hijack the agent through accumulated context. Research on multi-turn jailbreaks against open-weight models has documented success rates as high as 92%, because the attacker gets multiple opportunities to nudge the conversation off-rail without any single turn appearing adversarial.

The structural blind spot is that this category requires session state that perimeter tools do not carry. WAFs, API gateways, and prompt classifiers process individual requests and have no memory of what came before. Session-state instrumentation lives at the application layer — inside the agent runtime, not in front of it — where Layer 2 behavioral baselines get built. Conversation reset between sessions evades the category entirely; cross-session attacks that drip-feed manipulation across days remain invisible to single-session analysis.

Category 4: Indirect Injection Bypasses Every Request-Plane Signal

Categories 1 through 3 all analyze prompt content. Category 4 is different. It analyzes where the prompt content came from.

Provenance signals trace the origin of every token in the agent’s context window: direct user input through the request plane, content retrieved through RAG, output returned by a tool call, context delegated from another agent, scheduled content fetched on a timer. The evidence type is source attribution and propagation chain — metadata about the path the text traveled before reaching the agent, not the text itself.

This is the only category that surfaces indirect prompt injection at all. Indirect injection — OWASP’s #1 LLM risk — arrives through the data plane: a poisoned wiki page retrieved through RAG, a malicious response from a compromised tool call, a payload embedded by another agent on a shared delegation edge. The agent treats the retrieved content as data; the LLM treats it as instructions. PoisonedRAG research demonstrated that injecting roughly five crafted texts per targeted query achieves over 90% attack success — a small, surgical campaign hijacks retrieval for a specific topic without corrupting the corpus broadly.

The blind spot is that provenance is metadata, not text. Categories 1, 2, and 3 operate on the prompt as it appears in the agent’s context window — by the time the indirect payload is in that window, it is indistinguishable from legitimate retrieved content. The only way to surface Category 4 signals is to instrument the data-plane sources themselves: vector database writes, tool-call return values, agent-to-agent delegation edges. This is what the runtime-derived AI-BOM produces — the provenance substrate that makes Category 4 reachable. Without it, indirect injection has no Surface 1 signal at all.

The Three Structural Blind Spots, Not Four

Four categories, but the failure modes group into three structurally distinct kinds.

Categories 1 and 2 share a common ceiling: content-analysis limits. Both analyze prompt text at the request plane, and both fail when the payload sits outside the known pattern set (Category 1) or the training distribution (Category 2).

Category 3 introduces a different failure: session-state dependency. Per-turn analysis is structurally blind to attacks that span turns; the signal requires application-layer state perimeter tools do not carry.

Category 4 introduces a third: data-plane bypass. Indirect injection arrives outside the request plane entirely, where no content-analysis category can reach.

Three structural blind spots. Four signal categories. One reason none of them stand alone as detection.

When Every Surface 1 Signal Goes Green: An Indirect Injection Walkthrough

Return to the support agent from the introduction. Walk it forward at signal granularity.

T+0s. The agent receives the customer ticket and queries the policy wiki through RAG. The retrieved chunk contains legitimate return-policy text followed by a paragraph planted three weeks earlier by an attacker with wiki edit access — phrased as a routine internal note instructing the agent to verify customer eligibility against a specific database table. Category 1 (lexical): no pattern hit, perplexity within range. Category 2 (classifier): Bedrock Guardrails returns 0.04. Category 3 (behavioral-textual): single-turn interaction, no session state to evaluate. Category 4 (provenance): partial signal — RAG source flagged as the trusted-tier policy-wiki connector. No Surface 1 category produces a finding.

T+12s. The agent issues a tool call to the customer database. This is the first such call from this agent in the six-week behavioral history captured by its Application Profile DNA baseline. Surface 2 fires: tool-call deviation against per-agent baseline.

T+47s. The agent reads 2,400 customer records — outside the per-agent declared scope captured at deployment time. Surface 3 fires: identity-exercise deviation against declared scope.

T+90s. Cross-surface correlation produces the finding. Category 4’s partial provenance flag, Surface 2’s tool-call deviation, and Surface 3’s identity-exercise deviation combine into a single attack story.

No Surface 1 category produced detection on its own. The detection lived in the correlation across surfaces.

Detection Lives in Correlation, Not in Any Single Signal Category

What fired at T+90s was not a Surface 1 signal. It was the correlation across three: Category 4’s partial provenance flag, Surface 2’s tool-call deviation against the agent’s behavioral baseline, and Surface 3’s identity-exercise deviation outside declared scope. No single Surface 1 category produced that finding; the cross-surface correlation did.

The general pattern follows the same shape. Each Category 1–4 signal carries weak detection value on its own. Paired with a Surface 2 signal — a tool-call sequence outside the agent’s baseline, a parameter pattern not seen in the behavioral profile, a tool the agent has rights to but has never invoked — the combination produces a high-confidence finding. Paired with a Surface 3 signal — an identity exercise outside declared scope, a permission used after weeks of dormancy, a resource read at a volume outside the agent’s data-access envelope — the same lift applies.

This is the Layer 3 correlation work in the five-layer detection operating stack. Application Profile DNA at Deployment level provides the cross-surface baseline reference; ARMO’s cloud-native security for AI workloads runs the Layer 3 correlation that joins Surface 1 telemetry to Surface 2 and Surface 3 deviations into a single attack story.

Surface 1 signal categories are Layer 1 inputs. The detection is the correlation.

Frequently Asked Questions

Should we deploy Bedrock Guardrails, Llama Guard, or Model Armor if it can’t stand alone as detection?

Yes. Layer 1 telemetry is the foundation of the detection stack — without it, the correlation layer has nothing to correlate. Refusing to deploy a prompt classifier because it can be evaded conflates Layer 1 with Layer 3. These tools produce signal that flows into correlation alongside Surface 2 and Surface 3 telemetry.

How does ARMO’s approach differ from a prompt analyzer?

Prompt analyzers operate at Layer 1 on Surface 1 — one surface, one layer. ARMO operates at Layer 3 across all four surfaces, joining prompt-layer telemetry to tool-call, identity-exercise, and cross-agent telemetry into a single attack story. Both are necessary; they sit at different layers of the same stack.

What is the minimum cross-surface correlation that catches indirect prompt injection?

A Surface 1 provenance signal from runtime-derived AI-BOM paired with a Surface 2 tool-call deviation against the agent’s baseline, plus a Surface 3 identity-exercise deviation outside declared scope. Two of three is often enough for medium confidence; all three together produce the high-confidence finding shown in the walkthrough above.

Do we need session-state tooling for behavioral-textual signals?

Yes. Perimeter classifiers process each request in isolation. Multi-turn drift, role-impersonation, and instruction-precedence inversion all require session state that lives inside the agent runtime — application-layer instrumentation, not perimeter instrumentation.

Where does runtime-derived AI-BOM fit in prompt analysis?

It is the provenance substrate for Category 4 — it tracks what gets written into RAG indexes, returned by tool calls, and passed across agent delegation edges. Without it, the agent’s context window has no source attribution, and indirect injection has no Surface 1 signal at all.

Why Your Detection Latency Budget Determines Blast Radius

Most teams buy detection on a single number. The datasheet says “millisecond detection,” the proof-of-concept...

Shauli Rozen

CEO & Co-founder

Jun 1, 2026

What to Log for AI Agent Activity: The Minimum Viable Audit Trail

The first time a security team needs an AI agent audit trail is usually 72...

Yossi Ben Naim

VP of Product Management

Jun 1, 2026

AI-SPM Tools for Attack Detection: Where Posture Meets Runtime

Every AI-SPM tool runs posture and detection with a single arrow: runtime evidence flowing back...