Get the latest, first
arrowBlog
Runtime Observability for AI Agents: See What Your AI Actually Does

Runtime Observability for AI Agents: See What Your AI Actually Does

Mar 5, 2026

Shauli Rozen
CEO & Co-founder

Key takeaways

  • Why isn't developer observability (LangSmith, Arize) enough for AI agent security? Developer observability traces reasoning chains and token usage for debugging. Security observability asks different questions entirely — what agents exist, what they're doing at runtime, whether their behavior is normal, and what the blast radius is if one is compromised. The two share vocabulary but require fundamentally different instrumentation.
  • What is the AI Workload Observability Stack? A five-layer model that builds visibility from the ground up: Discovery → AI-BOM → Behavioral Visibility → Execution Graph → Identity Mapping. Each layer enables the next — skip one and the layers above it become unreliable.
  • Why can't CNAPPs, Falco, or container scanners see AI agent threats? Agentless CNAPPs snapshot configuration but never watch workloads operate. System-call tools like Falco detect process-level anomalies but can't distinguish a prompt injection from normal agent behavior. Neither operates at the application layer where AI-specific threats — tool misuse, data exfiltration, agent hijacking — actually unfold.
  • What is "observe-to-enforce" and why does it matter for AI agents? Most security teams can't write enforcement policies for AI agents because they don't yet understand what agents actually need. Observe-to-enforce solves this by monitoring an agent's runtime behavior first, then generating least-privilege policies based on evidence rather than documentation and guesswork.

Last Tuesday, a platform security engineer at a mid-size fintech company ran a routine audit on their production Kubernetes clusters. The audit surfaced three LangChain-based agents, two vLLM inference servers, and a Model Context Protocol (MCP) tool runtime. None had been reported by the development teams. None appeared in any security inventory. All had been running for weeks.

One of the agents had been making outbound API calls to a third-party data enrichment service every four minutes. Another had mounted a shared volume containing customer transaction records. The MCP runtime had been granted a service account with cluster-admin privileges during a late-night deployment that never went through the change management process.

The security team’s existing tools — a CNAPP, a container vulnerability scanner, and a SIEM — reported nothing unusual. The CNAPP saw the cloud resources. It dutifully classified them as compute workloads with some medium-severity posture findings. The vulnerability scanner found known CVEs in base images — the same CVEs it flags on every container in the fleet. The SIEM collected API audit logs showing routine Kubernetes operations.

But none of them understood that they were looking at AI agents. None could tell the security team what prompts those agents were processing, which tools they were invoking, or where data was flowing. None could explain why one agent was making outbound calls to an API endpoint that wasn’t in any approved integration list. None could answer the most basic question: are these agents doing what their developers intended, or has something gone wrong?

This wasn’t a breach. It was something worse: a complete blind spot. The security team was responsible for workloads they didn’t know existed, exhibiting behaviors they couldn’t see, with permissions they couldn’t assess.

If you’re responsible for security in an environment where developers are deploying AI agents — or where you suspect they might be, without telling you — this scenario probably isn’t hypothetical. It’s Tuesday.

Two Definitions, One Word, Zero Overlap

Ask an ML engineer what “observability” means for AI agents, and they’ll describe tracing reasoning chains, tracking token usage, monitoring latency, and debugging hallucinations. They’re thinking of tools like LangSmith, Braintrust, Arize, and Langfuse. That’s developer observability — and it’s valuable for the problems it solves.

Ask a security engineer the same question, and the conversation goes somewhere entirely different. They want to know: What AI agents exist in my clusters? What are they actually doing at runtime? What tools, APIs, and data sources are they accessing? Is their behavior consistent with their intended design? And if one of them is compromised, what’s the blast radius?

These are not the same problem. Developer observability and security observability share a vocabulary but operate in different domains, answer different questions, and require fundamentally different instrumentation. Conflating them — which almost every piece of content on this topic currently does — creates the blind spot that fintech team experienced. Your ML team may have full trace visibility in LangSmith and still have zero security visibility into the same agents.

This article is about the security side. Specifically, it’s about what runtime observability for AI agents requires when the question isn’t “why is my agent slow” but “what is my agent actually doing, and should I be worried.”

What follows is a five-layer observability model that security teams can use to build visibility from the ground up — from basic discovery through behavioral analysis to full execution mapping. Along the way, the article explains why the tools most organizations already have can’t see what matters, and what it takes to close the gap.

Why AI Agents Break Traditional Security Observability

Every security tool in your stack was built on an assumption: workloads are deterministic. You deploy code. That code does what it was written to do. If something deviates from expected behavior, your detection engine catches it.

AI agents violate that assumption fundamentally.

A LangChain agent with access to a Python interpreter tool can receive a prompt that causes it to generate and execute code nobody wrote, nobody reviewed, and nobody approved. A ReAct agent can chain together tool calls in sequences its developers never anticipated — not because it’s broken, but because that’s how autonomous reasoning works. An MCP-connected agent can discover and invoke new tools at runtime that weren’t even available when the agent was deployed.

Your container security tool sees “new process started.” Your network monitor logs “outbound HTTPS connection.” Your SIEM records “API key used.” What none of them can tell you is whether that process was triggered by a legitimate reasoning chain or a prompt injection attack. They lack the application-layer context to distinguish the two, because they were never designed to understand prompts, tool invocations, or the concept of “normal” for a workload that’s non-deterministic by design.

The consequences compound. Your SOC analyst receives an alert about unusual network traffic from a container. They spend 40 minutes investigating, tracing the connection through logs, cross-referencing with deployment records, asking the development team if this is expected behavior. The answer comes back: “Yeah, the agent calls that API sometimes depending on the prompt.” Alert closed. Twenty minutes later, a different alert fires — similar pattern, same agent. Another investigation. Same conclusion. By the third alert, the analyst starts auto-closing them. And that’s exactly when the actual attack — a prompt injection that redirected the agent to exfiltrate data through a subtly different API path — gets buried in the noise.

This isn’t a theoretical concern. The OWASP Top 10 for Agentic Applications catalogs the specific attack categories that exploit this gap: agent hijacking, prompt injection, tool misuse, identity impersonation, and AI-mediated data exfiltration. These aren’t container vulnerabilities. They aren’t misconfigurations. They’re categorically different threats that require categorically different visibility.

The result is a security architecture that works fine for deterministic workloads — and fails silently for the fastest-growing workload category in your environment.

Observability Is the Foundation, Not a Feature

When security teams evaluate tools for AI workloads, the first question they typically ask is: “What threats can you detect?” That’s the wrong place to start.

Here’s why: you can’t assess risk in workloads you don’t know exist. You can’t detect AI-specific threats without understanding which workloads are AI agents. You can’t enforce least privilege without behavioral baselines that show what an agent actually needs versus what it has access to. Every downstream security capability depends on observability data. Without it, your posture assessment is guessing, your detection is generic, and your enforcement is either too permissive or too disruptive.

The better first question is: “Can you see what AI agents exist in my clusters and what they’re actually doing at runtime?

To answer that question systematically, the requirements break down into five layers, each building on the one before it. Together, they form the Observability Stack for AI workloads.

The AI Workload Observability Stack

LayerCapabilityThe Question It Answers
5Identity MappingIf this agent is compromised, what can it reach?
4Execution GraphWhat’s the full chain — Agent → Tool → API → Data?
3Behavioral VisibilityWhat prompts are executing? Which tools are being invoked?
2AI-BOMWhat models, frameworks, tools, and data sources are actually in use?
1DiscoveryWhat AI agents exist in my clusters?

Each layer enables the next. Discovery tells you what exists. The AI-BOM tells you what each agent uses. Behavioral visibility tells you what it’s doing. The execution graph maps the full chain. And identity mapping tells you the blast radius. Skip a layer and the ones above it become unreliable.

Let’s walk through each one.

Layer 1: Discovery — Finding Every AI Agent in Your Clusters

Discovery is the hardest layer to get right, and the easiest to get wrong.

Here’s the core problem: developers are deploying AI agents without filing tickets, updating inventories, or notifying security. A backend engineer spins up a LangChain agent to automate QA testing. A data science team deploys an inference server to power an internal recommendation feature. A platform engineer connects an MCP tool runtime to give agents access to internal APIs. These deployments happen through standard Kubernetes manifests and CI/CD pipelines. There’s no “AI workload” checkbox.

Any observability approach that relies on you knowing about a workload before it can see it will miss these shadow AI deployments entirely — and shadow AI is where your biggest blind spot lives, because these are workloads with no security review, no access controls, and no monitoring.

Effective discovery needs to be automatic and runtime-based. The tool should detect AI agents, inference servers, frameworks (LangChain, AutoGPT, CrewAI, and others), and tool runtimes (MCP servers, custom tool chains) across all connected clusters — without manual configuration and without requiring developers to tag or register their workloads.

ARMO’s platform performs exactly this kind of Kubernetes-first discovery. It automatically detects agents, inference servers, and MCP tool runtimes as they appear — no tagging required. If a new LangChain agent shows up in your staging cluster at 2 AM, ARMO sees it by the time your security team checks in the next morning — a capability covered in depth in how to find every AI agent running in your clusters.

A red flag to watch for: if a vendor’s “discovery” feature shows you a list of cloud resources with an AI tag, that’s cloud asset inventory, not AI workload observability. The difference is behavioral visibility — seeing what agents actually do, not just that they exist.

Layers 2 and 3: The AI-BOM and Behavioral Visibility

Once you know what AI agents exist, the next two questions come fast: What does each one use? And what is it actually doing?

The AI-BOM: A Runtime Bill of Materials

Most security teams are familiar with SBOMs — static inventories of software dependencies generated at build time. An AI-BOM extends that concept, but with a critical distinction: it captures what an AI workload actually uses at runtime, not just what’s declared in its deployment manifest.

The gap between declared and actual matters enormously for AI workloads. A manifest might declare that a workload runs a Python container with certain dependencies. It won’t tell you that the agent dynamically loads a specific LLM, connects to a vector database the developer added last Thursday, makes calls to an external API that wasn’t in the original design, or pulls RAG data from a shared volume containing production customer records.

A runtime-derived AI-BOM captures all of this: the models actually loaded, the RAG sources actually connected, the external tools and APIs actually called, and the libraries actually running — with version tracking. It answers the question: “What AI capabilities does this workload actually have access to, and what does it actually exercise?”

ARMO generates this AI-BOM automatically from runtime observation, building a runtime-derived inventory that goes beyond static manifests — so security teams don’t need to rely on developers to self-report what their agents use. If an agent starts connecting to a new data source or loading a different model, the AI-BOM updates to reflect the change.

Prompt and Tool Call Visibility: The Behavioral Layer

The AI-BOM tells you what an agent has access to. Behavioral visibility tells you what it’s actually doing with that access.

This is where security observability and developer observability overlap in vocabulary but diverge in purpose. A developer tracing tool like LangSmith records prompt-response pairs so engineers can debug reasoning quality. Security-focused behavioral visibility records similar data but asks fundamentally different questions: Is this tool invocation authorized? Is data flowing to an approved destination? Is this prompt pattern consistent with the agent’s behavioral baseline — or does it look like injection?

Consider a practical example. Your customer-facing agent normally invokes a search tool, retrieves product information, and responds to user queries. Behavioral visibility shows this pattern repeating across thousands of interactions. Then one interaction triggers a different sequence: the agent invokes a code execution tool, generates a script that reads environment variables, and attempts to POST data to an unfamiliar external endpoint. Without behavioral visibility, that’s invisible. With it — specifically, visibility into what your AI agents are actually doing at the prompt and tool-call level — you have a signal worth investigating.

ARMO provides this behavioral visibility with privacy considerations built in. Prompt tracking can work privacy-safe using hashed prompt patterns rather than exposing sensitive content — the goal is detecting anomalous behavioral sequences, not reading customer conversations.

Layers 4 and 5: Execution Graphs and Identity Mapping

The first three layers tell you what exists, what it uses, and what it does. The final two layers answer the questions that matter most during incident response: What’s the full chain? And what’s the blast radius?

The Agent Execution Graph

An agent execution graph maps the complete chain of a single agent interaction: from the initial prompt, through tool invocations, to API calls, data access, and system operations. Instead of seeing isolated signals — a network connection here, a file read there — you see the full execution chain mapped from agent to data: Agent received prompt → invoked SQL tool → queried customer database → retrieved 2,400 records → passed data to summarization function → returned response to user.

That chain, rendered as a single graph, transforms how security teams analyze AI workload behavior. It’s the difference between reviewing five disconnected log entries and seeing one coherent narrative. When something goes wrong, the graph shows exactly where in the chain the deviation occurred, what data was exposed, and which systems were touched.

Identity Mapping

The execution graph shows what happened. Identity mapping shows what could happen.

By mapping every AI workload to its Kubernetes identities, service accounts, IAM roles, and network paths, you get the blast radius assessment that security teams need for incident response planning. If this agent is compromised via prompt injection, what service accounts does it run as? What APIs can those accounts reach? What data stores are within network policy scope? What lateral movement paths exist?

Most AI agents are deployed with permissions far exceeding what they need — a reality that’s hard to remediate without first understanding what the agent actually uses (back to the AI-BOM) and what it actually does (back to behavioral visibility). Identity mapping closes the loop: it connects observed runtime behavior to the full permission and access context so security teams can identify the gap between “what this agent actually does” and “what this agent could do if compromised.”

Why Existing Tools Fall Short: The Architectural Divide

Security teams evaluating AI observability tools will encounter three categories of solutions, each with a structural limitation that determines what it can and can’t see.

Where Each Tool Category Falls Short

CapabilityAgentless CNAPP (e.g., Wiz, Prisma Cloud)System-Call Runtime (e.g., Falco, Sysdig)Developer Observability (e.g., LangSmith, Arize)
AI agent discoveryCloud resource tags onlyContainer-level process detectionRequires manual SDK integration
Behavioral visibilityNone — snapshots configuration stateSystem calls only — no application contextFull prompt/response traces
Security contextPosture findings based on configurationProcess-level anomaly alertsNone — designed for debugging, not security
Risk assessmentTheoretical (what could happen)Generic (treats AI agents like any container)Performance-focused (latency, token cost)
Path to enforcementPolicy recommendations without behavioral dataSystem-call-level blocking (too coarse for AI)No enforcement capabilities

Agentless CNAPP tools scan cloud APIs and configuration metadata without deploying agents — a model that worked for posture management but doesn’t work for understanding what an AI agent does in production, because it never watches the workload operate. An agentless scanner can tell you that an AI workload has admin permissions. It can’t tell you whether those permissions are necessary for the agent’s actual behavior or whether they represent exploitable attack surface.

System-call-level runtime tools like Falco operate at the kernel level, monitoring process creation, file access, and network connections. They can detect that a container started an unexpected process. But they operate below the application layer — they have no concept of prompts, tool invocations, or agent reasoning chains. An AI agent executing a prompt injection attack and an AI agent performing its normal function may produce identical system-call patterns. The distinction only becomes visible at the application layer.

Developer observability tools like LangSmith and Arize provide rich prompt-level visibility, but they don’t ask security questions. They won’t tell you if a tool invocation is unauthorized, if data is flowing to an unapproved destination, or if the agent’s behavior is deviating from its security baseline. They also require manual SDK integration — meaning they only see agents that developers explicitly instrumented, which excludes shadow AI deployments entirely.

What’s missing from all three categories is the combination that AI workload security requires: runtime behavioral visibility at the application layer, delivered with security context, without requiring manual instrumentation.

This is the architectural foundation ARMO’s behavioral CADR was built on. ARMO’s eBPF-based sensor operates across the full stack — cloud infrastructure, Kubernetes orchestration, container runtime, and application layer — correlating signals across all four levels into a unified view. It detects AI agents automatically (no SDK integration required), observes their behavior at the prompt and tool-call level, and enriches every finding with Kubernetes identity and network context. The sensor runs at 1–2.5% CPU and 1% memory overhead — within the performance budget most platform teams accept for production monitoring.

From Observability to Action: What Comes Next

Observability is the foundation. It’s not the destination.

Once you have runtime visibility into your AI agents, three capabilities become possible that weren’t before — and they build on each other in a specific sequence.

Posture assessment with runtime context. Instead of surfacing 500 theoretical findings and asking your team to triage based on severity scores, a runtime-informed posture tool can distinguish between permissions an agent actually exercises and permissions it simply has. That distinction alone — theoretical risk versus actual runtime exposure — typically reduces actionable findings by 80% or more. Your team stops triaging noise and starts focusing on the gaps that represent real risk.

AI-native threat detection. When your observability layer knows which workloads are AI agents and what their normal behavioral patterns look like, detection rules can go beyond generic container alerts. Instead of “unexpected network connection” — the kind of generic alert that trains analysts to auto-close after the third false positive — you get “Agent X invoked an unauthorized tool following a prompt pattern inconsistent with its behavioral baseline” — an AI-specific finding with enough context to act immediately.

Progressive enforcement without breaking production. This is where observability pays its biggest dividend. Security teams know they should constrain AI agents to least privilege. They know they should restrict which tools an agent can invoke, which APIs it can call, which network destinations it can reach. But they can’t write those policies because they don’t yet understand what the agents actually need. The result is policy paralysis: either you write overly restrictive policies that break production (engineering escalates, security backs off, the policies get removed) or you write permissive policies that leave security gaps.

Observability resolves this. Instead of writing enforcement policies from documentation and hope, security teams can observe an agent’s actual behavior for a defined period, see exactly which tools, APIs, and data sources it needs, and then enforce a policy that allows only those behaviors. ARMO calls this “observe-to-enforce” — and it solves the policy paralysis problem that stops most security teams from ever constraining AI agents at all.

ARMO’s Cloud Application Detection and Response (CADR) platform is built around this progression. The same eBPF sensor that provides observability also builds the behavioral baselines used for detection and generates the policy recommendations used for enforcement. It’s a single architecture serving all four pillars — observability, posture, detection, enforcement — rather than four disconnected tools that need manual correlation.

The quantified outcomes from this architecture: 90%+ reduction in vulnerability noise through runtime reachability analysis (only flagging CVEs in code actually loaded and executing), 90%+ faster incident investigation through LLM-powered attack story generation (correlating signals into narratives rather than presenting alert lists), and 80%+ reduction in actionable findings through runtime-based prioritization.

This shift isn’t just ARMO’s perspective. The 2025 Latio Cloud Security Market Report found that AI visibility was the number one emerging priority for 65% of cloud security professionals — ahead of both application detection and response and access management. The industry is recognizing what practitioners already know: you can’t secure what you can’t see.

All built on open-source foundations. ARMO’s platform is built on Kubescape, one of the most widely adopted cloud-native security projects — used by more than 100,000 organizations with 11,000+ GitHub stars. That community validation isn’t a marketing metric. It’s an architecture that’s been tested in production Kubernetes environments at scale.

Start With What You Can See

AI agents are the fastest-growing workload category in enterprise Kubernetes environments. They’re also the least visible. Your existing security tools were built for deterministic workloads and will continue doing a fine job securing those. But for workloads that generate code, invoke tools autonomously, discover capabilities at runtime, and behave non-deterministically by design, you need a different kind of observability.

Not the kind that traces reasoning chains for debugging. The kind that answers the questions security teams actually ask: What AI agents exist in my clusters? What are they doing? What do they have access to? Is their behavior normal? And if something goes wrong, what’s the blast radius?

If you’re running Kubernetes in production and your teams are deploying AI agents — or if you suspect they might be, even without telling you — then observability isn’t a feature to evaluate later. It’s the prerequisite for every other security decision you’ll make about these workloads.

That’s what the five-layer Observability Stack provides — a structured path from “we don’t know what’s running” to “we can see everything and enforce controls based on evidence.” Each layer is covered in depth in the linked articles throughout this guide. Start with the one that matches your biggest current gap.

And if you want to see all five layers working together in a production Kubernetes environment — AI agents discovered automatically, behavioral baselines built from runtime observation, execution graphs mapping every tool call to an identity — ARMO’s platform shows you exactly what your security stack is currently missing.

See your AI agents in context. Request a demo of the ARMO Platform to see how runtime observability works in your environment.

Frequently Asked Questions

What’s the difference between AI observability for developers and AI observability for security? Developer observability tools like LangSmith and Arize trace reasoning chains, monitor latency, and debug hallucinations. Security observability answers a different set of questions: what AI agents exist in your clusters, what they’re doing at runtime, what data and APIs they access, and what happens if one is compromised. Your ML team can have full trace visibility and your security team can still have zero visibility into the same agents.

How do AI agents create security blind spots that traditional tools miss?
AI agents are non-deterministic by design — they generate code, chain tool calls in unpredictable sequences, and discover new capabilities at runtime. Traditional security tools were built for deterministic workloads where behavior follows code. A container scanner sees known CVEs, a CNAPP sees configuration state, and a SIEM collects audit logs, but none of them understand prompts, tool invocations, or what “normal” looks like for a workload that behaves differently every time it runs.

What is an AI-BOM and how is it different from an SBOM?
An SBOM is a static inventory of software dependencies generated at build time. An AI-BOM captures what an AI workload actually uses at runtime: the models it loads, the RAG sources it connects to, the external tools and APIs it calls, and the libraries it exercises. The gap matters because AI agents dynamically load models, connect to data sources, and invoke tools that never appear in deployment manifests.

Why does observability need to come before detection and enforcement?
You can’t detect AI-specific threats without first knowing which workloads are AI agents. You can’t enforce least privilege without behavioral baselines showing what an agent actually needs. Every downstream security capability — posture assessment, threat detection, policy enforcement — depends on observability data. Without it, your detection is generic and your enforcement is either too permissive or too disruptive.

Can eBPF-based runtime monitoring handle production workloads without performance impact? Modern eBPF-based sensors operate at the kernel level without intercepting application traffic. ARMO’s sensor runs at 1–2.5% CPU and approximately 1% memory overhead — within the performance budget most platform teams accept for production monitoring. This is significantly lighter than older sidecar or inline proxy approaches.

What is shadow AI and why is it a security risk?
Shadow AI refers to AI agents, inference servers, and tool runtimes deployed by development teams without notifying security. These workloads bypass security review, lack access controls, and have no monitoring — making them the largest blind spot in most Kubernetes environments. Any observability approach that requires you to know about a workload before it can see it will miss shadow AI entirely.

Close

Your Cloud Security Advantage Starts Here

Webinars
Data Sheets
Surveys and more
Group 1410190284
Ben Hirschberg CTO & Co-Founder
Rotem_sec_exp_200
Rotem Refael VP R&D
Group 1410191140
Amit Schendel Security researcher
slack_logos Continue to Slack

Get the information you need directly from our experts!

new-messageContinue as a guest