Get the latest, first
arrowBlog
Prompt and Tool Call Visibility: What Your AI Agents Are Actually Doing

Prompt and Tool Call Visibility: What Your AI Agents Are Actually Doing

Apr 29, 2026

Yossi Ben Naim
VP of Product Management

Key takeaways

  • Why isn't a developer tracing tool enough for security visibility? LangSmith, Arize, Phoenix and similar tools were designed to answer debugging questions: token usage, prompt-response quality, latency, hallucination patterns. They capture similar raw data to what security needs but record it without the authorization context, baseline reference, or downstream correlation IDs a SOC analyst requires to triage an incident. Same data shape, different schema, different retention.
  • What does a security-grade prompt and tool call record need to contain? LangSmith and eBPF aren't security visibility. Here's the four-capture-point composition and five-field record schema your stack needs. Five fields: entity identity, intent context, authorization context, baseline context, and downstream linkage. Records missing any of these are debugging telemetry with a security label.

It is 11:47 p.m. and the on-call security engineer is staring at two dashboards. On the left, LangSmith — the ML team’s debugging stack — showing the agent’s prompts, model responses, tool calls, and tokens consumed. On the right, the runtime detection console showing eBPF-captured syscalls, network connections, and process trees from the same Pod. Both are populated. Neither answers the question on the runbook: was this tool invocation authorized for this agent, and did its behavior deviate from baseline?

Most security teams running AI agents in production have some form of prompt and tool call visibility — usually accidentally, by virtue of the ML team installing a tracing tool and the security team installing a runtime sensor. What they don’t have is a deliberate visibility layer designed for the questions security actually asks. Prompt injection detection, tool misuse detection, behavioral baselines, and execution graph triage all assume this layer exists; it usually does not, in the form they need. We have previously laid out the full runtime observability model for AI agents across that downstream stack. This article goes one level deeper into the foundational layer those disciplines depend on: what a security-grade prompt and tool call record must contain, where you can capture it, and why every obvious capture point leaves a structural blind spot that defensible coverage requires composing across.

Why Prompt and Tool Call Visibility Is a Distinct Security Layer

The instinct to treat AI agent observability as one problem is the source of most confusion in this space. There are three distinct disciplines wearing the same word. Developer observability debugs non-deterministic systems — why did the model hallucinate, how many tokens did this cost. LangSmith, Arize, Phoenix, and the OpenTelemetry GenAI semantic conventions all live here. The consumer is a developer fixing a bug; the retention horizon is days. Kernel observability monitors what the operating system does — what process spawned, what syscall fired. Falco, Tetragon, eBPF-based runtime sensors all live here. This is where attacker actions eventually produce evidence — but only after the application-layer decisions that triggered them have already happened.

Prompt and tool call visibility is neither. It lives at the application boundary where prompts arrive and tool invocations dispatch, asking security questions about that boundary: was this invocation authorized for this agent, does this prompt pattern match the baseline, did this tool call produce the kernel-layer evidence we expected. The data shape has to be designed for that question set, not borrowed from either neighbor. The OpenTelemetry GenAI conventions encode this distinction in their default behavior — instrumentations are explicitly told not to capture message content unless an operator opts in, because security-grade visibility has different privacy and retention requirements than developer telemetry.

The Four Capture Points and What Each One Structurally Cannot See

The premise most “AI observability” articles skip: every capture point in the stack has a structural blind spot. Pick one and you ship a stack with a known gap. Defensible coverage requires composing across them with the gaps explicit.

Agent SDK instrumentation

The SDK layer covers code that runs inside the agent’s own process. The dominant standard now is the OpenTelemetry GenAI semantic conventions, which define standardized attributes for input messages, output messages, tool names, operation types, agent identity, and conversation identity — the building blocks any structured prompt and tool call record needs. The SDK approach has the highest fidelity of any capture point. It sees the prompt before serialization, the tool argument before marshaling, the response before parsing.

The structural blind spot is trust. Instrumentation that runs in the agent’s process trusts the agent to faithfully emit its own telemetry. A compromised agent, or simply an agent path that bypasses instrumentation — a developer adding a new tool but forgetting to wire it into the tracer — produces no record. Coverage gaps from drift are far more common than from malice, and equally invisible.

Framework callbacks

LangChain exposes a callback handler interface with hooks for chain starts, tool invocations, and LLM completions. LangGraph has its own checkpointing model. AutoGen has agent-level middleware. The OpenAI Agents SDK and Anthropic’s tool-use APIs each define their own callback patterns. Framework callbacks see the agent’s reasoning at the framework’s granularity — closer to security’s question set than raw API logs, because the framework already knows what counts as a tool call.

The structural blind spot is bypass. Framework callbacks fire only when code paths inside the framework are invoked. Code that calls the underlying model client directly — bypassing the chain — code that hits a tool endpoint outside the framework’s tool registry, or code moving through custom orchestration emits nothing. In practice, every production agent stack we have audited contains at least one bypass path the callbacks miss.

Sidecar proxy or service mesh

The proxy layer captures traffic at the network boundary between the agent and its model or tool endpoints — typically Envoy as a sidecar, an Istio mesh handling mTLS termination, or a custom forward proxy with cert injection. The operational advantage: no code changes required.

The structural blind spots are several, and each is operationally expensive. To inspect outbound HTTPS traffic to OpenAI, Anthropic, or AWS Bedrock endpoints, you need TLS termination — either a forward proxy injecting its own certificate authority into the agent’s trust store, or a service mesh terminating outbound TLS at the sidecar. Compliance teams in regulated industries often refuse to authorize TLS interception. Even after termination, the proxy needs per-provider parsers — the OpenAI Chat Completions API, the Anthropic Messages API, AWS Bedrock model invocation, and Google Vertex AI prediction each use different schemas, and a generic HTTP proxy speaks none of them. And the proxy is blind to in-process tool invocations: a LangChain tool wrapping a local Python function never crosses the network, so the proxy never sees it.

Kernel-level capture (eBPF)

The kernel layer sees every syscall, file open, network connection, and process spawn. eBPF programs attached to tracepoints, kprobes, and uprobes give a complete view of what the OS does on behalf of the agent. Production runtime sensors typically operate at 1–2.5% CPU and 1% memory overhead. The structural blind spot is application semantics. eBPF observes bytes; it does not natively know what a prompt is. A subtler version of this gap matters in practice: eBPF can uprobe userspace functions — including the encryption and decryption entry points in the OpenSSL library — to capture plaintext before it leaves the process, the technique behind tools like BCC’s sslsniff. So the simple framing “eBPF can’t see encrypted payloads” is incomplete. The accurate framing: eBPF can capture the bytes, but reconstructing them into authorization-aware records requires application-layer logic that does not naturally live in BPF programs and has to be assembled in user space. The gap relocates; it does not dissolve.

What this means for stack design

Every capture point has at least one structural blind spot, and the blind spots are not the same. The SDK layer trusts the agent. Framework callbacks bypass when code leaves the framework. The proxy needs TLS termination it often cannot get and is blind to in-process tools. The kernel sees bytes but not authorization decisions. Defensible visibility is composition, not selection. A stack capturing at the framework layer for in-framework tool calls, at the proxy layer for outbound API calls when TLS termination is feasible, and at the kernel layer for everything else — with a correlation ID stitching them — has each blind spot covered by another layer’s strengths. A stack that picks one and stops has the others as silent gaps.

Capture pointWhat it seesStructural blind spotTrust assumption
Agent SDKPre-serialization prompts and tool arguments via OpenTelemetry GenAI conventions; highest fidelityBypass and instrumentation drift; new code paths emit nothing until wired inTrusts the agent process to faithfully emit its own telemetry
Framework callbacksTool calls and chain steps at framework granularity (LangChain, LangGraph, AutoGen)Bypass at framework boundary; direct API calls and custom orchestration emit nothingTrusts that all agent activity flows through the framework’s hooks
Sidecar / proxyWire-level traffic to model and tool endpoints when TLS termination is in placeTLS interception cost; per-provider parsers required; in-process tools never cross the networkTrusts that compliance permits TLS termination and that all tools cross the network
Kernel (eBPF)Syscalls, network, file, process; SSL uprobes capture plaintext pre-encryptionNo native application semantics; record reconstruction lives in user spaceTrusts user-space reconstruction to assemble bytes into security-grade records

The Five Fields a Security-Grade Record Must Contain

The capture-point question is where. The schema question is what. A record can be captured at any of the four points and still fail to feed downstream security work if it lacks the fields below. Each is required because at least one downstream consumer — baseline, drift detection, triage, full-chain reconstruction — fails without it.

Entity identity. Which Kubernetes Deployment, ServiceAccount, agent name, and Pod produced this record. Sounds trivial; is not. Correlating an outbound request to OpenAI back to a specific Pod requires one of three approaches: the SDK attaches Pod metadata via OpenTelemetry resource attributes pulled from the downward API, the sidecar injects Pod identity headers, or the eBPF sensor maps socket to cgroup to Pod. Most stacks get one path working and assume the others. Without entity identity at the Deployment level, the per-Deployment behavioral profile downstream baselines need cannot be assembled.

Intent context. What user-facing operation, scheduled task, or upstream agent triggered this prompt or tool call. In OpenTelemetry terms, the trace ID and parent span ID — typically propagated via W3C Trace Context. It works cleanly when an upstream service propagates context through. It breaks in multi-agent systems where Agent A delegates to Agent B via in-memory queues, shared state, or framework-internal handoffs — propagation has to be wired manually.

Authorization context. What the agent’s declared scope is for the tool being invoked, and whether this invocation falls inside or outside it. This is the field most existing telemetry stacks omit, and the one that makes a record security telemetry rather than debugging telemetry. The agent’s declared scope lives in three places — LangChain tool definitions, MCP tool manifests, and IAM policies for the underlying APIs — and the visibility layer has to either reconcile against those at capture time or carry enough metadata for downstream to do it. Real engineering, not a configuration switch.

Baseline context. Whether the prompt pattern, tool, destination, and sequence in this record fall inside the agent’s established behavioral envelope. The visibility layer’s job is to carry the fields that make baseline comparison possible — the actual comparison runs downstream. Required: the prompt pattern hash (not the content), the tool name, the destination, the sequence position relative to recent invocations, and the agent’s identity. We have previously broken down the methodology for distinguishing legitimate behavioral evolution from compromise indicators in the context of intent drift detection.

Downstream linkage. A correlation ID tying this application-layer record to the syscall, network, and identity events the eBPF sensor captures at the same moment in the same Pod. Three approaches are common: time-window correlation by Pod and timestamp, trace-ID injection into outbound HTTP headers so the kernel sensor can lift it, or cgroup matching with sub-second precision. None are free. The chain reconstruction in the production prompt injection detection methodology and the rogue agent walkthrough on tool misuse and API abuse both depend on this linkage being in place.

A record carrying all five fields is downstream-usable. A record missing any one is debugging telemetry with a security label.

Privacy-Safe by Design, Not as an Afterthought

The parent observability discussion mentions privacy in a single line: hashed prompt patterns. The operational reality is a 2×3 problem.

Two data classes carry different sensitivity profiles. Prompt content is what users type or what RAG documents inject. Tool call parameters are often more sensitive than prompts — a SQL query containing a customer email, an HTTP body POSTed to a third party, a function argument carrying PII. Treating both as a single privacy class is the most common design mistake. The redaction strategy for natural-language prompts is not the strategy for structured tool call parameters with high signal density per byte.

Three mechanisms carry different trade-offs. Content hashing — a cryptographic hash of normalized content — preserves exact-match lookup and destroys everything else, including the similarity comparison most baselines rely on. Shape-preserving hashing like SimHash or MinHash on token sequences preserves semantic similarity for baseline comparison while breaking content readability. Structured field redaction preserves authorization context and schema while replacing values with redaction tokens; it requires an engine like Microsoft Presidio, AWS Macie, Google DLP, or per-tool schema-aware rules.

A fourth strategy operates at volume rather than content. Sampling-with-trigger-capture keeps low-volume metadata always and full payloads only when an anomaly fires. The cost the SOC needs to understand: attackers who learn the sampling threshold can stay under it.

The OpenTelemetry GenAI conventions encode the safer default explicitly. Instrumentations are told not to capture message content unless the operator sets the opt-in environment variable, and the recommended production pattern is to store content externally and record only references on the spans. This is the right default for any stack carrying content under EU AI Act Article 12 logging obligations or GDPR Article 5 minimization principles. (None of this is legal advice; the regulatory anchors are named so the architect knows where the questions live.)

How a Composed Visibility Layer Looks in Practice

Dissolving the abstraction: ARMO operates primarily at the kernel layer, deploying eBPF sensors as a DaemonSet across EKS, AKS, and GKE clusters at 1–2.5% CPU and 1% memory overhead per node. The kernel-layer capture supplies syscall, process, file, and network events with deep application-layer reconstruction from the bytes the sensor observes. Application Profile DNA assembles those events into per-Deployment behavioral profiles that survive Pod churn, supplying the baseline-context field downstream. The runtime-derived AI-BOM, covered in detail in the AI-BOM walkthrough, supplies the entity identity field by inventorying agents, tools, and data sources from observed execution rather than declared manifests. CADR — ARMO’s Cloud Application Detection and Response platform — correlates these signals across cloud, Kubernetes, container, and application layers, supplying the downstream linkage field by entity and process lineage. This composition is what ARMO’s cloud-native security platform for AI workloads operationalizes across managed Kubernetes, on-prem, and air-gapped environments.

The four capture points compose; no single layer carries all five fields. The architecture is what matters, not the slogan.

A Short Evaluation Checklist

Five questions to run against any vendor or stack claiming “AI agent visibility”:

At which capture points does the stack collect prompt and tool call records — SDK, framework, proxy, kernel, or some composition? If one layer only, what covers the structural blind spot of that layer?

For each captured record, are the five fields present — entity identity, intent context, authorization context, baseline context, downstream linkage? If any is missing, which downstream consumer (baseline, drift detection, triage, chain reconstruction) loses what?

What privacy mechanism does the stack use for prompt content versus tool call parameters? Are content hashing, shape-preserving hashing, structured redaction, and sampling distinguished by data class, or applied uniformly?

What is the correlation ID linking application-layer records to kernel-layer events captured in the same Pod at the same moment — trace-ID injection, time-window correlation, or cgroup matching — and what is the failure mode when the mechanism breaks?

What happens during the baseline learning window — does the stack run blind, fail closed, or fail open? The answer determines whether new agents are protected or unprotected on the day they ship.

A stack that answers these crisply has a real visibility layer. A stack that answers in marketing language has a dashboard.

To see the four-capture-point composition and five-field record schema running against AI agents in your own Kubernetes cluster — with kernel-layer eBPF, per-Deployment baselines, and cross-layer correlation wired together — book an ARMO demo.

Frequently Asked Questions

Can we use LangSmith traces directly for security?

No. LangSmith captures the right kind of data — prompts, responses, tool calls — but in a schema designed for debugging. The traces lack the authorization context field, the per-Deployment baseline reference, and the correlation ID linking them to runtime events. Retention is also wrong: developer tracing keeps detail for days; security incident reconstruction needs months. Use LangSmith for what it is for, not as the security layer.

Is eBPF alone enough for prompt and tool call visibility?

No. eBPF gives you the bytes and syscalls, but reconstructing them into authorization-aware records requires application-layer logic that does not naturally live in BPF programs. eBPF is a capture point, not a layer. It excels at the kernel-level fields and pairs well with framework callbacks, but it is not a complete visibility solution by itself. We have walked through where eBPF hits its application-layer ceiling in the context of enforcement.

How do we handle prompt content privacy under regulatory regimes like the EU AI Act or GDPR?

Treat prompt content and tool call parameters as separate data classes with separate handling. Default to not capturing content; opt in for specific high-value agents where the security benefit justifies retention cost. Use shape-preserving hashing for prompt patterns when the downstream question is similarity. Use structured redaction for tool call parameters where PII is structurally predictable. The OpenTelemetry GenAI conventions encode the right default: instrumentations off by default, opt-in per environment, content stored externally with only references on spans.

What is the minimum viable instrumentation set for a team just starting?

Framework callbacks plus kernel-layer eBPF, with a correlation ID — usually a trace ID injected into outbound headers — bridging the two. Framework callbacks supply application semantics; the kernel sensor supplies entity identity, downstream linkage, and the blind-spot coverage when callbacks bypass. Add a sidecar proxy when TLS termination becomes feasible and outbound traffic to model providers is a major surface.

How does this layer relate to AI-BOM, behavioral baselines, and execution graph triage?

The visibility layer is the input stream those disciplines consume. AI-BOM is the inventory derived from observed execution. The behavioral baseline is the profile assembled over weeks. The execution graph is the assembled chain of records correlated by entity and timestamp. Triage operates on the assembled graph. Each downstream layer fails silently if visibility records arrive incomplete — which is why the five-field schema matters more than the dashboard does.

Close

Your Cloud Security Advantage Starts Here

Webinars
Data Sheets
Surveys and more
Group 1410190284
Ben Hirschberg CTO & Co-Founder
Rotem_sec_exp_200
Rotem Refael VP R&D
Group 1410191140
Amit Schendel Security researcher
slack_logos Continue to Slack

Get the information you need directly from our experts!

new-messageContinue as a guest