Get the latest, first
arrowBlog
Runtime Observability for LangChain and AutoGPT on Kubernetes

Runtime Observability for LangChain and AutoGPT on Kubernetes

Apr 28, 2026

Yossi Ben Naim
VP of Product Management

A platform team at a mid-size SaaS company runs three LangChain agents and one AutoGPT-derived planner on EKS. LangSmith is wired in. OpenTelemetry traces flow into their observability stack. Falco runs on every node. The setup is what most security teams would consider thorough.

A pip dependency in one of the agents’ tool packages ships a malicious update. When the agent runs, the compromised tool reads in-memory secrets, opens a connection to an attacker-controlled domain, and exfiltrates them — but not before patching the LangChain BaseCallbackHandler instances in the same process to silence the trace events that would have surfaced the unusual behavior.

The LangSmith dashboard stays green for the entire incident. The OpenTelemetry traces show normal latency. The team had observability. They did not have security observability.

This is not a LangChain bug or an AutoGPT bug. It is a structural property of any observability stack that runs in the same process as the workload it watches. LangChain and AutoGPT make the property especially acute because both frameworks ship rich in-process telemetry that produces the correct intuition that something is being watched — and the wrong intuition that the security team has visibility.

This article walks the five-tier telemetry trust hierarchy for these frameworks on Kubernetes, ranked by adversary resistance rather than integration ease, and shows where each tier earns its keep. It assumes LangChain 0.2 or later with LangGraph for new code (classic AgentExecutor patterns are flagged where they differ) and covers both classic loop-based AutoGPT and Forge agents as well as the AutoGPT Platform workflow engine. The argument is the same; the surfaces differ. It also applies whether chains run as Python libraries inside their own pods or are wrapped in LangServe and exposed as HTTP endpoints — LangServe shifts what’s visible at Tier 3, but doesn’t change the core hierarchy.

Why Framework-Internal Telemetry Isn’t the Load-Bearing Layer

LangChain’s CallbackManager system (the family of BaseCallbackHandler subclasses, or events shipped to LangSmith via the LangChainTracer handler) gives developers a clean way to observe every chain start, tool invocation, LLM call, and agent step. AutoGPT exposes equivalent hooks through its plugin interface and event lifecycle. LangGraph adds node-transition events and shared-state mutations on top of LangChain’s primitives.

For debugging hallucinations, tracking token spend, and reconstructing reasoning chains, this telemetry is excellent. For security observability, it has one structural property that determines its place in the stack: every callback handler runs in the same Python process as the agent and the tools the agent invokes.

Consider what that means under compromise. An attacker who has reached code execution inside a tool — through a poisoned dependency, a malicious MCP tool description, or an indirect prompt injection in production AI agents that drove the agent to invoke compromised code — sits in the same memory space as every callback handler. The Python interpreter does not enforce a trust boundary between application code and observability code. Monkey-patching BaseCallbackHandler.on_tool_start to no-op is two lines; replacing the LangChainTracer instance with a silent stand-in is three. The traces that would have surfaced the attack stop existing before any visible behavior occurs.

The same property holds for AutoGPT plugins, which run in-process by design, and for any OpenTelemetry instrumentation that lives in the agent’s Python interpreter. None of these tools are bad; they are simply located on the wrong side of the compromise boundary to function as primary detection.

We have previously broken down why legacy security tools fail to protect cloud AI workloads at the CNAPP and CSPM layer; the same architectural failure mode shows up here one layer deeper, inside the framework process itself. The operational rule that follows: if a telemetry source can be silenced by an attacker who has already achieved the position you are trying to detect, it cannot be your detection’s load-bearing layer.

The Five-Tier Telemetry Trust Hierarchy

The five tiers below are ranked from least adversary-resistant to most. Each tier captures something distinct, and the tiers are complementary, not substitutable.

TierSourceWhat it capturesWhat silences it
Tier 1Framework-internal callbacks (LangChain CallbackManager, LangGraph hooks, AutoGPT plugin events, LangChainTracer)Prompts, tool inputs/outputs, completions, reasoning tracesAny in-process compromise
Tier 2Container stdout/stderr captured by the kubeletFramework log lines, AutoGPT think/act/reflect logsLogger replacement or fd redirection by compromised code
Tier 3Sidecar collectors and service mesh proxies (OTEL, fluent-bit, Istio, Linkerd)L7 request structure, OTEL GenAI semantic convention fields, headersCompromised app emits nothing; egress controls subverted
Tier 4eBPF DaemonSet at the kernel (kprobes, tracepoints, BPF LSM hooks)Syscalls, network connections, process spawns, file I/OPrivileged pod (privileged: true, CAP_SYS_ADMIN, hostPID)
Tier 5Control-plane audit logs (Kubernetes API server + cloud provider)TokenRequest, exec, ConfigMap/Secret reads, AssumeRoleWithWebIdentity, IAM exchangesDefault audit policy levels limit body capture

Tier 1 — Framework-internal callbacks. LangChain CallbackManager events, LangGraph node-transition hooks, AutoGPT plugin lifecycle events, the LangChainTracer handler shipping to LangSmith. Richest semantic context — prompts, tool inputs and outputs, completion text — and lowest adversary resistance: silenced by any in-process compromise.

Tier 2 — Container stdout/stderr. Framework log lines and AutoGPT think/act/reflect logs captured by the kubelet. Slightly more isolated than callbacks because the kubelet pulls them rather than the application pushing them, but the content is still author-controlled — a compromised tool can replace the logger or write through os.dup2 redirects.

Tier 3 — Sidecar collectors and service mesh proxies. OpenTelemetry collector sidecars, fluent-bit DaemonSets reading the application’s log files, and service mesh sidecars (Istio’s Envoy, Linkerd’s proxy) that terminate mTLS at the pod boundary. These run in adjacent processes — same pod or same node, separate PID — so an in-process patch cannot reach them directly. Service mesh proxies are particularly useful here because they see decrypted L7 traffic when the mesh terminates TLS, including OpenTelemetry GenAI semantic convention fields like gen_ai.system and gen_ai.usage.input_tokens when those propagate through HTTP headers.

Tier 4 — eBPF DaemonSets at the kernel. One pod per node, attached through kprobes, tracepoints, and BPF LSM hooks (sys_enter_execve, bpf_lsm_socket_connect, and similar) to syscalls, network connections, process spawns, and file I/O. Captures ground truth from below the application layer regardless of framework cooperation. The category includes Tetragon, Falco, Cilium Tetragon, and ARMO’s eBPF sensor; the trust property is shared, the application-layer enrichment is where they differ. Cannot be silenced from inside an unprivileged application container running under the default Kubernetes Pod Security Standards baseline or restricted profile. The qualifier matters — a pod running with privileged: true, CAP_SYS_ADMIN, hostPID: true, or hostNetwork: true undermines this property, which is itself an argument for sandboxing autonomous-code-execution workloads through gVisor or Kata, covered in our AI agent sandboxing guide.

Tier 5 — Control-plane audit logs. The Kubernetes API server audit log (TokenRequest calls, kubectl exec invocations, ConfigMap and Secret reads, RBAC use) and the cloud provider audit log (CloudTrail’s AssumeRoleWithWebIdentity for IRSA, Azure Activity Log for managed-identity exchanges, GCP Audit Logs for Workload Identity Federation). Both streams are written outside any tenant pod’s reach; they share the load-bearing trust property and differ in scope. Worth noting: most managed Kubernetes audit policies log events at Metadata level by default, not Request or RequestResponse, which means events are visible but their bodies are not. Promoting policy levels for security-relevant resources (Secrets, ConfigMaps, agent pods’ service accounts) is a deliberate cost decision and is what makes Tier 5 maximally useful.

What eBPF Cannot See

Tier 4 has one structural blind spot that determines why Tier 1 keeps a role despite its trust properties. When a LangChain agent makes an HTTPS call to api.openai.com or to a managed vector database, eBPF sees the TCP connect, the SNI in the TLS ClientHello, the byte volume on the socket, and the destination IP. It does not see the prompt content, the completion content, or the tool arguments inside the TLS tunnel. Without TLS interception — which most teams won’t deploy at the eBPF layer for performance and key-management reasons — the kernel-level signal is structural, not semantic.

This is why the trust hierarchy is a stack rather than a substitution. Tier 4 tells you that the agent contacted a new endpoint and transferred 47 KB. Tier 1 tells you what prompt drove the call. Both are needed for triage; only Tier 4 can be trusted to fire the alert.

Mapping LangChain Events Onto the Hierarchy

The table below maps representative LangChain and LangGraph events to the five tiers. “Rich” means full semantic content; “structural” means the event is observable but only as connection, process, or file metadata; “none” means the tier is blind.

LangChain eventTier 1Tier 2Tier 3Tier 4Tier 5
Chain start/endRichPartialRich if instrumentedNoneNone
Tool invocation (AgentExecutor triggers on_tool_start/_end)RichPartialRich if HTTPStructuralIf tool calls K8s API
LLM API callRich (prompt + completion)NoneStructural (TLS-encrypted)Structural (TCP/SNI)None
LangGraph node transitionRich (state mutation)PartialNoneNoneNone
Memory backend read (Redis, vector DB)RichNoneStructuralStructuralNone
Retriever invocationRichNoneRich if managed VDB APIStructuralIf managed VDB
LangServe HTTP inboundPartial (FastAPI middleware)PartialRich (mesh sees L7)StructuralNone

A few rows are worth elaborating. The AgentExecutor (or its LangGraph successor for agents written against the newer API) is what triggers on_tool_start and on_tool_end callbacks; the @tool decorator itself only registers the function. This matters because AgentExecutor-derived signals live entirely at Tier 1 and disappear if Tier 1 is silenced — and the LLM API call row is the cleanest example: full prompt and completion are visible only at Tier 1, while eBPF sees only that the call happened and where it went.

Memory and retriever events vary by hosting model. Self-hosted Chroma or Qdrant: Tier 4 sees the network call to the vector DB pod, no semantic content. Managed Pinecone, Azure AI Search, or Vertex AI Vector Search: cloud audit logs (Tier 5) capture API-level events, often with query metadata — meaning a managed vector DB choice meaningfully improves the trust profile of retrieval observability.

LangServe deployments add a Tier 3 surface that pure-library LangChain doesn’t have. Wrapping chains in FastAPI exposes them as HTTP endpoints, and a service mesh sidecar can capture the inbound request structure even when the application’s own callbacks have been silenced.

Mapping AutoGPT Events Onto the Hierarchy

AutoGPT means three different things in 2026: classic loop-based AutoGPT (now maintained alongside Forge for agent building), AutoGPT Platform (the node-and-block workflow engine the project pivoted toward), and Forge-derived custom agents. The trust hierarchy applies to all three; the events differ. Current versions of all three lineages live in the AutoGPT project repository.

AutoGPT eventTier 1Tier 2Tier 3Tier 4Tier 5
Planner step (think)RichPartialNoneNoneNone
Action selection (act)RichPartialNoneStructural if tool callNone
Reflection stepRichPartialNoneNoneNone
Plugin lifecycle eventRichPartialNoneNoneNone
Subprocess spawn (workspace exec)Rich (intent)PartialNoneRich (execve)None
Workspace volume writePartialNoneNoneRich (file syscalls)None
Iteration loop counterRichPartialNoneProxy via API rateNone
Platform block transitionRichPartialRich if HTTP-bridgedStructuralNone

The subprocess and workspace rows are the operationally interesting ones. Classic loop-based AutoGPT executes generated code in a workspace directory by spawning subprocesses, and Tier 4 captures the actual execve and the file syscalls regardless of what the planner thinks it asked for. If the planner has been compromised and is hiding intent at Tier 1, Tier 4 is what surfaces “this pod just spawned curl with an outbound destination it has never contacted.”

Iteration loop drift is a useful Tier 1 signal — a planner that normally completes in five to seven iterations and suddenly runs forty is a strong indicator of subgoal divergence — but it is silenced by the same compromise that silences any other Tier 1 signal. The Tier 4 proxy is API rate: a pod making LLM calls at four times its baseline frequency is observable at the kernel network layer even when the framework’s own counters lie. AutoGPT Platform deployments shift the visible surface: blocks transition through HTTP boundaries when distributed across pods, moving part of the orchestration into Tier 3 reach.

The Recommended Stack

The instrumentation rule, stated as a sequence: pick which signals must remain true under compromise of the agent process. Instrument those at Tier 4 or Tier 5. Use Tiers 1 through 3 for semantic enrichment and triage context, never as primary detection. Treat eBPF filter rules as security artifacts in their own right — at the volumes LangChain and AutoGPT pods generate connect, openat, and execve events, the filter is part of the detection, not a knob underneath it.

A practical stack for a typical Kubernetes deployment: an eBPF DaemonSet at Tier 4 capturing process lifecycle, network connections, file I/O, and BPF LSM enforcement decisions; audit log shipping at Tier 5 with promoted policy levels for agent-pod service accounts and sensitive resources, paired with the matching cloud provider audit log stream; a sidecar OTEL collector at Tier 3 receiving framework callbacks shipped over the loopback interface (shipped over the network rather than kept in-process, which moves them out of the compromise blast radius); and pod stdout shipping at Tier 2 for forensic continuity.

The ARMO platform sits at Tier 4 with framework-aware enrichment from Tiers 1, 2, and 5 correlated into single attack stories. Auto-discovery of LangChain, AutoGPT, AutoGen, and CrewAI deployments runs at the kernel layer without manual tagging, which extends coverage to shadow agent deployments security teams haven’t been told about. Application Profile DNA builds the behavioral baseline for AI agents from observed runtime behavior at Tier 4 — kernel ground truth — rather than from in-process callbacks, which is what makes the baseline survive compromise of the framework’s process. CADR correlates these signals into a single attack story, with Tier 1 callbacks treated as semantic enrichment rather than ground truth — the same pattern we apply to multi-agent orchestration detection across LangChain, CrewAI, and AutoGPT.

What the Trust Hierarchy Doesn’t Solve

The hierarchy is necessary, not sufficient. Three structural limitations are worth stating directly.

The TLS blind spot at Tier 4 means kernel-level observability cannot reconstruct prompt or completion content. A team that loses Tier 1 to compromise loses the ability to reconstruct what an agent was asked to do, even though it retains the ability to detect that the agent did something unusual.

The audit-policy prerequisite at Tier 5 means default managed-Kubernetes configurations do not capture the body content of resource reads. Promoting Secret and ConfigMap reads to RequestResponse level is a deliberate volume and cost decision that has to be made before Tier 5 carries its full weight.

The privileged-pod assumption at Tier 4 means autonomous-code-execution workloads — classic AutoGPT running planner-generated code, sandbox-evaluation tooling that needs elevated capabilities — undermine the trust property the hierarchy depends on. The right response is not to abandon the hierarchy; it is to sandbox those workloads at the runtime boundary through gVisor, Kata, or equivalent isolation, so the agent’s process cannot reach the eBPF sensor’s namespace.

Returning to the Cold Open

The platform team’s incident, walked through the stack the article recommends:

Tier 4 fires first. The malicious dependency’s execve of curl and the outbound TCP to a domain the agent pod has never contacted are kernel events the eBPF DaemonSet surfaces seconds after the connection is made. Tier 5 confirms unusual cadence on the pod’s service account, captured in CloudTrail’s AssumeRoleWithWebIdentity stream against the Kubernetes API server’s TokenRequest record. Tier 1 tells the SOC analyst, after the alert assembles, what the agent thought it was doing. The LangSmith dashboard is still green because the malicious code patched it. The incident closes in twelve minutes instead of three days because the alert came from a tier the attacker could not reach.

Build the Stack Before the Next Compromise Forces You To

LangChain and AutoGPT will keep producing more in-process telemetry as both frameworks evolve, and the in-process telemetry trap will keep deepening alongside it. The teams that stay ahead are the ones that ranked their telemetry sources by adversary resistance before the next dependency-chain compromise forced the conversation.

Walk through what a Tier-4-grounded observability stack looks like in your own cluster with ARMO’s cloud-native security for AI workloads.

Frequently Asked Questions

Should I disable LangSmith? No. LangSmith is the semantic enrichment layer that explains what an alert means once one fires from a higher tier; disabling it means losing the ability to reconstruct prompt content during triage. The change is to stop treating LangSmith as a security tool — it produces useful triage context, valuable in that role, and is not load-bearing for detection.

Where do I instrument framework callbacks if not in-process? Configure the LangChainTracer handler or equivalent OTEL exporter to ship over the network — typically to a sidecar OpenTelemetry collector via gRPC on the loopback interface. The originating handler still runs at Tier 1, but the events themselves move into Tier 3 storage where in-process compromise can no longer rewrite them after they have shipped.

What’s the eBPF overhead for AutoGPT specifically given the iteration loop? Modern eBPF sensors run as DaemonSets with overhead scaled per-node, not per-iteration; ARMO’s sensor operates at 1–2.5% CPU and roughly 1% memory overhead at the node level. AutoGPT’s loop generates more syscalls than a typical service, which makes filter tuning more important — capture security-relevant signals at the kernel and let the application-layer profile decide which iteration patterns matter.

How does this stack handle multi-agent orchestration spanning LangChain and AutoGPT in the same cluster? The trust hierarchy applies per-pod; multi-agent orchestration adds the inter-agent surface — delegation edges, shared context, orchestrator nodes — which depends on the per-pod instrumentation this article describes and is the subject of our companion analysis on AI-aware threat detection for cloud workloads.

What’s the minimum viable instrumentation for a team that can only deploy one new tool? An eBPF DaemonSet at Tier 4. It is the only tier that simultaneously delivers cross-framework coverage, adversary resistance, and compatibility with whatever in-process telemetry the team already has. Audit log shipping is the high-leverage second addition.

Close

Your Cloud Security Advantage Starts Here

Webinars
Data Sheets
Surveys and more
Group 1410190284
Ben Hirschberg CTO & Co-Founder
Rotem_sec_exp_200
Rotem Refael VP R&D
Group 1410191140
Amit Schendel Security researcher
slack_logos Continue to Slack

Get the information you need directly from our experts!

new-messageContinue as a guest