Get the latest, first
arrowBlog
AI Workload Discovery: How to Find Every AI Agent Running in Your Clusters

AI Workload Discovery: How to Find Every AI Agent Running in Your Clusters

Apr 22, 2026

Shauli Rozen
CEO & Co-founder

Key takeaways

  • Why does AI workload discovery require inference rather than lookup? Kubernetes has no “AI workload” concept. At the orchestrator layer, an AI agent is indistinguishable from any other container running a Python or Node process. The information that makes a container an AI workload — loaded SDKs, outbound LLM calls, framework imports — exists only at runtime. Discovery has to infer agent identity from observation, not retrieve it from metadata.
  • What is the Declaration–Observation Gap? Every inventory is either a declaration (what someone stated) or an observation (what actually happened). Shadow AI is the gap between the two for AI workloads specifically — the workloads that don’t appear in declared-state inventories because their developers didn’t declare them. Closing the gap requires moving discovery from declaration-based tooling to observation-based tooling; it’s not a governance problem.

A CISO at a mid-sized SaaS company pulls her platform lead aside after a board meeting. One question: “Do we have AI agents running in production?”

The lead pauses. He knows the data science team has been experimenting with LangChain. He remembers a conversation about a customer-support pilot. He thinks there might be an inference server in staging that got promoted last quarter. But when the CISO asks him to produce a list — every agent, every cluster, every service account — he can’t. Not because the information is hidden. Because his tools were never built to find it.

That’s the discovery problem in one conversation. Not “do AI agents exist” — everyone knows they do. The harder question is whether you can enumerate them, prove completeness, and hand the CISO a list that says: this is every one, and here’s what each one is doing.

Most security teams can’t. The reason isn’t a visibility gap in the traditional sense — the containers are right there, running, generating logs. The reason is architectural. Kubernetes has no concept of “AI workload,” and neither does any layer below it. An AI agent is not an object the infrastructure knows about. It’s a container running a Python process that happens to be doing something very different from every other container running Python.

This article walks through the discovery problem from the ground up. It names why the usual approaches fail, introduces a four-part signal framework for what effective discovery actually catches, and ends with a three-question test a platform team can run this week to measure the size of their own blind spot. Discovery is Layer 1 of the five-layer observability model for AI workloads — the layer everything else depends on.

Why AI Workload Discovery Is Not Cloud Asset Discovery

The first instinct of a well-resourced security team, when asked to enumerate AI workloads, is to open a cloud asset inventory tool and filter for AI-related tags. AWS Config for SageMaker endpoints. Azure Resource Graph for Cognitive Services. GCP Asset Inventory for Vertex AI. The providers know what services they sold you; their asset APIs surface them reliably.

This catches exactly one category of AI workload: the kind that’s visible at the cloud API layer because it runs on a provider-managed AI service. Everything else — which in most enterprise Kubernetes environments is almost everything — is invisible.

An AI agent in a typical deployment is a container. Inside that container is a Python interpreter (sometimes Node, sometimes Go). That interpreter has loaded some libraries — maybe LangChain, maybe the OpenAI SDK, maybe a vLLM server. The container is scheduled on a Kubernetes node by the same scheduler that handles the team’s web servers, databases, and batch jobs. From the cloud provider’s perspective, this container is a generic compute unit consuming CPU and memory. It doesn’t know the container is an AI agent. The Kubernetes API doesn’t know either. Neither does the container runtime. Neither does the image registry.

The information that makes this container an AI workload — which SDKs it’s using, which LLM endpoints it’s calling, which tools it has access to — exists only at runtime, inside the process. Nothing above the process boundary sees any of it.

This is the inference problem. You can’t look discovery up. You have to infer it from what the container actually does.

The Declaration–Observation Gap

Every inventory ever built in security is either a declaration or an observation. Your CMDB is a declaration. Your asset tags are a declaration. Your deployment manifests are a declaration. Your IAM policy is a declaration. All of these record what was stated, approved, and committed.

Observations are different. An observation records what actually happened. A network flow log is an observation. A process execution event is an observation. An outbound DNS query is an observation.

For deterministic workloads, declarations and observations stay close enough to each other. A web server configured to serve HTTP on port 443 does exactly that, and the gap between declared and observed behavior is small. Security tooling built around declarations works well because the observed state rarely drifts far from what was declared. The industry has written extensively about how runtime-first tooling differs from declarative-only tooling — and for AI workloads, that difference stops being a tooling preference and starts being a structural requirement.

AI workloads blow this apart. A developer deploys a container. The manifest declares Python 3.11, 2 GB of memory, one replica. Nothing in the manifest says “AI agent.” Nothing in the manifest says “LangChain.” Nothing in the manifest lists the LLM endpoints the agent will call, the tools it will invoke, or the data it will touch — because none of that is in the manifest to begin with. The agent’s behavior is assembled at runtime from a prompt, a tool registry, a model endpoint, and whatever instructions the prompt ends up triggering.

The gap between what was declared at deploy time and what is observed at runtime is the shadow AI problem. Shadow AI isn’t a governance failure. It’s not a compliance failure. It’s the structural and predictable result of trying to inventory AI workloads with declaration-based tooling. Every workload that’s discoverable only through observation is, by definition, shadow AI until someone observes it.

This reframe matters because it changes what the solution looks like. Shadow AI isn’t solved by tighter deployment review or better tagging discipline. It’s solved — or, more honestly, minimized — by moving the discovery mechanism from declaration to observation. The rest of this article is about what that looks like.

The Four Discovery Signal Classes

Effective discovery isn’t one method. It’s a stack of four signal classes, each catching workloads the previous class missed, and each requiring different instrumentation to capture.

Signal Class 1 — Declared Signals

This is where most organizations start, and where many stop. Declared signals are everything a workload says about itself at deployment time: Kubernetes labels and annotations (app=langchain-agent, ai.company.com/workload=true), image names (openai-proxy:v2), Helm chart names, ConfigMap entries, service catalog entries.

Declared signals are free and instant. Capturing them requires no new instrumentation — any CSPM or platform-engineering tool with cluster-scoped read access produces the full set. They’re high-confidence when present: if a developer labeled a pod workload=ai-agent, it almost certainly is one.

The weakness is coverage. Declared signals catch only the agents whose developers decided to declare them. In most organizations, this excludes everything deployed in a hurry, everything deployed by a data science team whose conventions don’t match platform team conventions, and everything deployed during a proof-of-concept that became production. It also catches nothing that an attacker introduced, since attackers don’t self-tag. Declared signals are the floor of discovery, not the ceiling.

Signal Class 2 — Infrastructure Signals

Infrastructure signals are the shape of the workload at the Kubernetes and node layer: GPU resource requests (nvidia.com/gpu: 1), access to /dev/nvidia* device nodes, mounted model-storage volumes, atypically large memory requests (often 8 GB or more), and node pool affinity to accelerated-compute pools.

These signals catch most inference servers reliably. A container requesting a GPU is almost certainly running either an inference workload or a training job. A container mounting a 40 GB model artifact from an object store is almost certainly an inference server. Infrastructure signals don’t require application-layer instrumentation; they’re visible to any tool with Kubernetes API and node-level inventory access.

The weakness: infrastructure signals miss the majority of agent workloads because most agents don’t need a GPU. A LangChain agent calling a hosted LLM API needs almost no infrastructure — a few hundred megabytes of memory, no special devices, a standard node pool. Infrastructure signals are strong for the heavyweight workloads and blind to the lightweight ones, which is exactly backwards for agent discovery: the lightweight workloads are where proliferation happens.

Signal Class 3 — Library and Process Signals

Library and process signals are what the container actually loaded into memory and what process tree it spawned. This is the first signal class that requires runtime instrumentation — you need something observing the workload from inside (or from the kernel), not from the orchestrator.

The signals in this class are the strongest indicators of AI-agent identity: LLM provider SDKs loaded into the process (the openai Python package, anthropic, google-generativeai, boto3 with Bedrock runtime calls, @anthropic-ai/sdk in Node); agent framework libraries loaded (langchain, llama_index, crewai, autogen, semantic_kernel); inference server processes (vllm, tgi, triton-inference-server, ray serve); MCP runtime signatures (process arguments consistent with an MCP tool server, JSON-RPC handlers on specific ports); and interpreter invocations from an agent process — a LangChain agent spawning a Python subprocess to execute generated code is a distinct process-tree pattern that doesn’t appear in non-agent Python workloads.

The instrumentation required is the same that runtime security in general depends on: kernel-level observation through eBPF-based sensors, or in-container agents that can see process memory and library load events. Tools already deployed for runtime threat detection can surface library and process signals as a byproduct of their existing telemetry.

Coverage from this class is high — most agent frameworks fingerprint cleanly at the library level — and confidence is high, because something that loaded langchain.agents into memory is not ambiguously an AI agent.

Signal Class 4 — Protocol and Network Signals

Protocol and network signals are what the workload says over the wire: outbound DNS queries, TLS SNI values, HTTP request targets, and protocol-level traffic patterns.

For AI workloads, the high-confidence signals are outbound connections to known LLM API endpoints. The major hosted-LLM providers all expose their inference APIs at a small set of predictable hostnames: api.openai.com, api.anthropic.com, generativelanguage.googleapis.com, Bedrock runtime endpoints at bedrock-runtime.<region>.amazonaws.com, and Azure OpenAI endpoints at <resource>.openai.azure.com. Any container making regular outbound connections to any of these is an AI workload with probability approaching certainty.

The second pattern in this class is MCP-specific traffic: JSON-RPC over defined transports (HTTP/SSE or stdio over socket), the specific RPC method names MCP defines (initialize, tools/list, tools/call, resources/list), and connection patterns consistent with agent-to-tool-server handshakes. These are fingerprintable at the network layer without decrypting payloads.

Protocol and network signals catch the workloads every other class misses: the agent that only runs briefly in response to a prompt, the agent whose developer didn’t use a major framework, the agent calling an LLM endpoint but using nothing else that would fingerprint as AI. They’re the signal class of last resort, and also the class an attacker would most need to hide from.

Coverage and confidence

The four classes are complementary, not alternatives. The summary below maps what each catches, misses, and requires:

Signal classWhat it catchesWhat it missesInstrumentation
DeclaredAgents whose developers tagged themShadow AI, attacker-introduced workloads, rushed deploymentsKubernetes API read access
InfrastructureInference servers, GPU workloads, training jobsLightweight agents calling hosted LLMsKubernetes API + node inventory
Library / processFramework-based agents, SDK users, inference runtimesCustom agents using no standard librariesRuntime observation (eBPF or in-container agent)
Protocol / networkHosted-LLM callers, MCP traffic, API-based agentsAgents making no outbound LLM callsNetwork flow visibility, DNS logs, or kernel-level network observation

No single signal class is sufficient. A discovery capability relying only on declared signals has a shadow AI problem by construction. A capability relying only on infrastructure signals misses most agents. Library and protocol signals together catch the workloads the first two classes miss, but require runtime telemetry that most CSPM and CNAPP tools don’t have — a structural gap the legacy-tooling analysis covers in depth. The complete picture requires all four.

Classification: Knowing What You’ve Discovered

Discovery that stops at “there’s an AI workload here” is useful but incomplete. The next question — and the one everything downstream depends on — is what kind of AI workload.

A LangChain ReAct agent with a Python code-interpreter tool and access to an internal vector database has a fundamentally different risk profile from a vLLM inference server that takes a prompt and returns a completion. Both are AI workloads. Both might show up in a discovery result. Treating them the same is a category error that will produce either too-loose policy for the agent or too-tight policy for the inference server.

Useful classification surfaces at least three attributes. Framework: LangChain, LlamaIndex, CrewAI, AutoGen, vLLM, or a custom agent using raw SDKs. Role: inference server (takes prompt, returns completion), agent orchestrator (reasons and invokes tools), tool runtime (exposes tools to other agents), RAG retriever (fetches context for a generator), or specialized variants like code-execution agents. Kubernetes identity: service account, RBAC bindings, pod security context, namespace — which, combined with behavioral observation later in the stack, becomes the identity layer at Layer 5.

Classification is what makes discovery produce an actionable inventory. A list of “47 AI workloads discovered” is a number. A list of “23 LangChain agents, 6 vLLM inference servers, 5 MCP tool runtimes, 13 custom agents using the OpenAI SDK, each mapped to service accounts X, Y, Z” is something a security team can triage, prioritize, and build policy around.

This is the point where discovery hands off to the next layer. Once you have a classified inventory of agents, you know enough to generate a runtime-derived AI-BOM — the per-agent record of what each one actually uses, which becomes the foundation for behavioral baselining, detection, and enforcement.

The Discovery Completeness Test

An abstract framework is hard to act on. Here are three questions a platform or security team can take to their existing tooling this week. Each question maps to one or more signal classes. The questions a team cannot answer define the shape of their discovery gap.

Question 1: Can you list every process in your clusters right now that has an LLM SDK loaded into memory?

This maps to Signal Class 3 (library and process). The specific test: list every Python process that has openai, anthropic, google.generativeai, langchain, llama_index, crewai, or autogen loaded as an imported module. A team that can answer this has runtime library-level visibility. A team that can’t — which is most teams, because most tools don’t expose process memory contents — is blind to Signal Class 3 entirely.

Question 2: Can you list every outbound connection to a known LLM API endpoint in the last 24 hours, grouped by source pod?

This maps to Signal Class 4 (protocol and network). The specific test: filter your network flow data for outbound traffic to api.openai.com, api.anthropic.com, generativelanguage.googleapis.com, Bedrock runtime endpoints, or Azure OpenAI endpoints, and group by Kubernetes pod. A team with full-stack network observability answers this in minutes. A team relying on VPC flow logs alone can probably answer part of it but won’t easily attribute to pods. A team with no east-west or egress observability can’t answer it at all.

Question 3: Can you list every container that either requested a GPU, mounted a model artifact larger than 100 MB, or runs in a node pool with accelerated compute?

This maps to Signal Class 2 (infrastructure). Most teams can answer this from their Kubernetes API and storage metadata, which makes it the easiest of the three. The reason it’s still included: it catches workloads Questions 1 and 2 might miss — a model server running entirely internally, using no hosted LLM APIs, and without its SDK showing up in standard framework fingerprints.

A team that can answer all three has discovery coverage across three of the four signal classes. Declared signals (Class 1) they’ll have by default. A team that can answer only Questions 1 and 2 but not 3 probably has strong runtime telemetry but weak Kubernetes inventory correlation. A team that can answer only Question 3 has the infrastructure layer but none of the runtime visibility the agent workloads require.

Every “no” on this test is a coverage gap. The test isn’t about grading a tool — it’s about producing an explicit map of where the team’s discovery blind spots are, so the next investment decision is evidence-based.

From Discovery to the Rest of the Stack

Discovery’s output is not a list. It’s a living map — updated continuously as new workloads deploy, as existing workloads change behavior, as agents discover new tools at runtime. Each entry in the map ties together which workload (pod, deployment, namespace, cluster), what kind (framework, role, classification attributes), which identity (service account, RBAC bindings, IAM federation), and what it’s calling (LLM endpoints, tool servers, data sources — observed, not declared).

Everything above Layer 1 uses this map as input. The runtime-derived AI-BOM depends on knowing which workloads to build an inventory for. Behavioral visibility depends on knowing which agents to baseline. Execution graphs depend on knowing which pods are agents in the first place. Identity mapping depends on the workload-to-identity correlation that discovery produces.

This is why the parent observability model argues discovery is the foundation. Skip it and the layers above don’t have reliable inputs; they produce confident-looking outputs built on incomplete ground truth.

ARMO’s discovery runs at the node level through eBPF-based telemetry, which gives it direct access to Signal Classes 3 and 4 — library loads and outbound network destinations — without requiring developer cooperation, without sidecars, and without modifying application code. Signal Classes 1 and 2 come from Kubernetes API and node inventory, the way any Kubernetes-aware tool surfaces them. The combination produces the classified, continuously updated map described above. A LangChain agent deployed without tags, calling api.anthropic.com for its completions, using an internal MCP tool server for its actions — the kind of workload a tagging-based inventory would miss completely — appears in the discovery map on its first execution.

Evaluating Your Discovery Capability

If you’re looking at your current stack and asking whether it covers the four signal classes, the useful questions are:

On declared signals: Can your tool surface pods by label, annotation, and image name across all clusters in one view? (Almost every Kubernetes-aware tool can.)

On infrastructure signals: Does your tool correlate GPU requests, large-memory pods, and model-storage mounts into a single workload profile? (Most Kubernetes-aware tools can, if asked.)

On library and process signals: Does your tool observe which libraries and modules are loaded into running processes, not just what’s in the image? (Most tools can’t. Image scanning tells you what’s on disk, not what’s loaded into memory.)

On protocol and network signals: Does your tool see outbound traffic to LLM API endpoints at the pod level, attributed to the Kubernetes identity that made the call? (VPC flow logs don’t attribute to pods. Service mesh telemetry does, but only for mesh-internal traffic. Kernel-level network observation attributes reliably.)

Three red flags to watch for when evaluating a tool’s discovery claims. “AI-tagged cloud assets” framing: if the discovery feature is a filter on cloud asset inventory, it only catches managed AI services — everything in Kubernetes is invisible to it. Developer tagging or registration as a prerequisite: any discovery method that depends on developer action has the shadow AI problem baked in; the workloads you most need to find are exactly the ones that won’t be tagged. Agent-per-pod or sidecar-per-pod instrumentation: if surfacing an AI workload requires modifying its deployment, discovery is gated on the same developer cooperation that shadow AI bypasses.

The opposite of these — discovery that works without tagging, without deployment modification, and without cloud-API gating — is what “automatic and runtime-based” means operationally. For the broader evaluation frame these discovery criteria fit into, the four-pillar AI workload security buyer’s guide walks observability, posture, detection, and enforcement together.

Start With What You Can See

The CISO question at the top of this article is a reasonable question. Every CISO should be able to ask it and get an answer. Most can’t — not because their security teams are underperforming, but because their tooling was built for a different kind of workload.

AI workload discovery is a different problem from cloud asset discovery, from CMDB inventory, from container security scanning. It’s an inference problem, solved by observing what containers actually load, what they actually spawn, and what they actually send over the wire. The four signal classes describe the observable space. The completeness test describes how much of it any given team currently covers. Everything else in the stack depends on the answer.

For teams running Kubernetes in production, discovery is the prerequisite to every subsequent decision. You can’t baseline behavior for agents you haven’t found. You can’t enforce least privilege on workloads you don’t know exist. You can’t write a policy for a category of thing your inventory denies is running.

See the agents you have. The rest of the stack becomes tractable once this layer is solved. ARMO’s platform performs runtime-based AI workload discovery across connected clusters, classifying agents by framework and role without tagging requirements or developer cooperation. Request a demo to see what your current stack is currently missing.

FAQ

Why isn’t developer tagging enough for AI workload discovery?

Developer tagging only catches the workloads whose developers decided to tag them. In a typical enterprise with multiple teams deploying agents independently, this excludes proof-of-concept deployments that became production, data-science-team deployments that don’t follow platform-team conventions, and anything introduced by an attacker who obviously won’t self-tag. Tagging is useful as a declaration layer on top of observational discovery, but it cannot be the primary mechanism.

Can static image scanning find AI workloads?

Partially. Image scanning will tell you which libraries are installed in a container image, so it catches agents that ship their frameworks at build time. It misses two cases: agents that install or load libraries at runtime (via pip install inside a running container, a pattern common in experimental deployments), and agents whose framework code is mounted from a volume rather than baked into the image. Runtime observation catches both; image scanning does not.

How long does runtime discovery take to produce a complete picture?

Discovery becomes useful the moment instrumentation is deployed. The first time an existing agent executes a tool call, spawns its typical subprocess, or opens its standard outbound connections, the discovery signals fire and the workload enters the inventory. For workloads already running when instrumentation is first deployed, time to full classification is the time to the workload’s next round of activity — usually minutes, not days.

What about agents that only run on demand, like serverless functions or Knative workloads?

The signal classes still apply, but Classes 2 and 3 become harder to observe for workloads with very short lifetimes. Signal Class 4 (protocol and network) is the most reliable for on-demand workloads because the outbound LLM API calls happen every time the function executes, and those calls are observable in network telemetry regardless of how briefly the function lives. Knative and similar platforms running on Kubernetes nodes remain observable through node-level instrumentation; fully serverless platforms like AWS Lambda require cloud-layer observation in addition to Kubernetes-layer observation.

Close

Your Cloud Security Advantage Starts Here

Webinars
Data Sheets
Surveys and more
Group 1410190284
Ben Hirschberg CTO & Co-Founder
Rotem_sec_exp_200
Rotem Refael VP R&D
Group 1410191140
Amit Schendel Security researcher
slack_logos Continue to Slack

Get the information you need directly from our experts!

new-messageContinue as a guest