Blog

Home
Blog
AI Inference Server Observability in Kubernetes: The Four Signals MLOps Tools Don’t Capture

AI Inference Server Observability in Kubernetes: The Four Signals MLOps Tools Don’t Capture

Apr 28, 2026

Ben Hirschberg
CTO & Co-founder

Key takeaways

Why don't MLOps observability tools catch attacks on inference servers? They observe output behavior — latency, tokens per second, statistical drift in model outputs. Attacks succeed at the serving layer (model load, request handling, accelerator activity, egress) and only change the output distribution well after the foothold is established. By then the investigation is reactive.
What are the four serving-behavior signal categories? Model load events, request handler behavior, accelerator activity, and egress and inter-service traffic. Each lives at a different layer of the inference server's runtime and catches a different category of documented attack on production serving infrastructure.

In August 2025, a vulnerability chain in NVIDIA Triton Inference Server was found that allowed an unauthenticated remote attacker to send a single crafted inference request, leak the name of an internal shared memory region, register that region for subsequent requests, gain read-write primitives into the Triton Python backend’s private memory, and achieve full remote code execution. The exploit chain ran entirely through Triton’s standard inference API. No anomalous traffic volume. No latency spike. No accuracy regression. Every metric on a Datadog APM dashboard or an Arize trace would have looked exactly as expected — right up to the moment the attacker had a shell on the GPU node.

The MLOps stack — Arize, Fiddler, WhyLabs, Langfuse, the OpenTelemetry GenAI semantic conventions, every “LLM observability” feature shipped by APM vendors in the last two years — measures whether the model is performing well. It tells you nothing about whether the inference server is being attacked. For security teams running vLLM, Triton, KServe, BentoML, TGI, or Ray Serve in production Kubernetes, that distinction is the difference between observability you can build a security program on and observability structurally blind to attacks already hitting peers. This article splits “tracking model behavior” into the two disciplines hidden inside the phrase, maps the four serving-behavior signals security teams must collect, and shows where each signal lives in the deployment patterns most teams run.

The Two Definitions of “Model Behavior”

The phrase “model behavior” carries two meanings in production AI infrastructure, and conflating them is why most security teams have invested heavily in observability without gaining security visibility into their inference fleet.

Output behavior is what MLOps tools measure. Did the model return a result within the latency budget? Are tokens per second steady? Is the output distribution drifting? Are quality metrics — accuracy, hallucination rate, helpfulness scores — stable? Tools like Arize, Fiddler, WhyLabs, and Langfuse instrument at the inference call boundary: input prompt, output response, latency, tokens. The OpenTelemetry GenAI semantic conventions formalize that boundary into a standard. This is real, useful, well-built observability, and categorically the wrong observation surface for catching attacks on the server doing the inferring.

Serving behavior is what security tools must measure. What artifacts did the inference server load at startup, from where, and with what integrity check? What syscalls do the request handlers make? What CUDA API calls does the worker process issue? What network destinations does the server reach, attributed to which process? These signals live at the kernel, syscall, file system, and library layers — inside the server process, not at its API boundary.

Attacks on inference servers succeed in the serving layer well before they manifest in the output layer. A weight-tampering attack between staging and production serves backdoored predictions with normal latency, throughput, and quality scores until an adversarial input triggers the backdoor. The Triton Python backend exploit chain produces full RCE through Triton’s standard inference API without altering any output-layer metric. Cross-tenant KV cache leakage in shared multi-tenant deployments leaks system prompts through timing differences indistinguishable from normal load variance. The output-layer dashboard stays green throughout. These are not different views of the same system. They are two systems being observed.

Why Inference Servers Are a Distinct Observation Target

Inference servers occupy an architectural slot in Kubernetes that’s structurally different from stateless application workloads, AI agents, or tool runtimes. A typical production deployment moves through four stages each time it starts. An init container fetches model weights from a registry — Hugging Face Hub, S3, GCS, Azure Blob, an internal model registry, or a PVC. The main container loads those weights into GPU memory through whichever inference framework is in use. Request handlers spawn worker processes that receive untrusted user input as prompts and run model forward passes against shared GPU memory. The server emits telemetry, fetches batched supplementary data, and returns results.

Every stage introduces a security observation surface that doesn’t exist in stateless web workloads. The init container fetches artifacts whose integrity must be verified at load time, not just at scan time. The request handlers execute user-controlled inputs through inference framework code that has shipped multiple critical CVEs in the last twelve months. The accelerator boundary holds tenant data in KV cache memory recoverable through timing side-channels. The egress lane carries traffic patterns that are workload-defining when normal and breach-defining when anomalous, and no destination-based control can tell them apart.

We have previously broken down why legacy security tools fail to protect cloud AI workloads at the framework category level. Inference servers are where that argument lands operationally — every architectural assumption baked into CSPM, CWPP, and CNAPP tooling about deterministic workloads breaks here.

Signal 1: Model Load Events

The first observation surface opens before the inference server serves a single request: the moment it loads weights into memory. Three load patterns dominate production. KServe deploys a storage-initializer init container that pulls from s3://, gs://, https://*.blob.core.windows.net/, hdfs://, hf://, or a PVC into /mnt/models, where the predictor container — Triton, TorchServe, MLServer, the HuggingFace runtime, or vLLM — reads it at startup. NVIDIA Triton’s standalone deployment uses a model repository configured at server start, with versioned subdirectories per model. vLLM in a raw Deployment uses –download-dir and the Hugging Face cache, optionally pre-warmed by an init container.

The format the artifact arrives in matters but doesn’t eliminate the observation requirement. Safetensors — now the default for new Hugging Face models, with native PyTorch support — eliminates the arbitrary code execution path that pickle-based PyTorch checkpoints carry. PyTorch’s torch.load now defaults to weights_only=True, restricting the unpickler to tensors, primitives, and explicitly allowed types. Both close the deserialization-as-RCE path. Neither tells you whether the artifact you loaded is the one you intended to load. Namespace-hijacked packages on Hugging Face Hub, weight tampering between staging and production, and registry credential abuse during init container fetch all bypass format-level scanning because the format never lies — the provenance does.

This is where runtime observability of the load event becomes the load-bearing layer. A runtime-derived AI Bill of Materials populated from observed load events captures what was actually loaded into memory, from which network endpoint, with which integrity hash, and through which process — turning a static manifest into continuous attestation. eBPF instrumentation at the node observes the syscalls the storage-initializer issues during fetch and the mmap and CUDA copy operations the predictor container makes during weight ingestion, regardless of inference framework.

What you watch for: artifacts loaded from registries that aren’t on your allowed list, integrity hash mismatches against the manifest, deserialization in a backend that should have been disabled, and load events that complete before the policy engine has approved the artifact.

Signal 2: Request Handler Behavior

Once the server is serving, the request handler is where most documented inference server attacks succeed — and where the observation gap is widest. Triton’s recently-patched vulnerability chain is a textbook example. The attacker sends a crafted, oversized request that triggers an exception in the Python backend’s error handling, leaking the unique name of an internal shared memory region. The attacker then calls Triton’s public registration endpoint with that leaked name, treating the public-facing API against itself: the server accepts the registration, and subsequent inference requests use the now-attacker-controlled memory region for input or output, providing read-write primitives into the Python backend’s private memory. From there, the attacker corrupts internal control structures and achieves full remote code execution. The full chain runs through Triton’s standard inference API.

What was visible in MLOps observability while this happened: request rate, response codes, latency, and output token distributions all within bounds. The attack didn’t touch any metric the MLOps stack measures, because the MLOps stack measures the API surface and the attack lives in shared-memory IPC, error-handler exception paths, and registration endpoint usage that look syntactically valid from the API.

What is visible in serving-behavior observability: the anomalous IPC pattern (a large unexpected access to /dev/shm paths the predictor doesn’t normally touch), the registration endpoint being called with a name that wasn’t generated through the documented path, the request handler subprocess making memory operations outside its baseline envelope, and the syscall pattern of post-exploitation persistence inside what was supposed to be a stateless inference container. None of these signals require knowing about the specific CVE in advance — they surface as deviations from the inference server’s behavioral baseline at the process and syscall layer.

This is the operational case for behavioral baselines that capture handler-level activity per-Deployment, not per-pod. Per-pod baselines never converge in autoscaling inference fleets where pods are ephemeral. Per-Deployment baselines survive scale events and capture the full envelope of what a given inference workload’s request handler legitimately does.

Signal 3: Accelerator Activity

The third signal is the most architecturally complex because the observation target is partitioned between the host kernel and the GPU device. The honest framing of what’s reachable: eBPF uprobes attached to libcudart.so and libcuda.so capture the CUDA API call envelope — cudaMalloc, cudaLaunchKernel, cudaMemcpy, cuLaunchKernel, and surrounding stream synchronization calls — at the boundary where the inference server’s worker process talks to the CUDA runtime. Driver-level visibility through kprobes on nvidia_* symbols and DRM tracepoints captures the ioctl pattern on /dev/nvidia* devices. Together they give you the host-side view of what the inference worker is asking the GPU to do. What stays invisible is in-GPU execution: warp divergence, thread-level memory access inside kernels, and access patterns within the GPU’s own SRAM.

For the security threat models that matter on inference servers, the host-side view is the load-bearing observation layer. Cross-tenant KV cache leakage is the canonical case. vLLM’s own documentation states that prefix caching is vulnerable to timing-based side-channel attacks where an adversary infers cached content by observing latency differences. The mitigation vLLM ships — per-request cache_salt injected into the block hash — is opt-in and requires the inference platform to implement tenant-aware salting at the application layer. Independent academic work on SafeKV has documented the same class of attack across vLLM and SGLang via TTFT timing distinguishability. The MITRE ATLAS framework now catalogs cache-based information extraction under AML.T0024 and AML.T0040.

What you watch for: anomalous cudaMemcpy source/destination patterns relative to baseline, KV cache block allocation patterns that diverge from the deployment’s behavioral envelope, prefill latency distributions that bifurcate in ways consistent with cross-tenant cache hits, and worker process accelerator usage that doesn’t match the request profile. Multi-tenant prefix caching needs both tenant-isolation salting at the application layer and behavioral baselining at the accelerator-API layer — one without the other leaves the timing side-channel open or invisible.

Signal 4: Egress and Inter-Service Traffic

The fourth signal is where the inference server’s runtime activity terminates in network destinations — and where the most visibly active attack against AI infrastructure today expresses itself. ShadowRay 2.0, the ongoing exploitation campaign against the Ray distributed computing framework, anchors what egress observability has to catch. The underlying flaw — CVE-2023-48022, a missing authentication check in the Ray Jobs API — was disclosed in late 2023, disputed by the maintainer, and never directly patched. Through 2025 and into 2026, attackers have used internet-exposed Ray clusters to submit unauthenticated jobs that execute arbitrary code. Researchers documented over 200,000 exposed Ray servers as of late 2025, with a self-replicating botnet using compromised clusters for cryptomining, data theft, and DDoS. Many of those clusters are running vLLM, because Ray-on-Kubernetes is one of the most common patterns for distributed inference.

The egress signal during a ShadowRay-style compromise looks like this. The Ray head node — which normally fetches model artifacts, communicates with worker pods, and emits OpenTelemetry traces to a collector — begins reaching external destinations: a cryptomining pool, an attacker-controlled exfiltration sink, a peer-to-peer botnet coordination server. Destinations may be allowlisted because someone needed external Hugging Face access during deployment. Volume may be within the workload’s normal envelope. Pod-attributed monitoring sees one IP talking to another. The signal that reveals the attack is process attribution: the egress comes from a Ray job worker process, not the model loader, and the destination doesn’t appear in any prior baseline of that workload’s egress topology.

What you watch for: egress destinations that don’t appear in the deployment’s behavioral baseline, traffic from worker processes to addresses outside the artifact registry and telemetry endpoints, beaconing patterns disguised as telemetry, and outbound volume that splits into a bimodal pattern — one peak being legitimate inference traffic, the other being something else. This is where tool misuse and API abuse detection patterns at the agent layer have a structural counterpart at the inference server layer: the malice is rarely in the action itself, only in the destination, sequence, and rate.

Where Each Signal Lives in Production

The four signals collect at different points depending on the deployment pattern:

Platform	Signal 1 (Load)	Signal 2 (Handler)	Signal 3 (Accelerator)	Signal 4 (Egress)
KServe	storage-initializer init container syscall trace	Predictor container process tree + syscalls	Predictor’s CUDA library uprobes	Predictor egress, distinct from queue-proxy sidecar
NVIDIA Triton (standalone)	Model repository sync events	Triton process + Python backend stub IPC	Triton’s CUDA stream calls	Triton egress + management API
vLLM (raw Deployment)	–download-dir fetch syscalls	vLLM engine + worker processes	PagedAttention block allocation patterns	vLLM HTTP server egress
Ray Serve / KubeRay	Head node artifact fetch + worker init	Ray Jobs API + worker process activity	Worker-attributed CUDA calls	Head + worker egress, separately attributed
Managed (Vertex AI / SageMaker / Azure ML)	Unreachable	Unreachable	Unreachable	Partial (destination only)

The managed-services row is the trade-off. Vertex AI Online Prediction, SageMaker Endpoints, and Azure ML Online Endpoints strip the eBPF substrate that produces three of the four signals, because the workload runs on infrastructure the customer doesn’t control. If serving-behavior observability is a security requirement — and for production inference handling sensitive data, it should be — the inference server runs in your cluster.

From Observability to Action

Once the four signals are collected, three downstream capabilities become possible. Per-Deployment behavioral baselines feed forward into runtime-informed posture: the inference server’s configured attack surface is what the manifest declares and the IAM grants permit; its operational attack surface is what the four signals show it actually exercises. Reconciling those two artifacts is what runtime-informed AI-SPM produces as a finding.

Detection rules can move past generic container alerting into AI-aware detection. The comparison of runtime versus declarative coverage of AI workload security frames inference as the structural blind spot for declarative tools; the four-signal framework operationalizes that frame. Enforcement on inference servers becomes an observe-to-enforce methodology applied to a specific workload class — observed serving behavior generates seccomp profiles, NetworkPolicies, and process-execution constraints that scope down blast radius without breaking production. The case for serving-behavior observability is the same one the broader runtime observability case for AI agents makes, applied one layer deeper to the workload type doing the inferring.

See Your Inference Servers in Context

ARMO’s Cloud-Native Security for AI Workloads covers all four signals across KServe, Triton, vLLM, Ray Serve, and BentoML deployments — runtime AI-BOM populated from observed load events, Application Profile DNA at the Deployment level for handler and accelerator baselines, and CADR cross-layer correlation that ties the four signals into a single attack story. Request a demo to see what your existing observability stack is missing.

FAQ

How does this fit alongside Arize, Fiddler, and Langfuse?

They stay. Output-behavior observability and serving-behavior observability run alongside each other, observing different layers of the same workload. The MLOps tools tell you whether the model is performing well; the serving-behavior layer tells you whether the inference server is being attacked. Both are necessary for production workloads handling sensitive data.

Do I need to instrument my inference application code?

For three of the four signals, no. eBPF at the Kubernetes node captures model load events, request handler behavior, and egress patterns without touching inference framework code. Accelerator activity is the partial exception — uprobes on the CUDA libraries are application-agnostic, but in-GPU observability requires either GPU-side tooling such as CUPTI or DCGM, or research-stage approaches beyond what production teams can deploy today.

What about managed inference services?

Three of four signals are unreachable on Vertex AI Online Prediction, SageMaker Endpoints, and Azure ML Online Endpoints because the substrate runs on infrastructure the provider controls. The fourth (egress) is partial — destinations visible, process attribution not. If runtime evidence of inference server behavior is a security requirement, the workload runs in your cluster.

How does this apply to multi-tenant inference deployments?

Multi-tenant batched serving — common with vLLM in production — creates KV cache adjacency that academic and vendor research has confirmed leaks through timing side-channels. The mitigation is two-layer: tenant-aware cache salting at the application layer (vLLM’s cache_salt, equivalent in SGLang) plus accelerator-API behavioral baselining to catch anomalous prefill timing. Application-only or signal-only deployments leave the channel open or invisible.

Where do I start if I’m not running any of this today?

Discovery first — which inference servers exist in your clusters, which framework each runs, which deployment pattern hosts each. Then model load events as the highest signal-per-unit-effort starting point: the substrate (init container syscalls plus predictor startup mmap traces) is the same regardless of inference framework. Then request handler behavior on the deployment with the highest blast radius. Per-Deployment baselines for accelerator activity and egress fill out the picture.