AI Agents in the Cloud: A Risk Management Framework for Security Leaders
Your risk committee meets Thursday. The agenda has a new item: AI agent risk posture....
May 1, 2026
A healthcare CISO opens her AI-SPM dashboard at the start of the quarter. Every clinical AI agent in the cluster reads green: full AI-BOM coverage, every permission scope reconciled, the HIPAA compliance tag clean across the fleet. The ambient scribe, the prior-authorization assistant, the oncology decision support agent — all monitored, all green, all the way through.
Six months later, the Office for Civil Rights opens an investigation. The complaint: the ambient scribe disclosed PHI to an unintended endpoint during a specialist referral. The investigators ask one question. Produce the continuous behavioral attestation, tied to FHIR semantics, for that agent across the relevant 30-day window.
The platform team can produce most of what’s asked for. They have the runtime model inventory. They have the per-agent identity logs. What they don’t have is what OCR is actually asking for: a record of what the agent did — which FHIR resources it touched, which patient cohorts, which write patterns — generated continuously from runtime, not from the agent’s declared role. Their dashboard was never instrumented for it.
The HIPAA tag was assembled from two real evidence streams and one silent one — and the silent one is exactly what §164.312(b) demands. AI security posture management splits into three evidence streams like this, and most dashboards don’t surface when one of them is silent.
Most healthcare AI-SPM material treats the three core HIPAA evidence demands — runtime model inventory bound to BAA scope, per-agent minimum-necessary attestation, patient-record-level behavioral context — as point-in-time vendor evaluation criteria. Evaluate the tool, snapshot at procurement, into the binder. The audit cycle assembles the packet. Between cycles, what’s actually being produced is unclear.
A HIPAA-defensible AI-SPM dashboard needs a structural reframe: each evidence demand maps to exactly one of the three AI-SPM disciplines, and continuous attestation per discipline is what produces each demand as a steady-state artifact:
| HIPAA evidence demand | Provision | AI-SPM discipline | Continuous-attestation output |
|---|---|---|---|
| Runtime AI-BOM joined against active BAA registry | §164.504(e) | D1 — Model & Artifact | Continuous AI-BOM with BAA-scope reconciliation per call |
| Per-agent identity binding + minimum-necessary attestation | §164.502(b) | D2 — Identity & Access | Declared-vs-observed scope reconciliation per agent at the Deployment level |
| Patient-record-level access log + behavioral envelope tied to FHIR semantics | §164.312(b) | D3 — Behavioral | FHIR-aware behavioral envelope at the Deployment level capturing resource type and operation per agent |
Three demands. Three disciplines. Three continuous outputs. The dashboard’s job is not to roll these into a single HIPAA tag — it’s to surface each independently, so a missing one is visible rather than absorbed by the others.
The substrate that produces all three on one platform — rather than stitching best-of-breed tools — is what ARMO calls Application Profile DNA: an eBPF-based runtime profile generated at the Kubernetes Deployment level. The same evidence stream feeds AI-BOM evidence for D1, identity binding for D2, FHIR-aware behavioral envelope for D3. When one discipline lags the others — the most common HIPAA exposure pattern in healthcare — that asymmetry surfaces directly rather than getting absorbed by a green tag.
The matrix is built from two axes. One is the three AI-SPM disciplines defined above. The other is the three gap types runtime-informed posture findings sort into:
Crossed, the two axes produce nine cells. Each cell carries the same four fields: a healthcare-specific example, the HIPAA exposure it creates, the continuous-attestation requirement that surfaces it, and the remediation path. All nine produce evidence on the same eBPF substrate.
| D1 — Model & Artifact | D2 — Identity & Access | D3 — Behavioral | |
|---|---|---|---|
| G1 — Latent Capability | Managed LLM endpoint configured but never called. §164.504(e). Continuous AI-BOM reconciliation. Scope-down: remove latent dependency. | IRSA role permits read of full Patient/Encounter/Observation/MedicationStatement; agent reads only a subset. §164.502(b). Declared-vs-observed reconciliation. Tighten declared scope to match observed. | Envelope permits a write-burst pattern the agent has never executed. §164.312(b). FHIR-aware behavioral envelope. Tighten production envelope to observed pattern. |
| G2 — Hidden Effective Scope | Fallback model endpoint added in hotfix; manifest still lists only primary. §164.504(e) BAA-scope. Runtime AI-BOM ⋈ BAA registry per call. Reconcile manifest; verify BAA covers fallback. | Shared service-account binding gives prior-auth agent access to a research dataset; inherited through namespace-level binding. §164.502(b). Per-agent identity binding at Deployment level. Replace shared binding with per-agent IRSA. | Agent observed writing FHIR DocumentReference resources matching a clinical-context type the declared role doesn’t include. §164.312(b). FHIR-aware behavioral envelope. Tighten envelope; CADR correlation surfaces this with D1×G2 fallback-model finding when both fire together. |
| G3 — In-Scope Anomaly | Manifest and runtime agree the embedding service is the embedding service — but agent now calls it for case classes never used before. §164.312(b) (behavioral signal, not scope). Drift detection on AI-BOM call patterns. Investigate; possible compromise indicator. | Declared and observed scope both include MedicationStatement access; read frequency triples in a week with no deployment event. §164.312(b). Per-agent rate-dimension guardrail. Investigate; rate-limit if confirmed anomalous. | Declared and observed write patterns agree on FHIR resource types and operations; recommendation distribution per case-class has shifted. §164.312(b). Distribution-drift detection at Deployment level. Causal correlation; investigate model or input drift. |
Three cells healthcare teams encounter most often:
D1 × Hidden Effective Scope. A platform team ships a hotfix adding a fallback model endpoint — when the primary LLM rate-limits or returns 5xx, the agent falls back to a secondary endpoint from a different vendor. The manifest still lists only the primary. AI-BOM says one model dependency; runtime says two. §164.504(e) exposure: the secondary vendor may not be under an active BAA. Continuous attestation joins runtime AI-BOM (built from observed calls) against the BAA registry per call. Calls landing at endpoints outside BAA fire D1×G2; remediation: reconcile the manifest and verify BAA coverage, or block the fallback.
D3 × Hidden Effective Scope. The cell from the opening scenario. An ambient scribe is observed writing FHIR DocumentReference resources matching a clinical-context type the declared role excludes — declared role covers ambulatory encounters, runtime shows behavioral health. §164.312(b) exposure: the agent operates in a clinical context it was never declared for. Continuous attestation: a behavioral envelope tied to FHIR resource type and operation, generated from runtime, not declared role. CADR correlation sharpens this when it fires with a D1×G2 fallback-model finding — usually meaning the fallback has different prompt scaffolding redirecting the agent into contexts the primary never operated in.
D3 × In-Scope Anomaly. Declared and observed scopes agree. The agent writes the resource types and operations it’s supposed to. But the recommendation distribution per case-class has shifted — an oncology decision support agent’s high-confidence recommendations have drifted from baseline over three weeks. No deployment event explains it. No scope violation. §164.312(b) exposure: behavioral integrity. Continuous attestation here is distribution-drift detection at the Deployment level, with causal correlation to surface the likely upstream — model update, FHIR feed change, training-data refresh — rather than just flag the symptom.
Nine cells, nine continuous outputs. The gap-type column dictates the remediation path. The discipline row dictates which HIPAA evidence demand the cell contributes to. SOC analysts see what kind of finding it is and what remediation path applies, without inferring it from a generic compliance tag.
Three failure modes specifically undermine the matrix in healthcare. They’re chosen for structural completeness: one structural (asymmetry across disciplines), one rooted in HIPAA’s permitted exceptions (break-glass), one rooted in healthcare’s regulatory cadence (HITRUST cycles plus EHR freezes). Other modes exist — vendor BAA expiry, identity churn, training-data leakage — and live in adjacent spokes. These three are the ones the matrix itself has to surface as first-class dashboard artifacts.
The structural failure mode. Two production agents under the same compliance tag. One is an ambient scribe with full eBPF runtime sensors — runtime-informed across all three disciplines. The other is a prior-auth assistant runtime-informed on D1 and D2 only. D3 is inventory-only.
Both read green. The first produces all three HIPAA evidence demands continuously. The second produces two of three; §164.312(b) is silent. The visual language inherited from CSPM was built for single-discipline coverage where green means coverage and red means gap. Partial multi-discipline coverage has no color.
Across 15–30 production agents, asymmetric maturity compounds. Different agents carry different patterns. The aggregate dashboard reads green; the aggregate evidence is incomplete; the asymmetry pattern is the fingerprint of the gap.
Remediation is structural, not procedural. Three best-of-breed tools producing three siloed runtime-informed attestations leave the asymmetry pattern across the fleet invisible — each tool reads green on its own discipline. The platform must produce evidence across all three disciplines on a unified substrate, with a per-discipline maturity surface naming the asymmetry directly: D1 coverage 100%, D2 coverage 100%, D3 coverage 60% (12 of 20 agents at runtime-informed, 8 at inventory-only).
The HIPAA-permitted-exception failure mode. HIPAA permits and Joint Commission surveys often require break-glass — emergency overrides that exceed observed envelopes for legitimate clinical reasons. A clinician needs records outside their normal panel during a code blue. An AI agent calls a model endpoint outside its declared scope during a system failure. Sanctioned and HIPAA-compliant, provided each produces its own evidence chain.
The failure mode is what happens when the per-exception evidence chain is missing. Each unattested fail-open use trains the envelope to include former exceptions as routine. After 50 break-glass activations over six months, the agent’s current observed envelope — the runtime baseline §164.502(b) attestation is computed against — includes patterns that originated as exceptions. The §164.502(b) attestation is no longer accurate. Effective scope has silently expanded.
Day 0: envelope tight, declared scope matches observed. Months 1–6: 47 legitimate break-glass activations, each producing transient envelope expansion absorbed into the rolling baseline. Month 6: cumulative exception-derived patterns are routine. The §164.502(b) attestation reads compliant. It isn’t.
Remediation is per-exception attestation telemetry. Break-glass paths tagged at the envelope-construction level and excluded from the routine envelope. Each fail-open use produces its own evidence chain (trigger, identity, scope exercised, expiry), surfaced as its own artifact, never folded back into the baseline.
The healthcare-regulatory-cadence failure mode. HITRUST CSF cycles, Joint Commission survey windows, and Epic/Cerner release freezes consume 4–6 weeks of the annual calendar. During those windows, security instrumentation changes and remediation deployments are typically disallowed — breaking a clinical workflow during a survey costs more than leaving findings open until the window closes.
The AI-SPM consequence: findings accumulate in a “known and can’t fix this quarter” backlog. Standard CSPM-inherited dashboards don’t surface this as a category. They show the finding as open, with a remediation path and a timeline. The fact that the timeline is operationally frozen — not because the team is slow, but because the change-control window prohibits action — is invisible to the executive view.
Remediation is dashboard-level. HIPAA-environment AI-SPM has to surface a frozen-finding state distinct from open. While frozen, the question shifts from “what’s the remediation path?” to “what compensating control is producing the §164.502(b) / §164.504(e) / §164.312(b) evidence the unfrozen finding would have produced?” The compensating control becomes the active artifact during the freeze — usually some combination of monitoring intensification, manual review cadence, or scope restriction at a layer the freeze doesn’t apply to. When the window ends, the frozen state lifts, the compensating control retires, and the remediation deploys.
Standing up the matrix takes about 60 days from sensor deployment to first complete continuous-attestation cycle.
Deploy eBPF runtime sensors across production AI workloads — phased rollout, one agent class first, extending to the fleet over 4–6 weeks. The framework rollout sequence for healthcare covers sensor deployment, baseline establishment, and the L2-to-L3 transition in detail.
During the L2-to-L3 transition, every AI-SPM finding gets tagged by which cell of the 3×3 matrix it belongs to. Until findings are routed by cell, the dashboard can’t surface per-cell remediation paths or per-discipline maturity. A useful warm-up: take the most recent ten findings and retroactively assign them to cells. Coverage gaps surface immediately — the team realizes most findings have been D1 or D2 and the D3 column has been mostly empty, which is the asymmetric-maturity fingerprint.
Wire the per-discipline maturity surface into the dashboard. Establish per-exception attestation telemetry as part of the break-glass workflow design, before the first break-glass event happens in production. Establish frozen-finding state and compensating-control attestation as part of the EHR change-control workflow, before the next HITRUST cycle.
The steady-state model operates without point-in-time evaluation events. The evidence is always being produced.
The 3×3 matrix is the steady-state operating model. The three HIPAA evidence demands are produced as continuous outputs, one per discipline. The three failure modes — asymmetric maturity, break-glass accumulation, change-window freeze — are the operational realities the matrix surfaces as first-class artifacts rather than absorbing into a green compliance tag.
Healthcare is the first vertical instance of a broader pattern. The same three disciplines and three gap types compose for any regulated environment where AI operates against a defined evidence framework. Financial services AI-SPM produces evidence against FFIEC and PCI-DSS using the same cell structure (model risk documentation, segregation-of-duties attestation, transaction-anomaly envelopes). Federal AI workloads do the same against NIST AI RMF and FedRAMP. The matrix is the pattern; each regulated vertical is an instance — and each maps cleanly to cross-vertical agentic-risk frameworks like the OWASP Top 10 for Agentic Applications.
What makes the pattern operational is the substrate. ARMO’s Application Profile DNA produces D1, D2, and D3 evidence on a single eBPF substrate — per-discipline asymmetry computed from the same evidence stream, not reconciled across three best-of-breed tools that don’t share a runtime model. CADR cross-discipline correlation makes findings spanning cells visible as one event rather than two unrelated alerts.
The dashboard’s job is not to declare HIPAA compliance. It’s to produce continuous evidence that the disciplines, gap types, and operational realities all compose into HIPAA-defensible posture — and to surface its own gaps directly when they emerge.
See how Application Profile DNA produces D1, D2, and D3 evidence on a single eBPF substrate. Request a healthcare AI-SPM walkthrough.
How does the dashboard surface asymmetric-maturity exposure when one discipline is at inventory-only maturity?
Per-discipline maturity is computed independently and surfaced as a coverage figure rather than a single green tag — for example, “D3 coverage 60%, 12 of 20 agents at runtime-informed, 8 at inventory-only.” The dashboard names the affected agents and the HIPAA evidence demand not being produced for them.
What does per-exception attestation actually capture for a break-glass event?
Five fields: trigger (what activated the override), identity (who or what exercised it), access scope exercised (what was touched, beyond the routine envelope), expiry (when the override deactivated), and return-to-normal verification. The attestation is its own artifact, never folded into the routine envelope.
How does a frozen-finding state differ from open or closed?
Open: actionable, team is working on it. Closed: remediated. Frozen: remediation path exists but a regulatory or operational change-control window prohibits deployment for a defined period. Frozen findings carry a compensating-control attestation producing the HIPAA evidence the unfrozen remediation would have produced. When the window lifts, state transitions to open, the compensating control retires, the remediation deploys.
Can the 3×3 matrix coexist with an existing CSPM dashboard, or does it replace one?
Coexist. The matrix operates at the AI workload layer — Kubernetes Deployments running clinical AI agents — while CSPM operates at the cloud infrastructure layer below. Different evidence, different exposures. The integration point is the maturity surface, where AI-SPM coverage is reported alongside CSPM rather than absorbed into it.
How long does the matrix take to produce its first complete continuous-attestation cycle?
About 60 days. Phase 1 (weeks 1–4): eBPF sensor rollout. Phase 2 (weeks 4–6): per-discipline baseline establishment. Phase 3 (weeks 6–8): the L2-to-L3 transition where findings get tagged by cell. The first complete cycle — covering all three disciplines, all three gap types, and at least one occurrence of each failure mode — typically completes around day 60.
Your risk committee meets Thursday. The agenda has a new item: AI agent risk posture....
Editing IAM policies cannot fix the most common architectural mistake in shipping AI agents on...
The residency evidence GDPR and the EU AI Act now expect lives in the runtime...