Prompt and Tool Call Visibility: What Your AI Agents Are Actually Doing
It is 11:47 p.m. and the on-call security engineer is staring at two dashboards. On...
Apr 29, 2026
The external auditor’s evidence request lands Tuesday morning. A security architect at a Tier 1 bank pulls up her AI-SPM dashboard for the SOC2 Type 2 review. Eighty-three AI agents running across the bank’s clusters. For each one, the dashboard shows the current configuration and the current behavioral baseline. The data is accurate, comprehensive, and point-in-time.
That last word is the problem. The auditor isn’t asking what these agents are doing today — he’s asking what they were doing every day for the last twelve months. Did the fraud scoring agent’s identity scope quietly widen in March? Did the document classification agent start routing outputs to a destination the BAA registry never approved? Did the customer service agent pick up an MCP tool that bypassed change management? The dashboard answers none of these. It reads the present, not a window. What the auditor needs is evidence of what was true on every day between two audits.
That gap is the difference between a posture dashboard and an AI Security Posture Management practice that holds up under audit. The dashboard is a snapshot. The practice is a continuous record. SOC2, PCI-DSS, and MAS each demand the second.
The published financial services CISO evaluation framework anchors on a question that became necessary the moment AI agents moved into production trading and payment systems: show me the call stack. Reconstruct the exact execution path from prompt through tool call through data egress when an incident fires. That diagnostic is correct, and the regulatory frameworks it maps to — FFIEC, SEC Regulation S-P, NYDFS Part 500, SOX §404 — are all retrospective. Each treats evidence as a question about reconstructing a specific event for a specific incident determination.
Posture management asks a different question, against a different time window. The auditor reading a SOC2 Type 2 report is asking what controls operated effectively across the entire reporting period. The QSA reviewing PCI-DSS 4.0.1 is testing whether the cardholder data environment maintained its required posture every day across the assessment cycle. The MAS supervisor reviewing AI risk management is testing whether continuous monitoring produced evidence at the cadence the lifecycle controls demand.
Different evidence base, different instrumentation, different artifact. The three disciplines AI-SPM splits into — Model and Artifact Posture, Identity and Access Posture, Behavioral Posture — each have to be attestable across an audit window. A snapshot of any of them at audit time satisfies none of the three frameworks.
The AICPA Trust Services Criteria define five categories — Security, Availability, Confidentiality, Processing Integrity, Privacy — with Security covering nine common criteria (CC1 through CC9). Type 1 reports describe whether controls were suitably designed at a single point in time. Type 2 tests whether controls operated effectively across a reporting period of typically six to twelve months. Type 2 is the report regulators and customers ask for. For AI agents, three Trust Services Criteria carry the load — and all three resolve to the same architectural requirement: the daily reconciliation between configured posture and operational posture, captured as a continuous record.
CC6 covers identity and access management. The Type 2 examiner will look for evidence that the gap between what the agent was authorized to access and what it actually accessed remained within an examined envelope every day in the reporting period. Static AI-SPM produces the configured side cleanly: IAM scope, RBAC bindings, federated credential subject claims. It does not produce the operational side at the cadence Type 2 requires. Runtime-informed AI-SPM reads two posture artifacts side by side and reports the gap as the finding. ARMO’s Application Profile DNA, captured per agent at the Deployment level via an eBPF substrate at 1-2.5% CPU overhead, produces the operational artifact every day the agent runs. The gap between declared scope and observed exercise is the CC6 evidence — a continuous record, not an audit-day snapshot.
CC7.2 specifically requires monitoring of system components for anomalies. For non-deterministic agents, the question is what “anomaly” means against a baseline that legitimately evolves with model updates and prompt template changes. The methodology described in the published intent drift detection approach produces the audit evidence: deployment-correlated drift events become CC7.2 monitoring records that distinguish expected evolution from unexamined behavioral change.
CC8.1 requires that changes are authorized before implementation. Model updates, prompt template changes, and MCP tool catalog additions are changes — and most are not flowing through traditional change management. A runtime-derived AI Bill of Materials produces the inventory artifact CC8 evidence requires: every model loaded into memory, every MCP tool the agent registered against, every framework dependency that actually executed, captured as it happens. The Type 2 examiner will accept the runtime AI-BOM as change evidence; they will reject a static manifest because it cannot demonstrate what was true on days the manifest was not refreshed.
PCI-DSS v4.0.1 replaced v4.0 on December 31, 2024. The 51 future-dated requirements introduced in v4.0 became mandatory across all assessments on March 31, 2025. As of 2026, every PCI-DSS assessment is conducted against v4.0.1, and the philosophy underneath the standard has shifted: the PCI Security Standards Council frames v4.0.1 as the rejection of the annual-event compliance model in favor of continuous, business-as-usual security practice. That shift maps directly onto AI agent posture management, because non-deterministic agents drift faster than annual assessments examine. Four requirements carry specific weight:
Req 6.3.3 mandates patches for critical vulnerabilities within 30 days of release. The applicable AI surface includes framework packages (LangChain, vLLM, MCP server libraries), inference runtime dependencies, and pickle-format model loads that standard SCA tools were not designed to scan. The runtime AI-BOM produces the daily evidence record of what actually loaded versus what was declared, and runtime reachability analysis filters the CVE list to packages actually loaded into memory — typically reducing actionable findings by 90% or more by focusing the patching cadence on vulnerabilities that fire in production.
Req 10.4.1 requires automated mechanisms for audit log review. For AI agents handling cardholder data adjacent operations, log review at the network or container level is structurally insufficient — the audit trail has to reach function and parameter granularity. ARMO’s CADR (Cloud Application Detection and Response) joins eBPF kernel telemetry with application-layer correlation, producing a function-call audit trail that ties syscall events to the prompt and tool invocation that triggered them.
Req 11.6.1 requires a change-and-tamper detection mechanism with assessment at least weekly — a tighter cadence is needed for AI agents because their behavior changes more often than payment-page scripts. CADR cross-layer correlation produces the continuous tamper-event record the QSA samples. Req 12.5.1 requires a current inventory of all in-scope system components, which for AI agents includes models, frameworks, MCP servers, RAG indexes, and tool catalogs that change at CI/CD pace. The runtime AI-BOM produces this inventory as a continuous record. The QSA will reject a static manifest because v4.0.1 explicitly requires accuracy at all times.
Two MAS frameworks apply to AI agent posture in financial services, sitting at different stages of maturity. The MAS Technology Risk Management Guidelines, alongside the binding Notices FSM-N05 and FSM-N06, are the existing supervisory framework — they apply now to every Singapore-regulated FI and govern the broader cybersecurity and technology risk posture, with continuous monitoring, change management, and asset inventory at the technology infrastructure level.
Layered on top, on November 13, 2025, MAS issued a consultation paper proposing Guidelines on AI Risk Management for Financial Institutions. The consultation closed January 31, 2026; final issuance is expected in 2026, with a 12-month transition period to follow. The proposed Guidelines build on the FEAT principles MAS published in 2018 and consolidate lessons from the 2024 thematic review of AI Model Risk Management. They explicitly cover generative AI and AI agents, and they apply to all Singapore-regulated FIs — including branches and subsidiaries of overseas groups, which can rely on parent-entity AI risk frameworks only if those frameworks meet the MAS supervisory bar.
Five expectations in the proposed Guidelines map onto AI-SPM evidence demands. AI identification: a documented process to consistently identify where AI is used. AI inventory: an accurate, up-to-date inventory of AI use cases, systems, and models, including third-party AI. Lifecycle controls calibrated to risk materiality. Continuous monitoring with documented change controls covering model updates, prompt template revisions, and tool catalog additions. AI-specific incident response playbooks connected to the broader cyber incident response apparatus the TRM Guidelines already require.
The structural difference from current US frameworks is the proactive cadence. US frameworks primarily produce evidence demands when an incident triggers determination. The proposed MAS AI Guidelines codify continuous evidence as a supervisory baseline regardless of incident state. For institutions operating across both, MAS becomes the binding upper expectation.
The deliverable architects can carry into a vendor evaluation is the matrix below. Rows correspond to the three AI-SPM disciplines. Columns correspond to the three named frameworks. Each cell names the specific posture artifact the framework requires from the discipline.
| Discipline | SOC2 Type 2 | PCI-DSS 4.0.1 | MAS AI Guidelines + TRM |
| Model and Artifact Posture | CC8 change management evidence: every model load, framework dependency, and MCP tool addition tied to authorized change record across the reporting period. | Req 6.3.3 and Req 12.5.1: runtime AI-BOM as continuous inventory of what actually loaded versus what was declared, with reachability filtering to executed code. | AI inventory expectation: a current AI use case, system, and model inventory including third-party AI, refreshed at runtime cadence rather than periodically. |
| Identity and Access Posture | CC6 logical access evidence: declared-versus-observed permission gap captured every day, surfacing latent capability and hidden effective scope across the audit window. | Req 7 and Req 8.4.2: per-agent identity binding pattern with each agent’s effective scope continuously attested against actual exercise. | Lifecycle controls calibrated to risk materiality: per-agent enforcement boundaries differentiated by AI use case classification, with continuous evidence boundaries held. |
| Behavioral Posture | CC7.2 anomaly monitoring: drift event records distinguishing expected evolution from unexamined change, with deployment correlation across the reporting period. | Req 10.4.1 and Req 11.6.1: function-parameter audit trail joined to prompt and tool invocation, captured continuously. | Continuous monitoring: behavioral baseline drift, model-update correlation, and tool-catalog change tracking captured at agent-risk-appropriate cadence. |
The structural finding is uniform: every cell requires the same instrumentation type — continuous behavioral capture joined to declared configuration, captured across the audit window. Static configuration scanning produces only the declared half of every cell. Kernel telemetry without application-layer correlation produces only the syscall half. The complete cell requires both: kernel substrate plus application-layer correlation that links syscalls to the prompts, tool invocations, and identity hops that produced them.
The published Runtime Context Test for AI workload security tools introduced the broader “show me” diagnostic device. Translated for AI-SPM in financial services, it becomes a single question vendors either answer or fail:
Between the SOC2 Type 2 audit your team completed last quarter and the next assessment cycle, can you produce a continuous attestation that this specific agent’s posture remained within its declared envelope on every day in the period — including days the agent was idle, days it scaled to zero, and days a new MCP tool was added to its catalog?
The question fails three vendor categories cleanly. CSPM tools rebadged as AI-SPM cannot produce evidence per day because their underlying configuration scans run periodically. Static AI inventory tools cannot produce evidence per agent because they describe the environment rather than per-workload state across time. CNAPP tools with bolted-on AI labels can produce per-day attestation for cloud configuration drift but not for AI-specific posture (model load history, tool catalog evolution, behavioral envelope drift).
An AI-SPM stack built on continuous behavioral capture answers differently. Application Profile DNA per agent at the Deployment level captures the exercised behavioral envelope every day. The runtime AI-BOM captures inventory state every day. The configured-versus-operational reconciliation runs every day. None are produced specifically for the audit; they are produced as the operational byproduct of the AI-SPM practice itself, and the audit evidence falls out of them as a query against existing data.
Two operational realities shape how continuous attestation deploys inside a financial services institution. The first is change management. As the published financial services framework implementation guide describes, every change to security tooling in a regulated FS environment goes through a Change Approval Board cycle that typically runs two to four weeks, with production freeze windows around quarter-end and year-end. An eBPF-based sensor deployed as a DaemonSet on existing node pools, with no application code changes and no sidecars, fits inside the change envelope most platform teams already accept for observability.
The second reality is evidence pipeline architecture. Most institutions try to satisfy SOC2, PCI-DSS, and MAS-equivalent frameworks with separate evidence pipelines — producing the same data three times in three different shapes, where the inconsistencies become audit findings of their own. The architecture that survives is one telemetry source, three evidence framings: a single continuous behavioral substrate that produces the operational posture artifact, with framework-specific reporting layers that translate it into SOC2 control evidence, PCI-DSS assessment evidence, or MAS-aligned monitoring records as needed.
That pattern matches the broader observe-to-enforce methodology — the same per-agent behavioral envelope that drives enforcement policies also produces the continuous attestation auditors need. ARMO’s platform, built on the open-source Kubescape foundation used by more than 100,000 organizations, supports continuous automated compliance monitoring across 260+ Kubernetes-native controls covering SOC2, PCI-DSS, HIPAA, GDPR, CIS, NSA, and NIST, with audit-ready evidence exports routed through the same telemetry base producing the runtime-informed posture findings.
AI agents in production financial services environments are subject to evidence demands retrospective per-incident reconstruction cannot answer. SOC2 Type 2, PCI-DSS 4.0.1, and the proposed MAS AI Guidelines all converge on continuous evidence between audits. Static AI-SPM produces a snapshot, and a snapshot satisfies none of them. The ARMO platform for cloud-native AI workload security produces the continuous behavioral substrate that closes the gap. Book a demo to see how it works.
How is this different from the financial services CISO call-stack diagnostic?
The financial services CISO evaluation framework answers a retrospective question: when an incident fires, can the vendor reconstruct the exact execution path the regulator’s investigation requires? That maps to FFIEC, NYDFS, and SEC frameworks where evidence demands are triggered by incident determination. The continuous attestation diagnostic answers a different question: between two audits, can the vendor produce day-by-day evidence the agent’s posture stayed within its declared envelope? That maps to SOC2 Type 2, PCI-DSS 4.0.1, and the MAS AI Guidelines. Most institutions need both diagnostics, because both regulatory categories apply.
What is the practical instrumentation requirement for continuous attestation?
Kernel-layer behavioral capture plus application-layer correlation, deployed continuously rather than at audit time. Kernel telemetry alone gets you process and syscall visibility but not the semantic context that makes the audit trail interpretable as agent behavior. Application-layer correlation alone gets you tool invocations but not the kernel-truth that survives a compromised application layer. The complete substrate runs both, on every node hosting AI workloads, every day in the audit window — at overhead levels typically 1-2.5% CPU and 1% memory.
How does PCI-DSS 4.0.1 continuous compliance change AI-SPM vendor evaluation criteria?
PCI-DSS v4.0.1 explicitly rejected the annual-event compliance model when the 51 future-dated requirements went mandatory on March 31, 2025. That changes the evaluation question from “can this tool produce a quarterly report?” to “can it produce evidence at a cadence the QSA can sample any day in the assessment cycle?” A vendor whose architecture assumes periodic scans cannot satisfy 4.0.1 evidence demands for AI agents, because the evidence has to exist for days the auditor was not specifically watching.
Do the proposed MAS AI Risk Management Guidelines apply to US banks?
Yes, when those banks operate in Singapore through a branch or subsidiary. The November 2025 consultation paper proposes the Guidelines apply to all MAS-regulated FIs in Singapore. Singapore branches or subsidiaries can rely on group-level AI frameworks only if those meet MAS’s supervisory bar. Consultation closed January 31, 2026; final issuance and a 12-month transition period are expected during 2026. FIs are already preparing ahead of finalization, and even institutions with no Singapore footprint cite the proposed Guidelines as a best-practice anchor — they remain the most explicit AI-specific articulation of supervisory expectations globally.
What is the minimum deployment footprint to produce SOC2 Type 2-grade continuous attestation?
Kernel-level eBPF sensors as a DaemonSet across all node pools hosting AI workloads, with application-layer correlation joining kernel events to the prompts, tool invocations, and identity hops that produced them. Per-agent behavioral profiles built at the Deployment level rather than per-pod, so they survive pod churn. A runtime-derived AI-BOM as the inventory layer feeding model load and dependency evidence. A reporting layer that exports framework-specific evidence packages from the same underlying telemetry base.
It is 11:47 p.m. and the on-call security engineer is staring at two dashboards. On...
A platform team at a mid-size SaaS company runs three LangChain agents and one AutoGPT-derived...
In August 2025, a vulnerability chain in NVIDIA Triton Inference Server was found that allowed...