Leadership

The Guardrail Is Not the Audit Trail

AgentCore, Vertex, and Azure Foundry enforce per-call agent policy. None reconstruct the session. The audit gap enterprise AI programs aren't pricing.

Koundinya Lanka

Leadership

Jun 29, 2026

10 min read

When AWS AgentCore Policy evaluates an agent action, Cedar looks at one thing: the current principal, action, resource, and context of [a single request](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/policy.html). The decision is scoped to that request. There is no session history. The agent's next call gets evaluated the same way, clean, isolated, with no memory of what came before. That is the [documented limitation](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/policy-limitations-section.html), not a bug. The design assumes the call boundary is the right place to enforce. For a great many actions, it is.

This is the same shape across every hyperscaler agent platform shipping in 2026. Google Vertex Agent Builder governs tools through an access-control layer backed by its tool registry layer. It tracks what the agent is permitted to invoke, not what it actually assembled across forty calls, and its session-level tracing surfaces in a developer-facing traces tab rather than a [compliance-grade audit trail](https://cloud.google.com/blog/products/ai-machine-learning/new-enhanced-tool-governance-in-vertex-ai-agent-builder). Azure AI Foundry intercepts at input, tool call, tool response, and output. Microsoft's own Cloud Adoption Framework explicitly lists behavioral observability via Azure Monitor and Application Insights as a [separate, additive layer](https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/ai-agents/governance-security-across-organization), not part of the guardrails enforcement path. Three different vendors, three different naming conventions, one shared architectural assumption.

Enterprise AI programs are buying these guardrails as if they are buying accountability. They are not the same purchase. That confusion will surface in the next audit cycle.

What do agent guardrails actually enforce?

Per-call gates. AgentCore's Cedar policies evaluate principal-action-resource-context [per request](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/policy.html). Vertex's tool governance whitelists which tools the agent can call. Azure Foundry's filters intercept at input, tool call, response, and output. Each of these is a per-call gate, and each does that job well. The vendors are not overselling — they are documenting precisely.

What they do not do is track the shape of the session.

Consider how this fails in practice. An agent has access to dozens of internal tools. The policy says it cannot call `customer_record.export`. Across dozens of tool invocations in a single conversation, the agent quietly queries customer names from a search endpoint, account IDs from a CRM lookup, support tickets from a ticketing API, billing summaries from an invoicing tool, and recent transactions from a finance read endpoint. Then it hits `customer_record.export`. The guardrail blocks the call.

sequenceDiagram
  participant Agent
  participant Policy

  Agent->>Policy: search_endpoint(customer names)
  Policy-->>Agent: OK

  Agent->>Policy: crm_lookup(account IDs)
  Policy-->>Agent: OK

  Agent->>Policy: ticketing_api(support tickets)
  Policy-->>Agent: OK

  Agent->>Policy: invoicing_tool(billing summaries)
  Policy-->>Agent: OK

  Agent->>Policy: finance_read(recent transactions)
  Policy-->>Agent: OK

  Agent->>Policy: customer_record.export
  Policy-->>Agent: BLOCKED

  Note over Agent,Policy: Policy log records 1 event.<br/>Agent assembled a near-complete customer record<br/>across 5 prior approved calls.

Every per-call decision was correct. The cumulative behavior was not. The agent assembled the export anyway, across many approved calls, and the single blocked call is the only thing that shows up in the policy log. No native enforcement layer reconstructs what just happened. The session, as a behavioral object, is invisible to the control plane.

This gap is architectural. Guardrails are design-time, per-call enforcement controls; observability is runtime, cross-session behavioral tracking. [Neither substitutes for the other](https://www.geordie.ai/resources/guardrails-explained/), and the [practitioner literature](https://atlan.com/know/ai-agent-observability/) has been consistent on that distinction.

Existing LLM observability stops at prompts and tokens

Most enterprise LLM observability tools stop at prompts, tokens, and latency. They can report that something went wrong but [cannot reconstruct why](https://atlan.com/know/ai-agent-observability/), and they were not designed to track autonomous behavioral patterns across multi-step, multi-call agent sessions. A latency graph and a token-spend chart do not explain why the agent decided to chain the seven calls it did, in the order it did, to produce the output it produced. The instrumentation was built for chatbots, where the unit of analysis is the prompt — not for autonomous agents, where the unit of analysis is the trajectory.

Third-party platforms ([Zenity](https://zenity.io/platform/ai-observability), [Galileo](https://galileo.ai/blog/ai-agent-compliance-governance-audit-trails-risk-management)) are building this layer precisely because the hyperscalers do not ship it natively. The pitch is cross-environment tracking of execution paths, tool invocations, memory updates, RAG queries, and file access. The trail of what the agent actually did, not what it was allowed to do.

That is two budget line items. Procurement is not pricing it that way.

Key Insight

Guardrails and observability are two budget line items. Procurement is not pricing them that way — it buys one and calls the job done. That's why the gap persists across every enterprise AI program.

The confidence-reality gap in the survey data

Executive confidence

Executives who believe their policies protect against unauthorized agent actions

Ship with full approval

Organizations that send agents to production with full security or IT approval

Had incidents

Organizations with confirmed or suspected AI agent security incidents in the preceding year

Gartner 2030 forecast

AI agent deployment failures Gartner projects will stem from insufficient governance runtime enforcement by 2030

One industry survey aggregated by AGAT Software reports that 82% of executives are confident their policies protect against unauthorized agent actions, yet only 14.4% of organizations send agents to production with full security or IT approval. A [68-point confidence-reality gap](https://agatsoftware.com/blog/ai-agent-security-enterprise-2026/). The same source reports 88% of organizations had confirmed or suspected AI agent security incidents in the preceding year, and only 21.9% treat AI agents as independent, identity-bearing entities with their own audit trails.

The primary publisher of the underlying "State of AI Agent Security 2026" data is not clearly identified in the citing source, so treat the magnitudes as directional rather than precise. The shape is the point. Even discounted, the distance between executive confidence and the proportion of agents that ship with a real audit identity is the diagnosis. Confidence is rising on the strength of the controls vendors document; the audit identity is what the controls do not by themselves produce.

Gartner has framed the consequence: by 2030, 50% of AI agent deployment failures will be due to [insufficient AI governance platform runtime enforcement](https://atlan.com/know/ai-agent-observability/), explicitly linking the behavioral observability gap to systemic multi-system agent failure.

The pattern is recognizable from prior platform cycles. A decade ago, cloud security looked like this. IAM policies in place, no flow logs turned on. SOC 2 auditors caught up. The same correction is coming for agents.

Regulators want logs, not policy documents

The EU AI Act, HIPAA Technical Safeguards, and the U.S. Treasury Financial Services AI Risk Management Framework all require [audit logs demonstrating that controls actually operated](https://galileo.ai/blog/ai-agent-compliance-governance-audit-trails-risk-management). Policy documentation alone is insufficient.

A Cedar policy that blocked a call is documentation that one control fired. It is not a reconstruction of the session. An auditor asking "what did this agent do in the customer-service interaction that resulted in the disputed refund" is asking for the session trace, the tool-call sequence, the memory updates, the prompt rewrites, the decision rationale. None of that lives in the policy log. The policy log says: at 14:32:17 UTC, action X was permitted; at 14:32:19 UTC, action Y was denied. The audit wants the story between those two timestamps.

Policy log vs. audit

Before

14:32:17 — action X: PERMITTED. 14:32:19 — action Y: DENIED. Nothing else — no session context, no behavioral record.

After

Session trace, ordered tool-call sequence, memory updates, prompt rewrites, decision rationale. EU AI Act, HIPAA, and U.S. Treasury FSAI RMF all require this column.

The exposure is concentrated and growing. [72% of S&P 500 companies disclosed at least one material AI risk in 2025](https://galileo.ai/blog/ai-agent-compliance-governance-audit-trails-risk-management), while only 26% have comprehensive AI governance policies in place. The companies disclosing risk to the SEC are the same ones who will be asked, in an enforcement action or a discovery request, to produce the agent's behavioral trail. A guardrail product cannot produce that artifact.

The two-purchase framework

The diagnostic for an enterprise AI program is two questions, not one.

First: what is the agent permitted to do? This is the guardrails purchase. Cedar policies, Vertex tool governance, Foundry intercept filters all do this and do it well. This question has good answers shipping from every hyperscaler.

Second: what did the agent actually do, across the whole session, and can the program prove it to an auditor, an incident responder, or a regulator six months later? This is the observability purchase. It is currently a separate vendor decision, a separate budget line, a separate set of integrations, and in most enterprise stacks shipping in 2026, a deferred one.

Programs that conflate the two end up with a policy document where they needed an audit trail. The [Microsoft Cloud Adoption Framework already names these as distinct requirements](https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/ai-agents/governance-security-across-organization). Treating them as one purchase is a cost-control decision dressed up as an architecture decision.

The per-call versus cross-call distinction shows up in AWS AgentCore Policy specifically. Cedar's strength is exactly what makes it the wrong tool to ask about session-level behavior. The same shape repeats across the Vertex and Foundry equivalents, which is why the third-party observability category exists at all.

What should enterprise AI programs do this quarter?

1
Separate the budget line
Treat behavioral observability as its own line item, owned by a named person, with a tool selection committed this quarter — not deferred to a Phase 2 that never opens.
2
Audit every agent in production
Map each agent to two artifacts: the policy that permits its actions, and the audit trail that reconstructs its sessions. If the second artifact does not exist, the agent is a pilot wearing a production badge.
3
Session trace before policy
Require a session-trace plan before the policy plan on every new agent project. The policy is the easy part — every hyperscaler ships a defaulted answer. The trace is what regulators will ask for first.

Three moves, ordered by how badly they get punted in most programs.

First, treat behavioral observability as a separate line item in the AI program budget, owned by a named person, with a specific tool selection in the next quarter. Not deferred to a hypothetical Phase 2 that never opens.

Second, map every agent currently in production to two artifacts: the policy that permits its actions, and the audit trail that reconstructs its sessions. If the second artifact does not exist, the agent is not production-grade. It is a pilot wearing a production badge, exactly the failure mode programs spend a year trying to escape and a quarter accidentally re-creating once agents come into scope.

Third, require that any new agent project include a session-trace plan before the policy plan. The policy is the easy part; every hyperscaler ships a defaulted answer. The session trace is the hard part and the part regulators are about to ask for. Inverting the sequence avoids the rework of bolting traces onto agents already approved on policy alone.

The 82% executive confidence number is the dangerous one. Enterprise AI programs are reporting up the chain that the agent stack is governed. The stack is governed at the call boundary. It is not yet observable at the session boundary. Treating "we have AgentCore Policy / Vertex tool governance / Foundry filters" as the security posture mistakes inputs for outputs. The output is the ability to answer "what did the agent do" with evidence, under deadline, under regulatory pressure, in the middle of an incident.

That cost is the one to put on the board's risk register this quarter, before the first agent incident makes it a line item someone else writes for you.

Frequently asked questions

Do AWS AgentCore, Azure AI Foundry, and Google Vertex Agent Builder provide behavioral observability for agents?

No. Each enforces per-call policy decisions but none natively reconstruct the full session-level behavior of an agent across multiple tool invocations. Microsoft's own Cloud Adoption Framework explicitly lists observability via Azure Monitor and Application Insights as a separate, additive layer from guardrails, and Vertex's session traces surface in a developer-facing tab rather than as a compliance-grade audit trail.

What is the difference between agent guardrails and agent observability?

Guardrails are per-call, design-time enforcement controls that decide whether the agent is permitted to take a specific action. Observability is runtime, cross-session behavioral tracking that reconstructs what the agent actually did across many calls, tools, and memory updates. Neither substitutes for the other, and regulatory frameworks including the EU AI Act, HIPAA, and the U.S. Treasury Financial Services AI Risk Management Framework increasingly require evidence of both.

Why is an AWS AgentCore Cedar policy log not enough for an AI compliance audit?

Cedar evaluates each action against the current principal, action, resource, and context of a single request, with no session history. The policy log records individual permit/deny decisions per call, while an auditor needs the full session trace including which tools the agent chained, what memory it updated, and why it produced a specific output. That reconstruction has to come from a separate observability layer, not the policy log itself.

Agentic OpsAI Governance & ComplianceLeadershipAI & FutureEnterprise AI

Share this article

Koundinya Lanka

Founder of The Production Line, writing weekly intelligence on enterprise AI adoption, agentic systems, and the future of work.

Enjoyed this article? Get more like it every week.

Back to blog

The Four Failure Modes That Kill Agentic AI Pilots

Agentic pilots fail at 70-95% from compound errors, trust chain collapse, environment disintegration, and goal hijacking. Model cards miss all four.

12 min read

The Eval Gap: Measuring AI Outputs While Outcomes Go Dark

Enterprise AI evals measure outputs, not outcomes — 83% of agentic-AI papers track only technical metrics. The fix: wire your eval pipeline to CFO KPIs.

8 min read

Engineering Team Sizing: How to Determine the Right Team Size for Every Stage of Growth

Team sizing is one of the highest-leverage decisions an engineering leader makes. Get it wrong and you either burn out your team or bloat your org. This guide covers benchmarks, ratios, structures, and a practical framework for every stage from pre-seed to Series C and beyond.

13 min read

Leadership

The Guardrail Is Not the Audit Trail

AgentCore, Vertex, and Azure Foundry enforce per-call agent policy. None reconstruct the session. The audit gap enterprise AI programs aren't pricing.

Koundinya Lanka

Leadership

Jun 29, 2026

10 min read

Enterprise AI programs are buying these guardrails as if they are buying accountability. They are not the same purchase. That confusion will surface in the next audit cycle.

What do agent guardrails actually enforce?

What they do not do is track the shape of the session.

sequenceDiagram
  participant Agent
  participant Policy

  Agent->>Policy: search_endpoint(customer names)
  Policy-->>Agent: OK

  Agent->>Policy: crm_lookup(account IDs)
  Policy-->>Agent: OK

  Agent->>Policy: ticketing_api(support tickets)
  Policy-->>Agent: OK

  Agent->>Policy: invoicing_tool(billing summaries)
  Policy-->>Agent: OK

  Agent->>Policy: finance_read(recent transactions)
  Policy-->>Agent: OK

  Agent->>Policy: customer_record.export
  Policy-->>Agent: BLOCKED

  Note over Agent,Policy: Policy log records 1 event.<br/>Agent assembled a near-complete customer record<br/>across 5 prior approved calls.

Existing LLM observability stops at prompts and tokens

That is two budget line items. Procurement is not pricing it that way.

Key Insight

The confidence-reality gap in the survey data

Executive confidence

Executives who believe their policies protect against unauthorized agent actions

Ship with full approval

Organizations that send agents to production with full security or IT approval

Had incidents

Organizations with confirmed or suspected AI agent security incidents in the preceding year

Gartner 2030 forecast

AI agent deployment failures Gartner projects will stem from insufficient governance runtime enforcement by 2030

Regulators want logs, not policy documents

Policy log vs. audit

Before

14:32:17 — action X: PERMITTED. 14:32:19 — action Y: DENIED. Nothing else — no session context, no behavioral record.

After

Session trace, ordered tool-call sequence, memory updates, prompt rewrites, decision rationale. EU AI Act, HIPAA, and U.S. Treasury FSAI RMF all require this column.

The two-purchase framework

The diagnostic for an enterprise AI program is two questions, not one.

What should enterprise AI programs do this quarter?

1
Separate the budget line
Treat behavioral observability as its own line item, owned by a named person, with a tool selection committed this quarter — not deferred to a Phase 2 that never opens.
2
Audit every agent in production
Map each agent to two artifacts: the policy that permits its actions, and the audit trail that reconstructs its sessions. If the second artifact does not exist, the agent is a pilot wearing a production badge.
3
Session trace before policy
Require a session-trace plan before the policy plan on every new agent project. The policy is the easy part — every hyperscaler ships a defaulted answer. The trace is what regulators will ask for first.

Three moves, ordered by how badly they get punted in most programs.

That cost is the one to put on the board's risk register this quarter, before the first agent incident makes it a line item someone else writes for you.

Frequently asked questions

Do AWS AgentCore, Azure AI Foundry, and Google Vertex Agent Builder provide behavioral observability for agents?

What is the difference between agent guardrails and agent observability?

Why is an AWS AgentCore Cedar policy log not enough for an AI compliance audit?

Agentic OpsAI Governance & ComplianceLeadershipAI & FutureEnterprise AI

Share this article

Koundinya Lanka

Founder of The Production Line, writing weekly intelligence on enterprise AI adoption, agentic systems, and the future of work.

Enjoyed this article? Get more like it every week.

Back to blog

The Four Failure Modes That Kill Agentic AI Pilots

Agentic pilots fail at 70-95% from compound errors, trust chain collapse, environment disintegration, and goal hijacking. Model cards miss all four.

12 min read

The Eval Gap: Measuring AI Outputs While Outcomes Go Dark

Enterprise AI evals measure outputs, not outcomes — 83% of agentic-AI papers track only technical metrics. The fix: wire your eval pipeline to CFO KPIs.

8 min read

Engineering Team Sizing: How to Determine the Right Team Size for Every Stage of Growth

13 min read

The Guardrail Is Not the Audit Trail

What do agent guardrails actually enforce?

Existing LLM observability stops at prompts and tokens

The confidence-reality gap in the survey data

Regulators want logs, not policy documents

The two-purchase framework

What should enterprise AI programs do this quarter?

Separate the budget line

Audit every agent in production

Session trace before policy

Frequently asked questions

Do AWS AgentCore, Azure AI Foundry, and Google Vertex Agent Builder provide behavioral observability for agents?

What is the difference between agent guardrails and agent observability?

Why is an AWS AgentCore Cedar policy log not enough for an AI compliance audit?

Koundinya Lanka

Related articles

The Four Failure Modes That Kill Agentic AI Pilots

The Eval Gap: Measuring AI Outputs While Outcomes Go Dark

Engineering Team Sizing: How to Determine the Right Team Size for Every Stage of Growth

The Guardrail Is Not the Audit Trail

What do agent guardrails actually enforce?

Existing LLM observability stops at prompts and tokens

The confidence-reality gap in the survey data

Regulators want logs, not policy documents

The two-purchase framework

What should enterprise AI programs do this quarter?

Separate the budget line

Audit every agent in production

Session trace before policy

Frequently asked questions

Do AWS AgentCore, Azure AI Foundry, and Google Vertex Agent Builder provide behavioral observability for agents?

What is the difference between agent guardrails and agent observability?

Why is an AWS AgentCore Cedar policy log not enough for an AI compliance audit?

Koundinya Lanka

Related articles

The Four Failure Modes That Kill Agentic AI Pilots

The Eval Gap: Measuring AI Outputs While Outcomes Go Dark

Engineering Team Sizing: How to Determine the Right Team Size for Every Stage of Growth