AgentCore Policy Watches Each Call. Agents Fail Across Calls.
AWS AgentCore Policy enforces per-call tool access with Cedar. But behavioral drift across tool calls — where enterprise agents actually fail — is unaddressed.
Koundinya Lanka
Industry Trends
AWS announced AgentCore Policy at GA in March 2026 with a specific architectural claim: Cedar policies, evaluated at the gateway, deciding which tools an agent identity is allowed to call before the call fires. It is a clean piece of infrastructure. It is also, by design, blind to the problem that breaks production agentic deployments.
Per the [AgentCore Policy documentation](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/policy.html), the layer operates at the individual tool-invocation boundary. Each call is evaluated in isolation. There is no cross-call state, no cumulative pattern detection, no inter-turn behavioral check. Each invocation is a fresh permission question.
AWS's security model for agents is the IAM model extended to tool calls: per-call, identity-scoped. Right for individual permission checks. Wrong for where enterprise agentic systems actually break.
What AgentCore Policy actually enforces
The AWS [policy controls announcement](https://aws.amazon.com/blogs/aws/amazon-bedrock-agentcore-adds-quality-evaluations-and-policy-controls-for-deploying-trusted-ai-agents/) is precise about what the layer does: Cedar policies at the AgentCore Gateway boundary, evaluated per tool invocation, with an optional 'emit logs only' enforcement mode for staged rollouts. It is competent infrastructure for a specific class of problem — 'did this agent identity have permission to call this tool at this moment, given these arguments.'
It stacks alongside two other AWS-native controls. [Bedrock Guardrails](https://interworks.com/blog/2026/03/06/securing-amazon-bedrock-what-enterprises-need-to-get-right/) handle content filtering: PII redaction, harmful content blocks, prompt injection detection, topic denial. Guardrails are filters on individual inputs and outputs; they do not enforce behavioral constraints across sequential agent turns or detect inter-turn reasoning drift. IAM policies underneath control which resources an identity can touch, but as [InterWorks' Bedrock security review](https://interworks.com/blog/2026/03/06/securing-amazon-bedrock-what-enterprises-need-to-get-right/) puts it, IAM cannot detect semantic manipulation or conversational drift.
Stack all three and AWS has a credible per-call security stack. Each gate is evaluated independently, at the moment of a single decision. None of them watches the agent over time.
Where production agentic systems actually fail
0
Interactions to drift
Median number of interactions before behavioral drift becomes detectable in multi-agent systems.
0
Task success drop
Reduction in task success rate caused by cumulative drift across agent interactions.
0
End-to-end success rate
What 85% per-step accuracy produces over 10 sequential steps — compounding failure, not compounding success.
0
Eval coverage gap
Difference between output-level and trajectory-level eval pass rates — what you miss by only checking final answers.
The empirical picture of failure does not match the per-call model. The most consequential enterprise agentic failures occur at the tool-call sequence level, not the individual model output level. [Zenity's agentic deployment guidance](https://zenity.io/academy/agentic-ai-best-practices), [Domino Data Lab's risks analysis](https://domino.ai/blog/agentic-ai-risks-and-challenges-enterprises-must-tackle), and [Trantor's failure-modes breakdown](https://www.trantorinc.com/blog/ai-agent-failure-modes-what-goes-wrong-design-resilience) all converge on the same point: individually valid tool calls can produce harmful outcomes through cumulative drift across a multi-step trajectory.
Each call passes its check. The aggregate does not.
A preprint study on multi-agent LLM behavioral degradation ([arXiv 2601.04170](https://arxiv.org/html/2601.04170)) reports that drift becomes detectable after a median of 73 interactions, with a 42% reduction in task success rate, and decline rates nearly tripling between interaction stages 0–100 and 300–400. The methodology is simulation, so the absolute numbers should be treated as directional rather than load-bearing. The acceleration pattern is what matters.
The compounding-error math is less forgiving than the per-step accuracy number suggests. [Trantor's analysis](https://www.trantorinc.com/blog/ai-agent-failure-modes-what-goes-wrong-design-resilience) observes that 85% per-step accuracy yields roughly 20% end-to-end success over 10 steps, and that agents evaluated on final-output quality alone pass 20–40% more test cases than trajectory-level evaluation reveals. (One practitioner estimate from a single secondary source; treat as directional, not industry-wide.) That gap between output evaluation and trajectory evaluation mirrors the gap between AgentCore Policy's per-call enforcement and the actual surface area of agent failure. The metric watches outputs. The failures live in the trajectory.
Silent partial failure
This class of failure does not announce itself. [Reporting on context decay and orchestration drift](https://www.dataworldbank.net/2026/04/26/context-decay-orchestration-drift-and-the-rise-of-silent-failures-in-ai-systems/) and a [CIO piece on agentic drift](https://www.cio.com/article/4134051/agentic-ai-systems-dont-fail-suddenly-they-drift-over-time.html) both describe 'silent partial failure': a component underperforms without crossing alert thresholds, the drift accumulates for weeks, and surfaces as user mistrust before it surfaces as incident tickets.
The infrastructure problem underneath: standard monitoring stacks (uptime, latency, error rate) [cannot distinguish operationally healthy from behaviorally reliable](https://www.dataworldbank.net/2026/04/26/context-decay-orchestration-drift-and-the-rise-of-silent-failures-in-ai-systems/). They do not track retrieval freshness or semantic drift under load. A Bedrock agent emitting clean 200s with median latency under a second can be in the middle of a slow-motion drift event and no CloudWatch dashboard will say so.
Key Insight
A Bedrock agent emitting clean 200s with median latency under a second can be in the middle of a slow-motion drift event and no CloudWatch dashboard will say so.
[Atlan's context-drift analysis](https://atlan.com/know/context-drift-ai-agents/) names the specific shape: an agent reasoning correctly over stale or misaligned context. No error codes, no anomaly alerts. The output is technically valid and semantically wrong. Data quality monitors miss it because the data is fine; the relationship between the data and the reasoning is the failure.
Agentic Ops is the operational discipline underneath autonomous agents. AWS-native observability wasn't built for it.
The pre-deployment bias
Most enterprise agent programs are over-indexed on pre-deployment review. Pen testing and policy authoring happen before the agent ships. Runtime behavioral oversight gets a budget line and an empty dashboard. [Trantor's reporting](https://www.trantorinc.com/blog/ai-agent-failure-modes-what-goes-wrong-design-resilience) notes that 84% of CIOs lack a formal process for tracking AI accuracy post-deployment, which is a remarkable number to put next to the volume of agent pilots currently in production review.
AgentCore Policy fits the pre-deployment slot. The security team authors it before launch. It is auditable and it checks the box for 'we have a policy layer.' Then the agent ships, AgentCore Policy keeps doing its per-call job, and nobody is watching the trajectory.
The baseline visibility problem makes this worse. [InterWorks](https://interworks.com/blog/2026/03/06/securing-amazon-bedrock-what-enterprises-need-to-get-right/) flags that Bedrock model invocation logging is disabled by default, and that throttled requests are excluded from error counts. Both are configuration footguns. An enterprise running default Bedrock telemetry has no per-invocation log to retrospectively trace drift in even if it wanted to, and the most common backpressure signal (throttling) is mathematically removed from the error-rate metric a CloudWatch alarm would fire on.
A team relying on AWS-native primitives to spot drift starts twenty seconds behind.
What's missing and where it lives
The honest assessment: AgentCore Policy is a well-built piece of infrastructure for a problem AWS already has solved patterns for. It is not the layer that catches the agent that drifts quietly across 73 interactions. That layer (cross-call behavioral monitoring and trajectory-level evaluation) does not currently have an AWS-native equivalent.
Whether AWS intends to ship one is an open question. AgentCore Policy's design (stateless per-call evaluation, Cedar's per-decision model) is consistent with a permanent architectural choice rather than a v1 limitation. Cedar can express complex per-call rules, but accumulating session-level state across tool calls (total refunds processed, semantic similarity to prior turns) is not what Cedar was built for, and AWS has not published guidance on whether teams should attempt it.
A few patterns teams are already shipping outside the AWS-native stack:
**Trajectory-level evaluation in CI.** Instead of evaluating the agent on final-output quality, instrument it to log full tool-call trajectories and run trajectory-level eval suites against those. This is what closes the output-versus-trajectory gap [Trantor identifies](https://www.trantorinc.com/blog/ai-agent-failure-modes-what-goes-wrong-design-resilience).
**Behavioral baselines with semantic checks.** Sample trajectories, embed them, watch for semantic distribution shift over time. The 'technically valid but semantically wrong' failure mode is invisible to schema validation but legible to embedding-distance monitoring against a known-good baseline.
**Session-scoped accumulators outside the agent runtime.** State that has to span tool calls (cumulative refund authority, repeated retrieval of the same stale document) has to live somewhere external to the per-call gateway. This ends up in teams' own orchestration layers because there is no AWS primitive for it.
**Treating throttling as a first-class signal.** Throttled requests being excluded from CloudWatch error metrics is fixable by emitting a custom metric the moment a Bedrock invocation returns a throttling response. It is a small change that closes a baseline visibility gap.
None of these are products. They are operational work that has to be done above the AWS-native control plane.
The framing for AI leaders
The trap, for an AI program leader evaluating Bedrock's agent stack, is to read AgentCore Policy + Guardrails + IAM as a complete security posture. It is a complete posture for a per-call threat model. The actual threat model for an agent that ships, runs for weeks, and drifts is a cross-call threat model. The AWS-native stack does not yet meet it.
This is not an argument against AgentCore Policy. The per-call gate is correct work. It is an argument against treating it as the whole answer. The layer that watches agent behavior over time, not per call, is the build-vs-buy decision most enterprises shipping Bedrock agents will face in the next twelve months. Currently it is a build, because the buy does not exist inside AWS.
The diagnostic to take into next week's architecture review:
1. Can your monitoring stack tell the difference between an agent that is operationally healthy and an agent that is behaviorally reliable? If those two questions share a dashboard, the answer is no.
2. Does your eval suite test final outputs or full trajectories? If it is the former, the accuracy number is overstating reliability by a margin one practitioner estimate puts in the 20–40% range.
3. Is anyone watching session-level state (cumulative actions, semantic drift), or only per-call decisions? If only per-call, the deployment is protected against the threat model AWS solved for and exposed to the one that breaks production.
AgentCore Policy stops the wrong tool from firing once. It does not stop the agent that drifts across 73 interactions toward an outcome no individual policy would have allowed. That second class of failure is where production enterprise agentic deployments actually break, and as of GA, no AWS-native control addresses it.
Koundinya Lanka
Founder of The Production Line, writing weekly intelligence on enterprise AI adoption, agentic systems, and the future of work.
Enjoyed this article? Get more like it every week.