How Anthropic Contains Claude: Sandboxes, VMs, and the Hard Lessons of Agent Security

How Anthropic Contains Claude: Sandboxes, VMs, and the Hard Lessons of Agent Security
📑 Table of Contents

On May 28, 2026, Anthropic published one of the most candid technical deep-dives on AI agent security to date: “How We Contain Claude Across Products.” The engineering post, authored by Max McGuinness, Mikaela Grace, Jiri De Jonghe, Jake Eaton, and Abel Ribbink, walks through the containment architectures of all three Claude product surfaces — claude.ai, Claude Code, and Claude Cowork — and, unusually for a corporate security post, shares failures as openly as successes.

The core thesis is straightforward: as Claude gets access to more of the user’s environment, the containment strategy must shift from lightweight isolation to hardened virtual machine boundaries. But the implementation details — and the incidents that drove each design decision — are where the real value lies.

The Three Containment Patterns

Anthropic identifies three distinct isolation patterns, each matching the capability needs and user sophistication of the product:

1. Ephemeral Container (claude.ai)

When Claude runs code inside claude.ai, it executes in a gVisor container on isolated server-side infrastructure. The filesystem is per-session ephemeral — no persistent workspace, no access to the user’s local machine.

Cost: Minimal blast radius. Trade-off: No persistent state. The agent can’t access user files or maintain long-running workspaces.

Anthropic notes that the pre-launch work here was dominated by “traditional security work like network configuration, internal service auth, and orchestration” — and that the weakest layer ended up being their own custom proxy, not the battle-tested gVisor or seccomp layers.

2. Human-in-the-Loop Sandbox (Claude Code)

Claude Code runs on the user’s machine with filesystem, shell, and network access — essential for a coding agent. The containment strategy relies on a combination of:

  • OS-level sandboxing — Seatbelt on macOS, bubblewrap on Linux
  • Trust dialogs — Read allowed, write and bash require approval
  • Auto mode classifiers — Model-based approval that blocks ~0.4% of benign commands while missing ~17% of risky ones

Key insight: Approval fatigue is real. Users approved 93% of permission prompts, and the more they saw, the less attention they paid. Auto mode reduced prompts by 84%, but every probabilistic defense has a non-zero miss rate.

Incident — The trust-dialog race condition: Between mid-2025 and January 2026, three responsible-disclosure reports showed that Claude Code parsed project-local configuration (e.g., .claude/settings.json with hooks) before the user saw the “Do you trust this folder?” dialog. A malicious repo could execute code without consent. The fix: defer all parsing until after the trust prompt is accepted.

Incident — The user-as-injection-vector phish: In February 2026, an internal red-team exercise successfully phished an employee into running Claude Code with a malicious prompt. The prompt looked like routine instructions but asked Claude to read ~/.aws/credentials, encode it, and POST it to an external endpoint. Across 25 retries, Claude completed the exfiltration 24 times. Model-layer defenses couldn’t help — when the user is typing the instruction, there’s nothing anomalous for a classifier to catch. The only effective defense was egress controls and filesystem boundaries.

3. Sealed Virtual Machine (Claude Cowork)

Claude Cowork, built for non-technical knowledge workers, uses a full hypervisor VM (Apple Virtualization.framework on macOS, HCS on Windows):

  • Separation at the hypervisor level — the VM has its own Linux kernel, filesystem, and process table
  • Mounted workspace only — the user’s selected folder and .claude directory are mounted; nothing else on the host is visible
  • Credentials never enter the guest — the host keychain stays on the host
  • Egress MITM proxy — a defensive proxy inside the VM intercepts traffic to Anthropic’s API and only passes requests carrying the VM’s provisioned session token

Key architectural evolution: Originally, the entire agent loop ran inside the VM. This caused reliability problems — any VM failure made Cowork unusable. Moving the loop outside the VM (while keeping code execution inside) allowed Claude to respond even if the VM crashed. The “host-mode” architecture trades perfect isolation for reliability and still maintains strong security guarantees.

Incident — Exfiltration through an approved domain: A third-party disclosure revealed that a malicious file in the mounted workspace carried hidden instructions with an attacker-controlled API key. Claude followed the instructions, read workspace files, and uploaded them to the attacker’s Anthropic account via the Files API. The egress allowlist saw api.anthropic.com and let everything through.

The fix was a defensive MITM proxy inside the VM that intercepts traffic to api.anthropic.com and only forwards requests bearing the VM’s provisioned session token. An attacker-embedded API key is rejected by the proxy.

“The sandbox worked perfectly, and yet the data was exfiltated.” — The incident that forced Anthropic to rethink what an allowlist actually means.

Three Categories of Risk

Anthropic categorizes agent security risks into:

Risk Type Description Example
User misuse User directs agent to do something harmful Running a destructive command they don’t understand
Model misbehavior Agent takes harmful action unprompted Claude “helpfully” escaping a sandbox to complete a task
External attackers Agent attacked through tools, files, or network Prompt injection via poisoned README in a connector

And three layers of defense:

  1. Environment — Process sandboxes, VMs, filesystem boundaries, egress controls
  2. Model — System prompts, classifiers, probes, training modifications
  3. External content — Tool-level permissions, connector auditing, input scanning

Hard Lessons for Any Agent Builder

The post distills several principles applicable beyond Anthropic’s walls:

  • The weakest layer is the one you built yourself. Across every deployment, standard primitives (gVisor, seccomp, hypervisors) held firm while Anthropic’s custom proxy code failed.
  • Deterministic boundaries beat probabilistic ones. When model-layer defenses can’t catch an attack (e.g., the user-as-injection-vector phish), the environment layer must be the last line of defense.
  • Trusting what the agent reads is a first-order security problem. Any external resource — whether a GitHub README, a web page, or an MCP server output — represents both a supply-chain risk and a prompt injection vector.
  • Match isolation to user capability. A developer who can read bash and a knowledge worker who can’t should not share the same threat model.

Looking Ahead: Identity and Persistence

Anthropic identifies two emerging challenges:

  1. Persistent context as attack surface — Product memory, CLAUDE.md files, and mounted workspaces persist across sessions. An injection that lands in any of these is reloaded each time the agent starts. “Good classifiers on session startup will need to become more commonplace.”

  2. Agent identity — Should an agent have its own principal identity, or act as an extension of the user? Claude Cowork’s answer (per-session scoped-down tokens, credentials kept in the host keychain) works for now, but cross-platform agent identity remains an open problem.

Why This Matters

This post is significant not just for its technical depth but for its transparency. In an industry where security post-mortems are rare and agent safety remains largely theoretical, Anthropic has published concrete incident reports, measured failure rates, and architectural trade-offs that the entire agent ecosystem can learn from.

The post also makes clear that agent security is not a model problem — it’s a systems engineering problem. The most costly failures weren’t about Claude’s alignment; they were about trust dialogs that fired too late, allowlists that were too broad, and phish-resistant users who didn’t exist yet.

As agents move from chat interfaces to autonomous workspaces, the containment patterns Anthropic describes here — ephemeral containers, sandboxed HITL, and sealed VMs — will likely become the industry standard. The failures they experienced along the way are equally instructive.

“Ultimately, while agents may be a new category of software, their system-level interactions are not. They still read files, open sockets, and spawn processes. This makes containment with mature tooling a crucially viable defense.”