I Ran Six Agent Sandbox Runtimes Back to Back. Here Is What Actually Worked.
By Rohit Ghumare • May 3, 2026 • 16 min read
For the last two weeks I have been running the same agent workload across six sandbox runtimes. The agent is a Claude Code session that gets a fresh checkout of a small Rust repo, runs the test suite, fixes a planted bug, and commits the patch. Boring. Reproducible. Easy to time.
I picked six runtimes that came up repeatedly when people asked me “where do I let my coding agent run unattended.”
- E2B
- Daytona
- Modal Sandboxes
- Morph (formerly Morph Labs / Morph Cloud)
- Vercel Sandbox
- Docker Sandboxes (the new
sbxCLI)
This post is what I learned. It is not a benchmark in the academic sense — I did not run a thousand iterations or measure cold-start variance to four decimal places. It is the writeup an engineer wants when they are picking one and have a Friday afternoon to decide.
Why agents need sandboxes at all
If you have not used a coding agent in “dangerously skip permissions” mode yet, the elevator pitch is this: turning off the per-tool approval prompt is a step-change improvement in agent productivity, and it is also the step that lets the agent rm -rf your home directory if a single prompt-injection attack lands.
The fix is not to put the prompts back. The fix is to put the agent somewhere it cannot hurt anything important. The sandbox is a small isolated machine — usually a microVM or a gVisor-shielded container — that has access to your project tree and nothing else. The agent is free inside it. Outside it, your machine is untouched.
That is the entire pitch. The runtimes differ on three axes that matter:
- Isolation model. Hardware (microVM with its own kernel) versus software (gVisor or container). Hardware is stronger; software is faster and lighter.
- Persistence. Ephemeral (the sandbox dies after the session) versus stateful (you can stop and resume, files survive, packages stay installed).
- Locality. Local on your laptop versus remote in someone else's cloud.
Pick a position on each axis and you have basically picked your runtime. Let me walk through the six.
E2B
E2B is the obvious starting point. It has the largest community, the most language SDKs, and the most existing “put your code interpreter in a microVM” recipes online. Each sandbox is a Firecracker microVM. Cold start in my measurements averaged 180 ms — close to the 150-200 ms range E2B advertises.
What I liked: the SDK is the cleanest of any in the list. Three lines and you have a sandboxed Python or TypeScript runtime with a file system and a command runner.
import { Sandbox } from "e2b"
const sbx = await Sandbox.create()
await sbx.files.write("/tmp/script.py", "print('hi')")
const out = await sbx.commands.run("python /tmp/script.py")
console.log(out.stdout)What bit me: 24-hour session ceiling. For a long-lived coding agent that stays attached to a project for a week, that is the wrong shape. You can chain sandboxes, but the state on disk does not survive a session boundary unless you carry it yourself.
Use it for: Stateless code execution, code interpreters, untrusted-input services. Anything where the workload is “run this code once, give me the output.”
Daytona
Daytona is the persistence-first option. They pivoted from dev-environments-as-a-service to AI agent infrastructure in early 2025, and the difference shows. Sandboxes are long-lived workspaces. If your agent installs a Python package or writes a config file, that change is still there next time.
Cold-start was the fastest in my run — under 100 ms most of the time. The trade is that the isolation is container-based, not microVM. They share a kernel with the host. For most coding agent workloads I do not care about the kernel boundary, but if you are running customer-supplied code from random users, this is the part to think about.
import { Daytona } from "@daytonaio/sdk"
const dt = new Daytona()
const ws = await dt.workspace.create({ template: "python" })
await ws.fs.write("/workspace/script.py", "print('hi')")
const out = await ws.process.exec("python /workspace/script.py")The killer feature for me was Computer Use support. Daytona spins up a Linux desktop you can drive over VNC. If your agent ever needs to use a browser or interact with a GUI app, this is the only one of the six that gives you that out of the box. I ended up using it for a different project entirely once I had the SDK on hand.
Use it for: Long-lived coding agent sessions, anything that needs a persistent workspace, anything that needs Computer Use.
Modal Sandboxes
Modal is not really a sandbox-first product. Modal is an everything platform — inference, training, batch jobs, notebooks — and Sandboxes are one of the things it can do. The isolation is gVisor, which is software-level (a user-space kernel that intercepts syscalls). Weaker than Firecracker on paper, but Modal has been running it at very high concurrency for years.
What I liked: GPU sandboxes. None of the other five runtimes give you A100 or H100 in a sandbox. If your agent needs to run an inference workload as part of a coding task, this is the only path.
import modal
app = modal.App("agent-sandbox")
@app.function(gpu="A100")
def run_in_sandbox(cmd: str):
sb = modal.Sandbox.create("ubuntu:22.04", app=app)
p = sb.exec("bash", "-c", cmd)
return p.stdout.read()What bit me: Python-first by design. I could not find a clean Node-only path that did not feel grafted on. Also no BYOC. If you have a corporate policy that says “customer data does not leave our VPC,” Modal is out unless you change the policy.
Use it for: ML-adjacent agent work, anything needing GPU, large batch sandboxing.
Morph
Morph is the snapshot-and-rollback story. The pitch is that you can checkpoint a sandbox in 300 ms and roll back to that exact disk and memory state later. For an agent that is iterating on a flaky bug, this is genuinely magical — you put it in the “before” state, let it try, watch it fail, roll back, try a different prompt.
Each sandbox is a Firecracker microVM. The storage layer is copy-on-write on NVMe, which is how the snapshots get cheap. Morph charges by written blocks, not allocated blocks, so a sandbox that does little writing is genuinely free-tier-compatible.
What I liked: the snapshot API is the cleanest expression of the “agent as a process you can rewind” idea. I have not seen this work elsewhere with the same UX.
const sb = await morph.create({ image: "ubuntu-22.04" })
const checkpoint = await sb.snapshot()
await sb.exec("rm -rf /etc") // do something risky
await sb.restore(checkpoint) // undoWhat bit me: the docs are still light, the SDK shape moves between minor versions, and I had a few sessions where snapshot restore took noticeably longer than the advertised 300 ms (closer to a second). Early product, real ideas.
Use it for: Iteration-heavy agent loops where rollback is part of the algorithm.
Vercel Sandbox
Vercel Sandbox is a Firecracker microVM service tightly bound to the Vercel platform. The session ceiling is 45 minutes — short, and it is short on purpose. The pitch is “you have an AI feature in your Vercel app and you need to run a bit of generated code somewhere.”
What I liked: per-active-CPU billing. Most sandbox time on a coding agent is the agent thinking, not the CPU computing. Vercel only charges you for the latter, which makes a 5-cent invocation actually cost a tenth of a cent. The free tier was generous enough that I never paid during my test.
import { Sandbox } from "@vercel/sandbox"
const sb = await Sandbox.create({ runtime: "node22" })
await sb.writeFiles([{ path: "index.js", content: "console.log('hi')" }])
const out = await sb.runCommand({ cmd: "node", args: ["index.js"] })What bit me: the 45-minute ceiling is hard. If your agent task is going to take an hour, you need to design for handoff. There is no GPU, no BYOC, and outside the Vercel platform the value proposition narrows a lot.
Use it for: Apps already on Vercel, short-lived per-request sandboxing, untrusted code execution from end-user input.
Docker Sandboxes (sbx)
sbx is the newcomer and the one that surprised me. Docker shipped it in March 2026 as a standalone CLI, no Docker Desktop required. Each sandbox is a microVM running on the local hypervisor — Apple Hypervisor on macOS, Hyper-V on Windows. Linux support is on the roadmap.
The killer feature is locality. The sandbox runs on your laptop. There is no remote API, no per-second billing, no rate limit. If you have ever burned an afternoon on cold-start latency or sandbox quota, you know how good that feels.
# install
brew install docker/tap/sbx
# log in
sbx login
# run claude code in a sandbox bound to the current directory
cd ~/code/my-rust-repo
sbx run claudeThat last command is the whole experience. Claude Code starts inside a microVM, sees only the current directory, runs in --dangerously-skip-permissions mode by default, and cannot reach the rest of your filesystem. Network is policy-gated; you choose between open, balanced, and locked down on first login.
What I liked: it integrates with the agents people actually use. Out of the box: Claude Code, Codex, GitHub Copilot CLI, Gemini CLI, Kiro, OpenCode. The branch mode (--branch) creates a git worktree under .sbx/ so the agent commits to its own branch, not your working tree.
What bit me: still experimental. The balanced network policy is missing common documentation domains, so a real coding session ends up needing “open” mode unless you maintain your own allow-list. macOS performance was great; Windows under Hyper-V was visibly slower for IO-heavy operations like a fresh cargo build.
Use it for: Local-first coding agent workflows. The right default unless you specifically need remote.
Side-by-side
Same workload, same agent, same Rust repo. Numbers are medians from ten runs each.
| Runtime | Isolation | Cold start | Session cap | Persistence | Local-first |
|---|---|---|---|---|---|
| E2B | Firecracker | ~180 ms | 24 h | No | No |
| Daytona | Container | ~95 ms | Unlimited | Yes | No |
| Modal | gVisor | ~600 ms | Unlimited | Snapshots | No |
| Morph | Firecracker | ~300 ms | Unlimited | Snapshots | No |
| Vercel | Firecracker | ~250 ms | 45 min | No | No |
| Docker sbx | microVM (local) | ~1.2 s first / instant warm | Until you stop it | Yes | Yes |
The cold-start number for sbx is misleading on first read. It is the slowest because the microVM provisions the first time you run a sandbox. After that, subsequent commands inside the sandbox run at native speed because there is no network round trip. The remote runtimes have a network hop on every operation that the local one does not.
How I actually pick
After two weeks of this, my decision tree is small.
If the agent runs on my laptop, sbx wins. Local microVM, no quota, integrates with the agents I already use, free. The first-time cold start is irrelevant once the image is cached.
If the agent runs in our infra and we need persistent sessions, Daytona wins. Long-lived workspaces, fast cold start, Computer Use if I ever need a browser inside a sandbox. The container isolation is acceptable for our trust model.
If the agent runs ephemerally as part of a request handler, Vercel Sandbox wins. Per-active-CPU billing, microVM, generous free tier, integrated with the platform we already deploy on.
If the agent is iterating with rollback as part of the algorithm, Morph wins. Nothing else has snapshot-and-restore as a first-class operation.
If the agent needs GPU, Modal wins. By default, by elimination. None of the others do GPU.
E2B is fine, but I have not found the workload it is the right answer to that one of the others does not also serve. It is the most polished SDK, and that matters if SDK ergonomics dominate your decision.
One thing nobody talks about
Every one of these runtimes does the “agent runs in a box” story well. None of them solves the second boundary that matters: the boundary between what the agent wrote and what reaches your main branch.
The microVM keeps the agent from rm -rf-ing your machine. It does not keep the agent from writing a subtly broken patch and pushing it to main. The same property that makes a YOLO agent fast — autonomous action, no human in the loop — means the output is unsupervised by definition. If that output reaches a shared branch without a human eye on it, the sandbox boundary did not save you, because the harm escaped through the legitimate output channel.
Every runtime I tested has some answer to this. sbx has --branch. Daytona has explicit workspace-to-branch mapping. The remote runtimes generally hand you a tarball and let you handle the git layer yourself. You should treat that handoff as the security boundary. The microVM is a precondition; the branch boundary is the actual control.
Where I landed
I run two sandboxes for two different jobs.
For local agent work, I use sbx. The latency is best, the integration with Claude Code is one command, and the cost is zero. The first day I switched, I noticed I was running the agent more aggressively because I knew the worst case was a wrecked sandbox, not a wrecked machine.
For our deployed agent that processes real customer requests, I use Vercel Sandbox. Per-active-CPU billing was the deciding factor — sandbox time on a coding agent is mostly waiting on the model, and I do not want to pay for that.
I will probably revisit this in three months. The space is moving fast — Zeroboot is claiming sub-millisecond cold starts via copy-on-write Firecracker forks, and Docker is adding Linux support to sbx later this year. If the comparisons change materially, I will write the followup.