Where Should AI‑Generated Code Run? A Practical Guide to Sandboxes and MicroVMs in Production
Your agent just generated 200 lines of code. Where does it run? Localhost breaks at scale, containers share kernels, VMs are slow. Learn why microVMs are becoming the default for production AI code execution—and how to choose the right sandbox
The Execution Question Nobody Asks Until It's Too Late
Your agent just wrote 200 lines of Python. Claude generated a data pipeline. Your LLM workflow synthesized a deployment script. Where does it run?
Most teams default to one of three paths: run it on their local machine, spin up a container or just execute it directly in their app runtime. Each works—until it doesn't.
The moment AI-generated code hits production scale, three problems surface: isolation breaks, costs spiral and compliance gaps appear. By the time you're debugging why an agent crashed your cluster or explaining to security how untrusted code accessed internal APIs, the architectural decision is already made.
This guide covers where AI code should actually run, what sandboxing really means in production and why microVMs are becoming the default execution primitive for agent infrastructure.
The Localhost-to-Production Gap
Running Locally Works Until It Doesn't
When you're prototyping, running AI-generated code on your laptop is fine. You have full control, fast iteration and zero infrastructure overhead.
But local execution doesn't scale. You can't orchestrate hundreds of agents on one machine, you can't enforce resource limits consistently and you definitely can't replicate this setup for your team or customers.
The Container Trap
Containers feel like the obvious next step. They're familiar, portable and fast to spin up. But containers share the host kernel.
If AI-generated code exploits a kernel vulnerability or escapes via a misconfigured namespace, it can access host resources, other containers and your internal systems.
For reviewed, trusted applications, this is manageable. For runtime-generated, non-deterministic agent code, it's a structural weakness.
Direct Execution in App Runtime
Some teams execute AI code directly in their application's runtime environment—Node processes, Python workers, Lambda functions.
This approach has the fastest cold start (zero), but also zero isolation. A runaway loop consumes all memory. Malicious code reads environment variables. One bad execution crashes the entire service.
What Production AI Code Execution Actually Needs
AI-generated code behaves differently than human-written code. It's iterative, non-deterministic and often untrusted until validated. Production execution needs to account for these properties.
Hard Isolation Boundaries
Each execution should run in its own isolated environment with VM-level separation. If code misbehaves, it can't touch other workloads, your infrastructure or customer data.
Containers provide process-level isolation. MicroVMs provide hardware-level isolation. The difference matters when code is untrusted by default.
Sub-Second Startup Times
Agents operate in loops. Generate code, execute, evaluate and iterate. If infrastructure adds seconds of latency per iteration, it breaks the workflow.
Traditional VMs boot in 5-30 seconds. MicroVMs boot in under a second. For agent loops running dozens of iterations, this compounds fast.
Resource Limits That Actually Enforce
CPU, memory, and execution time need hard caps. If an agent writes an infinite loop or tries to allocate 50GB of RAM, the sandbox should terminate it—not your host.
Operating system-level limits (cgroups, ulimits) work for containers but don't provide VM-level containment. MicroVMs enforce limits at the hypervisor layer.
Persistent State When You Need It
Some workloads are stateless: run code, return output and destroy the environment. Others need persistence: agents that maintain state across iterations, workflows that build incrementally or RL environments that evolve over time.
Production sandboxes need snapshot and cloning capabilities. Save state, restore it later or spin up 100 identical environments from a single baseline.
Observability by Default
You need stdout, stderr, exit codes, execution time and resource usage for every run. When agents fail, logs are the only way to understand what happened.
Sandboxes should return structured metadata—not just raw output—so you can build auditing, debugging, and compliance workflows on top.
MicroVMs: The Execution Primitive for AI Workloads
MicroVMs combine the isolation of traditional VMs with the speed and density of containers. They're purpose-built for running untrusted, ephemeral workloads at scale.
How MicroVMs Work
Each microVM runs its own kernel and has hardware-enforced boundaries via a minimal hypervisor. Unlike containers, which share the host kernel, microVMs provide true VM-level separation.
Unlike traditional VMs, which include full BIOS emulation and device drivers, microVMs strip out everything non-essential. Boot times drop from seconds to milliseconds. The result: strong isolation without the performance penalty.
When to Use MicroVMs Over Containers
Use microVMs when:
- Code is untrusted or AI-generated
- Multi-tenancy requires hard boundaries
- Compliance demands VM-level isolation
- Workloads are ephemeral and iterate frequently
- Security incidents would be high-impact
Use containers when:
- Code is reviewed and trusted
- You control the entire stack
- Workload density is the primary constraint
- You already have mature container orchestration
The Cost Equation
MicroVMs sound expensive—more overhead than containers, right?
Not necessarily. Modern microVM platforms use intelligent cloning and deduplication to reduce storage and compute costs. Instead of provisioning full cloud VMs per execution, microVMs share base images and clone on-demand.
Sandboxing AI Code in Practice
Theory matters, but practical implementation matters more. Here's what secure AI code execution actually looks like.
The Execution Flow
- Agent generates code: Your LLM or agent workflow produces a script (Python, TypeScript, Bash, etc.)
- API handoff: Your application sends the code to a sandbox API with resource limits and timeout constraints
- Isolated execution: The sandbox spins up a microVM, executes the code inside it, and enforces caps on CPU, memory, and runtime
- Structured output: The API returns stdout, stderr, exit code, and execution metadata—sandbox terminates immediately after
No persistent infrastructureo manual VM provisioning. Just an API call that returns results.
Code Example: Akira Sandbox API
import Akira from 'akiralabs';const client = new Akira({ apiKey: process.env['AKIRA_API_KEY'],});// Create a sandboxconst sandbox = await client.sandboxes.create({ image: 'akiralabs/akira-default-sandbox',});// Execute AI-generated codeconst result = await client.sandboxes.execute(sandbox.id, { command: 'python agent_generated_script.py',});console.log(result.stdout); // Output from executionconsole.log(result.exitCode); // 0 for success, non-zero for failureThe sandbox creates in under a second, executes the code in a hardware-isolated microVM and returns structured results. If the code misbehaves, it's contained within that microVM—nothing touches your infrastructure.
Persistent State with Snapshots
For stateful workloads—agents that learn over time, iterative builds or RL training loops—snapshots let you save and restore execution environments.
// Create a snapshot of the current sandbox stateconst snapshot = await client.sandboxes.createSnapshot(sandbox.id);// Later: restore from that snapshotconst restoredSandbox = await client.sandboxes.createFromSnapshot({ snapshotId: snapshot.id,});// Or clone it 100 times for parallel executionconst clones = await Promise.all( Array(100).fill(null).map(() => client.sandboxes.createFromSnapshot({ snapshotId: snapshot.id }) ));
This is how you scale agent swarms, parallelize workloads or maintain consistent environments across runs.
Resource Limits That Actually Work
Every sandbox should enforce:
- CPU caps: Prevent runaway processes from consuming all cores
- Memory limits: Terminate executions that try to allocate more than allowed
- Execution timeouts: Kill long-running scripts after a defined threshold
- Network policies: Control outbound access, block internal IPs, enforce allowlists
Real-World Use Cases
Where teams actually deploy sandboxed AI code execution:
Autonomous Coding Agents
Agents like Devin or Cursor-style workflows generate code, run tests, debug failures and iterate. Each execution needs isolation—tests might install dependencies, modify files or spawn processes.
MicroVMs let these agents run freely without risking the host environment. Snapshot the baseline environment, clone it per iteration, execute and destroy.
Multi-Tenant AI Platforms
If you're building a platform where customers deploy their own agents or LLM workflows, you need hard tenant boundaries. One customer's runaway agent can't impact another's workloads.
VM-level isolation ensures tenant A's code physically cannot access tenant B's data, even if both run on the same host.
Agentic CI/CD Pipelines
Agents that generate deployment scripts, run integration tests, or validate infrastructure-as-code need sandboxed execution. You don't want an AI-generated script accidentally deploying to production or modifying live configs.
Sandboxes provide a safe staging ground: execute the generated logic, validate the results, then promote to real infrastructure only if checks pass.
LLM-Powered Data Pipelines
Code that scrapes websites, transforms datasets, or runs analyses often comes from LLM-generated scripts. These workloads are ephemeral, potentially risky, and need resource caps.
MicroVMs handle the unpredictability: if a script hangs, exceeds memory, or tries to exfiltrate data, the sandbox kills it and returns an error.
Reinforcement Learning Environments
RL agents need reproducible, isolated environments to train in. Spin up a baseline microVM, let the agent interact with it, snapshot after each episode, and reset or clone as needed.
This is how you scale RL training across hundreds of parallel environments without manual infrastructure management.
Comparing Execution Options
Serverless Functions
Serverless functions (AWS Lambda, Google Cloud Functions, etc.) seem like they solve this—ephemeral, auto-scaling, and managed. But serverless has constraints:
- Execution time limits: Most cap at 15 minutes
- Cold start latency: Can take seconds for non-warmed functions
- Limited isolation control: You don't control the underlying VM or kernel
- No persistent state: Functions are stateless by design; snapshots and cloning aren't native
- Vendor lock-in: Tightly coupled to cloud provider APIs
Serverless works for simple, stateless executions. For complex agent workflows that need persistence, custom images, or sub-second iteration loops, dedicated sandbox infrastructure is the better fit.
The Decision Tree
Run locally if: You're prototyping, iterating solo and haven't hit scale or multi-user needs yet.
Use containers if: Code is trusted, you have existing orchestration (Kubernetes), and isolation requirements are process-level.
Use traditional VMs if: You need full OS control, workloads are long-running, and boot time isn't critical.
Use microVMs if: Code is AI-generated, untrusted, or multi-tenant; you need VM-level isolation with sub-second startup; and workloads are ephemeral or iterative.
Use serverless if: Executions are simple, stateless, under 15 minutes, and you're fine with vendor APIs.
For most production AI agent workflows, microVMs hit the sweet spot: strong isolation, fast iteration, and API-driven simplicity.
How Akira Solves This
Akira provides a sandbox platform purpose-built for AI-generated code execution. Hardware-isolated microVMs, sub-second cold starts, persistent snapshots and a REST API that abstracts all the complexity.
What You Get
- VM-level isolation for every execution
- Sub-1s sandbox startup
- Snapshot and cloning for stateful workflows
- Resource limits enforced at the hypervisor layer
- Structured output (stdout, stderr, exit codes, execution metadata)
- Simple API and SDKs (Python, TypeScript)
- Up to 75% lower compute costs
What You Don't Manage
- Hypervisor configuration
- VM provisioning
- Image caching and storage
- Orchestration and scaling
- Security patching
You call an API. The platform handles the rest.
Getting Started
If you're running AI-generated code in production—or planning to—start with these steps:
- Audit your current execution model: Where does AI code run today? What isolation exists? What happens if it misbehaves?
- Identify high-risk workloads: Multi-tenant environments, customer-facing agents, or anything iterating in production loops.
- Test with sandboxed execution: Spin up a microVM platform (like Akira), migrate a single workload, and measure isolation, latency, and cost.
- Add observability: Capture logs, execution metadata, and resource usage for every run. Build alerting on failures.
- Scale incrementally: Move more workloads into sandboxes as confidence grows. Use snapshots and cloning to handle stateful use cases.
The shift from "run it somewhere" to "run it in isolated, purpose-built infrastructure" isn't optional for production AI systems. It's the difference between scaling safely and explaining incidents to your security team.