Where Should AI‑Generated Code Run? A Practical Guide to Sandboxes and MicroVMs in Production

Your agent just generated 200 lines of code. Where does it run? Localhost breaks at scale, containers share kernels, VMs are slow. Learn why microVMs are becoming the default for production AI code execution—and how to choose the right sandbox

Cover graphic with title 'Where Should AI Code Run?' for Akira Labs blog post on sandboxes and microVMs"

The Execution Question Nobody Asks Until It's Too Late

Your agent just wrote 200 lines of Python. Claude generated a data pipeline. Your LLM workflow synthesized a deployment script. Where does it run?

Most teams default to one of three paths: run it on their local machine, spin up a container or just execute it directly in their app runtime. Each works—until it doesn't.

The moment AI-generated code hits production scale, three problems surface: isolation breaks, costs spiral and compliance gaps appear. By the time you're debugging why an agent crashed your cluster or explaining to security how untrusted code accessed internal APIs, the architectural decision is already made.

This guide covers where AI code should actually run, what sandboxing really means in production and why microVMs are becoming the default execution primitive for agent infrastructure.

The Localhost-to-Production Gap

Running Locally Works Until It Doesn't

When you're prototyping, running AI-generated code on your laptop is fine. You have full control, fast iteration and zero infrastructure overhead.

But local execution doesn't scale. You can't orchestrate hundreds of agents on one machine, you can't enforce resource limits consistently and you definitely can't replicate this setup for your team or customers.

The Container Trap

Containers feel like the obvious next step. They're familiar, portable and fast to spin up. But containers share the host kernel.

If AI-generated code exploits a kernel vulnerability or escapes via a misconfigured namespace, it can access host resources, other containers and your internal systems.

For reviewed, trusted applications, this is manageable. For runtime-generated, non-deterministic agent code, it's a structural weakness.

Direct Execution in App Runtime

Some teams execute AI code directly in their application's runtime environment—Node processes, Python workers, Lambda functions.

This approach has the fastest cold start (zero), but also zero isolation. A runaway loop consumes all memory. Malicious code reads environment variables. One bad execution crashes the entire service.

What Production AI Code Execution Actually Needs

AI-generated code behaves differently than human-written code. It's iterative, non-deterministic and often untrusted until validated. Production execution needs to account for these properties.

Hard Isolation Boundaries

Each execution should run in its own isolated environment with VM-level separation. If code misbehaves, it can't touch other workloads, your infrastructure or customer data.

Containers provide process-level isolation. MicroVMs provide hardware-level isolation. The difference matters when code is untrusted by default.

Sub-Second Startup Times

Agents operate in loops. Generate code, execute, evaluate and iterate. If infrastructure adds seconds of latency per iteration, it breaks the workflow.

Traditional VMs boot in 5-30 seconds. MicroVMs boot in under a second. For agent loops running dozens of iterations, this compounds fast.

Resource Limits That Actually Enforce

CPU, memory, and execution time need hard caps. If an agent writes an infinite loop or tries to allocate 50GB of RAM, the sandbox should terminate it—not your host.

Operating system-level limits (cgroups, ulimits) work for containers but don't provide VM-level containment. MicroVMs enforce limits at the hypervisor layer.

Persistent State When You Need It

Some workloads are stateless: run code, return output and destroy the environment. Others need persistence: agents that maintain state across iterations, workflows that build incrementally or RL environments that evolve over time.

Production sandboxes need snapshot and cloning capabilities. Save state, restore it later or spin up 100 identical environments from a single baseline.

Observability by Default

You need stdout, stderr, exit codes, execution time and resource usage for every run. When agents fail, logs are the only way to understand what happened.

Sandboxes should return structured metadata—not just raw output—so you can build auditing, debugging, and compliance workflows on top.

MicroVMs: The Execution Primitive for AI Workloads

MicroVMs combine the isolation of traditional VMs with the speed and density of containers. They're purpose-built for running untrusted, ephemeral workloads at scale.

How MicroVMs Work

Each microVM runs its own kernel and has hardware-enforced boundaries via a minimal hypervisor. Unlike containers, which share the host kernel, microVMs provide true VM-level separation.

Unlike traditional VMs, which include full BIOS emulation and device drivers, microVMs strip out everything non-essential. Boot times drop from seconds to milliseconds. The result: strong isolation without the performance penalty.

When to Use MicroVMs Over Containers

Use microVMs when:

  • Code is untrusted or AI-generated
  • Multi-tenancy requires hard boundaries
  • Compliance demands VM-level isolation
  • Workloads are ephemeral and iterate frequently
  • Security incidents would be high-impact

Use containers when:

  • Code is reviewed and trusted
  • You control the entire stack
  • Workload density is the primary constraint
  • You already have mature container orchestration

The Cost Equation

MicroVMs sound expensive—more overhead than containers, right?

Not necessarily. Modern microVM platforms use intelligent cloning and deduplication to reduce storage and compute costs. Instead of provisioning full cloud VMs per execution, microVMs share base images and clone on-demand.

Sandboxing AI Code in Practice

Theory matters, but practical implementation matters more. Here's what secure AI code execution actually looks like.

The Execution Flow

  1. Agent generates code: Your LLM or agent workflow produces a script (Python, TypeScript, Bash, etc.)
  2. API handoff: Your application sends the code to a sandbox API with resource limits and timeout constraints
  3. Isolated execution: The sandbox spins up a microVM, executes the code inside it, and enforces caps on CPU, memory, and runtime
  4. Structured output: The API returns stdout, stderr, exit code, and execution metadata—sandbox terminates immediately after

No persistent infrastructureo manual VM provisioning. Just an API call that returns results.

Code Example: Akira Sandbox API

typescript

import Akira from 'akiralabs';

const client = new Akira({
apiKey: process.env['AKIRA_API_KEY'],
});

// Create a sandbox
const sandbox = await client.sandboxes.create({
image: 'akiralabs/akira-default-sandbox',
});

// Execute AI-generated code
const result = await client.sandboxes.execute(sandbox.id, {
command: 'python agent_generated_script.py',
});

console.log(result.stdout); // Output from execution
console.log(result.exitCode); // 0 for success, non-zero for failure

The sandbox creates in under a second, executes the code in a hardware-isolated microVM and returns structured results. If the code misbehaves, it's contained within that microVM—nothing touches your infrastructure.

Persistent State with Snapshots

For stateful workloads—agents that learn over time, iterative builds or RL training loops—snapshots let you save and restore execution environments.

typescript

// Create a snapshot of the current sandbox state
const snapshot = await client.sandboxes.createSnapshot(sandbox.id);

// Later: restore from that snapshot
const restoredSandbox = await client.sandboxes.createFromSnapshot({
snapshotId: snapshot.id,
});

// Or clone it 100 times for parallel execution
const clones = await Promise.all(
Array(100).fill(null).map(() =>
client.sandboxes.createFromSnapshot({ snapshotId: snapshot.id })
)
);


This is how you scale agent swarms, parallelize workloads or maintain consistent environments across runs.

Resource Limits That Actually Work

Every sandbox should enforce:

  • CPU caps: Prevent runaway processes from consuming all cores
  • Memory limits: Terminate executions that try to allocate more than allowed
  • Execution timeouts: Kill long-running scripts after a defined threshold
  • Network policies: Control outbound access, block internal IPs, enforce allowlists

Real-World Use Cases

Where teams actually deploy sandboxed AI code execution:

Autonomous Coding Agents

Agents like Devin or Cursor-style workflows generate code, run tests, debug failures and iterate. Each execution needs isolation—tests might install dependencies, modify files or spawn processes.

MicroVMs let these agents run freely without risking the host environment. Snapshot the baseline environment, clone it per iteration, execute and destroy.

Multi-Tenant AI Platforms

If you're building a platform where customers deploy their own agents or LLM workflows, you need hard tenant boundaries. One customer's runaway agent can't impact another's workloads.

VM-level isolation ensures tenant A's code physically cannot access tenant B's data, even if both run on the same host.

Agentic CI/CD Pipelines

Agents that generate deployment scripts, run integration tests, or validate infrastructure-as-code need sandboxed execution. You don't want an AI-generated script accidentally deploying to production or modifying live configs.

Sandboxes provide a safe staging ground: execute the generated logic, validate the results, then promote to real infrastructure only if checks pass.

LLM-Powered Data Pipelines

Code that scrapes websites, transforms datasets, or runs analyses often comes from LLM-generated scripts. These workloads are ephemeral, potentially risky, and need resource caps.

MicroVMs handle the unpredictability: if a script hangs, exceeds memory, or tries to exfiltrate data, the sandbox kills it and returns an error.

Reinforcement Learning Environments

RL agents need reproducible, isolated environments to train in. Spin up a baseline microVM, let the agent interact with it, snapshot after each episode, and reset or clone as needed.

This is how you scale RL training across hundreds of parallel environments without manual infrastructure management.

Comparing Execution Options

Serverless Functions

Serverless functions (AWS Lambda, Google Cloud Functions, etc.) seem like they solve this—ephemeral, auto-scaling, and managed. But serverless has constraints:

  • Execution time limits: Most cap at 15 minutes
  • Cold start latency: Can take seconds for non-warmed functions
  • Limited isolation control: You don't control the underlying VM or kernel
  • No persistent state: Functions are stateless by design; snapshots and cloning aren't native
  • Vendor lock-in: Tightly coupled to cloud provider APIs

Serverless works for simple, stateless executions. For complex agent workflows that need persistence, custom images, or sub-second iteration loops, dedicated sandbox infrastructure is the better fit.

The Decision Tree

Run locally if: You're prototyping, iterating solo and haven't hit scale or multi-user needs yet.

Use containers if: Code is trusted, you have existing orchestration (Kubernetes), and isolation requirements are process-level.

Use traditional VMs if: You need full OS control, workloads are long-running, and boot time isn't critical.

Use microVMs if: Code is AI-generated, untrusted, or multi-tenant; you need VM-level isolation with sub-second startup; and workloads are ephemeral or iterative.

Use serverless if: Executions are simple, stateless, under 15 minutes, and you're fine with vendor APIs.

For most production AI agent workflows, microVMs hit the sweet spot: strong isolation, fast iteration, and API-driven simplicity.

How Akira Solves This

Akira provides a sandbox platform purpose-built for AI-generated code execution. Hardware-isolated microVMs, sub-second cold starts, persistent snapshots and a REST API that abstracts all the complexity.

What You Get

  • VM-level isolation for every execution
  • Sub-1s sandbox startup
  • Snapshot and cloning for stateful workflows
  • Resource limits enforced at the hypervisor layer
  • Structured output (stdout, stderr, exit codes, execution metadata)
  • Simple API and SDKs (Python, TypeScript)
  • Up to 75% lower compute costs

What You Don't Manage

  • Hypervisor configuration
  • VM provisioning
  • Image caching and storage
  • Orchestration and scaling
  • Security patching

You call an API. The platform handles the rest.

Getting Started

If you're running AI-generated code in production—or planning to—start with these steps:

  1. Audit your current execution model: Where does AI code run today? What isolation exists? What happens if it misbehaves?
  2. Identify high-risk workloads: Multi-tenant environments, customer-facing agents, or anything iterating in production loops.
  3. Test with sandboxed execution: Spin up a microVM platform (like Akira), migrate a single workload, and measure isolation, latency, and cost.
  4. Add observability: Capture logs, execution metadata, and resource usage for every run. Build alerting on failures.
  5. Scale incrementally: Move more workloads into sandboxes as confidence grows. Use snapshots and cloning to handle stateful use cases.

The shift from "run it somewhere" to "run it in isolated, purpose-built infrastructure" isn't optional for production AI systems. It's the difference between scaling safely and explaining incidents to your security team.

Footer Test