The sandbox is a function call.
pyex is a Python 3 interpreter written in Elixir, built as an execution substrate for agent loops. Sandboxed code never touches a Python runtime, a process, or your filesystem — it reaches an interpreter that sees only the capabilities you pass in. Tenant boot: microseconds. Container cold start: seconds.
# the whole pitch: the model emits Python, your tools are host functions, # isolation is deny-by-default — and it's all one function call. tools = %{ "search" => {:builtin, fn [q] -> MyApp.Search.run(q) end}, "fetch" => {:builtin, fn [url] -> MyApp.HTTP.get(url) end} } {:ok, result, ctx} = Pyex.run(code_the_model_wrote, modules: %{"agent" => %{"tools" => tools}}, filesystem: %{"notes.md" => scratchpad}, limits: [timeout: 5_000, max_memory_bytes: 50_000_000])
why code mode needs a different shape
Tool-calling agents pay one model round-trip per action. Code mode collapses ten tool calls into one program — but now you're executing untrusted, model-written code on every step. The industry answer is a VM per step, which reintroduces the latency you were removing and puts an RPC boundary between the agent's code and every tool it calls. pyex's answer: interpret the Python yourself, in-process — a step costs microseconds, and a tool call is a host function dispatch. No IPC, no marshalling, no path from Python source to an OS process.
| Sandbox service / microVM | pyex | |
|---|---|---|
| Start a run | seconds cold, or a warm pool to manage | ~200 µs, no pool |
| Call a tool | HTTP/RPC round-trip | Elixir function dispatch |
| Keep agent state | serialize + ship | a value on your heap |
| Per-tenant cost | a VM | a struct |
the trust boundary is a diff you can read
open()writes to the map you passed in.requests.gethits your allowlist. There is noos.execbecause it doesn't exist. And a static analyzer walks the compiled BEAM code on every CI run and fails the build if anything underlib/pyexreferencesFile,Port,System.cmd,spawn, or the host environment. The sandbox guarantee is a CI gate, not a code-review promise.
# Deny by default. Every effect is a capability you chose to hand in. Pyex.run(source, filesystem: %{"data.json" => json}, # open() sees only this network: [%{allowed_url_prefix: "https://api.example.com/"}], env: %{"API_KEY" => key}, # injected, never in source limits: [timeout: 5_000, max_memory_bytes: 50_000_000])
And every run returns an unforgeable capability ledger — an OpenTelemetry span tree of every
file, URL, and store the program touched, even when it crashed. Preview effects before they
happen: copy-on-write overlays stage open(...).write and store.put for
review, then commit/1 applies exactly the run you approved — deterministic under a
seed, so there is no time-of-check/time-of-use gap.
the loop itself is sandboxed
# The model wrote this. It runs 10 steps without a single # network hop between the code and the tools. import json from agent import call_model, tools state = {"steps": []} for _ in range(10): decision = call_model(state) if decision["action"] == "stop": break result = tools[decision["tool"]](*decision["args"]) state["steps"].append({"tool": decision["tool"], "result": result}) print(json.dumps(state))
Generators are continuations, so a step can pause and resume without owning a process.
asyncio.gather interleaves like CPython. Retries, planners, eval harnesses — the loop
logic the model emits just runs.
See examples/research_agent.py
for the runnable proof.
numbers, reproducibly
| Workload | p50 | p99 |
|---|---|---|
| FizzBuzz (100 iterations) | 182 µs | 238 µs |
| Algorithms suite (~150 LOC: sieve + sort + fib + stats) | 1.67 ms | 2.04 ms |
| FastAPI cold boot | 221 µs | 302 µs |
| FastAPI route — list + Jinja2 render | 108 µs | 166 µs |
| FastAPI route — 404 | 9 µs | 19 µs |
mix run bench/readme_bench.exs
The honest tradeoff: 10–100× slower than CPython for pure CPU work — and it doesn't matter, because agent steps are dominated by tool I/O, JSON shaping, and routing. Compute budgets exclude I/O time: an agent waiting on a slow tool isn't killed for it; an infinite loop is.
multi-tenancy
A booted app is a struct on your heap. 100,000 tenants is a benchmark file
(bench/multitenant_scaling_bench.exs), not a capacity-planning meeting. Storage
multitenancy is an object boundary, not a tenant_id filter someone forgets.
{:ok, app} = Pyex.Lambda.boot(model_generated_fastapi_source)
{:ok, resp, app} = Pyex.Lambda.handle(app, %{method: "GET", path: "/hello/world"})
# boot once, handle many; state threads through; tenants serialize like any value
trust, itemized
decimal.@spec on the public surface; the banned-call tracer fails CI if the sandbox boundary regresses.defense in depth
pyex stops the 99% cooperatively — step, memory, and output budgets with clean Python errors.
The BEAM stops the rest unconditionally — run each guest in a monitored process with a
GC-enforced max_heap_size and a wall-clock kill
(examples/sandbox_server.exs
is the copy-paste). A microVM around the whole node stops the adversary.
One ops property worth quoting: the guest can't move your 5xx rate — verdicts
(ok / error / timeout / OOM) are body fields; HTTP status describes only your service.
pyex is a hardened library, not a microVM. Against a sophisticated adversary it composes
with stronger isolation rather than replacing it. It runs the Python agents actually write —
json, re, asyncio, pydantic, requests,
fastapi, partial pandas — not all of CPython. And it's an interpreter:
pure CPU work runs 10–100× slower than CPython, which agent workloads don't notice. Naming our own
boundary is the point.