Computer Plane — modular split between Modal, E2B, and Daytona¶
Status: Proposed Owners: TBD Tracks issue: TBD (this doc is the canonical spec; the GitHub issue points at it)
Computer Plane is one of two compute planes under the unified
ComputeClientcontract — seedocs/reference/compute-client.md(#785). It advertisesCapabilities(dom_aware=False, stealth=True)atsession_init. The DOM-aware extensions (state.*,tabs.*,links.*) ship on Browser-Use Plane only; Computer Plane refuses them at the contract level. Pure-CUA — by design.
Summary¶
Today Mantis runs the brain (Holo3 / Claude / OpenCUA / Fara / Gemma4), the task loop, the step handlers, Xvfb, and Chrome inside one Modal function invocation. The Chrome profile lives on modal.Volume("osworld-data") mounted at /data. Brain ↔ environment calls are in-process Python.
This document specifies the split of that monolith into two planes connected by a narrow RPC:
- Brain plane — keeps the executor pool, brain, task loop, step handlers, and Augur emission. Unchanged code, lighter image.
- Computer plane — runs Xvfb + Chrome + xdotool behind a small FastAPI service (
ComputerAgent). Exposesscreenshot+xdotool+ opt-incdp.
The seam between them is the existing GymEnvironment ABC, with ComputerClient as a new factory-built implementation. Step handlers do not change.
The split is delivered in three phases. Phase 0 introduces the seam in-process (no behavior change). Phase 1 lifts the computer plane out as a separate Modal function (still inside the same Modal app, sharing the same volume). Phase 2 makes the computer plane host pluggable so an E2B Desktop or Daytona sandbox can replace the Modal computer-plane function via a deploy-config change.
Goals¶
- Single seam (
ComputerClient) for everything that talks to "the computer," replacing directXdotoolGymEnvinstantiation in the executor lifecycle. - Wire contract that is CUA-pure:
screenshot+xdotool(argv)are required;cdp_evaluate/cdp_click_at_pointare opt-in. No DOM-aware verbs. - Phase 1 ships zero new infra: brain and computer planes are two Modal functions in the same app, both mounting
osworld-data. Profile bytes stay where they are. No tar.zst, no S3. - Phase 2 makes E2B Desktop and Daytona pluggable behind the same
ComputerClientinterface via aRemoteComputerImpland a per-provider adapter. - Per-profile lock semantics, Augur emission, plan format, and HTTP API are unchanged.
Non-goals¶
- Moving the brain. Brain stays on Modal in all phases.
- Profile snapshotting / S3 /
ProfileSnapshotter. Only required when leaving Modal Volume — fully specified incomputer-plane-profile-snapshots.md. The snapshot doc is the gating doc that has to land before any Phase 2 host backend can ship. - Replacing the GPU executor pool. Holo3 / OpenCUA / Fara stay on their current Modal GPU tiers.
- Multi-region, geo-pinning, sandbox identity management. Out of scope here.
Architecture overview¶
┌──────────────────────────────────────────────────────────────┐
│ Modal app "mantis-cua-server" │
│ │
│ ┌─────────── BRAIN PLANE ───────────┐ │
│ │ FastAPI → Executor pool → │ │
│ │ Brain → Task Loop → │ │
│ │ Step Handlers → │ │
│ │ ComputerClient (NEW) │ │
│ │ ├── LocalXdotoolImpl (Phase 0) │
│ │ ├── RemoteModalImpl (Phase 1) │
│ │ ├── RemoteE2BImpl (Phase 2) │
│ │ └── RemoteDaytonaImpl (Phase 2) │
│ └───────────────────────────────────┘ │
│ │ HTTPS (Phase 1+) │
│ ▼ │
│ ┌────────── COMPUTER PLANE (NEW Modal fn, Phase 1) ──────┐ │
│ │ ComputerAgent FastAPI: │ │
│ │ POST /screenshot │ │
│ │ POST /xdotool │ │
│ │ POST /cdp (opt-in) │ │
│ │ POST /session (init: bind tenant + profile) │ │
│ │ Xvfb + Chrome + xdotool │ │
│ └────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
│ both mount /data
▼
┌──────────────────────────────────────────────────────┐
│ modal.Volume "osworld-data" — UNCHANGED │
│ /data/chrome-profile/<tenant>__<profile_id>/ │
│ /data/results/, /data/models/, /data/training/ │
└──────────────────────────────────────────────────────┘
Phase 2 adds non-Modal backends, connected over the same HTTPS surface. Profile snapshotting only enters the picture when the backend cannot mount osworld-data.
Wire contract¶
The contract is the only thing all backends must implement identically. Anything beyond this surface is implementation detail of a specific backend.
Endpoints¶
| Method | Path | Required | Notes |
|---|---|---|---|
POST |
/session/init |
yes | Bind this ComputerAgent instance to a (tenant_id, profile_id, run_id) triple. Returns session token; subsequent calls carry it as a header. |
POST |
/session/close |
yes | Tear down Chrome, flush state, release. |
POST |
/screenshot |
yes | Returns PNG bytes + viewport metadata. |
POST |
/xdotool |
yes | Executes xdotool argv against the bound display. Returns stdout + rc. |
POST |
/cdp |
no (opt-in) | Executes a CDP Runtime.evaluate / Input.* against the bound Chrome target. Disabled unless executor.cdp_enabled=true. |
GET |
/health |
yes | Liveness + last-action timestamp. |
Pydantic models (canonical)¶
# src/mantis_agent/gym/computer_wire.py (NEW)
from pydantic import BaseModel, Field
from typing import Literal
class SessionInitRequest(BaseModel):
tenant_id: str
profile_id: str
run_id: str
proxy_server: str | None = None # e.g. "http://user:pass@host:port"
chrome_flags: list[str] = Field(default_factory=list)
enable_cdp: bool = False # opt-in escape hatch
viewport: tuple[int, int] = (1280, 720)
class SessionInitResponse(BaseModel):
session_token: str # carry as X-Mantis-Session header
chrome_pid: int | None = None
xvfb_display: str # e.g. ":99"
class ScreenshotRequest(BaseModel):
# Empty body for v1. Reserved for future params (region, format).
format: Literal["png"] = "png"
class ScreenshotResponse(BaseModel):
image_b64: str # PNG bytes, base64-encoded
width: int
height: int
scroll_y: int # read from window.scrollY via CDP
captured_at_ms: int # unix ms
class XdotoolRequest(BaseModel):
argv: list[str] # e.g. ["mousemove", "842", "317", "click", "1"]
step_id: str # client-generated; agent dedupes within window
timeout_ms: int = 5000
class XdotoolResponse(BaseModel):
stdout: str
stderr: str
returncode: int
deduplicated: bool = False # true if step_id already saw a successful execution
class CDPRequest(BaseModel):
expression: str # JS expression (action-dispatch only by convention)
await_promise: bool = False
step_id: str
class CDPResponse(BaseModel):
result_json: str # JSON-stringified CDP result
returncode: int = 0
Idempotency¶
xdotool is not naturally idempotent — sending a click twice can navigate twice. The wire requires:
step_idon everyxdotoolandcdpcall, client-generated, unique per logical step.ComputerAgentmaintains a TTL-bounded LRU ofstep_id → response(default TTL = 30s, capacity = 1000). A repeatedstep_idreturns the cached response withdeduplicated=true.- Retries by the client use the same
step_id. New logical steps use a newstep_id.
Session lifecycle¶
brain plane computer plane
─────────── ──────────────
POST /session/init {tenant,...} →
← SessionInitResponse {token, ...}
[Xvfb + Chrome launched, profile mounted]
POST /screenshot →
POST /xdotool {argv, step_id} →
... 50–200 iterations of the loop ...
POST /session/close →
[Chrome SIGTERM, Xvfb stopped]
/session/init is idempotent on run_id: a second init for the same run_id returns the existing session token. Different run_id against an active session is a 409.
ComputerClient factory¶
# src/mantis_agent/gym/computer_client.py (NEW)
from src.mantis_agent.gym.base import GymEnvironment
class ComputerClient(GymEnvironment):
"""Marker base. All impls are GymEnvironment subclasses."""
def make_computer_client(cfg: ComputerPlaneConfig) -> ComputerClient:
match cfg.backend:
case "local": return LocalXdotoolImpl(cfg) # Phase 0
case "modal": return RemoteComputerImpl(cfg, ...) # Phase 1
case "e2b": return RemoteE2BImpl(cfg, ...) # Phase 2
case "daytona": return RemoteDaytonaImpl(cfg, ...) # Phase 2
case _: raise ValueError(...)
ComputerPlaneConfig lives in src/mantis_agent/config.py with a backend: Literal[...] field, default "local". Per-executor overrides allowed (e.g. run_claude_cua migrates first; the GPU executors stay on "local" longer).
Phase 0 — seam refactor (zero behavior change)¶
Goal: prove the abstraction is faithful by routing today's production through it with no observable change.
File changes¶
| File | Change |
|---|---|
src/mantis_agent/gym/computer_client.py |
NEW. Defines ComputerClient marker + make_computer_client(cfg) factory. |
src/mantis_agent/gym/local_xdotool_impl.py |
NEW. Thin wrapper around the existing XdotoolGymEnv. Implements ComputerClient. |
src/mantis_agent/gym/computer_wire.py |
NEW. Pydantic models above (used by Phase 1 client; defined now to lock the contract). |
src/mantis_agent/config.py |
Add ComputerPlaneConfig, default backend="local". |
src/mantis_agent/task_loop.py |
Replace direct XdotoolGymEnv(...) construction with make_computer_client(cfg). |
deploy/modal/modal_cua_server.py |
Pass cfg.computer_plane into the executor lifecycle. Default "local". |
tests/gym/test_computer_client_factory.py |
NEW. Verifies backend="local" returns a working LocalXdotoolImpl and the existing test suite passes through it. |
Acceptance¶
- All existing
XdotoolGymEnvtests pass via the factory. - A boattrader smoke plan rerun on Modal completes identically (lead count, cost-per-lead within ±5% of the baseline).
- Production unchanged: no Modal redeploy required beyond the seam PR.
Phase 1 — Modal computer-plane function¶
Goal: lift Xvfb + Chrome + xdotool into a separate Modal function, accessed via RemoteComputerImpl. Same Modal app, same volume.
File changes¶
| File | Change |
|---|---|
deploy/modal/computer_plane.py |
NEW. @app.function(volumes={"/data": data_volume}, image=computer_image, cpu=2.0) exposing the ComputerAgent FastAPI via @modal.fastapi_endpoint. Implements all six endpoints. Manages Xvfb (lazy start on first request, lifetime = session). |
deploy/modal/Dockerfile or modal.Image |
Brain plane image loses Xvfb / Chrome / xdotool layers (they move to computer_image). |
src/mantis_agent/gym/remote_computer_impl.py |
NEW. Implements ComputerClient over HTTPS. Generates step_ids. Honors timeouts. Catches transient 5xx, retries with the same step_id. ~150–250 LOC. |
src/mantis_agent/gym/computer_client.py |
Factory dispatches backend="modal" → RemoteComputerImpl(base_url=cfg.modal_computer_plane_url). |
deploy/modal/modal_cua_server.py |
Env var COMPUTER_PLANE_URL resolved per-region from a Modal Dict lookup; passed into RemoteComputerImpl. |
tests/gym/test_remote_computer_impl.py |
NEW. Mock-server tests for retry/dedup/timeout. |
tests/integration/test_phase1_e2e.py |
NEW. Spins up both Modal functions in a staging app; runs a 5-step plan. |
docs/hosting/modal.md |
Update: brain plane vs computer plane, image split, env vars. |
Migration order (per-executor)¶
Roll out per-executor behind ComputerPlaneConfig.per_executor_overrides:
run_claude_cua(0 GPU, easiest, biggest cost win from a clean split)run_gemma4_cua(T4 planner)run_faraandrun_holo3once the above two are stable- EvoCUA / OpenCUA tiers last
Roll back is a config flip; no redeploy required.
Acceptance¶
- Per-step p50 latency increase ≤ 20 ms (target: 6–16 ms RT same-region).
- Boattrader plan: leads / cost / runtime within ±10% of pre-split baseline.
- Chrome crash (intentional
kill -9in the computer-plane container) recoverable without restarting the brain executor. - Modal logs split cleanly:
modal app logs mantis-cua-servershows brain;modal app logs mantis-cua-server --function computer_planeshows browser.
Phase 2 — E2B + Daytona backends¶
Goal: make the computer-plane host pluggable. No code changes to step handlers or the brain.
File changes¶
| File | Change |
|---|---|
deploy/computer-plane/Dockerfile |
NEW. Same ComputerAgent code as Phase 1, packaged as a generic Linux image (not modal.Image). |
deploy/computer-plane/e2b.toml |
NEW. E2B Desktop sandbox spec referencing the image. |
deploy/computer-plane/daytona.toml |
NEW. Daytona workspace spec. |
src/mantis_agent/gym/remote_e2b_impl.py |
NEW. ComputerClient impl that drives an E2B Desktop sandbox. Wraps the E2B SDK's mouse/keyboard/screenshot primitives. ~200 LOC. |
src/mantis_agent/gym/remote_daytona_impl.py |
NEW. ComputerClient impl that talks HTTPS to a ComputerAgent running inside a Daytona workspace. Mostly the Phase 1 client with a different provisioning path. |
src/mantis_agent/observability/profile_snapshotter.py |
NEW (gated on Phase 2 actually shipping). Implements tar.zst snapshot + S3 canonical store, with the per-profile lock TTL/renewal semantics. Only relevant for non-Modal backends. |
docs/reference/computer-plane-profile-snapshots.md |
LANDED. Dedicated spec for the snapshot pipeline — gating doc for any Phase 2 host backend that can't mount osworld-data. |
Acceptance¶
- A boattrader plan family can be routed to E2B via config alone. No code in
src/mantis_agent/gym/step_handlers/orsrc/mantis_agent/brain_*.pychanges. - A second plan family routed to Daytona simultaneously. Brain dispatches per executor based on
ComputerPlaneConfig.per_plan_overrides.
Reliability notes¶
- Idempotency. Discussed under wire contract.
step_ideverywhere. Augur events carrystep_idso post-hoc reconciliation is possible. - Lock TTL. Phase 1 inherits today's in-process per-profile lock. Phase 2 must move to a TTL'd lock with renewal pings (Modal Dict) because the sandbox can outlive the executor invocation. Spec for that lock lives in the Phase 2 follow-up doc.
- Observability. Augur emission stays brain-side and is unaffected. The computer plane emits its own WARNING-level logs (per
feedback_warning_level_for_modal_observability.md) for any state the brain cannot see (Chrome crash, Xvfb segfault, proxy DNS failure). - CUA-purity boundary. The wire is the enforcement point. CDP endpoints are config-gated per executor (
enable_cdpdefaults false) and never used to derive grounding —feedback_browser_infra_rpc_surface.md.
Open questions¶
- Per-profile lock placement in Phase 2. Modal Dict vs. Redis vs. file-based. Defer until Phase 2 scope starts.
- Screenshot transport. Base64-in-JSON is simple but ~33% overhead. Multipart binary is faster but uglier on the client. Defer; profile first.
- Computer-plane container warm pool. Should computer-plane functions stay warm pinned to a profile (faster) or be ephemeral per-session (cheaper)? Production data needed.
- Cross-region. Out of scope until a customer asks for it.
Risks¶
| Risk | Mitigation |
|---|---|
| Per-step latency tax larger than projected | Phase 1 is rollback-as-config. Benchmark first executor, abort migration if p50 > 20 ms. |
step_id dedup mishandles legit retries |
TTL + capacity tuning; surface deduplicated=true in Augur events. |
| Brain image regression from removing Chrome layers | Phase 1 builds and validates both images side-by-side before flipping COMPUTER_PLANE_URL. |
| Augur event timing skews because env-side is no longer in-process | Add a brain-side xdotool_dispatched_at_ms event; reconcile in Augur. |
References¶
feedback_browser_infra_rpc_surface.md— CUA-pure RPC surface; computer naming.feedback_cua_no_dom_access.md— no DOM-derived grounding.feedback_modal_warm_container_caveat.md— 10-min stale-code window; Phase 1 decouples per side.feedback_modal_new_app_id_per_deploy.md— log resolution by app name, not id.project_sim_envs_route_ordering.md— FastAPI route ordering pitfall forComputerAgent/session/*vs catch-alls.