Computer Plane — modular split between Modal, E2B, and Daytona

Status: Proposed Owners: TBD Tracks issue: TBD (this doc is the canonical spec; the GitHub issue points at it)

Computer Plane is one of two compute planes under the unified ComputeClient contract — see docs/reference/compute-client.md (#785). It advertises Capabilities(dom_aware=False, stealth=True) at session_init. The DOM-aware extensions (state.*, tabs.*, links.*) ship on Browser-Use Plane only; Computer Plane refuses them at the contract level. Pure-CUA — by design.

Summary¶

Today Mantis runs the brain (Holo3 / Claude / OpenCUA / Fara / Gemma4), the task loop, the step handlers, Xvfb, and Chrome inside one Modal function invocation. The Chrome profile lives on modal.Volume("osworld-data") mounted at /data. Brain ↔ environment calls are in-process Python.

This document specifies the split of that monolith into two planes connected by a narrow RPC:

Brain plane — keeps the executor pool, brain, task loop, step handlers, and Augur emission. Unchanged code, lighter image.
Computer plane — runs Xvfb + Chrome + xdotool behind a small FastAPI service (ComputerAgent). Exposes screenshot + xdotool + opt-in cdp.

The seam between them is the existing GymEnvironment ABC, with ComputerClient as a new factory-built implementation. Step handlers do not change.

The split is delivered in three phases. Phase 0 introduces the seam in-process (no behavior change). Phase 1 lifts the computer plane out as a separate Modal function (still inside the same Modal app, sharing the same volume). Phase 2 makes the computer plane host pluggable so an E2B Desktop or Daytona sandbox can replace the Modal computer-plane function via a deploy-config change.

Goals¶

Single seam (ComputerClient) for everything that talks to "the computer," replacing direct XdotoolGymEnv instantiation in the executor lifecycle.
Wire contract that is CUA-pure: screenshot + xdotool(argv) are required; cdp_evaluate / cdp_click_at_point are opt-in. No DOM-aware verbs.
Phase 1 ships zero new infra: brain and computer planes are two Modal functions in the same app, both mounting osworld-data. Profile bytes stay where they are. No tar.zst, no S3.
Phase 2 makes E2B Desktop and Daytona pluggable behind the same ComputerClient interface via a RemoteComputerImpl and a per-provider adapter.
Per-profile lock semantics, Augur emission, plan format, and HTTP API are unchanged.

Non-goals¶

Moving the brain. Brain stays on Modal in all phases.
Profile snapshotting / S3 / ProfileSnapshotter. Only required when leaving Modal Volume — fully specified in computer-plane-profile-snapshots.md. The snapshot doc is the gating doc that has to land before any Phase 2 host backend can ship.
Replacing the GPU executor pool. Holo3 / OpenCUA / Fara stay on their current Modal GPU tiers.
Multi-region, geo-pinning, sandbox identity management. Out of scope here.

Architecture overview¶

┌──────────────────────────────────────────────────────────────┐
│  Modal app "mantis-cua-server"                               │
│                                                              │
│  ┌─────────── BRAIN PLANE ───────────┐                       │
│  │ FastAPI  →  Executor pool  →      │                       │
│  │ Brain    →  Task Loop      →      │                       │
│  │ Step Handlers  →                  │                       │
│  │   ComputerClient (NEW)            │                       │
│  │     ├── LocalXdotoolImpl  (Phase 0)                       │
│  │     ├── RemoteModalImpl    (Phase 1)                      │
│  │     ├── RemoteE2BImpl      (Phase 2)                      │
│  │     └── RemoteDaytonaImpl  (Phase 2)                      │
│  └───────────────────────────────────┘                       │
│                       │ HTTPS (Phase 1+)                     │
│                       ▼                                      │
│  ┌────────── COMPUTER PLANE (NEW Modal fn, Phase 1) ──────┐  │
│  │ ComputerAgent FastAPI:                                 │  │
│  │   POST /screenshot                                     │  │
│  │   POST /xdotool                                        │  │
│  │   POST /cdp        (opt-in)                            │  │
│  │   POST /session    (init: bind tenant + profile)       │  │
│  │ Xvfb + Chrome + xdotool                                │  │
│  └────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────┘
              │ both mount /data
              ▼
┌──────────────────────────────────────────────────────┐
│ modal.Volume "osworld-data"  — UNCHANGED              │
│   /data/chrome-profile/<tenant>__<profile_id>/        │
│   /data/results/, /data/models/, /data/training/      │
└──────────────────────────────────────────────────────┘

Phase 2 adds non-Modal backends, connected over the same HTTPS surface. Profile snapshotting only enters the picture when the backend cannot mount osworld-data.

Wire contract¶

The contract is the only thing all backends must implement identically. Anything beyond this surface is implementation detail of a specific backend.

Endpoints¶

Method	Path	Required	Notes
`POST`	`/session/init`	yes	Bind this `ComputerAgent` instance to a `(tenant_id, profile_id, run_id)` triple. Returns session token; subsequent calls carry it as a header.
`POST`	`/session/close`	yes	Tear down Chrome, flush state, release.
`POST`	`/screenshot`	yes	Returns PNG bytes + viewport metadata.
`POST`	`/xdotool`	yes	Executes `xdotool argv` against the bound display. Returns stdout + rc.
`POST`	`/cdp`	no (opt-in)	Executes a CDP `Runtime.evaluate` / `Input.*` against the bound Chrome target. Disabled unless `executor.cdp_enabled=true`.
`GET`	`/health`	yes	Liveness + last-action timestamp.

Pydantic models (canonical)¶

# src/mantis_agent/gym/computer_wire.py  (NEW)

from pydantic import BaseModel, Field
from typing import Literal


class SessionInitRequest(BaseModel):
    tenant_id: str
    profile_id: str
    run_id: str
    proxy_server: str | None = None           # e.g. "http://user:pass@host:port"
    chrome_flags: list[str] = Field(default_factory=list)
    enable_cdp: bool = False                  # opt-in escape hatch
    viewport: tuple[int, int] = (1280, 720)


class SessionInitResponse(BaseModel):
    session_token: str                        # carry as X-Mantis-Session header
    chrome_pid: int | None = None
    xvfb_display: str                         # e.g. ":99"


class ScreenshotRequest(BaseModel):
    # Empty body for v1. Reserved for future params (region, format).
    format: Literal["png"] = "png"


class ScreenshotResponse(BaseModel):
    image_b64: str                            # PNG bytes, base64-encoded
    width: int
    height: int
    scroll_y: int                             # read from window.scrollY via CDP
    captured_at_ms: int                       # unix ms


class XdotoolRequest(BaseModel):
    argv: list[str]                           # e.g. ["mousemove", "842", "317", "click", "1"]
    step_id: str                              # client-generated; agent dedupes within window
    timeout_ms: int = 5000


class XdotoolResponse(BaseModel):
    stdout: str
    stderr: str
    returncode: int
    deduplicated: bool = False                # true if step_id already saw a successful execution


class CDPRequest(BaseModel):
    expression: str                           # JS expression (action-dispatch only by convention)
    await_promise: bool = False
    step_id: str


class CDPResponse(BaseModel):
    result_json: str                          # JSON-stringified CDP result
    returncode: int = 0

Idempotency¶

xdotool is not naturally idempotent — sending a click twice can navigate twice. The wire requires:

step_id on every xdotool and cdp call, client-generated, unique per logical step.
ComputerAgent maintains a TTL-bounded LRU of step_id → response (default TTL = 30s, capacity = 1000). A repeated step_id returns the cached response with deduplicated=true.
Retries by the client use the same step_id. New logical steps use a new step_id.

Session lifecycle¶

brain plane                          computer plane
───────────                          ──────────────
POST /session/init {tenant,...}  →
                                 ←   SessionInitResponse {token, ...}
                                     [Xvfb + Chrome launched, profile mounted]

POST /screenshot                 →
POST /xdotool {argv, step_id}    →
   ... 50–200 iterations of the loop ...

POST /session/close              →
                                     [Chrome SIGTERM, Xvfb stopped]

/session/init is idempotent on run_id: a second init for the same run_id returns the existing session token. Different run_id against an active session is a 409.

`ComputerClient` factory¶

# src/mantis_agent/gym/computer_client.py  (NEW)

from src.mantis_agent.gym.base import GymEnvironment

class ComputerClient(GymEnvironment):
    """Marker base. All impls are GymEnvironment subclasses."""

def make_computer_client(cfg: ComputerPlaneConfig) -> ComputerClient:
    match cfg.backend:
        case "local":   return LocalXdotoolImpl(cfg)              # Phase 0
        case "modal":   return RemoteComputerImpl(cfg, ...)       # Phase 1
        case "e2b":     return RemoteE2BImpl(cfg, ...)            # Phase 2
        case "daytona": return RemoteDaytonaImpl(cfg, ...)        # Phase 2
        case _: raise ValueError(...)

ComputerPlaneConfig lives in src/mantis_agent/config.py with a backend: Literal[...] field, default "local". Per-executor overrides allowed (e.g. run_claude_cua migrates first; the GPU executors stay on "local" longer).

Phase 0 — seam refactor (zero behavior change)¶

Goal: prove the abstraction is faithful by routing today's production through it with no observable change.

File changes¶

File	Change
`src/mantis_agent/gym/computer_client.py`	NEW. Defines `ComputerClient` marker + `make_computer_client(cfg)` factory.
`src/mantis_agent/gym/local_xdotool_impl.py`	NEW. Thin wrapper around the existing `XdotoolGymEnv`. Implements `ComputerClient`.
`src/mantis_agent/gym/computer_wire.py`	NEW. Pydantic models above (used by Phase 1 client; defined now to lock the contract).
`src/mantis_agent/config.py`	Add `ComputerPlaneConfig`, default `backend="local"`.
`src/mantis_agent/task_loop.py`	Replace direct `XdotoolGymEnv(...)` construction with `make_computer_client(cfg)`.
`deploy/modal/modal_cua_server.py`	Pass `cfg.computer_plane` into the executor lifecycle. Default `"local"`.
`tests/gym/test_computer_client_factory.py`	NEW. Verifies `backend="local"` returns a working `LocalXdotoolImpl` and the existing test suite passes through it.

Acceptance¶

All existing XdotoolGymEnv tests pass via the factory.
A boattrader smoke plan rerun on Modal completes identically (lead count, cost-per-lead within ±5% of the baseline).
Production unchanged: no Modal redeploy required beyond the seam PR.

Goal: lift Xvfb + Chrome + xdotool into a separate Modal function, accessed via RemoteComputerImpl. Same Modal app, same volume.

File changes¶

File	Change
`deploy/modal/computer_plane.py`	NEW. `@app.function(volumes={"/data": data_volume}, image=computer_image, cpu=2.0)` exposing the `ComputerAgent` FastAPI via `@modal.fastapi_endpoint`. Implements all six endpoints. Manages Xvfb (lazy start on first request, lifetime = session).
`deploy/modal/Dockerfile` or `modal.Image`	Brain plane image loses Xvfb / Chrome / xdotool layers (they move to `computer_image`).
`src/mantis_agent/gym/remote_computer_impl.py`	NEW. Implements `ComputerClient` over HTTPS. Generates `step_id`s. Honors timeouts. Catches transient 5xx, retries with the same `step_id`. ~150–250 LOC.
`src/mantis_agent/gym/computer_client.py`	Factory dispatches `backend="modal"` → `RemoteComputerImpl(base_url=cfg.modal_computer_plane_url)`.
`deploy/modal/modal_cua_server.py`	Env var `COMPUTER_PLANE_URL` resolved per-region from a Modal Dict lookup; passed into `RemoteComputerImpl`.
`tests/gym/test_remote_computer_impl.py`	NEW. Mock-server tests for retry/dedup/timeout.
`tests/integration/test_phase1_e2e.py`	NEW. Spins up both Modal functions in a staging app; runs a 5-step plan.
`docs/hosting/modal.md`	Update: brain plane vs computer plane, image split, env vars.

Migration order (per-executor)¶

Roll out per-executor behind ComputerPlaneConfig.per_executor_overrides:

run_claude_cua (0 GPU, easiest, biggest cost win from a clean split)
run_gemma4_cua (T4 planner)
run_fara and run_holo3 once the above two are stable
EvoCUA / OpenCUA tiers last

Roll back is a config flip; no redeploy required.

Acceptance¶

Per-step p50 latency increase ≤ 20 ms (target: 6–16 ms RT same-region).
Boattrader plan: leads / cost / runtime within ±10% of pre-split baseline.
Chrome crash (intentional kill -9 in the computer-plane container) recoverable without restarting the brain executor.
Modal logs split cleanly: modal app logs mantis-cua-server shows brain; modal app logs mantis-cua-server --function computer_plane shows browser.

Phase 2 — E2B + Daytona backends¶

Goal: make the computer-plane host pluggable. No code changes to step handlers or the brain.

File changes¶

File	Change
`deploy/computer-plane/Dockerfile`	NEW. Same `ComputerAgent` code as Phase 1, packaged as a generic Linux image (not `modal.Image`).
`deploy/computer-plane/e2b.toml`	NEW. E2B Desktop sandbox spec referencing the image.
`deploy/computer-plane/daytona.toml`	NEW. Daytona workspace spec.
`src/mantis_agent/gym/remote_e2b_impl.py`	NEW. `ComputerClient` impl that drives an E2B Desktop sandbox. Wraps the E2B SDK's mouse/keyboard/screenshot primitives. ~200 LOC.
`src/mantis_agent/gym/remote_daytona_impl.py`	NEW. `ComputerClient` impl that talks HTTPS to a `ComputerAgent` running inside a Daytona workspace. Mostly the Phase 1 client with a different provisioning path.
`src/mantis_agent/observability/profile_snapshotter.py`	NEW (gated on Phase 2 actually shipping). Implements tar.zst snapshot + S3 canonical store, with the per-profile lock TTL/renewal semantics. Only relevant for non-Modal backends.
`docs/reference/computer-plane-profile-snapshots.md`	LANDED. Dedicated spec for the snapshot pipeline — gating doc for any Phase 2 host backend that can't mount `osworld-data`.

Acceptance¶

A boattrader plan family can be routed to E2B via config alone. No code in src/mantis_agent/gym/step_handlers/ or src/mantis_agent/brain_*.py changes.
A second plan family routed to Daytona simultaneously. Brain dispatches per executor based on ComputerPlaneConfig.per_plan_overrides.

Reliability notes¶

Idempotency. Discussed under wire contract. step_id everywhere. Augur events carry step_id so post-hoc reconciliation is possible.
Lock TTL. Phase 1 inherits today's in-process per-profile lock. Phase 2 must move to a TTL'd lock with renewal pings (Modal Dict) because the sandbox can outlive the executor invocation. Spec for that lock lives in the Phase 2 follow-up doc.
Observability. Augur emission stays brain-side and is unaffected. The computer plane emits its own WARNING-level logs (per feedback_warning_level_for_modal_observability.md) for any state the brain cannot see (Chrome crash, Xvfb segfault, proxy DNS failure).
CUA-purity boundary. The wire is the enforcement point. CDP endpoints are config-gated per executor (enable_cdp defaults false) and never used to derive grounding — feedback_browser_infra_rpc_surface.md.

Open questions¶

Per-profile lock placement in Phase 2. Modal Dict vs. Redis vs. file-based. Defer until Phase 2 scope starts.
Screenshot transport. Base64-in-JSON is simple but ~33% overhead. Multipart binary is faster but uglier on the client. Defer; profile first.
Computer-plane container warm pool. Should computer-plane functions stay warm pinned to a profile (faster) or be ephemeral per-session (cheaper)? Production data needed.
Cross-region. Out of scope until a customer asks for it.

Risks¶

Risk	Mitigation
Per-step latency tax larger than projected	Phase 1 is rollback-as-config. Benchmark first executor, abort migration if p50 > 20 ms.
`step_id` dedup mishandles legit retries	TTL + capacity tuning; surface `deduplicated=true` in Augur events.
Brain image regression from removing Chrome layers	Phase 1 builds and validates both images side-by-side before flipping `COMPUTER_PLANE_URL`.
Augur event timing skews because env-side is no longer in-process	Add a brain-side `xdotool_dispatched_at_ms` event; reconcile in Augur.

References¶

feedback_browser_infra_rpc_surface.md — CUA-pure RPC surface; computer naming.
feedback_cua_no_dom_access.md — no DOM-derived grounding.
feedback_modal_warm_container_caveat.md — 10-min stale-code window; Phase 1 decouples per side.
feedback_modal_new_app_id_per_deploy.md — log resolution by app name, not id.
project_sim_envs_route_ordering.md — FastAPI route ordering pitfall for ComputerAgent /session/* vs catch-alls.

Computer Plane — modular split between Modal, E2B, and Daytona¶

Summary¶

Goals¶

Non-goals¶

Architecture overview¶

Wire contract¶

Endpoints¶

Pydantic models (canonical)¶

Idempotency¶

Session lifecycle¶

ComputerClient factory¶

Phase 0 — seam refactor (zero behavior change)¶

File changes¶

Acceptance¶

Phase 1 — Modal computer-plane function¶

File changes¶

Migration order (per-executor)¶

Acceptance¶

Phase 2 — E2B + Daytona backends¶

File changes¶

Acceptance¶

Reliability notes¶

Open questions¶

Risks¶

References¶

`ComputerClient` factory¶