Skip to content

CUA models

The brain is the vision-action policy that drives the perception-reasoning-action loop on a screenshot. Mantis ships several interchangeable brains; pick one with cua_model="<name>" on /v1/predict, --model <name> on the CLI, or MANTIS_MODEL=<name> for the Baseten container.

All brains implement the same think(frames, task, action_history, screen_size) -> InferenceResult contract, so the Hybrid plan-executor + vision-fallback flow (gym/runner.py) works identically regardless of which one is wired in. "Passthrough" mode (every step routed straight to the brain) is just cua_model=<name> flowing through _executor_for_model().

Selecting a backend

How you select a backend depends on where Mantis is hosted. The selection model differs:

Surface Selection Notes
Modal /v1/predict Per-request cua_model field One deployment serves every backend; each request spawns the right GPU function.
Modal CLI (modal run) --model <name> Same dispatch as the HTTP API.
Baseten /production/predict and /production/sync/v1/cua Per-deployment via MANTIS_MODEL env Each pod loads one brain at startup. cua_model in the request body is ignored. To use Fara on Baseten you push deploy/baseten/fara/ as its own model and hit that deployment's URL. See the URL contract below for /production/predict vs /production/sync/<path>.

Default model is holo3 on both surfaces — existing callers that don't pass cua_model keep getting Holo3.

curl -X POST https://workspace--mantis-cua-server-api.modal.run/v1/predict \
  -H "X-Mantis-Token: $MANTIS_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "detached":     true,
    "micro":        "plans/example/extract_listings.json",
    "profile_id":   "marketplace-prod",
    "workflow_id":  "marketplace-listings-v1",
    "cua_model":    "fara",
    "max_cost":     2,
    "max_time_minutes": 20
  }'

GET /v1/models returns the list the dispatcher accepts:

{ "data": [
  { "id": "holo3" }, { "id": "fara" }, { "id": "gemma4-cua" },
  { "id": "evocua-8b" }, { "id": "evocua-32b" },
  { "id": "opencua-32b" }, { "id": "opencua-72b" }, { "id": "claude" }
]}
uv run modal run --detach deploy/modal/modal_cua_server.py \
  --micro plans/example/extract_listings.json \
  --model fara \
  --max-cost 2 --max-time-minutes 20 \
  --profile-id marketplace-prod \
  --workflow-id marketplace-listings-v1

Baseten — separate deployment per model

# Push Fara as its own Baseten model
uvx truss push deploy/baseten/fara --no-cache --promote \
  --deployment-name "mantis-fara-$(date -u +%Y%m%d-%H%M)"

# Call the orchestrated /predict route
curl -X POST "https://model-<FARA_MODEL_ID>.api.baseten.co/production/predict" \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -H "X-Mantis-Token: $TOK" \
  -H "Content-Type: application/json" \
  -d '{ "detached": true, "micro": "plans/example/extract_listings.json",
        "profile_id": "alice", "workflow_id": "v1",
        "max_cost": 2, "max_time_minutes": 20 }'

Notice the request body has no cua_model field — the Fara deployment was started with MANTIS_MODEL=fara (see deploy/baseten/fara/config.yaml), and that's what determines the brain. Holo3 stays on its own deployment URL.

Baseten URL contract

The Baseten gateway forwards two route families to every truss-server deployment:

Public URL Container path What it serves
/production/predict predict_endpoint: from config.yaml (default /predict) Mantis orchestrated run / status / resume — the high-level entry point.
/production/sync/<any-path> <any-path> on the in-pod FastAPI / nginx (port server_port) Pass-through for arbitrary FastAPI routes — /v1/chat/completions, /v1/models, /v1/health, /v1/cua.

For a non-promoted (canary) deployment, swap production for deployment/<DEPLOYMENT_ID> in either form.

This is how a remote-brain integration (caller runs the CUA loop locally, hits Mantis only for per-step inference) points a FaraBrain at the Baseten deployment:

endpoint = "https://model-<FARA_MODEL_ID>.api.baseten.co/production/sync"

brain = FaraBrain(
    base_url=f"{endpoint}/v1",                 # → /v1/chat/completions
    extra_headers={
        "Authorization": f"Api-Key {BASETEN_API_KEY}",
        "X-Mantis-Token": MANTIS_API_TOKEN,
    },
    screen_size=(1280, 800),
)

The brain only hits /v1/chat/completions (single-step inference). The CUA loop (screenshot → think → action → step) runs in the caller's container against the caller's XdotoolGymEnv — the Mantis Baseten pod is a thin model server in that mode.

Use /production/predict instead when you want Mantis itself to run the CUA loop on its own headed Chrome inside Xvfb (the orchestrated path).

Name Repo Base Serve GPU License
holo3 Hcompany/Holo3-35B-A3B (GGUF: mradermacher/Holo3-35B-A3B-GGUF) Qwen3.5-VL MoE llama.cpp + Q8_0 GGUF 1× A100-80GB / H100 / L40S 48GB Hcompany terms
fara microsoft/Fara-7B Qwen2.5-VL vLLM (bf16) 1× A100-40GB / L40S / A10G MIT
gemma4-cua local fine-tune (Gemma4-31B-CUA) Gemma4 llama.cpp + GGUF 1× A100-80GB Gemma terms
evocua-8b / evocua-32b meituan/EvoCUA-* Qwen2-VL vLLM 1-2× A100-80GB EvoCUA terms
opencua-32b / opencua-72b xlangai/OpenCUA-* Qwen2-VL vLLM 4-8× A100-80GB OpenCUA terms
claude Anthropic API API 0 API

Fara-7B

Microsoft's compact 7B CUA, Qwen2.5-VL-based, MIT-licensed. Smaller and cheaper to serve than Holo3 (single A100-40GB instead of A100-80GB), faster cold start because vLLM serves Qwen2.5-VL natively (no llama.cpp / GGUF detour), and competitive on small-model CUA benchmarks.

Action space

Fara emits one computer_use tool call per turn. The brain adapter maps it to Mantis actions as follows:

Fara action Mantis action Notes
left_click CLICK Coords are raw screen pixels at the screenshot's input resolution; the brain scales back to actual viewport.
type TYPE
key KEY_PRESS Lists are joined with + (["ctrl", "a"]"ctrl+a").
scroll SCROLL Carries scroll_direction + scroll_amount. Optional coordinate scales to viewport.
wait WAIT durationseconds.
visit_url TYPE(url) Rides the env's URL auto-navigate path (xdotool_env.py:955) — ctrl+l → paste → Return.
history_back KEY_PRESS("alt+Left")
web_search TYPE("https://www.google.com/search?q=…") Same URL auto-navigate path.
mouse_move WAIT(0.2s) No-op. xdotool clicks already mousemove before press; explicit mouse-only moves collapse harmlessly.
pause_and_memorize_fact WAIT(0.2s) The fact is stashed in Action.reasoning for the trajectory.
terminate DONE status="success"success=True.

Regressions vs Holo3

Fara has no native double_click, right_click, or drag in its training. The brain will accept and forward those if a planner adapter emits them, but the underlying model won't volunteer them. If your plan needs any of these, prefer Holo3.

Coordinate handling

Fara was trained at a fixed input resolution (default 1428×896). The brain resizes the screenshot to that resolution before sending and scales emitted coordinate values back to the real viewport. Override via the constructor input_size kwarg or the MANTIS_FARA_INPUT_WH env var ("1366x768" form).

Serving

  • Modalcua_model="fara" routes through run_fara on a single A100-40GB. The shared _run_executor path downloads microsoft/Fara-7B to the volume on first run, starts vLLM, and instantiates FaraBrain. See Modal hosting.
  • Basetentruss push deploy/baseten/fara/ ships the same FastAPI surface as Holo3 with vLLM (no llama.cpp build). Set MANTIS_MODEL=fara; MANTIS_FARA_MODEL_DIR defaults to /models/fara where the weights: block mounts the HF snapshot. See Baseten hosting.

System prompt

src/mantis_agent/prompts/files/fara_system.txt — Microsoft's recommended template with the "Critical Point" stop semantics (checkout / book / call / order). Override per-tenant via MANTIS_PROMPTS_DIR/fara_system.txt.

Holo3-35B-A3B

Hcompany's Qwen3.5-VL MoE (35B total, 3B active). Strong OSWorld-Verified score (77.8%), but served via llama.cpp because vLLM doesn't yet support qwen3_5_moe. The Modal run_holo3 executor uses a 1× A100-80GB; the Baseten holo3 config pulls the Q8_0 GGUF (~34 GB) + mmproj (~0.8 GB) from mradermacher/Holo3-35B-A3B-GGUF.

See src/mantis_agent/brain_holo3.py for the five-strategy response parser (tool_calls → native text → JSON → pyautogui → keyword) and Qwen smart-resize coordinate handling.