Pure CUA mode (/v1/cua)¶
Use this when you want Mantis to be a thin brain pass-through — no plan decomposition, no Claude grounding, no Claude extraction. The instruction is handed verbatim to the brain, which drives the headed browser inside the deployment via xdotool.
The brain in question is whichever one the pod was started with
(MANTIS_MODEL env — holo3, fara, gemma4-cua, …). Baseten pods
are single-brain by deployment; see CUA models for the selection model.
If you've used /v1/predict, the mental model is: same auth, same
tenant caps, same allowlist, same rate limit / concurrency / recording
plumbing — but the inside of the container is a single
brain ↔ XdotoolGymEnv loop instead of the orchestrated
MicroPlanRunner.
When to use it¶
Pick /v1/cua when… |
Pick /v1/predict when… |
|---|---|
| You want to benchmark Holo3's intrinsic plan-following | You want reliable multi-step extraction with verification |
| You're integrating Mantis as a "Holo3-on-Modal" backend for an existing CUA harness | You're building an end-to-end extraction pipeline |
| You don't have an Anthropic key on the deployment | You want Claude-quality grounding on small click targets |
| You want zero Claude spend per run | You're OK with ~$0.005/click for grounding accuracy |
The tradeoff: without ClaudeGrounding, click coordinates come straight
from the brain's model space (Holo3 / OpenCUA use Qwen smart-resize;
Fara uses raw screen pixels at its training resolution — see
Coordinate spaces). On small targets accuracy drops versus the grounded path. That's the whole point of "pure CUA" — you're measuring what the brain can do unassisted.
How "pure" it actually is (Claude assist matrix)¶
"No Claude" is the default, not an absolute guarantee. Two assists exist:
| Assist | When it fires | What it does |
|---|---|---|
| Claude director | Container has ANTHROPIC_API_KEY and an action loop is detected |
Substitutes a single tactical action (click / scroll / key_press / wait) to break the loop. Never plans, never types. Gate off with MANTIS_CUA_DIRECTOR=disabled. |
| Screenshot grounding | Request sets ground_clicks: true |
Refines the brain's click coords with the screenshot grounding model (≈$0.005/click). Off by default. |
| Decompose-then-cua | Request sets decompose: true |
Decomposes the instruction into an ordered sub-goal roadmap (Claude) up front and drives the brain with the augmented task — bridges planning and open-endedness on long multi-step flows. Off by default. |
For a strictly brain-only run (e.g. an ablation baseline), deploy without an
Anthropic key (or set MANTIS_CUA_DIRECTOR=disabled) and leave
ground_clicks unset.
Two reliability behaviors apply on this path regardless of Claude:
- Typed-text read-back — after each
type_text, the focused field is read back so the run log says(verified)/TYPING FAILEDinstead of the old blanket(unverified)(#931). Disable withMANTIS_VERIFY_TYPE=disabled. - Contenteditable rich-text editors (message boxes, composers) are filled via a host-focused CDP insert rather than raw keystrokes.
Action surface¶
What the brain can emit, all executed by xdotool against the headed Chrome inside Xvfb:
| Verb | Args | Effect |
|---|---|---|
click |
x, y |
Single click at screen pixel (x, y) |
double_click |
x, y |
Double-click |
type_text |
text |
Type into the focused element |
key_press |
key |
Single key or chord (e.g. "ctrl+a", "enter") |
scroll |
direction, amount | Scroll the viewport |
drag |
start_x, start_y, end_x, end_y |
Drag from start to end |
wait |
seconds |
Sleep — useful for slow page transitions |
done |
success, summary |
Terminate the loop |
Parsing supports five fallback strategies — OpenAI tool_calls, Holo3
native Action: name({...}) text, JSON action blob, pyautogui-style
calls, and bare keywords (DONE / FAIL). You don't configure this;
the brain handles it.
Request¶
curl -X POST "$ENDPOINT/v1/cua" \
-H "X-Mantis-Token: $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"instruction": "Find tomorrow’s music events on lu.ma and click the first one",
"start_url": "https://lu.ma/discover",
"max_steps": 25,
"settle_time": 4.0
}'
Body fields¶
| Field | Default | Effect |
|---|---|---|
instruction |
— (required) | Free-text instruction handed verbatim to the brain |
start_url |
"" |
Initial URL the browser navigates to before the first inference |
max_steps |
30 |
Cap on brain↔env iterations; server-clamped to MANTIS_MAX_STEPS_PER_PLAN |
frames_per_inference |
1 |
Screenshots fed to the brain per step. Holo3 wants 1 |
settle_time |
2.0 |
Seconds to wait after each action before re-screenshotting |
detached |
false |
Queue as background; return run_id to poll via /v1/predict actions |
profile_id |
"default" |
(#341) Chrome user-data-dir identity; sticky across plan revisions |
workflow_id |
plan_signature[:12] |
(#341) Checkpoint identity; rotate when the plan changes |
state_key |
unset | Legacy single-field identity. When set alone, routes to both profile_id and workflow_id (back-compat) |
max_cost |
25.0 |
USD cap (mostly informational — pure CUA spends nothing on Claude) |
max_time_minutes |
60 |
Wall-clock cap; clamped against tenant cap |
proxy_city / proxy_state |
unset | Geo override (allowlist-gated) |
proxy_provider |
unset | privateproxy / oxylabs / iproyal — defaults to MANTIS_PROXY_PROVIDER env |
proxy_disabled |
false |
Skip the residential proxy entirely |
record_video |
false |
Capture screencast; fetch via GET /v1/runs/{run_id}/video |
video_format |
"mp4" |
One of mp4, webm, gif |
video_fps |
5 |
Capture rate [1, 30] |
The proxy_* / max_cost / max_time_minutes fields can also be declared inside a wrapped plan via the runtime block. When both are set the HTTP body wins.
Sync response¶
{
"run_id": "20260511_143215",
"mode": "pure_cua",
"provider": "baseten",
"session_name": "pure_cua",
"model": "holo3",
"instruction": "Find tomorrow's music events on lu.ma…",
"start_url": "https://lu.ma/discover",
"success": true,
"termination_reason": "done", // or "max_steps" / "loop" / "env_done"
"steps": 18,
"duration_s": 142,
"elapsed_seconds": 142.36,
"trajectory_len": 18
}
Detached response¶
When "detached": true, the response matches /v1/predict's detached
shape — a queued envelope with status_path / result_path /
events_path pointers. Poll via the standard polling actions on
/v1/predict:
curl -X POST "$ENDPOINT/v1/predict" \
-H "X-Mantis-Token: $TOKEN" \
-d '{"action": "status", "run_id": "<run_id>"}'
The result JSON written to disk has the same shape as the sync response above.
CLI¶
The mantis cua run subcommand is a thin HTTP shim — no local browser,
no Anthropic key, no Playwright. It just POSTs to /v1/cua:
export MANTIS_ENDPOINT="https://workspace--app-fn.modal.run"
export MANTIS_TOKEN="<tenant_token>"
mantis cua run "Find tomorrow's music events on lu.ma and click the first one" \
--start-url https://lu.ma/discover \
--max-steps 25 \
--settle-time 4.0
Useful flags:
| Flag | Purpose |
|---|---|
--endpoint |
Mantis deployment base URL (overrides MANTIS_ENDPOINT) |
--token |
Per-tenant token (overrides MANTIS_TOKEN) |
--header KEY=VALUE |
Add HTTP header (e.g. Authorization=Api-Key … on Baseten) |
--start-url |
Initial URL for the browser |
--max-steps |
Cap on the CUA loop |
--settle-time |
Seconds to wait after each action |
--profile-id |
(#341) Chrome user-data-dir identity (sticky across plan revisions) |
--workflow-id |
(#341) Checkpoint identity (rotate on plan change; pair with --resume-state) |
--state-key |
Legacy single-field; routes to both --profile-id and --workflow-id |
--detached |
Queue as background; print run_id and exit |
--proxy-disabled |
Skip the residential proxy |
--record-video |
Capture screencast |
--json |
Print the raw response JSON instead of a summary |
Exit code is 0 on success=true, 1 otherwise. Same as mantis plan run.
How this differs from /v1/predict under the hood¶
/v1/predict → _task_suite_from_payload → MicroPlanRunner
├── ClaudeExtractor (extract / verify)
├── ClaudeGrounding (refine click coords)
└── Holo3Brain (tactical actions)
└── GymRunner ↔ XdotoolGymEnv
/v1/cua → _run_pure_cua → GymRunner
└── Holo3Brain (all actions)
└── XdotoolGymEnv
Concretely: _run_pure_cua instantiates GymRunner(brain=self.brain,
grounding=None) and calls runner.run(task=<instruction>). There is
no extractor, no decomposer, no per-step verification. Whatever the
brain emits, xdotool executes.
What you give up¶
- No structured extraction. The response is
success/steps/duration_s/termination_reason— there's noleadsarray, no extracted data. The brain ends by emittingdonewith a free-text summary; if you need structured data, parse the trajectory events or use/v1/predictwith anextraction_schema. - No grounding correction. Holo3 occasionally picks coordinates a
few pixels off the target.
/v1/predictrecovers via Claude;/v1/cuadoesn't. - No section / gate / loop semantics. It's one open-ended loop until
doneormax_steps. - No skip envelope. The
skip/skip_reasonfields documented in Sending plans are produced byMicroPlanRunner. They don't appear on/v1/cuaresponses.
If any of those are dealbreakers, you want /v1/predict. If they're
fine, /v1/cua is the cheapest, simplest way to drive Holo3 on Modal /
Baseten.