cua`)¶

Use this when you want Mantis to be a thin brain pass-through — no plan decomposition, no Claude grounding, no Claude extraction. The instruction is handed verbatim to the brain, which drives the headed browser inside the deployment via xdotool.

The brain in question is whichever one the pod was started with (MANTIS_MODEL env — holo3, fara, gemma4-cua, …). Baseten pods are single-brain by deployment; see CUA models for the selection model.

If you've used /v1/predict, the mental model is: same auth, same tenant caps, same allowlist, same rate limit / concurrency / recording plumbing — but the inside of the container is a single brain ↔ XdotoolGymEnv loop instead of the orchestrated MicroPlanRunner.

When to use it¶

Pick `/v1/cua` when…	Pick `/v1/predict` when…
You want to benchmark Holo3's intrinsic plan-following	You want reliable multi-step extraction with verification
You're integrating Mantis as a "Holo3-on-Modal" backend for an existing CUA harness	You're building an end-to-end extraction pipeline
You don't have an Anthropic key on the deployment	You want Claude-quality grounding on small click targets
You want zero Claude spend per run	You're OK with ~$0.005/click for grounding accuracy

The tradeoff: without ClaudeGrounding, click coordinates come straight from the brain's model space (Holo3 / OpenCUA use Qwen smart-resize; Fara uses raw screen pixels at its training resolution — see Coordinate spaces). On small targets accuracy drops versus the grounded path. That's the whole point of "pure CUA" — you're measuring what the brain can do unassisted.

How "pure" it actually is (Claude assist matrix)¶

"No Claude" is the default, not an absolute guarantee. Two assists exist:

Assist	When it fires	What it does
Claude director	Container has `ANTHROPIC_API_KEY` and an action loop is detected	Substitutes a single tactical action (click / scroll / key_press / wait) to break the loop. Never plans, never types. Gate off with `MANTIS_CUA_DIRECTOR=disabled`.
Screenshot grounding	Request sets `ground_clicks: true`	Refines the brain's click coords with the screenshot grounding model (≈$0.005/click). Off by default.
Decompose-then-cua	Request sets `decompose: true`	Decomposes the instruction into an ordered sub-goal roadmap (Claude) up front and drives the brain with the augmented task — bridges planning and open-endedness on long multi-step flows. Off by default.

For a strictly brain-only run (e.g. an ablation baseline), deploy without an Anthropic key (or set MANTIS_CUA_DIRECTOR=disabled) and leave ground_clicks unset.

Two reliability behaviors apply on this path regardless of Claude:

Typed-text read-back — after each type_text, the focused field is read back so the run log says (verified) / TYPING FAILED instead of the old blanket (unverified) (#931). Disable with MANTIS_VERIFY_TYPE=disabled.
Contenteditable rich-text editors (message boxes, composers) are filled via a host-focused CDP insert rather than raw keystrokes.

Action surface¶

What the brain can emit, all executed by xdotool against the headed Chrome inside Xvfb:

Verb	Args	Effect
`click`	`x`, `y`	Single click at screen pixel `(x, y)`
`double_click`	`x`, `y`	Double-click
`type_text`	`text`	Type into the focused element
`key_press`	`key`	Single key or chord (e.g. `"ctrl+a"`, `"enter"`)
`scroll`	direction, amount	Scroll the viewport
`drag`	`start_x`, `start_y`, `end_x`, `end_y`	Drag from start to end
`wait`	`seconds`	Sleep — useful for slow page transitions
`done`	`success`, `summary`	Terminate the loop

Parsing supports five fallback strategies — OpenAI tool_calls, Holo3 native Action: name({...}) text, JSON action blob, pyautogui-style calls, and bare keywords (DONE / FAIL). You don't configure this; the brain handles it.

Request¶

curl -X POST "$ENDPOINT/v1/cua" \
  -H "X-Mantis-Token: $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "instruction": "Find tomorrow’s music events on lu.ma and click the first one",
    "start_url": "https://lu.ma/discover",
    "max_steps": 25,
    "settle_time": 4.0
  }'

Body fields¶

Field	Default	Effect
`instruction`	— (required)	Free-text instruction handed verbatim to the brain
`start_url`	`""`	Initial URL the browser navigates to before the first inference
`max_steps`	`30`	Cap on brain↔env iterations; server-clamped to `MANTIS_MAX_STEPS_PER_PLAN`
`frames_per_inference`	`1`	Screenshots fed to the brain per step. Holo3 wants 1
`settle_time`	`2.0`	Seconds to wait after each action before re-screenshotting
`detached`	`false`	Queue as background; return `run_id` to poll via `/v1/predict` actions
`profile_id`	`"default"`	(#341) Chrome user-data-dir identity; sticky across plan revisions
`workflow_id`	`plan_signature[:12]`	(#341) Checkpoint identity; rotate when the plan changes
`state_key`	unset	Legacy single-field identity. When set alone, routes to both `profile_id` and `workflow_id` (back-compat)
`max_cost`	`25.0`	USD cap (mostly informational — pure CUA spends nothing on Claude)
`max_time_minutes`	`60`	Wall-clock cap; clamped against tenant cap
`proxy_city` / `proxy_state`	unset	Geo override (allowlist-gated)
`proxy_provider`	unset	`privateproxy` / `oxylabs` / `iproyal` — defaults to `MANTIS_PROXY_PROVIDER` env
`proxy_disabled`	`false`	Skip the residential proxy entirely
`record_video`	`false`	Capture screencast; fetch via `GET /v1/runs/{run_id}/video`
`video_format`	`"mp4"`	One of `mp4`, `webm`, `gif`
`video_fps`	`5`	Capture rate `[1, 30]`

The proxy_* / max_cost / max_time_minutes fields can also be declared inside a wrapped plan via the runtime block. When both are set the HTTP body wins.

Sync response¶

{
  "run_id": "20260511_143215",
  "mode": "pure_cua",
  "provider": "baseten",
  "session_name": "pure_cua",
  "model": "holo3",
  "instruction": "Find tomorrow's music events on lu.ma…",
  "start_url": "https://lu.ma/discover",
  "success": true,
  "termination_reason": "done",         // or "max_steps" / "loop" / "env_done"
  "steps": 18,
  "duration_s": 142,
  "elapsed_seconds": 142.36,
  "trajectory_len": 18
}

Detached response¶

When "detached": true, the response matches /v1/predict's detached shape — a queued envelope with status_path / result_path / events_path pointers. Poll via the standard polling actions on /v1/predict:

curl -X POST "$ENDPOINT/v1/predict" \
  -H "X-Mantis-Token: $TOKEN" \
  -d '{"action": "status", "run_id": "<run_id>"}'

The result JSON written to disk has the same shape as the sync response above.

CLI¶

The mantis cua run subcommand is a thin HTTP shim — no local browser, no Anthropic key, no Playwright. It just POSTs to /v1/cua:

export MANTIS_ENDPOINT="https://workspace--app-fn.modal.run"
export MANTIS_TOKEN="<tenant_token>"

mantis cua run "Find tomorrow's music events on lu.ma and click the first one" \
  --start-url https://lu.ma/discover \
  --max-steps 25 \
  --settle-time 4.0

Useful flags:

Flag	Purpose
`--endpoint`	Mantis deployment base URL (overrides `MANTIS_ENDPOINT`)
`--token`	Per-tenant token (overrides `MANTIS_TOKEN`)
`--header KEY=VALUE`	Add HTTP header (e.g. `Authorization=Api-Key …` on Baseten)
`--start-url`	Initial URL for the browser
`--max-steps`	Cap on the CUA loop
`--settle-time`	Seconds to wait after each action
`--profile-id`	(#341) Chrome user-data-dir identity (sticky across plan revisions)
`--workflow-id`	(#341) Checkpoint identity (rotate on plan change; pair with `--resume-state`)
`--state-key`	Legacy single-field; routes to both `--profile-id` and `--workflow-id`
`--detached`	Queue as background; print `run_id` and exit
`--proxy-disabled`	Skip the residential proxy
`--record-video`	Capture screencast
`--json`	Print the raw response JSON instead of a summary

Exit code is 0 on success=true, 1 otherwise. Same as mantis plan run.

How this differs from `/v1/predict` under the hood¶

/v1/predict   →   _task_suite_from_payload   →   MicroPlanRunner
                                                    ├── ClaudeExtractor    (extract / verify)
                                                    ├── ClaudeGrounding    (refine click coords)
                                                    └── Holo3Brain         (tactical actions)
                                                          └── GymRunner ↔ XdotoolGymEnv

/v1/cua       →   _run_pure_cua              →   GymRunner
                                                    └── Holo3Brain         (all actions)
                                                          └── XdotoolGymEnv

Concretely: _run_pure_cua instantiates GymRunner(brain=self.brain, grounding=None) and calls runner.run(task=<instruction>). There is no extractor, no decomposer, no per-step verification. Whatever the brain emits, xdotool executes.

What you give up¶

No structured extraction. The response is success / steps / duration_s / termination_reason — there's no leads array, no extracted data. The brain ends by emitting done with a free-text summary; if you need structured data, parse the trajectory events or use /v1/predict with an extraction_schema.
No grounding correction. Holo3 occasionally picks coordinates a few pixels off the target. /v1/predict recovers via Claude; /v1/cua doesn't.
No section / gate / loop semantics. It's one open-ended loop until done or max_steps.
No skip envelope. The skip / skip_reason fields documented in Sending plans are produced by MicroPlanRunner. They don't appear on /v1/cua responses.

If any of those are dealbreakers, you want /v1/predict. If they're fine, /v1/cua is the cheapest, simplest way to drive Holo3 on Modal / Baseten.

Pure CUA mode (/v1/cua)¶