Concepts¶
Once you've done the Quickstart, this page tells you what every component actually does. Read it once before designing your own plan.
The runtime¶
Caller /v1/predict MicroPlanRunner
───── ─────────── ───────────────
┌─ BrainHolo3 (Holo3 GPU inference)
plan, tenant ──HTTP──► validate, clamp ──────────►├─ ClaudeGrounding (refine clicks)
caps, namespace ├─ ClaudeExtractor (read structured data)
profile_id, workflow_id ├─ DynamicPlanVerifier
└─ XdotoolGymEnv (Xvfb + Chrome)
│
▼
Target site
| Component | Purpose | Where it runs |
|---|---|---|
| Holo3-35B-A3B (GGUF) | Tactical click / scroll / type / drag actions | Single GPU per Mantis pod |
| Claude (Anthropic API) | Strategic gate verification, structured data extraction, click coordinate refinement | Anthropic cloud, called per surgical step |
| MicroPlanRunner | Section/gate/loop state machine, per-step retry, checkpointing | Mantis pod, in-process |
| XdotoolGymEnv | Real Chrome inside Xvfb + xdotool for fingerprint-free clicks | Mantis pod |
| IPRoyal proxy | Residential, geo-targeted egress for sites with bot detection | Mantis pod (sticky session per run) |
Plans¶
A plan is a structured description of what the agent should do. Three shapes are supported, in priority order on /v1/predict:
| Shape | Field | When to use |
|---|---|---|
| Inline task suite | task_suite: { ... } |
You have arbitrary task data and don't want to bake it into the container image |
| Pre-baked path | task_file: "tasks/crm/crm_tasks.json" |
The plan ships in the container image |
| Micro-plan | micro: "plans/example/...json" (path) or inline list |
High-reliability extraction with sections / gates / loops |
| Plain text | plan_text: "Extract the first 3 product listings from example.com" |
One-shot ad-hoc; server decomposes via Claude (cached after first call) |
See Plan formats for full schemas and examples.
Step types (micro-plan shape)¶
Each step in a micro-plan is a JSON object with a type and an intent:
type |
What the runner does |
|---|---|
navigate |
Loads intent's URL via env.reset(); waits for Cloudflare; sets the proxy |
click |
Fresh Holo3 inference loop with budget actions; optionally refined by ClaudeGrounding |
scroll |
Holo3 scrolls until intent satisfied; scroll-fail-as-success fallback |
extract_url |
Reads the address bar via Claude — no Holo3 |
extract_data |
Claude reads the screenshot and emits structured fields per the schema |
navigate_back |
Alt+Left + verify URL change |
paginate |
URL-based or grounded click on the Next button |
loop |
Jumps back to step loop_target up to loop_count times. Optional stop_var reads a runner state variable and exits the loop early when truthy. |
if_else |
Branches on runner._state_vars[condition_var] to then_target (truthy) or else_target (falsy/missing). Composes with detect_visible. Missing var / out-of-range target falls through to the next step. (#820) |
detect_visible |
One Claude/Holo3 yes/no vision call ("is the cookie banner visible?"); writes a bool to runner._state_vars[out_var]. Pairs with if_else and step-level guard. |
extract_rows |
Multi-row extraction in one Claude call (top-N from a list page). Same handler as extract_data; extract_data also takes the multi-row branch when its schema has max_items > 1. (#820) |
filter |
Claude finds the filter checkbox and clicks it |
fill_field |
Claude finds the labelled input (params.label), clears it, types params.value |
submit |
Claude finds the labelled button / nav-link / row-link (params.label + params.kind) and left-clicks it |
select_option |
Opens a labelled dropdown then picks the named option (params.dropdown_label + params.option_label) |
right_click |
Claude finds the labelled element (params.label) and right-clicks to open the native context menu — use for "Open Link in New Tab" / "Copy Link" / app-defined context menus on table rows or grid cells (#373) |
Useful per-step modifiers:
| Field | Effect |
|---|---|
section |
One of setup, extraction, pagination. Used by retry/halt logic. |
required |
If true: retry on fail, then halt the whole run. |
gate |
Claude verifies a condition; halt the run on fail. |
verify |
Free-text condition Claude checks. |
claude_only |
Skip Holo3 entirely; Claude does the perception. Use for extract / gate steps. |
grounding |
Refine Holo3's click coordinates with ClaudeGrounding. |
budget |
Max actions Holo3 can take in this step (default 8). |
loop_target |
Step index to jump back to (only on loop steps). |
loop_count |
Max loop iterations; clamped to MANTIS_MAX_LOOP_ITERATIONS. |
stop_var |
(loop only) Name of a runner state variable; when truthy, exit the loop early instead of running remaining iterations. |
condition_var |
(if_else only) Name of a state variable read for the branch decision. |
then_target / else_target |
(if_else only) Absolute step indices to jump to. -1 = fall through to next step. |
out_var |
(detect_visible only) Name of the state variable that receives the boolean answer. |
guard |
Name of a state variable; when falsy, the step is skipped entirely (no vision call, no env action). |
Tenants and tokens¶
Mantis is multi-tenant from Tier 1. A tenant is just a record in the operator's keys file mapping an X-Mantis-Token to:
tenant_id— the namespace prefix used inprofile_id/workflow_id/ legacystate_keyand on the data volumescopes— which actions this token can do (run,status,result,logs)max_concurrent_runs,max_cost_per_run,max_time_minutes_per_run,rate_limit_per_minute— caps the server enforces in addition to the global hard capsanthropic_secret_name— which Anthropic key this tenant's runs use (each tenant can bring its own billing)allowed_domains— wildcards matched againstnavigateURLs in submitted planswebhook_url,webhook_secret_name— optional run-completion callback
Plans submitted by tenant A cannot read tenant B's checkpoints, profiles, or recordings — profile_id / workflow_id (and legacy state_key) are all server-prefixed with the tenant id and the data volume is namespaced.
profile_id + workflow_id — the resume primitives (#341)¶
The two most important per-run fields after the plan itself. They were one field — state_key — until #341 split them because they have opposite rotation lifetimes.
| Field | Behavior | |
|---|---|---|
| Browser profile | profile_id |
A Chrome user-data-dir at tenants/<tenant_id>/chrome-profile/<profile_id>/ is created or reused. Cookies + logged-in sessions persist across runs. Sticky — keep this stable so you don't have to log back in every time the plan changes. |
| Checkpoint | workflow_id |
The runner saves progress to tenants/<tenant_id>/checkpoints/<workflow_id>.json. Pass resume_state: true to pick up where the last run with this id left off. Rotate when the plan definition changes meaningfully — resuming step N/12 of an old plan against a new layout is incoherent. |
| Pause-time browser snapshot | browser_state |
Captured automatically on PauseRequested and round-tripped through PauseState: current URL, scroll offset, viewport size, and unsubmitted form input. On action=resume the agent re-lands on the exact pixel + repopulates half-filled forms. Passwords are masked; missing selectors on the resumed DOM are silently skipped. See api.md → What's captured at pause time for the full table. (Epic #358.) |
| Idempotency | (not these) | The Idempotency-Key header is the request identity (24h dedup); workflow_id is the workflow identity. |
Pick profile_id to match the account or persona — e.g. alice-prod, customer-12345. Pick workflow_id to match the plan revision — e.g. marketplace-miami-listings-v3, the default is plan_signature[:12].
Same account running 5 different workflows in parallel? Pass one profile_id and five distinct workflow_ids. (Note: Chrome serializes those runs because two processes cannot share a user-data-dir at the same time — distinct profile_ids are required for true parallelism. See #342 for the Modal HTTP endpoint that surfaces this as a 409 instead of silent corruption.)
Legacy state_key¶
The single-field state_key still works — when set alone, the server routes it to both profile_id and workflow_id for back-compat. New code should set the two fields independently. The result envelope echoes all three fields so callers can grep either.
The cost meter¶
Every run produces a cost breakdown:
{
"summary": {
"cost_total": 0.42,
"cost_breakdown": {
"gpu": 0.12, // Holo3 GPU minutes
"claude": 0.12, // Anthropic API tokens
"proxy": 0.18 // IPRoyal residential proxy bandwidth
}
}
}
Caps are enforced server-side: max_cost (default $25, env-overridable to global, per-tenant clamped) is the wall before the runner halts. max_time_minutes is the wall-clock cap.
Polished recording¶
When record_video: true, every run produces both a raw screencast and a polished walkthrough composed by ffmpeg:
The polished version is what GET /v1/runs/<id>/video returns by default; pass ?raw=1 for the screen capture without overlays.
Action overlays include click ripples, keyboard chord badges (CTRL + S), scroll arrows, type captions, and drag trails. They render the same regardless of what the agent clicks (browser, file manager, terminal, dialog) because all communication is in pixels.
What's NOT in scope¶
- Browser automation via DevTools / CDP — Mantis uses Xvfb + xdotool specifically to avoid the fingerprints CDP leaves. If you need DOM-level access, a different tool is the right fit.
- Headless mode — the agent watches a real Xvfb display so the model sees what a human would.
- Audio / video stream as input — Mantis takes screenshots, not arbitrary media.
Next¶
- Plan formats — write your first plan
- Hosting — deploy your own instance
- Client — integrate from your app