Concepts¶
Once you've done the Quickstart, this page tells you what every component actually does. Read it once before designing your own plan.
The runtime¶
Caller /v1/predict MicroPlanRunner
───── ─────────── ───────────────
┌─ BrainHolo3 (Holo3 GPU inference)
plan, tenant ──HTTP──► validate, clamp ──────────►├─ ClaudeGrounding (refine clicks)
caps, namespace ├─ ClaudeExtractor (read structured data)
state_key ├─ DynamicPlanVerifier
└─ XdotoolGymEnv (Xvfb + Chrome)
│
▼
Target site
| Component | Purpose | Where it runs |
|---|---|---|
| Holo3-35B-A3B (GGUF) | Tactical click / scroll / type / drag actions | Single GPU per Mantis pod |
| Claude (Anthropic API) | Strategic gate verification, structured data extraction, click coordinate refinement | Anthropic cloud, called per surgical step |
| MicroPlanRunner | Section/gate/loop state machine, per-step retry, checkpointing | Mantis pod, in-process |
| XdotoolGymEnv | Real Chrome inside Xvfb + xdotool for fingerprint-free clicks | Mantis pod |
| IPRoyal proxy | Residential, geo-targeted egress for sites with bot detection | Mantis pod (sticky session per run) |
Plans¶
A plan is a structured description of what the agent should do. Three shapes are supported, in priority order on /v1/predict:
| Shape | Field | When to use |
|---|---|---|
| Inline task suite | task_suite: { ... } |
You have arbitrary task data and don't want to bake it into the container image |
| Pre-baked path | task_file: "tasks/crm/crm_tasks.json" |
The plan ships in the container image |
| Micro-plan | micro: "plans/example/...json" (path) or inline list |
High-reliability extraction with sections / gates / loops |
| Plain text | plan_text: "Extract the first 3 product listings from example.com" |
One-shot ad-hoc; server decomposes via Claude (cached after first call) |
See Plan formats for full schemas and examples.
Step types (micro-plan shape)¶
Each step in a micro-plan is a JSON object with a type and an intent:
type |
What the runner does |
|---|---|
navigate |
Loads intent's URL via env.reset(); waits for Cloudflare; sets the proxy |
click |
Fresh Holo3 inference loop with budget actions; optionally refined by ClaudeGrounding |
scroll |
Holo3 scrolls until intent satisfied; scroll-fail-as-success fallback |
extract_url |
Reads the address bar via Claude — no Holo3 |
extract_data |
Claude reads the screenshot and emits structured fields per the schema |
navigate_back |
Alt+Left + verify URL change |
paginate |
URL-based or grounded click on the Next button |
loop |
Jumps back to step loop_target up to loop_count times |
filter |
Claude finds the filter checkbox and clicks it |
Useful per-step modifiers:
| Field | Effect |
|---|---|
section |
One of setup, extraction, pagination. Used by retry/halt logic. |
required |
If true: retry on fail, then halt the whole run. |
gate |
Claude verifies a condition; halt the run on fail. |
verify |
Free-text condition Claude checks. |
claude_only |
Skip Holo3 entirely; Claude does the perception. Use for extract / gate steps. |
grounding |
Refine Holo3's click coordinates with ClaudeGrounding. |
budget |
Max actions Holo3 can take in this step (default 8). |
loop_target |
Step index to jump back to (only on loop steps). |
loop_count |
Max loop iterations; clamped to MANTIS_MAX_LOOP_ITERATIONS. |
Tenants and tokens¶
Mantis is multi-tenant from Tier 1. A tenant is just a record in the operator's keys file mapping an X-Mantis-Token to:
tenant_id— the namespace prefix used instate_keyand on the data volumescopes— which actions this token can do (run,status,result,logs)max_concurrent_runs,max_cost_per_run,max_time_minutes_per_run,rate_limit_per_minute— caps the server enforces in addition to the global hard capsanthropic_secret_name— which Anthropic key this tenant's runs use (each tenant can bring its own billing)allowed_domains— wildcards matched againstnavigateURLs in submitted planswebhook_url,webhook_secret_name— optional run-completion callback
Plans submitted by tenant A cannot read tenant B's checkpoints, profiles, or recordings — state_key is server-prefixed with the tenant id and the data volume is namespaced.
state_key — the resume primitive¶
state_key is the most important per-run field after the plan itself. It controls:
| Behavior | |
|---|---|
| Browser profile | A Chrome profile dir at tenants/<tenant_id>/chrome-profile/<state_key>/ is created or reused. Cookies + sessions persist across runs with the same key. |
| Checkpoint resume | The runner saves progress to tenants/<tenant_id>/checkpoints/<state_key>.json. Pass resume_state: true to pick up where the last run left off. |
| Idempotency | (Not the same as Idempotency-Key header) — state_key is the workflow identity, the header is the request identity. |
Pick state_key to match the conceptual workflow: marketplace-miami-listings-v1, crm-prod, customer-12345-onboarding. Reuse the same key across runs of the same workflow; pick a new key when the workflow definition changes.
The cost meter¶
Every run produces a cost breakdown:
{
"summary": {
"cost_total": 0.42,
"cost_breakdown": {
"gpu": 0.12, // Holo3 GPU minutes
"claude": 0.12, // Anthropic API tokens
"proxy": 0.18 // IPRoyal residential proxy bandwidth
}
}
}
Caps are enforced server-side: max_cost (default $25, env-overridable to global, per-tenant clamped) is the wall before the runner halts. max_time_minutes is the wall-clock cap.
Polished recording¶
When record_video: true, every run produces both a raw screencast and a polished walkthrough composed by ffmpeg:
The polished version is what GET /v1/runs/<id>/video returns by default; pass ?raw=1 for the screen capture without overlays.
Action overlays include click ripples, keyboard chord badges (CTRL + S), scroll arrows, type captions, and drag trails. They render the same regardless of what the agent clicks (browser, file manager, terminal, dialog) because all communication is in pixels.
What's NOT in scope¶
- Browser automation via DevTools / CDP — Mantis uses Xvfb + xdotool specifically to avoid the fingerprints CDP leaves. If you need DOM-level access, a different tool is the right fit.
- Headless mode — the agent watches a real Xvfb display so the model sees what a human would.
- Audio / video stream as input — Mantis takes screenshots, not arbitrary media.
Next¶
- Plan formats — write your first plan
- Hosting — deploy your own instance
- Client — integrate from your app