Concepts¶

Once you've done the Quickstart, this page tells you what every component actually does. Read it once before designing your own plan.

The runtime¶

Caller                /v1/predict                 MicroPlanRunner
─────                 ───────────                 ───────────────
                                                  ┌─ BrainHolo3   (Holo3 GPU inference)
plan, tenant ──HTTP──► validate, clamp ──────────►├─ ClaudeGrounding (refine clicks)
                       caps, namespace             ├─ ClaudeExtractor (read structured data)
                       state_key                   ├─ DynamicPlanVerifier
                                                   └─ XdotoolGymEnv (Xvfb + Chrome)
                                                          │
                                                          ▼
                                                     Target site

Component	Purpose	Where it runs
Holo3-35B-A3B (GGUF)	Tactical click / scroll / type / drag actions	Single GPU per Mantis pod
Claude (Anthropic API)	Strategic gate verification, structured data extraction, click coordinate refinement	Anthropic cloud, called per surgical step
MicroPlanRunner	Section/gate/loop state machine, per-step retry, checkpointing	Mantis pod, in-process
XdotoolGymEnv	Real Chrome inside Xvfb + xdotool for fingerprint-free clicks	Mantis pod
IPRoyal proxy	Residential, geo-targeted egress for sites with bot detection	Mantis pod (sticky session per run)

Plans¶

A plan is a structured description of what the agent should do. Three shapes are supported, in priority order on /v1/predict:

Shape	Field	When to use
Inline task suite	`task_suite: { ... }`	You have arbitrary task data and don't want to bake it into the container image
Pre-baked path	`task_file: "tasks/crm/crm_tasks.json"`	The plan ships in the container image
Micro-plan	`micro: "plans/example/...json"` (path) or inline list	High-reliability extraction with sections / gates / loops
Plain text	`plan_text: "Extract the first 3 product listings from example.com"`	One-shot ad-hoc; server decomposes via Claude (cached after first call)

See Plan formats for full schemas and examples.

Step types (micro-plan shape)¶

Each step in a micro-plan is a JSON object with a type and an intent:

`type`	What the runner does
`navigate`	Loads `intent`'s URL via env.reset(); waits for Cloudflare; sets the proxy
`click`	Fresh Holo3 inference loop with `budget` actions; optionally refined by `ClaudeGrounding`
`scroll`	Holo3 scrolls until intent satisfied; `scroll-fail-as-success` fallback
`extract_url`	Reads the address bar via Claude — no Holo3
`extract_data`	Claude reads the screenshot and emits structured fields per the schema
`navigate_back`	Alt+Left + verify URL change
`paginate`	URL-based or grounded click on the Next button
`loop`	Jumps back to step `loop_target` up to `loop_count` times
`filter`	Claude finds the filter checkbox and clicks it

Useful per-step modifiers:

Field	Effect
`section`	One of `setup`, `extraction`, `pagination`. Used by retry/halt logic.
`required`	If true: retry on fail, then halt the whole run.
`gate`	Claude verifies a condition; halt the run on fail.
`verify`	Free-text condition Claude checks.
`claude_only`	Skip Holo3 entirely; Claude does the perception. Use for extract / gate steps.
`grounding`	Refine Holo3's click coordinates with `ClaudeGrounding`.
`budget`	Max actions Holo3 can take in this step (default 8).
`loop_target`	Step index to jump back to (only on `loop` steps).
`loop_count`	Max loop iterations; clamped to `MANTIS_MAX_LOOP_ITERATIONS`.

Tenants and tokens¶

Mantis is multi-tenant from Tier 1. A tenant is just a record in the operator's keys file mapping an X-Mantis-Token to:

tenant_id — the namespace prefix used in state_key and on the data volume
scopes — which actions this token can do (run, status, result, logs)
max_concurrent_runs, max_cost_per_run, max_time_minutes_per_run, rate_limit_per_minute — caps the server enforces in addition to the global hard caps
anthropic_secret_name — which Anthropic key this tenant's runs use (each tenant can bring its own billing)
allowed_domains — wildcards matched against navigate URLs in submitted plans
webhook_url, webhook_secret_name — optional run-completion callback

Plans submitted by tenant A cannot read tenant B's checkpoints, profiles, or recordings — state_key is server-prefixed with the tenant id and the data volume is namespaced.

state_key — the resume primitive¶

state_key is the most important per-run field after the plan itself. It controls:

	Behavior
Browser profile	A Chrome profile dir at `tenants/<tenant_id>/chrome-profile/<state_key>/` is created or reused. Cookies + sessions persist across runs with the same key.
Checkpoint resume	The runner saves progress to `tenants/<tenant_id>/checkpoints/<state_key>.json`. Pass `resume_state: true` to pick up where the last run left off.
Idempotency	(Not the same as `Idempotency-Key` header) — `state_key` is the workflow identity, the header is the request identity.

Pick state_key to match the conceptual workflow: marketplace-miami-listings-v1, crm-prod, customer-12345-onboarding. Reuse the same key across runs of the same workflow; pick a new key when the workflow definition changes.

The cost meter¶

Every run produces a cost breakdown:

{
  "summary": {
    "cost_total": 0.42,
    "cost_breakdown": {
      "gpu":    0.12,    // Holo3 GPU minutes
      "claude": 0.12,    // Anthropic API tokens
      "proxy":  0.18     // IPRoyal residential proxy bandwidth
    }
  }
}

Caps are enforced server-side: max_cost (default $25, env-overridable to global, per-tenant clamped) is the wall before the runner halts. max_time_minutes is the wall-clock cap.

Polished recording¶

When record_video: true, every run produces both a raw screencast and a polished walkthrough composed by ffmpeg:

title card · 3s   →   captioned run · per-step + per-action overlays   →   outro card · 5s

The polished version is what GET /v1/runs/<id>/video returns by default; pass ?raw=1 for the screen capture without overlays.

Action overlays include click ripples, keyboard chord badges (CTRL + S), scroll arrows, type captions, and drag trails. They render the same regardless of what the agent clicks (browser, file manager, terminal, dialog) because all communication is in pixels.

What's NOT in scope¶

Browser automation via DevTools / CDP — Mantis uses Xvfb + xdotool specifically to avoid the fingerprints CDP leaves. If you need DOM-level access, a different tool is the right fit.
Headless mode — the agent watches a real Xvfb display so the model sees what a human would.
Audio / video stream as input — Mantis takes screenshots, not arbitrary media.

Next¶

Plan formats — write your first plan
Hosting — deploy your own instance
Client — integrate from your app