Glossary¶
Quick definitions for project-specific terms. For deeper explanations, follow the links into the relevant section.
| Term | Definition |
|---|---|
| Action | A unit of agent behavior the env can execute. One of CLICK, DOUBLE_CLICK, TYPE, KEY_PRESS, SCROLL, DRAG, WAIT, DONE. |
| Brain | An inference client. Today: BrainHolo3 (llama.cpp / OpenAI-compat), BrainClaude (Anthropic API). |
| CUA | Computer Use Agent. The category Mantis sits in. |
| Detached run | A run started with detached: true. The server returns a run_id immediately and continues work in the background. Caller polls / waits for a webhook. |
| Gate | A micro-plan step with gate: true. Claude verifies a free-text condition; if it fails, the run halts. |
| GymEnvironment | The abstract env interface: reset(), step(action), screenshot(), close(). XdotoolGymEnv is the concrete production impl. |
| Holo3 | Holo3-35B-A3B, the small specialist multimodal model used for click / scroll / type / drag. Runs on a single GPU as GGUF. |
| Idempotency-Key | Header-based dedup key. Same key → same run_id returned. 24 h cache. |
| MicroPlan | A list of micro-step objects: [{intent, type, section, ...}, ...]. Reliable for high-volume workflows. |
| MicroPlanRunner | The orchestrator. Walks the micro-plan, runs each step, handles retries / gates / loops / checkpoint resume. |
| Plan text | Plain-English instruction submitted via plan_text; the server decomposes via Claude (cached). |
| Polished video | The composed walkthrough (title card → captioned run with action overlays → outro card) produced when record_video: true. |
| Raw recording | The unprocessed Xvfb screencast. Available via ?raw=1 on the video endpoint. |
run_id |
Server-generated identifier for a detached run. Format: <YYYYMMDD>_<HHMMSS>_<random_hex>. |
profile_id |
(#341) Caller-chosen Chrome user-data-dir identity. Server prefixes with tenant_id to isolate. Sticky across plan revisions — same id ⇒ same cookies / logged-in sessions (reusing a login needs only the same profile_id, not resume_state). Two concurrent runs on one profile_id get a 409 (Chrome locks the user-data-dir); distinct ids parallelize. Defaults to "default". See Profiles & login reuse. |
workflow_id |
(#341) Caller-chosen checkpoint identity. Server prefixes with tenant_id to isolate. Rotate when the plan definition changes; pair with resume_state: true to pick up where the last run with this id stopped. Defaults to plan_signature[:12]. |
state_key |
Legacy single-field identity. When set alone, the server routes it to both profile_id and workflow_id for back-compat. Prefer the split fields in new code; see #341. Concurrency rule: because a key resolves to a Chrome user-data-dir and a checkpoint, two in-flight calls must not share one state_key — they would race on the same profile + checkpoint. Distinct keys run in parallel (Modal scales out via @modal.concurrent). Fan out under this rule with StateKeyDispatcher — see Continuous-improvement architecture §6 and experiments/holdout/state_key_dispatcher.py. |
| Step | One element of a micro-plan. Has intent, type, optional section, gate, verify, loop_target, etc. |
| Task suite | Multi-task plan format ({tasks: [...]}). Each task has its own verify predicate. |
| Tenant | A unit of isolation. Each tenant has its own token, scopes, caps, browser profile dir, run dir, and optionally its own Anthropic key. |
| TenantConfig | The resolved configuration for one request: (tenant_id, scopes, max_*, anthropic_secret_name, allowed_domains, webhook_*). |
| Tier 1 / Tier 2 | The hardening tiers. Tier 1: multi-tenant safe (auth, caps, isolation). Tier 2: production-quality (rate limits, idempotency, webhooks, allowlist, metrics). |
| xdotool | The X11 input-injection tool used to send clicks / keys to Chrome inside Xvfb. Fingerprint-free (no DevTools / CDP signal). |
| Xvfb | X virtual framebuffer. The headless display the agent's browser paints to and ffmpeg captures. |
X-Mantis-Token |
The container-level auth header. Custom header (not Authorization: Bearer) so it doesn't collide with platform-gateway auth. |
register_tool |
Host-provided tool registration on MicroPlanRunner and GymRunner (#285). Lets the embedding host expose its own tools — auth_credentials, request_user_input, bash, etc. — to the brain mid-run. JSON-schema input definition. See issue #71. |
PauseRequested |
Exception a registered tool raises when it needs out-of-band input (OTP, 2FA, confirmation). Both runners catch it, snapshot state, and return a serializable PauseState. Resume by calling runner.resume(state, user_input=..., plan=plan) on MicroPlanRunner or runner.resume(state, user_input=...) on GymRunner. See issues #73, #285. |
PauseState |
JSON-safe snapshot of a paused run. MicroPlanRunner fills step_results + loop_counters + listings_on_page; GymRunner fills trajectory_steps + task + task_id instead. The shared fields (step_index, pending_tool, prompt, timestamp) are populated by both. Designed to round-trip through Postgres JSONB. |
ActionType.TOOL_CALL |
Action type a brain emits to invoke a registered host tool — Action(TOOL_CALL, params={"name": str, "args": dict}). The runner short-circuits env.step and routes through tool_channel.invoke; handler return values land in the trajectory's feedback field, raised PauseRequested triggers a RunResult(paused=True, ...). See issue #285. |
RunnerResult |
Rich return type from MicroPlanRunner.run_with_status(plan) and runner.resume(...). Carries steps, status (completed / halted / cancelled / paused), cancelled bool, and optional pause_state. |
cancel_event |
External cancellation hook on MicroPlanRunner and GymRunner (#288). Pass any object with .is_set() (e.g. threading.Event) or a no-arg callable. Checked at every step boundary. On a positive check, GymRunner returns RunResult(paused=True, pause_state=..., termination_reason="cancelled") with pending_reason="cancelled" — same snapshot shape as the PauseRequested path, so resume() rehydrates via the same code path. ECS / k8s SIGTERM is the canonical use case. See issues #76, #288. |
step_callback |
Per-step observability hook: Callable[[idx, intent, action_or_none, ok], None]. Errors are logged, never raised — observability bugs cannot break a run. See issue #74. |
LAUNCH_APP |
An ActionType that lets a plan start a desktop binary on demand (chromium, custom wrapper script, ...). Closes the symmetry gap with the Claude backend's bash tool. See issue #72. |
scale_brain_to_display |
Helper that maps brain-space pixel coordinates to display-space pixels when a host resizes screenshots before inference. The contract that prevents the 1.5× click-offset bug class. See reference/coordinate-spaces.md and issue #75. |
fill_field |
Step type for a labelled text input. params={"label": "User ID", "value": "alice"}. The runner clicks the field, clears any pre-filled value, and types value. Uses find_form_target (single labelled element), not find_all_listings. See issue #80. |
submit |
Step type for a single labelled button — Login / Save / Update / Submit / Continue. params={"label": "Login", "aliases"?: ["Sign In", "Continue"]}. One click, then waits 2.5 s. The runner scrolls Page_Down up to 4 times when the button isn't in viewport (issue #89 §2); aliases covers copy variation across products (e.g. "Update Lead" / "Save" / "Save Changes"). |
select_option |
Step type for a dropdown / <select> choice. params={"dropdown_label": "Industry Vertical", "option_label": "Space Exploration"}. Two-phase: click dropdown → screenshot the open menu → click option. |
find_form_target |
ClaudeExtractor method that locates one labelled element on a non-listings page. Counterpart to find_all_listings for forms. Returns {x, y, action, value, label}. |
GroundingCache |
TTL+LRU cache that short-circuits Claude grounding calls when the same (perceptual_hash(crop), description_hash) pair recurs. Constructed once per Runtime and shared across ClaudeGrounding instances. Hit/miss/eviction/size emitted on Prometheus. See issue #117. |
prefer_som_grounding |
SiteConfig flag. When True (and GymRunner has a page_discovery), the runner promotes Set-of-Mark dispatch (DOM-discovered candidates → brain choice) ABOVE the direct executor. On SoM-friendly sites this skips the Claude grounding round-trip. Default False. See issue #117. |
| Set-of-Mark / SoM | The dispatch pattern proven in VWA: overlay numbered boxes on DOM-derived candidates, ask the brain "which N?", click the chosen element via DOM reference. Implemented in gym/page_discovery.py + _try_discovery_execution. |
predicted_outcome / observed_outcome |
Fields on TrajectoryStep (#120). The brain emits Predicted: <delta> describing what its action will cause; the runner records the runtime feedback as the observed delta. Reward functions compute Jaccard similarity to derive a per-step world-model accuracy signal — see mantis_agent.rewards.world_model_accuracy_reward and PlanAdherenceReward.world_model_weight. |
LoopDetector |
Detector with three signals: is_repeat_loop (byte-equal), is_drift_loop (coordinate-near), is_state_loop (URL/screenshot frozen). Both GymRunner and StreamingCUA call is_any_loop(window=...) after recording each action. Reset at plan-step boundaries (#116) so cross-step false positives clear. |
BrainLadder |
Wraps a primary + fallback Brain. force_fallback() routes the next think() call to the fallback (one-shot or sticky). Annotates results with brain_used. Emits mantis_brain_escalation_total per call. |