Mantis CUA Architecture¶

Mantis is a Computer Use Agent that fuses perception, reasoning, and action into a single model. One brain consumes screen frames + task + history and outputs actions, observing consequences in real-time.

System Overview¶

User objective (text)
  |
  v
ObjectiveSpec.parse()          Parse objective into structured spec
  |
  v
GraphLearner                   Probe site + generate dependency graph
  |  |
  |  +-> SiteProber            Screenshot-based page analysis (no brain needed)
  |  +-> WorkflowGraph         DAG of phases with loop semantics
  |
  v
GraphCompiler                  Compile graph -> flat MicroPlan
  |
  v
PlanValidator                  Structural checks + auto-fix
  |
  v
MicroPlanRunner                Execute with checkpoint/verify/reverse
  |  |
  |  +-> Brain.think()         Perception + reasoning + action
  |  +-> GymEnvironment.step() Execute action, capture screenshot
  |  +-> ClaudeExtractor       Read structured data from screenshots
  |  +-> ClaudeGrounding       Refine click coordinates
  |  +-> DynamicPlanVerifier   Track coverage per page
  |
  v
ExtractionResult               Structured output (fields, viability, spam check)

Module Map¶

Core (`src/mantis_agent/`)¶

Module	Purpose
`brain.py`	Gemma4Brain -- unified vision-language model
`brain_holo3.py`	Holo3-35B via llama.cpp
`brain_llamacpp.py`	Generic llama.cpp bridge for local GGUF models
`brain_opencua.py`	OpenCUA 32B/72B via vLLM tensor-parallel
`brain_claude.py`	Claude Sonnet/Opus via Anthropic API
`actions.py`	Action enum (click, type, scroll, done) and tool schemas
`extraction.py`	ClaudeExtractor + ExtractionSchema -- schema-driven data extraction
`grounding.py`	ClaudeGrounding -- pixel-level click targeting via Claude
`plan_decomposer.py`	Text plan -> MicroPlan via Claude Sonnet
`site_config.py`	SiteConfig -- URL patterns, pagination format, gate prompts
`server_utils.py`	Shared utilities -- proxy, plan signatures, result builders
`task_loop.py`	TaskLoopConfig + run_task_loop -- shared executor lifecycle

Graph Learning (`src/mantis_agent/graph/`)¶

Module	Purpose
`objective.py`	ObjectiveSpec -- structured objective with fields, filters, completion
`graph.py`	WorkflowGraph, PhaseNode, PhaseEdge -- DAG with repeat modes
`probe.py`	SiteProber -- navigate + screenshot + Claude analysis (no brain)
`compiler.py`	GraphCompiler -- WorkflowGraph -> MicroPlan
`learner.py`	GraphLearner -- orchestrates probe + skeleton + sample + cache
`store.py`	GraphStore -- persist/load keyed by domain + objective hash
`plan_validator.py`	PlanValidator -- structural checks + auto-fix before execution

Execution (`src/mantis_agent/gym/`)¶

Module	Purpose
`runner.py`	GymRunner -- step-level agent loop with feedback and loop detection
`micro_runner.py`	MicroPlanRunner -- execute MicroPlan with checkpoint/verify/reverse
`workflow_runner.py`	WorkflowRunner -- dynamic loops and pagination over GymRunner
`learning_runner.py`	LearningRunner -- verified execution for building playbooks
`xdotool_env.py`	XdotoolGymEnv -- real Chrome + xdotool (zero automation fingerprints)
`playwright_env.py`	PlaywrightGymEnv -- headless Chromium via Playwright
`plan_executor.py`	PlanExecutor -- deterministic DOM-based step execution
`page_discovery.py`	PageDiscovery -- DOM inspection for element selection

Verification (`src/mantis_agent/verification/`)¶

Module	Purpose
`dynamic_plan_verifier.py`	Per-page coverage tracking (found/attempted/opened/completed)
`step_verifier.py`	Before/after screenshot comparison via Claude
`playbook.py`	PlaybookStore -- learned site-specific steps with confidence scores

Key Interfaces¶

Brain Protocol¶

class Brain(Protocol):
    def think(
        frames: list[Image.Image],
        task: str,
        action_history: list[Action] | None,
    ) -> InferenceResult

Implementations: Gemma4Brain, Holo3Brain, LlamaCppBrain, OpenCUABrain, ClaudeBrain

GymEnvironment¶

class GymEnvironment(ABC):
    def reset(task, start_url) -> GymObservation
    def step(action) -> GymResult
    def screenshot() -> Image.Image
    def close()

Implementations: XdotoolGymEnv (real Chrome), PlaywrightGymEnv (headless)

ExtractionSchema¶

@dataclass
class ExtractionSchema:
    entity_name: str              # "boat listing", "job posting"
    fields: list[OutputField]     # what to extract
    required_fields: list[str]    # viability check
    spam_indicators: list[str]    # what to reject
    allowed_controls: list[str]   # safe reveal buttons
    forbidden_controls: list[str] # lead-form traps

Drives ClaudeExtractor prompts dynamically. Default: marketplace-listings schema (mantis_agent.recipes.marketplace_listings.schema.SCHEMA).

SiteConfig¶

@dataclass
class SiteConfig:
    detail_page_pattern: str      # regex for detail URLs
    results_page_pattern: str     # regex for results URLs
    pagination_format: str        # "/page-{n}/" or "?page={n}"
    gate_verify_prompt: str       # what to check after filters
    filtered_results_url: str     # recovery URL if filters lost

Used by MicroPlanRunner for URL checks instead of hardcoded patterns.

Execution Flow¶

Phase 1: Plan Generation¶

Text objective
  -> ObjectiveSpec.parse()    Extract domain, entity, filters, schema
  -> GraphLearner.learn()     Check cache -> probe site -> generate skeleton
  -> GraphCompiler.compile()  DAG -> flat MicroPlan
  -> PlanValidator.enhance()  Fix missing navigate/gate/loops

Phase 2: Filter Application (Setup)¶

navigate -> filter_0 -> filter_1 -> ... -> gate_verification

Each filter is a separate required step. If any fails, pipeline halts. Gate checks all filters are active before extraction begins.

Phase 3: Extraction Loop¶

for each discovered item on page:
    click title -> extract URL -> scroll to details
    -> expand collapsed sections -> extract data -> go back

when page exhausted:
    paginate -> loop back to discovery

Coverage tracked by DynamicPlanVerifier: - found items, attempted items, opened items, completed items - 7 structural checks per page (filters, attempts, completions, exhaustion)

Phase 4: Result¶

ExtractionResult per item:
  - extracted_fields: {name: value}
  - is_viable(): required fields present + not spam
  - to_summary(): "VIABLE | Year: 2020 | Make: Sea Ray | ..."

Run result:
  - leads count, phone lead count
  - costs (GPU, Claude API, proxy bandwidth)
  - dynamic_verification_summary with per-page checks
  - checkpoint for resume

Deployment¶

deploy/modal/modal_cua_server.py
  |
  +-> gemma4_planner()       Persistent T4, llama.cpp, /v1/chat/completions
  +-> run_holo3()            Per-run A100, llama.cpp GGUF
  +-> run_gemma4_cua()       Per-run A100, llama.cpp GGUF
  +-> run_cua_*gpu()         Per-run A100s, vLLM (EvoCUA/OpenCUA)
  +-> run_claude_cua()       Per-run CPU only, Anthropic API

All executors delegate to task_loop.run_executor_lifecycle(). Executor-specific behavior via callbacks: on_task_result, on_task_complete, on_loop_complete.

Baseten¶

baseten_server.py (FastAPI)
  |
  +-> POST /predict          Run micro-plan or task suite
  +-> action=graph_learn     Probe + graph (CPU only)
  +-> action=status/result   Poll detached runs

Uses same task_loop.run_executor_lifecycle() as Modal.

Local¶

set -a && source .env && set +a
uv run modal run deploy/modal/modal_cua_server.py \
  --micro plans/example/extract_listings.json \
  --model holo3 --max-cost 0.30

Cost Model¶

Component	Cost	When
Plan decomposition	~$0.01	Once per text plan (cached)
Graph skeleton	~$0.01	Once per objective (cached)
Site probing	~$0.02	Once per domain (4-6 screenshots)
Claude extraction	~$0.003	Per listing (1-2 screenshots)
Claude grounding	~$0.003	Per click target refinement
Gate verification	~$0.003	Per gate check
GPU (Holo3/Gemma4)	~$3.25/hr	During execution
Proxy bandwidth	~$5/GB	During browser sessions

Typical run: ~$0.14/lead extracted (2 Claude calls + GPU time + proxy).

Directory Structure¶

cua-agent/
  src/mantis_agent/
    graph/              Graph learning, compilation, validation
    gym/                Environments, runners, plan execution
    verification/       Coverage tracking, step verification, playbooks
    curriculum/         Domain-specific interaction techniques
    tools/              Utility helpers
    extraction.py       Schema-driven screenshot data extraction
    grounding.py        Claude-based click targeting
    site_config.py      Domain-specific URL patterns
    server_utils.py     Shared proxy, result builders, CSV
    task_loop.py        Shared executor lifecycle
    plan_decomposer.py  Text -> MicroPlan via Claude
  deploy/
    modal/              Modal CLI entrypoints (modal_cua_server.py, modal_osworld_*.py, ...)
    baseten/            Baseten Truss deployments (holo3, gemma4, gemma4_26b)
  docker/               Containerfiles (cua, hud, local)
  scripts/              CLI tools (run_*.py, check_*.sh, baseten_workload.py)
  plans/                Plan files (.txt, .json)
  tasks/                Task descriptors
  benchmarks/           OSWorld / VWA benchmark harnesses (per-domain Modal apps)
  training/             Distillation and fine-tuning configs
  tests/                pytest suite
  docs/                 Architecture documentation