Skip to content

Mantis CUA HTTP API

Reference for callers who want to use the Mantis CUA service directly — without going through a host wrapper. For library-shaped integrations where you drive MicroPlanRunner in your own process, see Embedding MicroPlanRunner and the any-agent integration playbook.

Endpoints

Path Auth Purpose
POST /v1/predict X-Mantis-Token (run scope) Run a plan, poll status, fetch result. The high-level orchestrator.
POST /predict X-Mantis-Token (run scope) Backwards-compat alias for /v1/predict. Identical behavior.
POST /v1/chat/completions X-Mantis-Token (run scope) OpenAI-compat reverse proxy to in-pod Holo3 (raw inference).
GET /v1/models open OpenAI-compat model list. Returns holo3.
GET /v1/health, GET /health open Liveness/readiness probe.
GET /metrics open Prometheus scrape endpoint. Returns 503 if prometheus_client not installed.
GET /v1/runs/{run_id}/video X-Mantis-Token Download the screencast captured during a run. Returns 404 if record_video was not requested.

When deployed behind Baseten, all requests must also carry Authorization: Api-Key <BASETEN_API_KEY> (gateway auth, separate from container auth).

Authentication

The service uses two layers of auth when deployed on Baseten:

Header Layer Purpose
Authorization: Api-Key <BASETEN_API_KEY> Baseten gateway Authenticates the platform request. Required for any call.
X-Mantis-Token: <tenant_token> Container Authenticates the tenant. Required for /v1/predict and /v1/chat/completions.

X-Mantis-Token is split into a custom header (rather than another Authorization: Bearer) because the Baseten gateway's Authorization: Api-Key header is forwarded to the container; using the same header for both auth layers would clash.

If MANTIS_TENANT_KEYS_PATH is configured on the deployment, each tenant has its own token. Otherwise a single MANTIS_API_TOKEN works for all callers (single-tenant mode).

Rate / scale caps

Per-request server-side caps that the caller cannot exceed:

Env var Default Effect
MANTIS_MAX_STEPS_PER_PLAN 200 Plans larger than this are rejected with 400.
MANTIS_MAX_LOOP_ITERATIONS 50 loop_count in any loop step is silently clamped to this.
MANTIS_MAX_RUNTIME_MINUTES 60 max_time_minutes in the request body is clamped.
MANTIS_MAX_COST_USD 25.0 max_cost in the request body is clamped.

Plus per-tenant caps when multi-tenant is enabled (max_concurrent_runs, max_cost_per_run, max_time_minutes_per_run).


POST /v1/predict

Run a plan, poll an existing run, fetch the result, or fetch live logs. The mode is determined by the action field (or its absence).

Run a new plan

The request body must contain exactly one of these plan-shape fields, in priority order:

Field Type Description
task_suite object Inline task-suite dict. Use this for arbitrary plans where you don't want to bake them into the container image.
task_file_contents string JSON-as-string. Same shape as task_suite but pre-serialized.
task_file string Path inside the container image (e.g. tasks/crm/crm_tasks.json).
micro string Path to a micro-plan JSON or plain-text plan inside the image (e.g. plans/example/extract_listings.json).
plan_text string Inline plain-English plan. Decomposed via Claude on the server side.

Plus the run options:

Field Default Description
detached true Return a run_id immediately and continue work in the background. Set false to block until done (only useful for short plans — 5–10s).
state_key "" Caller-chosen identifier; the server prefixes it with tenant_id so callers can't collide. Reuse the same key across runs to share checkpoint state and Chrome profile (cookies, sessions).
resume_state false Reconstruct browser state from the latest checkpoint at state_key before starting.
max_cost 25.0 Cap in USD; clamped against the tenant cap.
max_time_minutes 60 Wall-clock cap; clamped against the tenant cap.
proxy_city, proxy_state unset Optional IPRoyal geo overrides. Subject to allowlist.
record_video false If true, captures the Xvfb display while the run executes and saves a screencast under the per-tenant run dir. Fetch via GET /v1/runs/{run_id}/video.
video_format "mp4" One of mp4, webm, gif.
video_fps 5 Capture rate; clamped to [1, 30]. Higher fps = larger file + more CPU.

Detached response

{
  "status": "queued",
  "created_at": "2026-04-28T01:57:08.316Z",
  "model": "holo3",
  "mode": "detached",
  "run_id": "20260428_021432_076255ef",
  "payload": { ... echoed input ... },
  "updated_at": "2026-04-28T01:57:08.317Z",
  "status_path":  "/workspace/mantis-data/runs/<run_id>/status.json",
  "result_path":  "/workspace/mantis-data/runs/<run_id>/result.json",
  "csv_path":     "/workspace/mantis-data/runs/<run_id>/leads.csv",
  "events_path":  "/workspace/mantis-data/runs/<run_id>/events.log"
}

The *_path fields are server-internal; you fetch them through the polling actions (next section).

Poll / fetch / cancel an existing run

Set action and run_id in the body:

{ "action": "status", "run_id": "20260428_021432_076255ef" }
{ "action": "result", "run_id": "..." }
{ "action": "logs",   "run_id": "...", "tail": 200 }
{ "action": "cancel", "run_id": "..." }

status returns the current state plus a summary block when the run is in a terminal state:

{
  "status": "succeeded",          // or running | failed | cancelled
  "run_id": "...",
  "started_at": "...",
  "finished_at": "...",
  "summary": {
    "total_time_s": 569,
    "steps_executed": 17,
    "viable": 3,
    "leads_with_phone": 1,
    "result_path": "...",
    "csv_path": "...",
    "dynamic_verification_summary": { ... },
    "cost_total": 0.42,
    "cost_breakdown": {
      "gpu":    0.12,
      "claude": 0.12,
      "proxy":  0.18
    }
  }
}

result returns the full lead list and per-step trace. logs returns the last tail events written by the runner (default 200, max 10000).

Errors

Status Meaning
400 Bad request. Common causes: no plan-shape provided, malformed JSON, plan exceeds MANTIS_MAX_STEPS_PER_PLAN, micro-step missing intent/type.
401 Missing or invalid X-Mantis-Token.
403 Token valid but tenant lacks run scope (read-only key).
404 action=status\|result\|logs referenced an unknown run_id.
429 (Tier 2) Tenant exceeded concurrent-run cap.
500 Unhandled exception — check events_path for traceback.
502 Upstream Holo3 (/v1/chat/completions) or Anthropic API unreachable.
503 Server auth not configured (MANTIS_API_TOKEN unset and no keys file).

POST /v1/chat/completions

OpenAI-compatible reverse proxy to the in-pod Holo3 server. For raw inference only — no plan orchestration, no Claude grounding, no checkpointing. Designed for clients that want to run their own perception-action loop and use Holo3 as the brain.

curl -X POST "https://model-qvvgkneq.api.baseten.co/production/sync/v1/chat/completions" \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -H "X-Mantis-Token: $MANTIS_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "holo3",
    "messages": [
      {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
        {"type": "text", "text": "Click the boat listing title."}
      ]}
    ],
    "max_tokens": 256
  }'

Auth headers and Mantis-side cookies are stripped before the request is forwarded to llama.cpp; the upstream never sees your tenant credentials.

For the orchestrated/reliable path that handles the full plan, use /v1/predict instead.


GET /v1/models

OpenAI-compatible model listing.

{
  "object": "list",
  "data": [
    { "id": "holo3", "object": "model", "owned_by": "mantis" }
  ]
}

End-to-end example: 3-listing extraction

TOKEN=$(read -srp "MANTIS_API_TOKEN: " v && echo "$v")
BTKEY="$BASETEN_API_KEY"
# Baseten gateway forwards /sync/<any path> to the container. /predict is
# the legacy default route (equivalent to /sync/predict).
ENDPOINT="https://your-model.api.baseten.co/production/sync"

# 1. Launch detached run — supply your own plan_text or a micro-plan.
RESP=$(curl -fsS -X POST "$ENDPOINT/v1/predict" \
  -H "Authorization: Api-Key $BTKEY" \
  -H "X-Mantis-Token: $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "detached": true,
    "plan_text": "Extract the first 3 listings from <your URL>: year, make, model, price, phone, url.",
    "state_key": "smoke-test",
    "resume_state": false,
    "max_cost": 2,
    "max_time_minutes": 20
  }')
RUN_ID=$(echo "$RESP" | jq -r .run_id)
echo "run_id: $RUN_ID"

# 2. Poll status until terminal
while true; do
  STATUS=$(curl -fsS -X POST "$ENDPOINT/v1/predict" \
    -H "Authorization: Api-Key $BTKEY" \
    -H "X-Mantis-Token: $TOKEN" \
    -H "Content-Type: application/json" \
    -d "{\"action\":\"status\",\"run_id\":\"$RUN_ID\"}" | jq -r .status)
  echo "$(date '+%H:%M:%S') $STATUS"
  case "$STATUS" in succeeded|failed|cancelled) break ;; esac
  sleep 30
done

# 3. Fetch leads
curl -fsS -X POST "$ENDPOINT/v1/predict" \
  -H "Authorization: Api-Key $BTKEY" \
  -H "X-Mantis-Token: $TOKEN" \
  -H "Content-Type: application/json" \
  -d "{\"action\":\"result\",\"run_id\":\"$RUN_ID\"}" \
  | jq .result.leads

Result shape (one row per successfully extracted listing):

<year> <make> <model>  — <price> — phone <phone or 'none'>
<year> <make> <model>  — <price>
<year> <make> <model>  — <price>

Plan shapes — when to use which

Use case Recommended shape
Recurring high-volume workflow with predictable steps Hand-author a micro-plan JSON, ship it in the image at plans/<domain>/<workflow>.json, reference via micro
Arbitrary plain-English request plan_text — server decomposes it via Claude (cached after first run)
Ad-hoc plan you don't want baked into the image task_suite (inline JSON dict)
Multi-task suite with task_id + verify clauses task_suite or task_file

Plan formats

micro — micro-plan JSON

A flat list of step objects executed by MicroPlanRunner:

[
  {"intent": "Navigate to https://...", "type": "navigate",
   "section": "setup", "required": true},
  {"intent": "Verify filters applied",  "type": "extract_data",
   "claude_only": true, "section": "setup", "gate": true,
   "verify": "Page shows boat listings ..."},
  {"intent": "Click listing title",     "type": "click",
   "grounding": true, "section": "extraction"},
  {"intent": "Read URL",                "type": "extract_url",
   "claude_only": true, "section": "extraction"},
  {"intent": "Scroll to description",   "type": "scroll",
   "budget": 10, "section": "extraction"},
  {"intent": "Extract data",            "type": "extract_data",
   "claude_only": true, "section": "extraction"},
  {"intent": "Go back",                 "type": "navigate_back",
   "section": "extraction"},
  {"intent": "Loop",                    "type": "loop",
   "loop_target": 2, "loop_count": 3, "section": "extraction"}
]

Step types: navigate, filter, click, scroll, extract_url, extract_data, navigate_back, paginate, loop.

Key fields:

Field Effect
section One of setup, extraction, pagination. Used by retry/halt logic.
required If true, retry on fail then halt the whole run.
gate Claude verifies a condition; halt on fail.
verify Free-text condition Claude checks.
claude_only Skip Holo3; Claude does the perception. Use for extract / gate steps.
grounding Refine click coordinates with ClaudeGrounding.
budget Max actions Holo3 can take in this step (default 8).
loop_target Step index to jump back to.
loop_count Max loop iterations (clamped to MANTIS_MAX_LOOP_ITERATIONS).

task_suite — multi-task JSON

For Claude-CUA-style autonomous-per-task workflows (the existing tasks/crm/crm_tasks.json is this shape):

{
  "session_name": "crm_demo",
  "base_url": "https://crm.example.com",
  "auth": { "user_id": "...", "password": "..." },
  "tasks": [
    {
      "task_id": "login",
      "intent": "Go to https://... and log in with user X and password Y",
      "save_session": true,
      "start_url": "https://...",
      "verify": { "type": "url_not_contains", "value": "login" }
    },
    {
      "task_id": "update_lead_industry",
      "intent": "Go to the Leads Page. Update industry of qualified lead to 'Space Exploration'.",
      "require_session": true,
      "start_url": "https://...",
      "verify": { "type": "page_contains_text", "value": "Space Exploration" }
    }
  ]
}

Each task runs with its own max_steps budget; Claude decides what to do per task. The runner verifies the verify clause after each.

plan_text — plain-English

{
  "plan_text": "Go to a marketplace listings site, filter to private sellers above $35,000 in Florida, extract listing details for the first 3 listings, save year/make/model/price/phone."
}

PlanDecomposer (Claude-backed, cached by signature) converts this into a micro-plan and proceeds. Decomposition costs ~$0.10 the first time per unique plan text; subsequent runs hit the cache.


Pricing (verified end-to-end)

Real numbers from a 3-listing marketplace-extraction run on Baseten:

Item Cost
GPU (Holo3 on H100, ~10 min) ~$0.12
Claude (gates + extract + grounding) ~$0.12
Proxy (IPRoyal residential) ~$0.18
Total per 3-listing run ~$0.42
Per-listing ~$0.14

For comparison, equivalent Claude-only CUA flow ~$0.50–$1.50 per listing.


Security model

Concern Guarantee
Tenant token confidentiality Stored in Baseten secrets; constant-time compare on validation; never echoed in logs
Per-tenant Anthropic key Resolved from the tenant's anthropic_secret_name — keys are not shared across tenants
Per-tenant browser profile Mounted at /workspace/mantis-data/tenants/<tenant_id>/chrome-profile/<state_key>/ — cookies cannot bleed across tenants
Per-tenant run state Same volume layout — state_key is server-prefixed so callers cannot read another tenant's checkpoint
Plan injection (e.g., loop_count: 999_999) Server-side hard caps clamp the values; oversized plans are rejected with 400
Upstream credential leak /v1/chat/completions strips X-Mantis-Token, Authorization, Cookie before forwarding to in-pod llama.cpp

Limits / caveats

  • Detached runs survive replica restart (state on the data volume) but only on the same Baseten model. Cross-region failover not supported.
  • Pause/resume for OTP is not yet wired through /v1/predict. It works today in library-embedded integrations because the loop runs in the host's own process — see Embedding MicroPlanRunner.
  • /v1/chat/completions is unstreamed in v1. Streaming SSE is a Tier 2 follow-up.
  • Single Anthropic-key per tenant at request time (re-resolved on every call).

Screencast / video recording

Send a plan with record_video: true and the runtime produces a feature-walkthrough video — title card → captioned run footage → outro card with the result summary. Fetch with GET /v1/runs/{run_id}/video. The raw screencast is preserved alongside; pass ?raw=1 to fetch it instead.

The walkthrough has three segments plus animated click ripples on top of the run footage:

┌─────────────────┐  ┌─────────────────────────┐  ┌─────────────────┐
│  Title card     │→ │  Run footage (captions  │→ │  Outro card     │
│  (3s)           │  │   + click ripples)      │  │  (5s)           │
│                 │  │  per-step intent shown  │  │                 │
│  Mantis CUA     │  │  with [OK] / [FAIL]     │  │  Run complete   │
│  ───            │  │  in the bottom strip    │  │  ───            │
│  <plan name>    │  │  while the action plays │  │  3 viable leads │
│  tenant: …      │  │  + expanding sky-blue   │  │  1 with phone   │
│  run: …         │  │  ripple at every click  │  │  17 steps · 9m  │
│                 │  │                         │  │  cost: $0.42    │
└─────────────────┘  └─────────────────────────┘  └─────────────────┘

Title and outro are rendered with PIL. Captions are SRT cues burned in by ffmpeg's subtitles= filter (libass). Click ripples are PNG-sequence overlay frames composited via ffmpeg's overlay filter. Polish is best-effort — if anything fails (PIL, ffmpeg, libass not built in the image), the raw recording is still saved and the endpoint serves it.

Action overlays — universal computer use

Every kind of agent action gets a visual cue, regardless of what application is in focus (browser, file manager, terminal, dialogs, anything visible on the Xvfb display). The agent emits actions with pixel coordinates / key chords / text, and the overlay renderer composites the matching visual onto the recording.

Agent action Overlay
CLICK (single) Sky-blue expanding ripple at (x, y), 0.6 s, fades out
DOUBLE_CLICK Same as click + a second offset ring 0.1 s later
KEY_PRESS (e.g. Ctrl+S, Tab, Enter) Slate badge in the bottom-right with the chord text, 1.5 s, slide-in then fade
TYPE (typed text) "⌨ Typing: \"…\"" caption near the top, 1.8 s, fades after text appears on screen
SCROLL (up / down / left / right) Sky-blue arrow at the matching screen edge, slides in the scroll direction, 0.8 s
DRAG Animated trail line from start to end with a moving head dot, 0.9 s
WAIT, NAVIGATE, DONE No overlay (no useful visual locus)

All overlays are deliberately minimal — visible without being disruptive. Sky-blue accent color across the set so they read as a single visual language.

You'll see counts in the result metadata under video.actions:

{
  "video": {
    "path": ".../recording.mp4",
    "polished_path": ".../recording_polished.mp4",
    "actions": {
      "clicks": 17,
      "keys":   3,
      "types":  2,
      "scrolls": 8,
      "drags":  0
    },
    "clicks": 17,    // backwards-compat field
    ...
  }
}
# 1. Submit a recorded run
RESP=$(curl -fsS -X POST "$ENDPOINT/v1/predict" \
  -H "Authorization: Api-Key $BTKEY" \
  -H "X-Mantis-Token: $TOK" \
  -H "Content-Type: application/json" \
  -d '{
    "detached": true,
    "micro": "plans/example/extract_listings.json",
    "state_key": "demo-recording",
    "max_cost": 2,
    "max_time_minutes": 20,
    "record_video": true,
    "video_format": "mp4",
    "video_fps": 8
  }')
RUN_ID=$(echo "$RESP" | jq -r .run_id)

# 2. Poll status until succeeded ... (same as the regular flow)

# 3. Download the screencast
curl -fsS -o demo.mp4 \
  -H "X-Mantis-Token: $TOK" \
  "$ENDPOINT/v1/runs/$RUN_ID/video"

Result-side metadata (in the summary block):

{
  "video": {
    "path": "/workspace/mantis-data/tenants/<tenant>/runs/<run_id>/recording.mp4",
    "polished_path": "/workspace/mantis-data/tenants/<tenant>/runs/<run_id>/recording_polished.mp4",
    "format": "mp4",
    "duration_seconds": 567.3,
    "bytes": 31457280,
    "error": null
  }
}

polished_path is set only when the post-process compose step succeeded; on failure it's omitted and the endpoint falls back to the raw recording.

Endpoint behavior

Request Returns
GET /v1/runs/{run_id}/video Polished mp4 (preferred) → raw mp4 (fallback) → 404
GET /v1/runs/{run_id}/video?raw=1 Raw mp4 only → 404

Format tradeoffs

Format Container Encode cost Output size (typical 10-min run) Best for
mp4 H.264 (libx264, ultrafast preset, CRF 28) low ~30–80 MB sharing, downloads
webm VP9 (libvpx-vp9, cpu-used 5, CRF 32) medium ~25–60 MB embedding in web pages
gif palettegen + paletteuse high ~50–200 MB docs, Slack, animated thumbnails (lossy)

For long recordings or tight bandwidth, prefer mp4 at 5 fps. The gif path uses a palette-aware filtergraph but file size grows fast — use only for short demos (< 60 s).

Operational caveats

  • The container image must have ffmpeg installed. Both docker/server.Dockerfile and deploy/baseten/holo3/config.yaml ship it; if you're rolling your own image, add ffmpeg to the apt deps. Without ffmpeg, record_video: true is a soft-fail — the run completes normally, and the response carries video.error: "ffmpeg-not-installed".
  • Recordings live at $MANTIS_DATA_DIR/tenants/<tenant_id>/runs/<run_id>/recording.<fmt> so tenants cannot read each other's files. The download endpoint uses the authenticated tenant's dir; even if you guess another tenant's run_id, the file lookup is scoped.
  • video_fps is clamped to [1, 30]. Higher fps doesn't help much (UI rarely changes faster than 5–10 fps) and bloats the file.
  • Each second of recording is ~50 KB at 5 fps mp4. Multiply by your target run duration + tenant count to size the EFS / Filestore.

Tier 2 features (rate limits, idempotency, webhooks, allowlist, metrics)

Rate limits

Two dimensions, both enforced per-tenant:

Dimension Source Behavior on exceed
Concurrent runs tenant.max_concurrent_runs (default 5) 429 Too Many Requests with Retry-After: 5
Rate (token bucket) tenant.rate_limit_per_minute (default 30) 429 with Retry-After: <seconds-until-token>

State is in-process per replica. Behind a load balancer with N replicas, the effective per-tenant cap is roughly N × configured_cap. For strict cluster-wide limits, deploy a single replica or swap to a Redis-backed limiter (planned Tier 2.5).

Idempotency keys

Send Idempotency-Key: <unique-string> on POST /v1/predict. The server caches (tenant_id, key) → run_id with a 24-hour TTL. Subsequent retries with the same key return the original run_id without starting a new run.

curl -X POST "$ENDPOINT/v1/predict" \
  -H "Authorization: Api-Key $BTKEY" \
  -H "X-Mantis-Token: $TOK" \
  -H "Idempotency-Key: order-7afc3b91" \
  -H "Content-Type: application/json" \
  -d '{...}'

The cache is sidecar-backed ($MANTIS_DATA_DIR/idempotency/<tenant_id>/<key_hash>.json) so a replica restart preserves it.

Webhook callbacks

Two ways to receive run-completion notifications:

  1. Per-tenant default — set webhook_url and webhook_secret_name in the tenant keys file.
  2. Per-request override — pass callback_url in the /v1/predict body.

When the run reaches a terminal state (succeeded, failed, cancelled), the server POSTs:

{
  "run_id": "20260428_021432_076255ef",
  "tenant_id": "tenant_a",
  "status": "succeeded",
  "summary": { ... same shape as /v1/predict status response ... },
  "delivered_at": "2026-04-28T02:24:01.648Z"
}

With an HMAC-SHA256 signature in X-Mantis-Signature: sha256=<hex> (signed with the tenant's webhook secret). 3 retries with exponential backoff (1s, 5s, 30s) if the receiver returns non-2xx or fails to connect.

Verify the signature on receipt:

import hmac, hashlib
def verify(body: bytes, header_sig: str, secret: str) -> bool:
    expected = "sha256=" + hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
    return hmac.compare_digest(expected, header_sig)

URL allowlist enforcement

If a tenant has allowed_domains set in the keys file, every plan submitted via /v1/predict is scanned for navigate-type URLs and task_suite.base_url / task.start_url. Off-list hosts return 403 Forbidden before any run starts:

{
  "detail": "plan references host(s) not in tenant allowlist: evil.com"
}

Wildcards: *.example.com matches any subdomain but not example.com.evil.com. Empty allowed_domains (the default) skips this check.

Prometheus metrics

GET /metrics returns Prometheus text format. Metric names + labels:

Metric Type Labels Notes
mantis_predict_requests_total counter tenant_id, mode, outcome mode = run\|status\|result\|logs\|cancel; outcome = ok\|bad_request\|rate_limited\|denied_allowlist\|idempotent_hit\|error
mantis_chat_completions_total counter tenant_id, outcome outcome = ok\|status_4xx\|status_5xx\|upstream_error
mantis_run_duration_seconds histogram tenant_id, model, status Buckets: 10s … 3600s
mantis_run_cost_usd histogram tenant_id, model, status Buckets: $0.01 … $25
mantis_concurrent_runs gauge tenant_id Currently in-flight runs
mantis_rate_limit_rejections_total counter tenant_id, kind kind = rate\|concurrent

If prometheus_client isn't installed in the container (e.g., orchestrator-only install), /metrics returns 503 and all metric calls become no-ops — the rest of the API is unaffected.

Tier roadmap

This API is at Tier 2 — production-quality multi-tenant. Upcoming:

  • Tier 3: billing records, admin API, multi-region.

See Architecture for the bigger architectural picture.