Mantis CUA HTTP API¶

Reference for callers who want to use the Mantis CUA service directly — without going through a host wrapper. For library-shaped integrations where you drive MicroPlanRunner in your own process, see Embedding MicroPlanRunner and the any-agent integration playbook.

Endpoints¶

Path	Auth	Purpose
`POST /v1/predict`	`X-Mantis-Token` (run scope)	Run a plan, poll status, fetch result. The high-level orchestrator.
`POST /predict`	`X-Mantis-Token` (run scope)	Backwards-compat alias for `/v1/predict`. Identical behavior.
`POST /v1/chat/completions`	`X-Mantis-Token` (run scope)	OpenAI-compat reverse proxy to in-pod Holo3 (raw inference).
`GET /v1/models`	open	OpenAI-compat model list. Returns `holo3`.
`GET /v1/health`, `GET /health`	open	Liveness/readiness probe.
`GET /metrics`	open	Prometheus scrape endpoint. Returns 503 if `prometheus_client` not installed.
`GET /v1/runs/{run_id}/video`	`X-Mantis-Token`	Download the screencast captured during a run. Returns 404 if `record_video` was not requested.

When deployed behind Baseten, all requests must also carry Authorization: Api-Key <BASETEN_API_KEY> (gateway auth, separate from container auth).

Authentication¶

The service uses two layers of auth when deployed on Baseten:

Header	Layer	Purpose
`Authorization: Api-Key <BASETEN_API_KEY>`	Baseten gateway	Authenticates the platform request. Required for any call.
`X-Mantis-Token: <tenant_token>`	Container	Authenticates the tenant. Required for `/v1/predict` and `/v1/chat/completions`.

X-Mantis-Token is split into a custom header (rather than another Authorization: Bearer) because the Baseten gateway's Authorization: Api-Key header is forwarded to the container; using the same header for both auth layers would clash.

If MANTIS_TENANT_KEYS_PATH is configured on the deployment, each tenant has its own token. Otherwise a single MANTIS_API_TOKEN works for all callers (single-tenant mode).

Rate / scale caps¶

Per-request server-side caps that the caller cannot exceed:

Env var	Default	Effect
`MANTIS_MAX_STEPS_PER_PLAN`	200	Plans larger than this are rejected with `400`.
`MANTIS_MAX_LOOP_ITERATIONS`	50	`loop_count` in any `loop` step is silently clamped to this.
`MANTIS_MAX_RUNTIME_MINUTES`	60	`max_time_minutes` in the request body is clamped.
`MANTIS_MAX_COST_USD`	25.0	`max_cost` in the request body is clamped.

Plus per-tenant caps when multi-tenant is enabled (max_concurrent_runs, max_cost_per_run, max_time_minutes_per_run).

`POST /v1/predict`¶

Run a plan, poll an existing run, fetch the result, or fetch live logs. The mode is determined by the action field (or its absence).

Run a new plan¶

The request body must contain exactly one of these plan-shape fields, in priority order:

Field	Type	Description
`task_suite`	object	Inline task-suite dict. Use this for arbitrary plans where you don't want to bake them into the container image.
`task_file_contents`	string	JSON-as-string. Same shape as `task_suite` but pre-serialized.
`task_file`	string	Path inside the container image (e.g. `tasks/crm/crm_tasks.json`).
`micro`	string	Path to a micro-plan JSON or plain-text plan inside the image (e.g. `plans/example/extract_listings.json`).
`plan_text`	string	Inline plain-English plan. Decomposed via Claude on the server side.

Plus the run options:

Field	Default	Description
`detached`	`true`	Return a `run_id` immediately and continue work in the background. Set `false` to block until done (only useful for short plans — 5–10s).
`state_key`	`""`	Caller-chosen identifier; the server prefixes it with `tenant_id` so callers can't collide. Reuse the same key across runs to share checkpoint state and Chrome profile (cookies, sessions).
`resume_state`	`false`	Reconstruct browser state from the latest checkpoint at `state_key` before starting.
`max_cost`	`25.0`	Cap in USD; clamped against the tenant cap.
`max_time_minutes`	`60`	Wall-clock cap; clamped against the tenant cap.
`proxy_city`, `proxy_state`	unset	Optional IPRoyal geo overrides. Subject to allowlist.
`record_video`	`false`	If true, captures the Xvfb display while the run executes and saves a screencast under the per-tenant run dir. Fetch via `GET /v1/runs/{run_id}/video`.
`video_format`	`"mp4"`	One of `mp4`, `webm`, `gif`.
`video_fps`	`5`	Capture rate; clamped to `[1, 30]`. Higher fps = larger file + more CPU.

Detached response¶

{
  "status": "queued",
  "created_at": "2026-04-28T01:57:08.316Z",
  "model": "holo3",
  "mode": "detached",
  "run_id": "20260428_021432_076255ef",
  "payload": { ... echoed input ... },
  "updated_at": "2026-04-28T01:57:08.317Z",
  "status_path":  "/workspace/mantis-data/runs/<run_id>/status.json",
  "result_path":  "/workspace/mantis-data/runs/<run_id>/result.json",
  "csv_path":     "/workspace/mantis-data/runs/<run_id>/leads.csv",
  "events_path":  "/workspace/mantis-data/runs/<run_id>/events.log"
}

The *_path fields are server-internal; you fetch them through the polling actions (next section).

Poll / fetch / cancel an existing run¶

Set action and run_id in the body:

{ "action": "status", "run_id": "20260428_021432_076255ef" }
{ "action": "result", "run_id": "..." }
{ "action": "logs",   "run_id": "...", "tail": 200 }
{ "action": "cancel", "run_id": "..." }

status returns the current state plus a summary block when the run is in a terminal state:

{
  "status": "succeeded",          // or running | failed | cancelled
  "run_id": "...",
  "started_at": "...",
  "finished_at": "...",
  "summary": {
    "total_time_s": 569,
    "steps_executed": 17,
    "viable": 3,
    "leads_with_phone": 1,
    "result_path": "...",
    "csv_path": "...",
    "dynamic_verification_summary": { ... },
    "cost_total": 0.42,
    "cost_breakdown": {
      "gpu":    0.12,
      "claude": 0.12,
      "proxy":  0.18
    }
  }
}

result returns the full lead list and per-step trace. logs returns the last tail events written by the runner (default 200, max 10000).

Errors¶

Status	Meaning
`400`	Bad request. Common causes: no plan-shape provided, malformed JSON, plan exceeds `MANTIS_MAX_STEPS_PER_PLAN`, micro-step missing `intent`/`type`.
`401`	Missing or invalid `X-Mantis-Token`.
`403`	Token valid but tenant lacks `run` scope (read-only key).
`404`	`action=status\\|result\\|logs` referenced an unknown `run_id`.
`429`	(Tier 2) Tenant exceeded concurrent-run cap.
`500`	Unhandled exception — check `events_path` for traceback.
`502`	Upstream Holo3 (`/v1/chat/completions`) or Anthropic API unreachable.
`503`	Server auth not configured (`MANTIS_API_TOKEN` unset and no keys file).

`POST /v1/chat/completions`¶

OpenAI-compatible reverse proxy to the in-pod Holo3 server. For raw inference only — no plan orchestration, no Claude grounding, no checkpointing. Designed for clients that want to run their own perception-action loop and use Holo3 as the brain.

curl -X POST "https://model-qvvgkneq.api.baseten.co/production/sync/v1/chat/completions" \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -H "X-Mantis-Token: $MANTIS_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "holo3",
    "messages": [
      {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
        {"type": "text", "text": "Click the boat listing title."}
      ]}
    ],
    "max_tokens": 256
  }'

Auth headers and Mantis-side cookies are stripped before the request is forwarded to llama.cpp; the upstream never sees your tenant credentials.

For the orchestrated/reliable path that handles the full plan, use /v1/predict instead.

`GET /v1/models`¶

OpenAI-compatible model listing.

{
  "object": "list",
  "data": [
    { "id": "holo3", "object": "model", "owned_by": "mantis" }
  ]
}

End-to-end example: 3-listing extraction¶

TOKEN=$(read -srp "MANTIS_API_TOKEN: " v && echo "$v")
BTKEY="$BASETEN_API_KEY"
# Baseten gateway forwards /sync/<any path> to the container. /predict is
# the legacy default route (equivalent to /sync/predict).
ENDPOINT="https://your-model.api.baseten.co/production/sync"

# 1. Launch detached run — supply your own plan_text or a micro-plan.
RESP=$(curl -fsS -X POST "$ENDPOINT/v1/predict" \
  -H "Authorization: Api-Key $BTKEY" \
  -H "X-Mantis-Token: $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "detached": true,
    "plan_text": "Extract the first 3 listings from <your URL>: year, make, model, price, phone, url.",
    "state_key": "smoke-test",
    "resume_state": false,
    "max_cost": 2,
    "max_time_minutes": 20
  }')
RUN_ID=$(echo "$RESP" | jq -r .run_id)
echo "run_id: $RUN_ID"

# 2. Poll status until terminal
while true; do
  STATUS=$(curl -fsS -X POST "$ENDPOINT/v1/predict" \
    -H "Authorization: Api-Key $BTKEY" \
    -H "X-Mantis-Token: $TOKEN" \
    -H "Content-Type: application/json" \
    -d "{\"action\":\"status\",\"run_id\":\"$RUN_ID\"}" | jq -r .status)
  echo "$(date '+%H:%M:%S') $STATUS"
  case "$STATUS" in succeeded|failed|cancelled) break ;; esac
  sleep 30
done

# 3. Fetch leads
curl -fsS -X POST "$ENDPOINT/v1/predict" \
  -H "Authorization: Api-Key $BTKEY" \
  -H "X-Mantis-Token: $TOKEN" \
  -H "Content-Type: application/json" \
  -d "{\"action\":\"result\",\"run_id\":\"$RUN_ID\"}" \
  | jq .result.leads

Result shape (one row per successfully extracted listing):

<year> <make> <model>  — <price> — phone <phone or 'none'>
<year> <make> <model>  — <price>
<year> <make> <model>  — <price>

Plan shapes — when to use which¶

Use case	Recommended shape
Recurring high-volume workflow with predictable steps	Hand-author a micro-plan JSON, ship it in the image at `plans/<domain>/<workflow>.json`, reference via `micro`
Arbitrary plain-English request	`plan_text` — server decomposes it via Claude (cached after first run)
Ad-hoc plan you don't want baked into the image	`task_suite` (inline JSON dict)
Multi-task suite with `task_id` + `verify` clauses	`task_suite` or `task_file`

Plan formats¶

`micro` — micro-plan JSON¶

A flat list of step objects executed by MicroPlanRunner:

[
  {"intent": "Navigate to https://...", "type": "navigate",
   "section": "setup", "required": true},
  {"intent": "Verify filters applied",  "type": "extract_data",
   "claude_only": true, "section": "setup", "gate": true,
   "verify": "Page shows boat listings ..."},
  {"intent": "Click listing title",     "type": "click",
   "grounding": true, "section": "extraction"},
  {"intent": "Read URL",                "type": "extract_url",
   "claude_only": true, "section": "extraction"},
  {"intent": "Scroll to description",   "type": "scroll",
   "budget": 10, "section": "extraction"},
  {"intent": "Extract data",            "type": "extract_data",
   "claude_only": true, "section": "extraction"},
  {"intent": "Go back",                 "type": "navigate_back",
   "section": "extraction"},
  {"intent": "Loop",                    "type": "loop",
   "loop_target": 2, "loop_count": 3, "section": "extraction"}
]

Step types: navigate, filter, click, scroll, extract_url, extract_data, navigate_back, paginate, loop.

Key fields:

Field	Effect
`section`	One of `setup`, `extraction`, `pagination`. Used by retry/halt logic.
`required`	If true, retry on fail then halt the whole run.
`gate`	Claude verifies a condition; halt on fail.
`verify`	Free-text condition Claude checks.
`claude_only`	Skip Holo3; Claude does the perception. Use for extract / gate steps.
`grounding`	Refine click coordinates with `ClaudeGrounding`.
`budget`	Max actions Holo3 can take in this step (default 8).
`loop_target`	Step index to jump back to.
`loop_count`	Max loop iterations (clamped to `MANTIS_MAX_LOOP_ITERATIONS`).

`task_suite` — multi-task JSON¶

For Claude-CUA-style autonomous-per-task workflows (the existing tasks/crm/crm_tasks.json is this shape):

{
  "session_name": "crm_demo",
  "base_url": "https://crm.example.com",
  "auth": { "user_id": "...", "password": "..." },
  "tasks": [
    {
      "task_id": "login",
      "intent": "Go to https://... and log in with user X and password Y",
      "save_session": true,
      "start_url": "https://...",
      "verify": { "type": "url_not_contains", "value": "login" }
    },
    {
      "task_id": "update_lead_industry",
      "intent": "Go to the Leads Page. Update industry of qualified lead to 'Space Exploration'.",
      "require_session": true,
      "start_url": "https://...",
      "verify": { "type": "page_contains_text", "value": "Space Exploration" }
    }
  ]
}

Each task runs with its own max_steps budget; Claude decides what to do per task. The runner verifies the verify clause after each.

`plan_text` — plain-English¶

{
  "plan_text": "Go to a marketplace listings site, filter to private sellers above $35,000 in Florida, extract listing details for the first 3 listings, save year/make/model/price/phone."
}

PlanDecomposer (Claude-backed, cached by signature) converts this into a micro-plan and proceeds. Decomposition costs ~$0.10 the first time per unique plan text; subsequent runs hit the cache.

Pricing (verified end-to-end)¶

Real numbers from a 3-listing marketplace-extraction run on Baseten:

Item	Cost
GPU (Holo3 on H100, ~10 min)	~$0.12
Claude (gates + extract + grounding)	~$0.12
Proxy (IPRoyal residential)	~$0.18
Total per 3-listing run	~$0.42
Per-listing	~$0.14

For comparison, equivalent Claude-only CUA flow ~$0.50–$1.50 per listing.

Security model¶

Concern	Guarantee
Tenant token confidentiality	Stored in Baseten secrets; constant-time compare on validation; never echoed in logs
Per-tenant Anthropic key	Resolved from the tenant's `anthropic_secret_name` — keys are not shared across tenants
Per-tenant browser profile	Mounted at `/workspace/mantis-data/tenants/<tenant_id>/chrome-profile/<state_key>/` — cookies cannot bleed across tenants
Per-tenant run state	Same volume layout — `state_key` is server-prefixed so callers cannot read another tenant's checkpoint
Plan injection (e.g., `loop_count: 999_999`)	Server-side hard caps clamp the values; oversized plans are rejected with `400`
Upstream credential leak	`/v1/chat/completions` strips `X-Mantis-Token`, `Authorization`, `Cookie` before forwarding to in-pod llama.cpp

Limits / caveats¶

Detached runs survive replica restart (state on the data volume) but only on the same Baseten model. Cross-region failover not supported.
Pause/resume for OTP is not yet wired through /v1/predict. It works today in library-embedded integrations because the loop runs in the host's own process — see Embedding MicroPlanRunner.
/v1/chat/completions is unstreamed in v1. Streaming SSE is a Tier 2 follow-up.
Single Anthropic-key per tenant at request time (re-resolved on every call).

Screencast / video recording¶

Send a plan with record_video: true and the runtime produces a feature-walkthrough video — title card → captioned run footage → outro card with the result summary. Fetch with GET /v1/runs/{run_id}/video. The raw screencast is preserved alongside; pass ?raw=1 to fetch it instead.

The walkthrough has three segments plus animated click ripples on top of the run footage:

┌─────────────────┐  ┌─────────────────────────┐  ┌─────────────────┐
│  Title card     │→ │  Run footage (captions  │→ │  Outro card     │
│  (3s)           │  │   + click ripples)      │  │  (5s)           │
│                 │  │  per-step intent shown  │  │                 │
│  Mantis CUA     │  │  with [OK] / [FAIL]     │  │  Run complete   │
│  ───            │  │  in the bottom strip    │  │  ───            │
│  <plan name>    │  │  while the action plays │  │  3 viable leads │
│  tenant: …      │  │  + expanding sky-blue   │  │  1 with phone   │
│  run: …         │  │  ripple at every click  │  │  17 steps · 9m  │
│                 │  │                         │  │  cost: $0.42    │
└─────────────────┘  └─────────────────────────┘  └─────────────────┘

Title and outro are rendered with PIL. Captions are SRT cues burned in by ffmpeg's subtitles= filter (libass). Click ripples are PNG-sequence overlay frames composited via ffmpeg's overlay filter. Polish is best-effort — if anything fails (PIL, ffmpeg, libass not built in the image), the raw recording is still saved and the endpoint serves it.

Action overlays — universal computer use¶

Every kind of agent action gets a visual cue, regardless of what application is in focus (browser, file manager, terminal, dialogs, anything visible on the Xvfb display). The agent emits actions with pixel coordinates / key chords / text, and the overlay renderer composites the matching visual onto the recording.

Agent action	Overlay
`CLICK` (single)	Sky-blue expanding ripple at (x, y), 0.6 s, fades out
`DOUBLE_CLICK`	Same as click + a second offset ring 0.1 s later
`KEY_PRESS` (e.g. `Ctrl+S`, `Tab`, `Enter`)	Slate badge in the bottom-right with the chord text, 1.5 s, slide-in then fade
`TYPE` (typed text)	"⌨ Typing: \"…\"" caption near the top, 1.8 s, fades after text appears on screen
`SCROLL` (`up` / `down` / `left` / `right`)	Sky-blue arrow at the matching screen edge, slides in the scroll direction, 0.8 s
`DRAG`	Animated trail line from start to end with a moving head dot, 0.9 s
`WAIT`, `NAVIGATE`, `DONE`	No overlay (no useful visual locus)

All overlays are deliberately minimal — visible without being disruptive. Sky-blue accent color across the set so they read as a single visual language.

You'll see counts in the result metadata under video.actions:

{
  "video": {
    "path": ".../recording.mp4",
    "polished_path": ".../recording_polished.mp4",
    "actions": {
      "clicks": 17,
      "keys":   3,
      "types":  2,
      "scrolls": 8,
      "drags":  0
    },
    "clicks": 17,    // backwards-compat field
    ...
  }
}

# 1. Submit a recorded run
RESP=$(curl -fsS -X POST "$ENDPOINT/v1/predict" \
  -H "Authorization: Api-Key $BTKEY" \
  -H "X-Mantis-Token: $TOK" \
  -H "Content-Type: application/json" \
  -d '{
    "detached": true,
    "micro": "plans/example/extract_listings.json",
    "state_key": "demo-recording",
    "max_cost": 2,
    "max_time_minutes": 20,
    "record_video": true,
    "video_format": "mp4",
    "video_fps": 8
  }')
RUN_ID=$(echo "$RESP" | jq -r .run_id)

# 2. Poll status until succeeded ... (same as the regular flow)

# 3. Download the screencast
curl -fsS -o demo.mp4 \
  -H "X-Mantis-Token: $TOK" \
  "$ENDPOINT/v1/runs/$RUN_ID/video"

Result-side metadata (in the summary block):

{
  "video": {
    "path": "/workspace/mantis-data/tenants/<tenant>/runs/<run_id>/recording.mp4",
    "polished_path": "/workspace/mantis-data/tenants/<tenant>/runs/<run_id>/recording_polished.mp4",
    "format": "mp4",
    "duration_seconds": 567.3,
    "bytes": 31457280,
    "error": null
  }
}

polished_path is set only when the post-process compose step succeeded; on failure it's omitted and the endpoint falls back to the raw recording.

Endpoint behavior¶

Request	Returns
`GET /v1/runs/{run_id}/video`	Polished mp4 (preferred) → raw mp4 (fallback) → 404
`GET /v1/runs/{run_id}/video?raw=1`	Raw mp4 only → 404

Format tradeoffs¶

Format	Container	Encode cost	Output size (typical 10-min run)	Best for
`mp4`	H.264 (libx264, `ultrafast` preset, CRF 28)	low	~30–80 MB	sharing, downloads
`webm`	VP9 (libvpx-vp9, cpu-used 5, CRF 32)	medium	~25–60 MB	embedding in web pages
`gif`	palettegen + paletteuse	high	~50–200 MB	docs, Slack, animated thumbnails (lossy)

For long recordings or tight bandwidth, prefer mp4 at 5 fps. The gif path uses a palette-aware filtergraph but file size grows fast — use only for short demos (< 60 s).

Operational caveats¶

The container image must have ffmpeg installed. Both docker/server.Dockerfile and deploy/baseten/holo3/config.yaml ship it; if you're rolling your own image, add ffmpeg to the apt deps. Without ffmpeg, record_video: true is a soft-fail — the run completes normally, and the response carries video.error: "ffmpeg-not-installed".
Recordings live at $MANTIS_DATA_DIR/tenants/<tenant_id>/runs/<run_id>/recording.<fmt> so tenants cannot read each other's files. The download endpoint uses the authenticated tenant's dir; even if you guess another tenant's run_id, the file lookup is scoped.
video_fps is clamped to [1, 30]. Higher fps doesn't help much (UI rarely changes faster than 5–10 fps) and bloats the file.
Each second of recording is ~50 KB at 5 fps mp4. Multiply by your target run duration + tenant count to size the EFS / Filestore.

Tier 2 features (rate limits, idempotency, webhooks, allowlist, metrics)¶

Rate limits¶

Two dimensions, both enforced per-tenant:

Dimension	Source	Behavior on exceed
Concurrent runs	`tenant.max_concurrent_runs` (default 5)	`429 Too Many Requests` with `Retry-After: 5`
Rate (token bucket)	`tenant.rate_limit_per_minute` (default 30)	`429` with `Retry-After: <seconds-until-token>`

State is in-process per replica. Behind a load balancer with N replicas, the effective per-tenant cap is roughly N × configured_cap. For strict cluster-wide limits, deploy a single replica or swap to a Redis-backed limiter (planned Tier 2.5).

Idempotency keys¶

Send Idempotency-Key: <unique-string> on POST /v1/predict. The server caches (tenant_id, key) → run_id with a 24-hour TTL. Subsequent retries with the same key return the original run_id without starting a new run.

curl -X POST "$ENDPOINT/v1/predict" \
  -H "Authorization: Api-Key $BTKEY" \
  -H "X-Mantis-Token: $TOK" \
  -H "Idempotency-Key: order-7afc3b91" \
  -H "Content-Type: application/json" \
  -d '{...}'

The cache is sidecar-backed ($MANTIS_DATA_DIR/idempotency/<tenant_id>/<key_hash>.json) so a replica restart preserves it.

Webhook callbacks¶

Two ways to receive run-completion notifications:

Per-tenant default — set webhook_url and webhook_secret_name in the tenant keys file.
Per-request override — pass callback_url in the /v1/predict body.

When the run reaches a terminal state (succeeded, failed, cancelled), the server POSTs:

{
  "run_id": "20260428_021432_076255ef",
  "tenant_id": "tenant_a",
  "status": "succeeded",
  "summary": { ... same shape as /v1/predict status response ... },
  "delivered_at": "2026-04-28T02:24:01.648Z"
}

With an HMAC-SHA256 signature in X-Mantis-Signature: sha256=<hex> (signed with the tenant's webhook secret). 3 retries with exponential backoff (1s, 5s, 30s) if the receiver returns non-2xx or fails to connect.

Verify the signature on receipt:

import hmac, hashlib
def verify(body: bytes, header_sig: str, secret: str) -> bool:
    expected = "sha256=" + hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
    return hmac.compare_digest(expected, header_sig)

URL allowlist enforcement¶

If a tenant has allowed_domains set in the keys file, every plan submitted via /v1/predict is scanned for navigate-type URLs and task_suite.base_url / task.start_url. Off-list hosts return 403 Forbidden before any run starts:

{
  "detail": "plan references host(s) not in tenant allowlist: evil.com"
}

Wildcards: *.example.com matches any subdomain but not example.com.evil.com. Empty allowed_domains (the default) skips this check.

Prometheus metrics¶

GET /metrics returns Prometheus text format. Metric names + labels:

Metric	Type	Labels	Notes
`mantis_predict_requests_total`	counter	`tenant_id`, `mode`, `outcome`	mode = `run\\|status\\|result\\|logs\\|cancel`; outcome = `ok\\|bad_request\\|rate_limited\\|denied_allowlist\\|idempotent_hit\\|error`
`mantis_chat_completions_total`	counter	`tenant_id`, `outcome`	outcome = `ok\\|status_4xx\\|status_5xx\\|upstream_error`
`mantis_run_duration_seconds`	histogram	`tenant_id`, `model`, `status`	Buckets: 10s … 3600s
`mantis_run_cost_usd`	histogram	`tenant_id`, `model`, `status`	Buckets: $0.01 … $25
`mantis_concurrent_runs`	gauge	`tenant_id`	Currently in-flight runs
`mantis_rate_limit_rejections_total`	counter	`tenant_id`, `kind`	kind = `rate\\|concurrent`

If prometheus_client isn't installed in the container (e.g., orchestrator-only install), /metrics returns 503 and all metric calls become no-ops — the rest of the API is unaffected.

Tier roadmap¶

This API is at Tier 2 — production-quality multi-tenant. Upcoming:

Tier 3: billing records, admin API, multi-region.

See Architecture for the bigger architectural picture.