Skip to content

Errors

What every status code means and how to handle it.

Quick reference

Status Meaning Retry?
200 Success n/a
400 Bad request — malformed JSON, oversized plan, missing intent/type on a step No — fix the payload
401 missing No X-Mantis-Token No — add the header
401 invalid Token doesn't match a tenant No — verify the token wasn't truncated
403 scope Token valid but tenant lacks the required scope (run / status / result / logs) No — operator needs to add scope
403 allowlist Plan references a host not in tenant's allowed_domains No — fix the plan or ask operator to add the domain
404 run action=status\|result\|logs referenced a run_id your tenant doesn't own No — wrong run_id or wrong tenant
404 video No recording for the run (recording disabled or ffmpeg failed) No
409 profile-busy (#342, Modal HTTP endpoint only) Another run is currently holding the requested profile_id's Chrome user-data-dir lock. The detail includes the held run_id so you can poll it. Yes — wait for the held run to finish, or submit with a different profile_id
429 rate Tenant exceeded rate_limit_per_minute Yes — honor Retry-After header
429 concurrent Tenant at max_concurrent_runs Yes — honor Retry-After
500 Unhandled exception Sometimes — check {"action":"logs", ...} for the traceback before retrying
502 upstream Holo3 (/v1/chat/completions) or Anthropic API unreachable Yes — exponential backoff
503 auth-not-configured Server has neither MANTIS_API_TOKEN nor MANTIS_TENANT_KEYS_PATH set No — operator misconfiguration
503 metrics prometheus_client not installed in the container; /metrics only n/a

Error body shape

{
  "detail": "human-readable error string"
}

The detail string is meant to be safe to surface to your users (no internals leaked). For deeper debugging, use {"action": "logs", "run_id": "...", "tail": 500} to get the runner's event log.

Run-level failures

A run can return 200 OK and still have status: failed:

{
  "status": "failed",
  "run_id": "...",
  "error": "page_blocked",       // or "max_cost_exceeded", "timeout", etc.
  "summary": { ... whatever was completed ... }
}
error value Cause Fix
page_blocked The site detected the agent / Cloudflare didn't pass Different proxy, geo, or reduce request rate
max_cost_exceeded Budget cap hit before the plan completed Raise max_cost (within tenant cap) or shorten the plan
timeout max_time_minutes hit Same
gate_failed A gate: true step's verify clause was false The site didn't reach the expected state — different filters / start URL
extract_failed Claude couldn't parse the screenshot into the expected schema Schema mismatch with what's actually visible — adjust the plan
navigation_blocked Cloudflare 403 or sustained 5xx from the target Wait + retry; or try with a fresh profile_id (different cookies / IP rotation)

These are partial successes — the summary block reflects whatever was completed (e.g., 2 of 3 listings extracted before the cap hit).

Retry guidance

Code Retry strategy
4xx (except 429) Don't retry — fix the request
429 Honor Retry-After; capped exponential backoff after that
500 Inspect {"action":"logs"} first; retry with a fresh workflow_id (and a fresh profile_id if the failure looks tied to a corrupted Chrome session) if it looks transient
502 Exponential backoff (5 s → 30 s → 2 min); the upstream Holo3 / Anthropic might be down
503 (auth) Don't retry — operator action needed

If you're hitting 429 frequently, ask your operator to raise the per-tenant rate_limit_per_minute or max_concurrent_runs — see Rate limits.

Idempotency for safe retries

For workflows where double-execution would be expensive (e.g., extracting from a paginated site twice), use Idempotency-Key:

KEY="my-job-$(uuidgen)"
curl -X POST "$ENDPOINT/v1/predict" \
  -H "Idempotency-Key: $KEY" \
  -H "X-Mantis-Token: $MANTIS_API_TOKEN" \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -d '{...}'

Every retry of the same Idempotency-Key returns the cached run_id (24 h TTL). See Idempotency.

Diagnosing a failed step

When a step fails, action=result returns each failed step with structured diagnostics — read these before falling back to action=logs. The full schema is documented in mantis plan runresult.json; the fields you'll care about per failed step are:

Field What it tells you
failure_class Stable enum (see table below). Branch on this in dashboards / retry policies.
final_url Browser URL at the moment of failure.
page_title Page title at the moment of failure — CF interstitials surface here even when data is empty.
last_action The final Action dispatched before failure ({type, params, reasoning}).
screenshot_b64 Base64-encoded PNG of the post-failure viewport.
data Short prose from the handler (gate:FAIL:Error 403, fill_error: not found, …).

failure_class → likely cause → first action

failure_class Likely cause First action
cf_challenge Cloudflare / anti-bot interstitial didn't clear Retry with a fresh profile_id (rotates the IP / cookies). For repeated failures, ask operator to bump MANTIS_CF_PREWARM_MAX_SECONDS.
http_4xx Target returned 401 / 404 / 410 Check final_url — usually a stale URL in the plan or a tenant-scoped page. Not retryable.
http_5xx Target backend returned 5xx Exponential backoff + retry. Usually transient.
nav_timeout Page load exceeded the navigate budget Bump wait_after_load_seconds on the step, or MANTIS_NAV_WAIT_SECONDS. Repeated → check egress / proxy.
selector_miss Click / fill / submit couldn't locate the target Inspect screenshot_b64 — the page is often in a different state than the plan expects. Adjust the plan, or add a wait step before.
no_state_change Action handler reported success but the runner-state snapshot saw no URL / page / scroll change. Self-healing demotion (epic #377 Phase A). Fires on click / submit / navigate_back. Usually transient — the runner already triggered a retry and (after 2× repeats on the same step) routed through Holo3StepHandler. If it persists into terminal failure, the click is hitting an element that doesn't actually navigate; inspect screenshot_b64 + last_action.
brain_loop_exhausted The inner GymRunner ran to its step budget (or the loop detector tripped) without success. Typically signals a goal-shaped intent the brain can't satisfy in a bounded loop (epic #377 Phase A.2). The next attempt should route the intent through an intent rewriter (Phase B) — rewriting "Scroll down to reveal the title, date, location, host details" into a literal "Scroll down by one viewport" and retrying. Until Phase B ships, edit the plan to use mechanical (verb + N) intents instead of multi-clause goals.
wrong_target The SPA-aware visual verifier decided the click landed on the wrong destination — category card instead of an event detail, login wall, ad, etc. Distinct from no_state_change (where the click had no effect at all). The intent rewriter (Phase B) is the right response — rewrite the intent to be more specific about which target element. Until Phase B ships, add a verify gate to the click step that asserts the post-click URL matches the expected detail-page pattern.
extractor_error Claude extractor failed / returned empty Schema mismatch with what's visible. Tighten the recipe or relax the schema.
budget_exceeded max_cost / max_time / per-URL context budget tripped Raise the cap or shorten the plan.
unknown No rule matched Pull action=logs (below) — the runner traceback usually identifies it.

Decoding the failure screenshot

# Pull the result, extract the failed step's screenshot to a PNG.
curl -fsS -X POST "$ENDPOINT/v1/predict" \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -H "X-Mantis-Token: $MANTIS_API_TOKEN" \
  -d "{\"action\":\"result\",\"run_id\":\"$RUN_ID\"}" \
  | jq -r '.steps[] | select(.success == false) | .screenshot_b64' \
  | head -1 | base64 -d > failed_step.png

screenshot_b64 is omitted on success and on steps where capture failed.

failure_help on the lifecycle response

Every terminal-failure response on GET /v1/runs/{id} (and the status-action body) carries a failure_help dict — the human-actionable companion to halt_class and failure_class:

{
  "phase": "halted",
  "halt_class": "anthropic_unreachable",
  "failure_help": {
    "halt_class": "anthropic_unreachable",
    "summary": "Could not reach the Anthropic API after the retry budget was exhausted.",
    "likely_causes": [
      "Transient Anthropic 5xx / 529 Overloaded during peak hours",
      "Per-IP rate limit on the Modal egress shared pool",
      "Deployed image is stale (>14 days) and its TLS / certifi stack hasn't been refreshed",
      "Anthropic API key revoked or hit account quota"
    ],
    "next_steps": [
      "Wait 60s and retry — the retry policy backs off exponentially",
      "Check `GET /v1/version` — if `deploy_age_days > 14`, redeploy from current main",
      "Verify ANTHROPIC_API_KEY hasn't been rotated in the Modal Secret"
    ],
    "debug_surfaces": {
      "events": "/v1/runs/<run_id>/events?sse=true",
      "augur":  "/v1/runs/<run_id>/augur",
      "logs":   "POST /v1/predict {action: logs, run_id: <run_id>}",
      "phase":  "/v1/runs/<run_id>",
      "status": "/v1/runs/<run_id>/status",
      "result": "/v1/runs/<run_id>/result"
    },
    "retries_spent": 3
  }
}

The taxonomy currently covers anthropic_unreachable, cf_challenge, page_blocked, navigation_drift, navigate_failed, bad_url, extract_data_failed, no_schema_configured, budget_cap, time_cap, halt_timeout, and cancelled. Unknown classes fall back to a default help dict that still surfaces the debug_surfaces URLs — operators always have a path forward.

When retries_spent > 0, the API actually retried before giving up; this is the most common cause of slow failures and tells you the budget was exhausted (vs e.g. an API-key error which would never retry).

The extraction_quality summary field on action=result carries unknown_placeholder_row_count — non-zero means rows were captured but their non-URL fields were all <UNKNOWN> / none / empty placeholders. Useful when viable > 0 but the data feels thin.

When failure_class isn't enough

For unknown (or to read the traceback / per-step Holo3 logs), fall through to:

curl -fsS -X POST "$ENDPOINT/v1/predict" \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -H "X-Mantis-Token: $MANTIS_API_TOKEN" \
  -d "{\"action\":\"logs\",\"run_id\":\"$RUN_ID\",\"tail\":500}" \
  | jq -r '.events[]'

action=logs returns the runner thread's Python logger tail (state transitions, step handler messages, exception tracebacks). This is the HTTP-API analog of operator-only modal app logs — same content, different access path.

If the events.log doesn't surface the cause either, ask your operator to pull modal app logs mantis-plan-runner for the container-side stdout / stderr. Open a ticket with the run_id.

Debugging a stuck run

If a running status doesn't advance for ≥ 5 min:

# Pull the last 500 events the runner emitted
curl -fsS -X POST "$ENDPOINT/v1/predict" \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -H "X-Mantis-Token: $MANTIS_API_TOKEN" \
  -d "{\"action\":\"logs\",\"run_id\":\"$RUN_ID\",\"tail\":500}" \
  | jq -r '.events[]'

Common stuck patterns and what they look like in the logs:

Symptom Likely cause
Many [click] (0,0) grounding=NO lines Holo3 hallucinating coordinates; the agent might be on a different page than expected
[runner] plan=False executor=False idx=0 in_range=False repeating The runner exhausted the per-step budget; the next step's max_steps ran out
[content-control] parse failed: ... Claude returned prose instead of JSON. Non-fatal; the runner moves on
Runner interrupted due to worker preemption (Modal only) Spot GPU was preempted. The function auto-restarts
Cloudflare challenge timeout The page's anti-bot didn't auto-resolve. Try a different proxy geo

If the run is genuinely stuck (no log progress for 10 + minutes within the time budget), cancel it and start fresh.

See also