Errors¶
What every status code means and how to handle it.
Quick reference¶
| Status | Meaning | Retry? |
|---|---|---|
| 200 | Success | n/a |
| 400 | Bad request — malformed JSON, oversized plan, missing intent/type on a step |
No — fix the payload |
| 401 missing | No X-Mantis-Token |
No — add the header |
| 401 invalid | Token doesn't match a tenant | No — verify the token wasn't truncated |
| 403 scope | Token valid but tenant lacks the required scope (run / status / result / logs) |
No — operator needs to add scope |
| 403 allowlist | Plan references a host not in tenant's allowed_domains |
No — fix the plan or ask operator to add the domain |
| 404 run | action=status\|result\|logs referenced a run_id your tenant doesn't own |
No — wrong run_id or wrong tenant |
| 404 video | No recording for the run (recording disabled or ffmpeg failed) | No |
| 409 profile-busy | (#342, Modal HTTP endpoint only) Another run is currently holding the requested profile_id's Chrome user-data-dir lock. The detail includes the held run_id so you can poll it. |
Yes — wait for the held run to finish, or submit with a different profile_id |
| 429 rate | Tenant exceeded rate_limit_per_minute |
Yes — honor Retry-After header |
| 429 concurrent | Tenant at max_concurrent_runs |
Yes — honor Retry-After |
| 500 | Unhandled exception | Sometimes — check {"action":"logs", ...} for the traceback before retrying |
| 502 upstream | Holo3 (/v1/chat/completions) or Anthropic API unreachable |
Yes — exponential backoff |
| 503 auth-not-configured | Server has neither MANTIS_API_TOKEN nor MANTIS_TENANT_KEYS_PATH set |
No — operator misconfiguration |
| 503 metrics | prometheus_client not installed in the container; /metrics only |
n/a |
Error body shape¶
The detail string is meant to be safe to surface to your users (no internals leaked). For deeper debugging, use {"action": "logs", "run_id": "...", "tail": 500} to get the runner's event log.
Run-level failures¶
A run can return 200 OK and still have status: failed:
{
"status": "failed",
"run_id": "...",
"error": "page_blocked", // or "max_cost_exceeded", "timeout", etc.
"summary": { ... whatever was completed ... }
}
error value |
Cause | Fix |
|---|---|---|
page_blocked |
The site detected the agent / Cloudflare didn't pass | Different proxy, geo, or reduce request rate |
max_cost_exceeded |
Budget cap hit before the plan completed | Raise max_cost (within tenant cap) or shorten the plan |
timeout |
max_time_minutes hit |
Same |
gate_failed |
A gate: true step's verify clause was false |
The site didn't reach the expected state — different filters / start URL |
extract_failed |
Claude couldn't parse the screenshot into the expected schema | Schema mismatch with what's actually visible — adjust the plan |
navigation_blocked |
Cloudflare 403 or sustained 5xx from the target | Wait + retry; or try with a fresh profile_id (different cookies / IP rotation) |
These are partial successes — the summary block reflects whatever was completed (e.g., 2 of 3 listings extracted before the cap hit).
Retry guidance¶
| Code | Retry strategy |
|---|---|
| 4xx (except 429) | Don't retry — fix the request |
| 429 | Honor Retry-After; capped exponential backoff after that |
| 500 | Inspect {"action":"logs"} first; retry with a fresh workflow_id (and a fresh profile_id if the failure looks tied to a corrupted Chrome session) if it looks transient |
| 502 | Exponential backoff (5 s → 30 s → 2 min); the upstream Holo3 / Anthropic might be down |
| 503 (auth) | Don't retry — operator action needed |
If you're hitting 429 frequently, ask your operator to raise the per-tenant rate_limit_per_minute or max_concurrent_runs — see Rate limits.
Idempotency for safe retries¶
For workflows where double-execution would be expensive (e.g., extracting from a paginated site twice), use Idempotency-Key:
KEY="my-job-$(uuidgen)"
curl -X POST "$ENDPOINT/v1/predict" \
-H "Idempotency-Key: $KEY" \
-H "X-Mantis-Token: $MANTIS_API_TOKEN" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-d '{...}'
Every retry of the same Idempotency-Key returns the cached run_id (24 h TTL). See Idempotency.
Diagnosing a failed step¶
When a step fails, action=result returns each failed step with structured diagnostics — read these before falling back to action=logs. The full schema is documented in mantis plan run → result.json; the fields you'll care about per failed step are:
| Field | What it tells you |
|---|---|
failure_class |
Stable enum (see table below). Branch on this in dashboards / retry policies. |
final_url |
Browser URL at the moment of failure. |
page_title |
Page title at the moment of failure — CF interstitials surface here even when data is empty. |
last_action |
The final Action dispatched before failure ({type, params, reasoning}). |
screenshot_b64 |
Base64-encoded PNG of the post-failure viewport. |
data |
Short prose from the handler (gate:FAIL:Error 403, fill_error: not found, …). |
failure_class → likely cause → first action¶
failure_class |
Likely cause | First action |
|---|---|---|
cf_challenge |
Cloudflare / anti-bot interstitial didn't clear | Retry with a fresh profile_id (rotates the IP / cookies). For repeated failures, ask operator to bump MANTIS_CF_PREWARM_MAX_SECONDS. |
http_4xx |
Target returned 401 / 404 / 410 | Check final_url — usually a stale URL in the plan or a tenant-scoped page. Not retryable. |
http_5xx |
Target backend returned 5xx | Exponential backoff + retry. Usually transient. |
nav_timeout |
Page load exceeded the navigate budget | Bump wait_after_load_seconds on the step, or MANTIS_NAV_WAIT_SECONDS. Repeated → check egress / proxy. |
selector_miss |
Click / fill / submit couldn't locate the target | Inspect screenshot_b64 — the page is often in a different state than the plan expects. Adjust the plan, or add a wait step before. |
no_state_change |
Action handler reported success but the runner-state snapshot saw no URL / page / scroll change. Self-healing demotion (epic #377 Phase A). Fires on click / submit / navigate_back. |
Usually transient — the runner already triggered a retry and (after 2× repeats on the same step) routed through Holo3StepHandler. If it persists into terminal failure, the click is hitting an element that doesn't actually navigate; inspect screenshot_b64 + last_action. |
brain_loop_exhausted |
The inner GymRunner ran to its step budget (or the loop detector tripped) without success. Typically signals a goal-shaped intent the brain can't satisfy in a bounded loop (epic #377 Phase A.2). |
The next attempt should route the intent through an intent rewriter (Phase B) — rewriting "Scroll down to reveal the title, date, location, host details" into a literal "Scroll down by one viewport" and retrying. Until Phase B ships, edit the plan to use mechanical (verb + N) intents instead of multi-clause goals. |
wrong_target |
The SPA-aware visual verifier decided the click landed on the wrong destination — category card instead of an event detail, login wall, ad, etc. Distinct from no_state_change (where the click had no effect at all). |
The intent rewriter (Phase B) is the right response — rewrite the intent to be more specific about which target element. Until Phase B ships, add a verify gate to the click step that asserts the post-click URL matches the expected detail-page pattern. |
extractor_error |
Claude extractor failed / returned empty | Schema mismatch with what's visible. Tighten the recipe or relax the schema. |
budget_exceeded |
max_cost / max_time / per-URL context budget tripped |
Raise the cap or shorten the plan. |
unknown |
No rule matched | Pull action=logs (below) — the runner traceback usually identifies it. |
Decoding the failure screenshot¶
# Pull the result, extract the failed step's screenshot to a PNG.
curl -fsS -X POST "$ENDPOINT/v1/predict" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-H "X-Mantis-Token: $MANTIS_API_TOKEN" \
-d "{\"action\":\"result\",\"run_id\":\"$RUN_ID\"}" \
| jq -r '.steps[] | select(.success == false) | .screenshot_b64' \
| head -1 | base64 -d > failed_step.png
screenshot_b64 is omitted on success and on steps where capture failed.
failure_help on the lifecycle response¶
Every terminal-failure response on GET /v1/runs/{id} (and the
status-action body) carries a failure_help dict — the
human-actionable companion to halt_class and failure_class:
{
"phase": "halted",
"halt_class": "anthropic_unreachable",
"failure_help": {
"halt_class": "anthropic_unreachable",
"summary": "Could not reach the Anthropic API after the retry budget was exhausted.",
"likely_causes": [
"Transient Anthropic 5xx / 529 Overloaded during peak hours",
"Per-IP rate limit on the Modal egress shared pool",
"Deployed image is stale (>14 days) and its TLS / certifi stack hasn't been refreshed",
"Anthropic API key revoked or hit account quota"
],
"next_steps": [
"Wait 60s and retry — the retry policy backs off exponentially",
"Check `GET /v1/version` — if `deploy_age_days > 14`, redeploy from current main",
"Verify ANTHROPIC_API_KEY hasn't been rotated in the Modal Secret"
],
"debug_surfaces": {
"events": "/v1/runs/<run_id>/events?sse=true",
"augur": "/v1/runs/<run_id>/augur",
"logs": "POST /v1/predict {action: logs, run_id: <run_id>}",
"phase": "/v1/runs/<run_id>",
"status": "/v1/runs/<run_id>/status",
"result": "/v1/runs/<run_id>/result"
},
"retries_spent": 3
}
}
The taxonomy currently covers anthropic_unreachable, cf_challenge,
page_blocked, navigation_drift, navigate_failed, bad_url,
extract_data_failed, no_schema_configured, budget_cap,
time_cap, halt_timeout, and cancelled. Unknown classes fall
back to a default help dict that still surfaces the debug_surfaces
URLs — operators always have a path forward.
When retries_spent > 0, the API actually retried before giving up;
this is the most common cause of slow failures and tells you the
budget was exhausted (vs e.g. an API-key error which would never
retry).
The extraction_quality summary field on action=result carries
unknown_placeholder_row_count — non-zero means rows were captured
but their non-URL fields were all <UNKNOWN> / none / empty
placeholders. Useful when viable > 0 but the data feels thin.
When failure_class isn't enough¶
For unknown (or to read the traceback / per-step Holo3 logs), fall through to:
curl -fsS -X POST "$ENDPOINT/v1/predict" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-H "X-Mantis-Token: $MANTIS_API_TOKEN" \
-d "{\"action\":\"logs\",\"run_id\":\"$RUN_ID\",\"tail\":500}" \
| jq -r '.events[]'
action=logs returns the runner thread's Python logger tail (state transitions, step handler messages, exception tracebacks). This is the HTTP-API analog of operator-only modal app logs — same content, different access path.
If the events.log doesn't surface the cause either, ask your operator to pull modal app logs mantis-plan-runner for the container-side stdout / stderr. Open a ticket with the run_id.
Debugging a stuck run¶
If a running status doesn't advance for ≥ 5 min:
# Pull the last 500 events the runner emitted
curl -fsS -X POST "$ENDPOINT/v1/predict" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-H "X-Mantis-Token: $MANTIS_API_TOKEN" \
-d "{\"action\":\"logs\",\"run_id\":\"$RUN_ID\",\"tail\":500}" \
| jq -r '.events[]'
Common stuck patterns and what they look like in the logs:
| Symptom | Likely cause |
|---|---|
Many [click] (0,0) grounding=NO lines |
Holo3 hallucinating coordinates; the agent might be on a different page than expected |
[runner] plan=False executor=False idx=0 in_range=False repeating |
The runner exhausted the per-step budget; the next step's max_steps ran out |
[content-control] parse failed: ... |
Claude returned prose instead of JSON. Non-fatal; the runner moves on |
Runner interrupted due to worker preemption |
(Modal only) Spot GPU was preempted. The function auto-restarts |
Cloudflare challenge timeout |
The page's anti-bot didn't auto-resolve. Try a different proxy geo |
If the run is genuinely stuck (no log progress for 10 + minutes within the time budget), cancel it and start fresh.
See also¶
- Authentication — auth-related error details
- Operations / Rate limits — operator-side cap tuning