Augur — per-run debug bundles + live streaming¶
augur-sdk instruments any
screenshot-grounded CUA with a single DebugSession context manager.
Mantis ships the wedge in src/mantis_agent/observability/augur.py so
every gym run produces a path-stable, schema-validated bundle on disk —
and, when AUGUR_DSN is exported, streams the same records live to the
Augur workspace (Sentry-style DSN, connection-status badge).
The upstream integration guide is the canonical reference for the bundle layout, schema, and replay contract: https://mercurialsolo.github.io/augur-sdk/integrations/mantis/. This page documents how Mantis itself wires in.
Install¶
The adapter only activates when the augur-sdk package is importable.
Pick whichever install matches your deployment:
# Direct
pip install augur-sdk
# Or via the Mantis observability extra (adds nothing else)
pip install 'mantis-agent[observability]'
# Or pull everything Mantis ships
pip install 'mantis-agent[full]'
When the package isn't installed, the wedge is a hard no-op — no error, no overhead, no bundle.
What gets written¶
Every gym run that reaches RunExecutor.execute() opens a
DebugSession. The default on-disk location is:
Override with MANTIS_AUGUR_DIR=/custom/root (the run id is still
appended). Inside each bundle the SDK writes:
augur/<run_id>/
├── manifest.json # bundle metadata (schema_version, paths, ...)
├── trace.json # session + step traces
├── screenshots/0001_post.png ...
├── events/0001.jsonl # one DecisionEvent per line, grouped by step
├── steps/0001.json
└── schema/ # vendored JSON Schemas for offline validation
Validate any bundle with:
from augur_sdk.validation import validate_bundle
issues = validate_bundle("data/augur/<run_id>/")
assert not issues
What gets streamed¶
If AUGUR_DSN is set in the runtime environment, the SDK opens a
streaming sink at session start and forwards every record_step,
record_event, and attach_observation call to the workspace
identified by the DSN. The on-disk bundle stays the source of truth —
streaming is observe-only.
Network failures are non-fatal. A misconfigured DSN, a firewall block, or a transient blip lands as a debug log line; the run continues.
Mantis → Augur vocabulary mapping¶
The adapter normalizes Mantis-internal layer names to the Augur
DecisionLayer
enum:
| Mantis layer | Augur layer |
|---|---|
critic-frontier |
verifier |
critic |
verifier |
gate-decision |
verifier |
preview-gate |
verifier |
agentic-recovery |
step_recovery |
recovery |
step_recovery |
som-click |
grounding |
planner |
planner |
| anything else | runner |
Mantis healing-event kinds collapse onto Augur's five-state
DecisionKind:
| Mantis kind | Augur kind |
|---|---|
fire / skip / decision |
decision |
result / observation |
observation |
info |
info |
error |
error |
metric |
metric |
Mantis runner status maps onto RunStatus:
| Mantis status | Augur status |
|---|---|
completed, completed_with_failures, succeeded |
succeeded |
halted, budget_exceeded, time_exceeded |
halted |
failed |
failed |
cancelled |
cancelled |
running, paused |
running |
Environment knobs¶
| Variable | Effect |
|---|---|
AUGUR_DSN |
Sentry-style DSN. When set, the SDK opens a streaming sink to the workspace; when unset, only the on-disk bundle is produced. |
AUGUR_CAPTURE_MODE |
One of off, metadata, trace, screenshots (default), video, model_io, dispatch, replay, full. Controls what the SDK captures. |
MANTIS_AUGUR_DIR |
Override the root directory where bundles are written (run id is still appended). |
MANTIS_AUGUR_DISABLED |
1 / true / yes → adapter is a no-op even with the SDK installed. Useful for tests / CI. |
MANTIS_DATA_DIR |
Default bundle root when MANTIS_AUGUR_DIR is unset. Defaults to ./data. |
MANTIS_VERSION |
Surfaced as client.version on the manifest — useful when bisecting which build produced a bundle. |
MANTIS_GIT_SHA |
Surfaced as client.git_sha on the manifest. |
MANTIS_AUGUR_VERBOSE |
1 → adapter diagnostics log at WARN instead of DEBUG. Modal suppresses INFO/DEBUG, so flip this on when verifying a deploy. |
Structured session costs (augur-sdk 0.1.8+)¶
The Augur workspace's Runs-list COST column reads from the
canonical session.costs record (added in SDK 0.1.6, populated via
the set_costs helper in 0.1.8). Mantis stamps it from the run's
CostMeter totals at finalization:
augur.set_costs(
total_usd=..., model_usd=..., gpu_usd=..., proxy_usd=...,
tokens_in=..., tokens_out=..., cache_hit_tokens=...,
)
The legacy add_tag("cost_usd", ...) / add_tag("cost_claude_usd",
...) chain was retired alongside the bump — keep one source of
truth. elapsed_seconds stays a tag because the costs schema has no
slot for run wallclock.
Per-step cost deltas land on the StepTrace.costs field (filled
inline by _build_step_trace from a CostMeter.totals_from(snapshot)
diff). The adapter additionally exposes set_step_costs(step_index,
...) for callers that compute costs after the fact, but the inline
path is preferred when it's available.
Per-LLM-call modelio capture (augur-sdk 0.1.6+, opt-in)¶
modelio/<step:04d>-<layer>-<seq>.json is the canonical per-call
training-data record (one file per LLM invocation under the bundle).
Mantis ships the plumbing in
src/mantis_agent/observability/modelio.py plus a hook inside the
shared AnthropicToolUseClient; per-layer call sites opt in by
publishing a context.
Live streaming (augur-sdk 0.1.10+): with the pin bumped to
augur-sdk>=0.1.10, every record_modelio call ALSO POSTs the
redacted record live to /api/v1/runs/<run_id>/modelio/<relpath>
on a background thread (in addition to staging the file for the
on-close bundle). The local bundle behavior is unchanged — streaming
is purely additive. If the server returns 403 on the first POST
(tenant not opted in to modelio streaming), the SDK latches the
sink off for the rest of the session and bundle-only consumers see
no regression. AugurAdapter.record_modelio is a verbatim forward
to DebugSession.record_modelio so all five wired layers
(planner / grounding / model / verifier / step_recovery) get the
streaming behavior automatically with no per-site code changes.
from mantis_agent.observability.modelio import publish_modelio_context
# Around a planner / grounding / verifier / step_recovery / judge LLM call:
with publish_modelio_context(runner._augur, layer="planner", step_index=0):
plan = decomposer.decompose_text(plan_text) # auto-captures
# any call via AnthropicToolUseClient inside this block is also captured
Captured layers must be one of the SDK enum literals: planner,
grounding, model, verifier, step_recovery, judge. Unknown
layer names log a warning and fall through as a no-op (typo guard).
Vendor field-name mapping (gotcha to remember): the canonical
modelio.schema.json response.usage block uses OpenAI field
names — prompt_tokens / completion_tokens — with
additionalProperties: false. Anthropic responses ship
input_tokens / output_tokens plus
cache_creation_input_tokens / cache_read_input_tokens. The
record_anthropic_modelio mapper translates at the boundary; never
construct a modelio record by hand from an Anthropic response or
the SDK's Draft 2020-12 validation will reject it.
Status (#523 multi-PR campaign — closed): all four LLM layers
Mantis actually has are wired. The fifth layer the SDK enum lists
(judge) is N/A for Mantis — no dedicated judge LLM call site
exists in the codebase.
| Layer | Where it's wired | PR |
|---|---|---|
planner |
plan_decomposer.decompose_text — inline record_anthropic_modelio call after the requests.post |
#527 |
grounding |
gym/step_handlers/form.py:ClaudeGuidedFormHandler.execute — wraps the dispatch via _dispatch(...) helper; covers all ~10 target_provider.find_* calls |
#531 |
verifier |
gym/_runner_helpers.py:ensure_results_filters (visual filter gate) + gym/_runner_helpers.py:execute_step (plan-author step.gate=true branch) |
#532 |
step_recovery |
agentic_recovery._call_recovery_tool (inline wire) + gym/step_recovery.StepRecoveryPolicy._try_agentic_recovery (caller wrap) |
#533 |
judge |
N/A — no judge layer in Mantis | — |
Each wire is a no-op when augur is None or inactive — the default control path is unchanged from before any of this landed. When augur is active, an entire run with all four layers exercising Anthropic will produce something like:
modelio/
├── 0000-planner-001.json # plan_decomposer (run-scoped)
├── 0003-grounding-001.json # ClaudeGuidedFormHandler on step 2
├── 0003-grounding-002.json # second find_form_target on same step
├── 0004-verifier-001.json # step.gate=true verify
├── 0005-step_recovery-001.json # agentic recovery analysis
└── ...
Open #523 acceptance items not addressed by this campaign (filed
as separate follow-ups when scheduled): references field on
each StepTrace pointing at its modelio files, redaction policy
tightening, capture_mode-gated emission (skip on
metadata-only runs), 50 MB modelio bundle-size budget warning.
Continuous verdict score (augur-sdk 0.1.7+)¶
When a step's typed Verdict.confidence is non-zero, the adapter
patches the step's verdict with a continuous reward score via
augur.set_score(step_index, confidence, comparator="verifier").
This replaces Augur's default binary status→score map
(passed=1.0 / failed=0.0) with the verifier's actual signal,
which RLHF / DPO pipelines need for ranked preferences.
comparatormust be one of the SDK's canonical values:verifier,model-judge,exact-match,human. Mantis stampsverifier.- Score is clamped to
[0.0, 1.0]SDK-side; out-of-band values trigger aValueErrorthat the wrapper swallows (telemetry never breaks runs). - Steps with
Verdict.confidence == 0(the default — most deterministic handlers never set it) are skipped, so we don't pollute the bundle with bogus scores.
Capture-mode upgrade on first failure (augur-sdk 0.1.3+)¶
The runner upgrades the active capture mode from metadata (cheap)
to screenshots (full evidence) the first time a step fails:
if not step_result.success and not runner._augur_capture_upgraded:
augur.set_capture_mode("screenshots")
runner._augur_capture_upgraded = True
Healthy runs stay on the original (cheap) mode; failing runs auto- collect screenshot evidence from the first failure onward. The sentinel is idempotent — only the first failure flips the switch.
The override is per-step (stamped on the next record_step call,
not on the manifest baseline). To upgrade the baseline globally,
set AUGUR_CAPTURE_MODE=screenshots at the runtime instead.
Failure-class hygiene diagnostic (augur-sdk 0.1.4+)¶
0.1.4 added a server-side rule, cua.uncategorized_failure, that
flags any step landing with status: "failed" but an empty or literal-
"unknown" failure_class. The rule surfaces in the per-run
Diagnostics panel with severity low and recommends tightening
the producer's classifier.
No code change on the Mantis side — purely a server-side surface that
audits the bundles we already ship. Practical impact: a few Mantis
recovery paths emit failure_class="unknown" (most visibly on
halted click steps); after the 0.1.4 bump those will start showing
up as low-severity diagnostics. Addressing them is hygiene work
for a follow-up, not a regression in the wedge.
Log streaming (augur-sdk 0.1.3+)¶
AugurAdapter.append_log(text, *, step_index=None, name="run") is a
thin wrapper over DebugSession.append_log (added upstream in 0.1.3).
When streaming is enabled, it POSTs the chunk to
/api/v1/runs/<run_id>/logs — the server appends it to
logs/<name>.log (or logs/step-<idx>.log when step_index is set),
bounded at 1 MB per file. Workspace surfaces these in the per-step
logs panel.
The wrapper is a clean no-op when:
- The runtime ships an older augur-sdk install without append_log
- No AUGUR_DSN is set (no server to stream to)
Mantis does not auto-pipe logger.* output into this surface today —
callers explicitly call runner._augur.append_log(...) when they have
text worth showing in the viewer. Per-step structured data already
goes through record_step / record_event; append_log is for
free-text additions (e.g. raw model output dumps, recovery rationale
prose, long verifier explanations).
Coexistence with TraceExporter¶
mantis_agent.gym.trace_exporter.TraceExporter keeps writing one JSON
per run to MANTIS_TRACE_EXPORT_DIR when configured. The Augur adapter
sits alongside it — the two writers don't share state and neither
replaces the other. Augur's bundle is finer-grained (per-step events,
typed layers, on-disk screenshots) and supports live streaming; the
trace export is the legacy coarse-grained record.
Verification¶
To confirm streaming works against a workspace:
- Export
AUGUR_DSN=https://<key>@<host>/<project>(and any required API key) in the Mantis runtime environment. - Submit any gym run (
/v1/predict, the CLI, or an embeddedMicroPlanRunner). - Check the workspace badge — the SDK heartbeats every ~15 s, so the "connected" indicator should turn green within one heartbeat of the first step landing.
- After the run finishes, verify the bundle on disk:
ls "$MANTIS_DATA_DIR/augur/<run_id>/"
python -c "from augur_sdk.validation import validate_bundle; print(validate_bundle('$MANTIS_DATA_DIR/augur/<run_id>/'))"
References¶
- Augur SDK source: https://github.com/mercurialsolo/augur-sdk
- Integration guide: https://mercurialsolo.github.io/augur-sdk/integrations/mantis/
- Normative SDK spec: https://mercurialsolo.github.io/augur-sdk/spec/
- Adapter authoring: https://mercurialsolo.github.io/augur-sdk/reference/adapter-authoring/