Ablation harness¶

Quality-related changes need paired ON/OFF evidence — running ON-arm only and inspecting the response surface misses silent regressions. The scripts/ablate_v1_cua.py harness does single-deploy A/B against a warm Modal container using per-request overrides on /v1/cua.

Usage¶

export MANTIS_ENDPOINT=https://getmason--mantis-server-api.modal.run
export MANTIS_API_TOKEN=mantis_…

python scripts/ablate_v1_cua.py \
    --toggle perceptual_verify \
    --instruction "Find a sign-in button and click it. Then call done." \
    --start-url https://lu.ma/discover \
    --pairs 2

What it does:

Warm-up — one request whose results are discarded so subsequent requests reuse the Chrome session cache (#311) and don't include ~10 s cold-start noise. Skip with --skip-warmup when the container is already warm from a previous run.
Paired requests — for each pair: one OFF arm (toggle=false), one ON arm (toggle=true). Identical instruction, same warm container.
Diff report — prints per-pair side-by-side fields, then an aggregate (success rate Δ, wall-time Δ, per-reason count Δ, request errors). Marks regression flags explicitly.

Available toggles¶

Every major runner toggle has a per-request override on /v1/cua:

Toggle name (payload field)	Issue	Default env var
`perceptual_verify`	#293	`MANTIS_PERCEPTUAL_VERIFY`
`loop_recovery`	#302	`MANTIS_LOOP_RECOVERY`
`done_gate`	#303	`MANTIS_DONE_GATE`
`predicate_verify`	#291	`MANTIS_PREDICATE_VERIFY`
`adaptive_settle`	#294	`MANTIS_ADAPTIVE_SETTLE`
`form_controller`	#301	`MANTIS_FORM_CONTROLLER`
`reuse_session`	#311	`MANTIS_CHROME_REUSE`
`speculation`	#118	`MANTIS_SPECULATIVE_INFERENCE`

When a per-request override is present, it overrides the container's env var for that single request only. Other concurrent requests on the container would not be affected (only one /v1/cua request runs at a time on a single container today, so this is safe).

Reading the report¶

Each pair prints a field table marked Δ on changed fields:

=== Pair 1 ===
  field                      OFF                               ON
Δ done_rejections_by_reason  {}                                {"empty_summary":2}
Δ steps                      1                                 3
  success                    True                              True

The aggregate at the bottom sums signals across pairs and flags regressions:

=== Aggregate (done_gate, n=2 pairs) ===
  success rate           OFF=2/2  ON=2/2
  wall mean              OFF=15.1s  ON=13.5s
  steps mean             OFF=2.0   ON=3.0
  done_rejections_by_reason
    OFF: {}
    ON:  {"empty_summary":4}
  ✅ no regression flags

Flags emitted automatically:

success regression: ON dropped N successes vs OFF — fires when the ON arm's success count is strictly lower.
ON arm had request errors — investigate logs — fires on any HTTP error or JSON decode failure.
wall-time regression: ON > OFF * 1.5 — fires on a substantial per-arm wall increase. Verify against the Modal logs before blaming the toggle — pre-existing hangs (e.g. #320) can surface as an outlier on one arm.

Discipline¶

Required for any PR that touches: - gym/runner.py paths that affect the brain↔env loop - gym/done_gate.py, gym/predicates.py, gym/perceptual_diff.py, gym/loop_recovery.py, gym/form_controller.py, gym/adaptive_settle.py - brain_* parsers / classifiers - verification/

The PR body must include an Ablation report section with the harness output, at least 2 pairs, and a ✅ no regression flags line or an explicit explanation of any flagged anomaly.

Optional for: - Pure refactors with no observable behaviour change - Pure plumbing PRs that don't change runner decisions - Pure docs / typing / lint changes

Retroactive ablations (2026-05-12)¶

To verify the existing quality PRs didn't introduce silent regressions:

PR	Toggle	Verdict	Notes
#316 (#293 perceptual diff)	`perceptual_verify`	✅ no regression	One outlier traced to #320, pre-existing
#317 (#302 loop recovery)	`loop_recovery`	✅ no regression	Recovery fires on submit-shaped loops as designed
#306 (#303 done gate)	`done_gate`	✅ no regression	Gate fires on empty-summary dones; intended ~1 extra step

All three quality PRs are clean. The harness flagged one infrastructure issue (#320) during the #293 run — exactly the kind of hidden regression the discipline is meant to catch.

Limitations¶

Single-container assumption: the harness reuses the same Chrome session across pairs (via #311 cache) so toggle effects aren't conflated with cold-start variance. If Modal scales up and routes pairs to different containers, the wall-time comparison gets noisy.
Holo3-Q8 stochasticity: temperature is 0 but Holo3 still emits slightly different actions on identical inputs sometimes (vLLM numerical noise). For high-confidence regression detection, --pairs 4 or higher; the default 2 is enough for catching obvious failures.
No quality metric beyond success/steps/aggregate counters: the harness can't tell you whether a successful run actually achieved the user's goal (vs the brain emitting a fake done(success=true) with a fabricated summary). Pair with #303 done-gate to mitigate.