Model promotion scorecard (#183)¶

Before a fine-tuned Holo3 checkpoint serves more than its initial shadow share, it has to clear a named scorecard. The scorecard composes the artefacts the rest of the continual-fine-tuning pipeline already emits — the eval report (#155 step 4), shadow analytics (step 5), and labelled traces (step 2) — and reports pass/fail per gate at one of three tiers.

Tiers¶

Tier	Use
`base`	bare-minimum — no worse than Holo3 stock weights
`first_sft`	first fine-tune off base; the typical promotion gate
`future`	aspirational long-term target

Gates¶

Gate	Direction	Reads from	What it catches
`task_pass_rate`	min	eval report	held-out task success rate
`parser_validity`	min	override / labeller	tool-call shape regression
`grounding_accuracy`	min	override / labeller	grounded clicks landing on usable elements
`forbidden_region_avoidance`	min	labeller (escalation = inverse signal)	clicks avoid photos / ads / social / off-site
`loop_rate_max`	max	labeller	repeated-action loops
`gallery_recovery_rate`	min	override	lightbox / gallery trap recovery
`escalation_rate_max`	max	shadow analytics	brain ladder Holo3 → Claude rate
`done_completeness`	min	override	structured `done()` summary present
`cost_per_success_usd_max`	max	eval report	$ per successful held-out task

min gates pass when value ≥ threshold; max gates pass when value ≤ threshold.

Default thresholds are conservative — operators tune by editing DEFAULT_THRESHOLDS in training/promotion_scorecard.py or by passing their own override map into evaluate(thresholds=...) from Python.

Usage¶

python -m training.promotion_scorecard \
    --eval-report reports/candidate.json \
    --shadow-summary reports/shadow_summary.json \
    --labelled-traces /data/labelled \
    --tier first_sft \
    --output reports/scorecard.json

Exit code is 0 when every gate passes at the chosen tier, 1 otherwise. Drop into CI to gate the next traffic-share bump.

Override individual metrics from the command line for any signal the artefacts don't yet expose:

python -m training.promotion_scorecard \
    --eval-report reports/candidate.json \
    --metric parser_validity=0.99 \
    --metric grounding_accuracy=0.78 \
    --tier first_sft

Output shape¶

{
  "tier": "first_sft",
  "overall_passed": true,
  "gates": [
    {"name": "task_pass_rate", "value": 0.62, "threshold": 0.55,
     "direction": "min", "passed": true, "note": ""},
    {"name": "escalation_rate_max", "value": 0.04, "threshold": 0.05,
     "direction": "max", "passed": true, "note": ""},
    {"name": "parser_validity", "value": 0.0, "threshold": 0.98,
     "direction": "min", "passed": true,
     "note": "input missing — gate skipped"}
  ],
  "metadata": {
    "label_step_count": 312,
    "label_reason_counts": {"escalation": 12, "gate_verify_pass": 84, ...}
  }
}

When an artefact is missing, the corresponding gate is skipped (note says so) and overall_passed is unchanged. That way the script is useful at every stage of the pipeline — operators don't need the full set of artefacts to start gating.

Where this fits¶

[5] eval        eval_harness.run_eval → reports/candidate.json
[5] shadow      shadow_analytics.aggregate → reports/shadow_summary.json
[5] label       mantis trace label → /data/labelled/
                                     ↓
                    promotion_scorecard.evaluate
                                     ↓
              reports/scorecard.json   exit 0 / 1
                                     ↓
[6] deploy        bump shadow share when scorecard.overall_passed

The scorecard is intentionally a composer — it doesn't run any new evaluations itself. New gates land by extending DEFAULT_THRESHOLDS + the _GATES list, with the metric source wired either in evaluate() or via --metric overrides.