Skip to content

Model promotion scorecard (#183)

Before a fine-tuned Holo3 checkpoint serves more than its initial shadow share, it has to clear a named scorecard. The scorecard composes the artefacts the rest of the continual-fine-tuning pipeline already emits — the eval report (#155 step 4), shadow analytics (step 5), and labelled traces (step 2) — and reports pass/fail per gate at one of three tiers.

Tiers

Tier Use
base bare-minimum — no worse than Holo3 stock weights
first_sft first fine-tune off base; the typical promotion gate
future aspirational long-term target

Gates

Gate Direction Reads from What it catches
task_pass_rate min eval report held-out task success rate
parser_validity min override / labeller tool-call shape regression
grounding_accuracy min override / labeller grounded clicks landing on usable elements
forbidden_region_avoidance min labeller (escalation = inverse signal) clicks avoid photos / ads / social / off-site
loop_rate_max max labeller repeated-action loops
gallery_recovery_rate min override lightbox / gallery trap recovery
escalation_rate_max max shadow analytics brain ladder Holo3 → Claude rate
done_completeness min override structured done() summary present
cost_per_success_usd_max max eval report $ per successful held-out task

min gates pass when value ≥ threshold; max gates pass when value ≤ threshold.

Default thresholds are conservative — operators tune by editing DEFAULT_THRESHOLDS in training/promotion_scorecard.py or by passing their own override map into evaluate(thresholds=...) from Python.

Usage

python -m training.promotion_scorecard \
    --eval-report reports/candidate.json \
    --shadow-summary reports/shadow_summary.json \
    --labelled-traces /data/labelled \
    --tier first_sft \
    --output reports/scorecard.json

Exit code is 0 when every gate passes at the chosen tier, 1 otherwise. Drop into CI to gate the next traffic-share bump.

Override individual metrics from the command line for any signal the artefacts don't yet expose:

python -m training.promotion_scorecard \
    --eval-report reports/candidate.json \
    --metric parser_validity=0.99 \
    --metric grounding_accuracy=0.78 \
    --tier first_sft

Output shape

{
  "tier": "first_sft",
  "overall_passed": true,
  "gates": [
    {"name": "task_pass_rate", "value": 0.62, "threshold": 0.55,
     "direction": "min", "passed": true, "note": ""},
    {"name": "escalation_rate_max", "value": 0.04, "threshold": 0.05,
     "direction": "max", "passed": true, "note": ""},
    {"name": "parser_validity", "value": 0.0, "threshold": 0.98,
     "direction": "min", "passed": true,
     "note": "input missing — gate skipped"}
  ],
  "metadata": {
    "label_step_count": 312,
    "label_reason_counts": {"escalation": 12, "gate_verify_pass": 84, ...}
  }
}

When an artefact is missing, the corresponding gate is skipped (note says so) and overall_passed is unchanged. That way the script is useful at every stage of the pipeline — operators don't need the full set of artefacts to start gating.

Where this fits

[5] eval        eval_harness.run_eval → reports/candidate.json
[5] shadow      shadow_analytics.aggregate → reports/shadow_summary.json
[5] label       mantis trace label → /data/labelled/
                    promotion_scorecard.evaluate
              reports/scorecard.json   exit 0 / 1
[6] deploy        bump shadow share when scorecard.overall_passed

The scorecard is intentionally a composer — it doesn't run any new evaluations itself. New gates land by extending DEFAULT_THRESHOLDS + the _GATES list, with the metric source wired either in evaluate() or via --metric overrides.