Changelog¶
All notable changes to augur-sdk are recorded here. Format roughly follows
Keep a Changelog, with semver applied per
docs/versioning.md upstream (MAJOR.MINOR.PATCH; pre-1.0 minors may break).
[Unreleased]¶
[0.2.2] — 2026-05-23¶
StreamingSink honours 429 + Retry-After (closes #27)¶
augur#114 made the high-frequency ingest endpoints async on the server
side. The happy path now returns 202 Accepted (already handled — any
2xx counted as success) and a saturated ingest queue returns
429 Too Many Requests with Retry-After: <seconds>. Before 0.2.2 the
SDK debug-logged the 429 and dropped the step — the producer thought the
write landed; the server never got it.
- Shared
_request_with_retryhelper. Every HTTP call site now routes through one place. On 429 it sleeps the parsedRetry-Aftervalue, reissues the request exactly once, and if the second attempt also 429s emits aWARNING("augur ingest dropped after retry: …") and returns. Single retry only — no unbounded loop, no SDK-side buffer. The local bundle on disk remains authoritative. - Endpoints covered.
_post_json(steps / trace / events / side-effects / reasoning / outcomes / eval-candidates / judge-decisions / logs),_post_modelio_request,_post_multipart(screenshots), and_send_heartbeat— all share the same retry behaviour. Modelio's existing 403-latch (tenant opt-in) is preserved across the retry path. _parse_retry_afterhelper. Parses augur's integer delta-seconds form, falls back to1.0son HTTP-date / missing / garbage, and clamps the result to[0, 30]so a malformed or hostile header cannot park a background thread.- Idempotency. Server endpoints are idempotent on
(run_id, step_index)for steps, on(run_id, manifest)for trace, and heartbeats are at-most-once by nature, so a single retry after 429 never double-writes.
Tests¶
tests/test_streaming_backpressure.py(16 tests) —_parse_retry_afterunit tests (integer / float / whitespace / missing / garbage / negative-clamp / huge-clamp),_post_json202 pass-through, 429 → 202 retry, double-429 drop + warn, sleep value verification, default-sleep whenRetry-Afteris missing, plus parallel retry coverage for_post_multipart,_send_heartbeat, and modelio (including the modelio 403-latch interaction after a 429 retry). Full suite: 214 tests, all green.
[0.2.1] — 2026-05-23¶
Branching replay — producer-side execution (closes #25)¶
DebugSession already accepted a branch_context kwarg in 0.1.14
(per augur-schema 0.3.1), but no producer code path drove it.
0.2.1 adds the high-level constructor.
DebugSession.branch_from(parent_run_id, branch_point_step_index, mutated_axis, mutation, ...)— classmethod that returns a child session pre-stamped with the parent linkage. Counterpart to platformmercurialsolo/augur#91(closed); operator surface declared the branch, this lets the SDK actually execute it.- Modes (
mode=): replay— load steps[0, branch_point_step_index)from a parent bundle (parent_bundle=<path>) into the new session, copying pre/post screenshot bytes verbatim. The producer continues fresh from the branch point.sandbox— stampbranch_contextonly; the producer executes from scratch against a live target.auto(default) — pickssandboxwhenmutated_axis="action"(action changes break the deterministic-prefix assumption per SPEC §10),replayotherwise.- Refuses
mode="replay"whenmutated_axis="action"— replay would lie about what the agent did since the parent's downstream observations no longer reflect what the new agent will see. Per-axis safety enforced at construction time, not at bundle write. branch_idand childrun_iddefault tof"{parent_run_id}:branch:<short-uuid>"(perbranch_context.schema.jsonconvention) — callers can override either independently.session.branch_modeaccessor exposes the resolved mode ("replay","sandbox", orNonefor production runs).
Tests¶
tests/test_branch_from.py(18 tests) — sandbox / replay / auto resolution, prefix loading + screenshot copy from parent bundle, replay-refuses-action-mutation, axis enum validation, branch_id generation, **kwargs pass-through. Full suite: 198 tests, all green.
[0.2.0] — 2026-05-23¶
Schemas now live in the augur-schema PyPI package¶
Closes #20. The SDK no longer vendors its own copy of the canonical
JSON Schemas. They're shipped by the
augur-schema package and
pulled in as a runtime dependency
(augur-schema>=0.3.1,<0.4). Single source of truth across the SDK,
the platform server, the viewer, and any third-party reader.
- Deleted
src/augur_sdk/_schema/(the JSON files and the loader module). 17 files removed. from augur_sdk._schema import …→from augur_schema import …insession.py,bundle.py,validation.py, and all tests. The upstream package re-exports the same surface (SCHEMA_VERSION,SchemaError,list_schemas,load_schema,schemas_dir,validator_for), so the swap is mechanical for callers.ValidationErrorno longer re-exported viaaugur_sdk._schema; callers that need it import directly fromjsonschema.exceptions(one-line change in two test modules +session.py).bundle.py's schema-embed-in-bundle step now copies raw bytes fromaugur_schema.schemas_dir()rather than parse-and-reserialise viaload_schema(). The embedded copy is bit-for-bit identical to the dep's published schema.
Schema-side fix consumed in this release¶
branch_contextis now referenced ondebug_session.schema.jsonasanyOf [$ref, null](parallel to the existing reference onstep_trace.schema.json). The SDK already emittedbranch_contexton the session record; withaugur-schema 0.3.1, validation accepts it. Without the fix,validate_bundle()rejected its own branch bundles after the dep swap. Caught by a one-off field-shape diff between the vendored copy and the dep before deletion.
Minor version bump¶
The augur_sdk._schema module was importable from outside the SDK
(referenced in 0.1.x docstrings and the bundle-layout doc), so
removing it is a semver-minor break even though no other public-API
signature changed. The SDK records, on-disk bundle layout, and
method surface are unchanged from 0.1.14.
Records that gain canonical schemas¶
The upstream package ships new canonical schemas for records the SDK
already emitted as raw JSON in 0.1.14:
reasoning_trace.schema.json, outcome_record.schema.json,
eval_candidate.schema.json, plus standalone files for the
previously-inlined branch_context, captured_versions,
env_fingerprint, judge_decision, and side_effect. Calling
validator_for(name) works the same way for these as for every
other schema.
Tests¶
- 180 tests pass against
augur-schema 0.3.1. The full quality gate (uv run pytest,uv run mypy -p augur_sdk,uv run ruff check) stays green.
[0.1.14] — 2026-05-23¶
Sentry-for-CUA primitives — round two (umbrella #9)¶
mark_for_eval()(closes #16). Tag a step as a regression-fixture candidate so server-side promotion can one-click it into the eval set. Tags land ineval_candidates.jsonat the bundle root and on the live stream viaStreamingSink.post_eval_candidate(). Idempotent onstep_index— last-write-wins. Tagging is allowed beforerecord_step()lands.finalize_outcome()(closes #18). Couples(verdict, task_class, cost_summary)into one OutcomeRecord per step or session so the platform's cost-per-outcome rollup doesn't need to JOIN three independent fields. Session scope rolls up per-stepstep.costsfor any field absent at the session level (session-levelset_costs()wins ties). Records land inoutcomes.jsonat the bundle root and on the live stream viaStreamingSink.post_outcome(). New helpersession.successful_task_cost_summary()returns the rolled-up summary in-process for budget asserts in tests.record_reasoning()(closes #14). First-class capture for models that emit explicit reasoning (Claude extended thinking, OpenAI reasoning summaries). Records land inevents/reasoning.jsonland on the live stream viaStreamingSink.post_reasoning(). Reasoning text gets its own redaction hook onRedactionPolicy(add_reasoning_redactor,apply_reasoning) distinct from the regular redactors — reasoning often carries PII that the action stream doesn't.ModelApiAdapterBase.extract_reasoning_from_response()pulls reasoning blocks out of a Claude or OpenAI response without caller wiring.BranchContext(closes #15). Newbranch_context=kwarg onDebugSessionlabels replay-branch trajectories withparent_run_id,branch_point_step_index,mutated_axis(model|prompt|action|grounder|tool_description), and the mutation payload. The full context lands on the session record; a lightweight slice (without the mutation payload) propagates to every step so server-side cohort filters can exclude branches by default without joining back to the session. Production runs omit the field — no regression for callers that don't opt in.- Side-effect ledger API (closes #11). New
declare_side_effect(step_index, resource, action, ...)/commit_side_effect(side_effect_id, observed_result)/mark_side_effect_aborted(side_effect_id, reason)/abort_pending_side_effects(reason)API for irreversible agent actions. Declarations land before dispatch so the ledger shows the agent's intent even when the run is killed mid-step. Records live underside_effects/<step:04d>-<id>.jsonkeyed to(run_id, step_id, side_effect_id). Redaction policy applies to bothresourceandobserved_result. NewAdapter.on_side_effect()hook so adapters can auto-detect known irreversible action types.StreamingSink.post_side_effect()posts the full lifecycle live. New canonicalside_effect.schema.jsonbundled with every release. - Intervention channel (closes #12). Server→SDK control plane.
New
session.bind_intervention(adapter)starts a long-poll onGET /runs/<run_id>/commands, dispatchingpause,resume,kill,inject_hint, andoverride_actionto the adapter's intervention hooks (on_pause,on_resume,on_kill,on_inject_hint,on_action_override). Operator-supplied coordinates always carryprovenance="human_override"so the trajectory preserves the SPEC §4 invariant that runtime action selection is screenshot-grounded.killcouples to the side-effect ledger (#11) — the channel aborts every pending declared side effect before invokingadapter.on_kill(). Adapters that don't implement a particular hook see the channel degrade to a no-op for that command type. At-least-once delivery with idempotency keys oncommand_id; cursor-based resumption so a dropped poll doesn't lose commands. Audit trail: every received command is logged to the session's decision-event stream with the operator id.
Bundle layout¶
- New top-level paths:
side_effects/(ledger),eval_candidates.json(#16 tags),outcomes.json(#18 coupled records),events/reasoning.jsonl(#14 reasoning). All reflected inmanifest.pathsandmanifest.signatures.
Models¶
BranchContext,SideEffect, andInterventionCommandare now re-exported fromaugur_sdkfor adapter authors.
Tests¶
- 49 new tests across
tests/test_branching.py,tests/test_eval_candidates.py,tests/test_intervention.py,tests/test_outcomes.py,tests/test_reasoning.py,tests/test_side_effects.py. Full suite: 180 tests, all green.
[0.1.13] — 2026-05-23¶
Sentry-for-CUA primitives (umbrella #9)¶
captured_versionson every StepTrace (closes #10). Promoted fromReplayFixture-only to a first-class field. Extended withprompt_hash,tool_descriptions_hash, andenv_fingerprint_ref. NewDebugSession.set_step_versions(step_index, model=…, prompt_hash=…)accepts partial updates and merges last-write-wins.record_modelio()auto-stampsmodel(fromrequest.model) andprompt_hashonto the corresponding step'scaptured_versionswhen called withstep_index=…. Silent no-op if the step hasn't been recorded yet — late stamping uses the explicit setter.env_fingerprinton Observation + StepTrace (closes #13). NewDebugSession.attach_env_fingerprint(step_index, url_host=…, url_path_template=…, viewport_hash=…, dom_hash=…, api_shapes=…, extensions=…). Stored side-by-side with the visual fingerprint (observation.hashes.phash_64), not merged, so the platform's determinism checker can attribute drift to agent/model/env independently. SDK never derivesdom_hashitself — only adapters that already probe DOM for diagnostics should populate it (preserves the screenshot-grounded core invariant).record_judge_decision()(closes #17). Model, rule, human, and hybrid judges land as first-class verdicts onstep.judge_decisions. The operativestep.verdictis set last-write-wins by default;step.verdict_sourcecarries<judge_type>:<judge_id>provenance.attach_verifier()now emits an implicitjudge_type="rule"decision alongside its verdict patch so the rule-vs-model-vs-human provenance is preserved across the legacy entry point too. NewStreamingSink.post_judge_decision()fire-and-forget hook so HITL overrides land live.trajectory_fingerprinton every manifest (closes #19). Deterministic, hand-crafted digest over the(action.type, failure_class|verdict.status, normalized_target_label)sequence of the run. Same shape → same fingerprint, across SDK versions and machines. One-step swap → close-but-different fingerprint (Hamming-style proximity on the bigram half). The algorithm is pluggable via theaugur_sdk.fingerprintsentry point group — adapters can swap in a learned embedding without changing the SDK API. Documented atdocs/concepts/trajectory-fingerprint.md. Default =cua_v1.
Models¶
JudgeDecision,EnvFingerprint, andCapturedVersionsare now re-exported fromaugur_sdkfor adapter authors.
Tests¶
tests/test_captured_versions.py,tests/test_env_fingerprint.py,tests/test_judge_decisions.py,tests/test_fingerprint.py— full round-trip coverage including streaming hooks, validation against the bundled schemas, and backward compat for bundles without the new fields.
[0.1.12] — 2026-05-22¶
Docs¶
- Mirror raw
.mdalongside every published HTML page on the docs site — e.g.…/latest/reference/api.mdreturns the canonical markdown source. Coding agents can fetch docs directly without scraping HTML. - Emit
llms.txtandllms-full.txtat the root of each versioned build (…/latest/llms.txt), following the llms.txt convention.llms.txtis a grouped index of every doc page;llms-full.txtis a single-file concatenation of the full corpus in nav order. - Both are produced by a small mkdocs build hook (
hooks/llms_export.py) with no new runtime dependencies. The hook captures each page's markdown after theinclude-markdownplugin runs, so transcluded content (e.g. the changelog) is inlined in the mirrored.md.
[0.1.11] — 2026-05-22¶
Docs¶
- Document the producer-facing surface for live modelio streaming.
Session.record_modelio()now spells out that, with a DSN configured, the same call also POSTs the redacted record to/api/v1/runs/<run_id>/modelio/<relpath>on a background thread, and that a403latches the sink off for the session while the bundle on disk still owns the record. - Add
StreamingSink.post_modelio()andStreamingSink.post_logs()to the streaming class surface inreference/api.md(they shipped in 0.1.10 and 0.1.8 respectively but weren't listed). - Add the
POST /api/v1/runs/{id}/modelio/{relpath}ingest row + curl example toreference/http-api.md.
[0.1.10] — 2026-05-21¶
Streaming¶
StreamingSink.post_modelio()(closes #5). When a streaming sink is attached, eachDebugSession.record_modelio()call now fires the staged record live to the server's per-tenant ingest route atPOST /api/v1/runs/{id}/modelio/{relpath}in addition to staging it for the on-close bundle. If the server returns 403 (the tenant hasn't opted in), the SDK latches_modelio_disabledand stops trying for the rest of the session — the local bundle is unaffected and bundle-only consumers see no regression. Paired with the upstream server route (mercurialsolo/augur#62).
Tests¶
tests/test_streaming_modelio.py(3 tests) — wire format on the modelio route, 403 → latch behaviour, and the negative case where non-403 errors don't disable streaming.tests/test_record_modelio.py— extended with a sink-wiring case that assertspost_modelio()is called with the recorder's reserved relpath and the post-redaction payload.
[0.1.9] — 2026-05-20¶
Schema¶
- Vendor
prior_steps.schema.json(closes #4). The replay workbench's sibling artifactreplay/<step_index:04d>.prior.jsonfinally has a vendored schema in the SDK; producers can now validate prior-step payloads offline viaaugur_sdk._schema.validator_for("prior_steps"). Byte-identical with the upstreammercurialsolo/augurmonorepo. Surfaced during the augur#55 schema v1.0 freeze review.
Tests¶
tests/test_prior_steps_schema.py(13 tests) — required-field failures, additionalProperties on the item level, action.type + verdict.status / verdict.score validation, and a bundle-disk round-trip alongside the existing replay fixture coverage.
[0.1.8] — 2026-05-20¶
Added — producer-side helpers for the 0.1.6 training-data substrate¶
Closes the three open SDK issues (#1, #2, #3). Until 0.1.8 the costs/score/modelio schema fields existed but adapters had to mutate session internals to emit them.
DebugSession.set_costs(*, total_usd=…, model_usd=…, gpu_usd=…, proxy_usd=…, tokens_in=…, tokens_out=…, cache_hit_tokens=…): stamp a structured cost rollup on the session. Surfaces on both the session record (trace.json) and the manifest (manifest.json#/costs) so cost-aware consumers (run-list dashboards, training pipelines) can read either. Repeat-call merge semantics; unset dimensions are preserved. Closes #1.DebugSession.set_step_costs(step_index, *, total_usd=…, model_usd=…, tokens_in=…, tokens_out=…, cache_hit_tokens=…): patch a recorded step'scostsobject. Merge semantics; raisesValueErrorif no step exists atstep_index. Closes #1.DebugSession.record_modelio(record, *, step_index=None, layer=None, validate=True): canonical producer-side helper for one model call. Validates against the vendoredmodelio.schema.json, stages undermodelio/<step_index:04d>-<layer>-<seq>.json(ormodelio/run-<layer>-<seq>.jsonwhenstep_indexis None), and is idempotent onprompt_hash— repeat calls with the same hash return the existing path without staging a duplicate. Applies the session'sRedactionPolicyand stampsredaction_applied: true. Stampslayeronto the record when the kwarg is provided and the record doesn't already carry one. Closes #2.DebugSession.set_score(step_index, score, *, comparator=None, components=None): attach a continuous reward signal (0..1) to a recorded step's verdict. Merges into the existing verdict (status/reason preserved); score clamped to[0.0, 1.0];comparatorvalidated against the canonical enum (verifier | model-judge | exact-match | human). Closes #3.
Schema¶
manifest.schema.jsongains optionalcosts(mirrors the session-level rollup) and two new path constants inpaths:modelio→modelio/andpreferences→preferences/. Additive; every prior 0.1.x manifest continues to validate.- Bundle layout: new
modelio/directory is written automatically whenrecord_modeliois called. The path tree indocs/concepts/bundle-layout.mdreflects the new entries.
Notes for producer authors¶
- Step → modelio linkage is path-based (
<step:04d>-…), not field-based —step_trace.schema.jsondoesn't carry amodelio_refsarray. Consumers locate model calls for step N by globbingmodelio/N*.json.
[0.1.7] — 2026-05-20¶
Added¶
ModelApiAdapterBase: shared scaffolding for adapters whose native format is a message log (OpenAI Responses, Anthropic Messages, similar). Subclasses implement two methods —iter_tool_calls(messages)and optionallyload_messages(path)— and inheritbundle_from_input(input, output)which walks the log, opens a DebugSession, emits one StepTrace per tool call, and resolves sidecar screenshots from ascreens/directory. Reduces the OpenAI / Anthropic Computer-Use adapter footprint by ~70%. Closes upstream augur#53.
[0.1.6] — 2026-05-20¶
Added — training-data substrate schemas (mirrors upstream augur 0.1.2 schema)¶
modelio.schema.json(new): canonical record for one model call's full input + output. Loaded under short namemodelio. Pinned shape lets downstream SFT/DPO pipelines harvest training data from any compliant producer without adapter-specific parsers. Closes upstream augur#56.step_trace.schema.json: new optionalcostsandlatencyobjects on each step (token counts, USD breakdown, per-layer ms). Closes part of upstream augur#58.debug_session.schema.json: new optionalcostsrollup on the session. Closes the rest of upstream augur#58.step_trace.schema.json→verdict: new optionalscore(0..1),score_components,comparatorenum (verifier|model-judge|exact-match|human). Thestatusenum stays the canonical pass/fail bucket; score is additive for RL/SFT pipelines needing partial credit. Closes upstream augur#59.preference.schema.json(new): preference / counterfactual record for DPO + RLHF training data. Stored atpreferences/<step_index:04d>.json, decoupled from the immutable trace so a human rater can add comparisons days after the run completes. Capturespreferred_action, rankedalternativeswith optionalreward_estimate, and thecomparator(verifier | model-judge | human-rater | replay-diff). Closes upstream augur#57.
All additions are optional + additive. Every 0.1.x bundle still validates.
[0.1.5] — 2026-05-20¶
Added¶
DebugSession.attach_verifier(step_index, status=..., reason=..., check=..., expected=..., actual=..., evidence_refs=...): let an external harness add a post-hoc verdict to a step the producer didn't categorize. Useful for traces from frameworks with no native verifier signal (OpenAI / Anthropic Computer-Use, raw OSWorld). Replaces the step'sverdictfield in-place. Streams the patched step through the live sink so viewers see the update without waiting for close(). Closes upstream augur#51.- New
cua.dom_used_as_runtime_targetdiagnostic rule (severity: high). Fires when a step'sgrounding.provenance == "dom"ANDaction.paramscarries numeric x/y — a spec §4 invariant violation (runtime targets must be screenshot-grounded; DOM probes are diagnostic-only). Closes upstream augur#50.
[0.1.4] — 2026-05-19¶
Added¶
- New
cua.uncategorized_failurediagnostic rule (severity: low). Fires when a step hasstatus: "failed"butfailure_classis empty or the literal"unknown"— the producer's classifier didn't match any rule. Surfaces as hygiene feedback to adapter authors; the recommendation points at the step's intent + the verdict.reason so a maintainer can extend the classify() rules.
[0.1.3] — 2026-05-20¶
Added¶
DebugSession.set_capture_mode(mode): change the active capture mode mid-run. The nextrecord_step(and every subsequent one) gets an explicitcapture_modefield stamped on the StepTrace. Lets a CUA start inmetadata(cheap) and upgrade toscreenshotsafter the first failed verifier check, without restarting the session. Matchesstep_trace.schema.json's new optionalcapture_modefield (closes upstream augur#36).DebugSession.append_log(text, step_index=None, name="run"): convenience to stream a runner-log chunk to the server'sPOST /api/v1/runs/<id>/logsendpoint. Routes tologs/step-<idx>.logwhenstep_indexis set, elselogs/<name>.log. No-op when streaming is disabled (the bundle on disk owns local logs). Closes upstream augur#17.
Schema¶
- Vendored
step_trace.schema.jsonmirrors the upstream addition of the optionalcapture_modefield. Additive; every 0.1 bundle still validates.
[0.1.2] — 2026-05-19¶
Fixed¶
- Heartbeat POST now sends a JSON body with
Content-Type: application/json. Previously it usedurllib3'sfields=kwarg, which producesmultipart/form-data— the server parsed that as an empty JSON body and returned HTTP 422 "client_id field required", silently breaking the connection-status badge against any server shipped after the JSON-only heartbeat parser landed. - Added
tests/test_streaming_heartbeat.pyto pin the wire format so this can't regress.
[0.1.1] — 2026-05-19¶
Changed¶
StreamingSinknow fires one immediatesession_openedheartbeat at construction time (i.e. onDebugSession(dsn=…)), so the workspace's connection list shows the client before the first step is recorded. The periodic 15 s heartbeat loop continues to start at__enter__as before.- Heartbeat payload gains an optional
last_eventfield used by the initial fire. Existing servers ignore unknown fields; no client or server upgrade required.
Spec¶
- §4.4 now documents the initial-fire requirement.
- §4.9 now reflects that side effects begin at
DebugSession(...)construction whendsnis configured, not at__enter__.
[0.1.0] — 2026-05-19¶
Initial standalone release.
Added¶
DebugSessioncontext manager — the only API a CUA needs to integrate.- Capture modes (
off/metadata/trace/screenshots/model_io/dispatch/replay/full) spec'd inaugur_sdk.CaptureMode. - Atomic local bundle writer with path-stable layout (
manifest.json,trace.json,steps/,events/,screenshots/,AGENT.md,schema/). - DSN-based streaming sink (Sentry-style). Set
AUGUR_DSN=…or passdsn=directly; per-step + per-screenshot POSTs with a 15 s heartbeat. Bundle on disk is always written even if the network is flapping. - Vendored JSON Schemas (
augur_sdk._schema) — zero monorepo dependency at install time. RedactionPolicy+ shippeddefault-pii-v1policy: dropsAuthorization/Cookie/Set-Cookie/X-API-Key; maskstoken/api_key/ssn/credit_card/cvv; regex-scrubs bearer tokens, AWS keys, JWTs, and password=… patterns.- Diagnostic rules engine + generic
cuarule pack (10 rules). Adapter packs land via theaugur.rule_packsentry-point group. - Adapter base protocol (
augur_sdk.Adapter) + contract for adapter authors. LocalFSStore(default) +S3Storestub for forward compatibility.validate_bundle()— schema-validates every record in a bundle against the vendored JSON Schemas.
Compatibility¶
- Schema version:
0.1. - Python: ≥ 3.11.
- Compatible with the Augur server's
/api/v1/*ingest endpoints at server version 0.1.x.