github uditgoenka/autoresearch v2.1.4
v2.1.4 — Regression Stability Gate

4 hours ago

v2.1.4 — Regression Stability Gate

A 14th family member for autoresearch: /autoresearch:regression — a heavy, layered regression-testing gate that captures baseline behavior from a git worktree of the base ref, diffs the candidate across 8 dimensions, and emits a single STABLE / UNSTABLE verdict you can wire into CI.

Released as a patch (v2.1.3 → v2.1.4) to match repo convention — the repo ships features as patch bumps, and v2.2.0 stays reserved for the planned evals --compare feature.


Why it exists

"Did my change break anything?" is the one question every push needs answered mechanically. regression answers it by comparing the candidate against a real baseline of the base ref — not a hand-wavy "looks good" — and refusing to pass anything it could not actually evaluate.

The core invariant

A regression is a green→red transition only. Everything else is classified out, never counted:

Transition Classification
green → red regression (the only one that gates)
red → red pre-existing failure
absent → red new coverage
flake → red flaky (routed to the flakiness score)

Tests are matched by test-id, then path. A dimension with no green baseline reports BASELINE_UNAVAILABLE rather than silently passing.

Tiered verdict

  • HARD gatefunctional, api-contract, data-migration, integration-e2e. Any green→red ⇒ UNSTABLE.
  • SCOREflakiness (.30), performance (.30), resource (.20), visual-ui (.20), 0–100 noise-tolerant, UNSTABLE below threshold 95.
  • No dimension ranBASELINE_UNAVAILABLE (fail-safe, never a false green).

The displayed score is floored, never rounded up — it can never read ≥ threshold while the verdict is UNSTABLE.

Built to resist false signals

  • Statistical perf gate — 7 independent-process samples/side (warmups discarded), Mann–Whitney U, flagged only beyond max(noise-band%, k·stdev). Visual diffs use SSIM that ignores anti-aliasing. --samples / --noise-band tunable.
  • data-migration is hard-guarded — opt-in, forward-only by default, and refuses any DB URL that is not an anchored ephemeral/allowlisted target (host exactly localhost/127.0.0.1/container, or dbname with a _test/_ci suffix — a bare substring like test inside latest does not qualify).
  • Hunter reproducibility gate — bisect (reusing debug) only for HARD dims passing 3/3 reproduction; SCORE / non-deterministic findings route to differential root-cause instead.
  • Dimension self-detection — unavailable dimensions are listed, never silently passed; the verdict declares which dims actually ran.

Composable, like the rest of the family

--predict (pre-empt before the gate), --reason (ambiguous root cause), --probe / --no-probe, --debug, --fix / --fix-cycles (max 3, each must strictly shrink the blocking set or STOP "not converging" — no HARD-gate bypass), --evals, --chain, --max-runs.

Canonical combo:

/autoresearch:regression --predict --evals --fix --ship

Distribution & tests

  • 5-surface mirror parity — Claude Code, OpenCode, Codex, .agents, and the bundled plugin all carry a byte-identical command spec.
  • 3 plugin manifests + all 5 SKILL.md routing tables updated; command count 13 → 14.
  • New scripts/score-regression.sh (rubric + verdict backend, CI exit codes).
  • New tests/test-regression.sh50 assertions over 10 golden fixtures. Full suite 155/155 green (105 hooks + 50 regression).

Docs

README, the per-platform guides, COMPARISON.md, docs/project-changelog.md, docs/development-roadmap.md, and docs/system-architecture.md all updated for the 14-command surface.


Full Changelog: v2.1.3...v2.1.4

Don't miss a new autoresearch release

NewReleases is sending notifications on new releases.