v2.1.4 — Regression Stability Gate
A 14th family member for autoresearch: /autoresearch:regression — a heavy, layered regression-testing gate that captures baseline behavior from a git worktree of the base ref, diffs the candidate across 8 dimensions, and emits a single STABLE / UNSTABLE verdict you can wire into CI.
Released as a patch (
v2.1.3 → v2.1.4) to match repo convention — the repo ships features as patch bumps, andv2.2.0stays reserved for the plannedevals --comparefeature.
Why it exists
"Did my change break anything?" is the one question every push needs answered mechanically. regression answers it by comparing the candidate against a real baseline of the base ref — not a hand-wavy "looks good" — and refusing to pass anything it could not actually evaluate.
The core invariant
A regression is a green→red transition only. Everything else is classified out, never counted:
| Transition | Classification |
|---|---|
| green → red | regression (the only one that gates) |
| red → red | pre-existing failure |
| absent → red | new coverage |
| flake → red | flaky (routed to the flakiness score) |
Tests are matched by test-id, then path. A dimension with no green baseline reports BASELINE_UNAVAILABLE rather than silently passing.
Tiered verdict
- HARD gate —
functional,api-contract,data-migration,integration-e2e. Any green→red ⇒ UNSTABLE. - SCORE —
flakiness(.30),performance(.30),resource(.20),visual-ui(.20), 0–100 noise-tolerant, UNSTABLE below threshold 95. - No dimension ran ⇒
BASELINE_UNAVAILABLE(fail-safe, never a false green).
The displayed score is floored, never rounded up — it can never read ≥ threshold while the verdict is UNSTABLE.
Built to resist false signals
- Statistical perf gate — 7 independent-process samples/side (warmups discarded), Mann–Whitney U, flagged only beyond
max(noise-band%, k·stdev). Visual diffs use SSIM that ignores anti-aliasing.--samples/--noise-bandtunable. - data-migration is hard-guarded — opt-in, forward-only by default, and refuses any DB URL that is not an anchored ephemeral/allowlisted target (host exactly
localhost/127.0.0.1/container, or dbname with a_test/_cisuffix — a bare substring liketestinsidelatestdoes not qualify). - Hunter reproducibility gate — bisect (reusing
debug) only for HARD dims passing 3/3 reproduction; SCORE / non-deterministic findings route to differential root-cause instead. - Dimension self-detection — unavailable dimensions are listed, never silently passed; the verdict declares which dims actually ran.
Composable, like the rest of the family
--predict (pre-empt before the gate), --reason (ambiguous root cause), --probe / --no-probe, --debug, --fix / --fix-cycles (max 3, each must strictly shrink the blocking set or STOP "not converging" — no HARD-gate bypass), --evals, --chain, --max-runs.
Canonical combo:
/autoresearch:regression --predict --evals --fix --ship
Distribution & tests
- 5-surface mirror parity — Claude Code, OpenCode, Codex,
.agents, and the bundled plugin all carry a byte-identical command spec. - 3 plugin manifests + all 5
SKILL.mdrouting tables updated; command count 13 → 14. - New
scripts/score-regression.sh(rubric + verdict backend, CI exit codes). - New
tests/test-regression.sh— 50 assertions over 10 golden fixtures. Full suite 155/155 green (105 hooks + 50 regression).
Docs
README, the per-platform guides, COMPARISON.md, docs/project-changelog.md, docs/development-roadmap.md, and docs/system-architecture.md all updated for the 14-command surface.
Full Changelog: v2.1.3...v2.1.4