The first release with measured benchmark results across four frontier models via the Cursor SDK, a corpus-wide hardening pass across all 15 published skills, and the first full-body Stage 3 effectiveness result.
The correction in this release is important: we did not stop at the three descriptions that the router benchmark complained about. Every skill body was audited against the same standard, and the machine-readable substrate moved with it: mechanisms, claims, corpus index, activation fixtures, validators, template, docs, PR/release narrative.
Pull request: #87
Full inventory:CHANGELOG.md
Latest router report:researcher/benchmarks/router/results-published/2026-05-19.md
Stage 3 hard pilot:researcher/benchmarks/effectiveness/results-published/2026-05-19-stage3-hard-pilot.md
Project narrative:researcher/insights/how-we-built-this.md
Technical findings:researcher/insights/auto-research-experiment.md
Benchmark plan:researcher/benchmarks/PLAN.md
Research charts
Stage 2 router results
Three 600-record sweeps at seed=1, same fixture, same shuffle orders:
- Baseline
- Post-description rewrite
- Post-corpus hardening
Measured routing gains:
context-fundamentals: 0.255 -> 0.489 top-1 (+23.4pp)project-development: 0.750 -> 1.000 top-1 (+25pp)tool-design: 0.729 -> 0.807 top-1 (+7.8pp)
May 19 corpus-hardening sweep: 600/600 usable records, 0 format failures after retry.
| Model | Top-1 | Top-3 |
|---|---|---|
| Gemini 3.1 Pro | 0.920 | 0.933 |
| Composer 2 | 0.913 | 0.947 |
| GPT-5.5 | 0.913 | 0.973 |
| Claude Opus 4.7 | 0.840 | 0.933 |
Router negative controls now support expected_primary_skill: "none", so basic arithmetic, translation, and generic formatting prompts no longer reward catch-all skill loading.
Stage 3 effectiveness result
The first Stage 3 pilot was a null result: the runner worked, but the task was too easy. We used that failure correctly and hardened the benchmark.
The final hard pilot uses task 002-filesystem-scratchpad-behavior, which requires a compact scratch/ evidence artifact from a large diagnostic trace. The runner inlines selected full SKILL.md bodies so the benchmark measures actual skill-body content.
Result on composer-2, 3 replications per condition:
| Condition | Success |
|---|---|
| No-skill control | 0/3 |
filesystem-context target
| 3/3 |
Irrelevant bdi-mental-states negative control
| 0/3 |
| Target + related skill | 3/3 |
| Target + unrelated skill | 3/3 |
| Full corpus | 1/3 |
This is +100pp target-minus-control and +100pp target-minus-negative on a deterministic behavior gate. The full-corpus result is also important: precise activation beat loading every skill, which validates the repository's central context-engineering claim.
Corpus-wide hardening
- All 15 skill bodies now include explicit ownership, adjacent
Do not activaterouting, practical guidance, examples, gotchas, integration boundaries, and references. bdi-mental-statesandhosted-agentswere fixed as structural outliers and now pass strict health.filesystem-contextgained compact evidence artifact guidance validated by the Stage 3 hard pilot.- Mechanism registry: 5 -> 16 accepted mechanisms.
- Claim provenance: 6 -> 12 records.
- Activation fixtures: 8 initial / 14 post-router -> 19, now covering every published skill.
- Strict skill health: 0.8111 with 2 flagged skills -> 0.9117 with 0 flagged skills.
validate_repo.py --strictnow enforces full body sections, router prompt schemas, effectiveness task schemas, and explicit non-activation boundaries.
Runner hardening
The Cursor SDK runner now has the required paid-loop safety features:
- Resume by scanning existing per-run JSON files
- Bounded concurrency
- Per-run progress logging
- Automatic retry for transient empty or unparsable router outputs
- Worst-case cost forecast that accounts for retry attempts
- Stage 3 task execution with condition-specific full-body skill context
Special thanks to Cursor for the API credits that made these sweeps practical.
Verification
Green locally after the hard Stage 3 pilot:
python -m py_compile ...for changed researcher scriptspython researcher/scripts/validate_repo.py --strict-> 0 errors, 0 warnings, 15 skillspython researcher/scripts/skill_health.py --strict --no-history-> corpus score 0.9117, 0 flagged skillspython researcher/scripts/check_activation_cases.py-> 19 cases, 0 failurespython researcher/scripts/run_benchmarks.py-> 3 checks, 0 failures, 7 scenariosnpm run typecheckinresearcher/benchmarks/sdk-runner- Router dry-run with retry-aware cost cap
- Effectiveness dry-run for task 002 with cost cap
- Full paid router sweep -> 600/600 usable records
- Paid Stage 3 hard pilot -> target 3/3 vs control 0/3
- IDE lints clean
Honest scope caveats
- Stage 3 now has a task-specific positive result for
filesystem-context, not a corpus-wide effectiveness claim. - Full-corpus loading underperformed targeted loading in the hard pilot; composition needs Stage 4 work.
- No LLM-judge adapter yet for automatic source evaluation or novelty scoring.
- Automated source discovery remains manual-seed based.
- The draft release references PR #87; code still requires human-approved push/merge.