muratcankoylan/Agent-Skills-for-Context-Engineering v2.3.0 on GitHub

The first release with measured benchmark results across four frontier models via the Cursor SDK, a corpus-wide hardening pass across all 15 published skills, and the first full-body Stage 3 effectiveness result.

The correction in this release is important: we did not stop at the three descriptions that the router benchmark complained about. Every skill body was audited against the same standard, and the machine-readable substrate moved with it: mechanisms, claims, corpus index, activation fixtures, validators, template, docs, PR/release narrative.

Pull request: #87
Full inventory: CHANGELOG.md
Latest router report: researcher/benchmarks/router/results-published/2026-05-19.md
Stage 3 hard pilot: researcher/benchmarks/effectiveness/results-published/2026-05-19-stage3-hard-pilot.md
Project narrative: researcher/insights/how-we-built-this.md
Technical findings: researcher/insights/auto-research-experiment.md
Benchmark plan: researcher/benchmarks/PLAN.md

Research charts

Stage 2 router results

Three 600-record sweeps at seed=1, same fixture, same shuffle orders:

Baseline
Post-description rewrite
Post-corpus hardening

Measured routing gains:

context-fundamentals: 0.255 -> 0.489 top-1 (+23.4pp)
project-development: 0.750 -> 1.000 top-1 (+25pp)
tool-design: 0.729 -> 0.807 top-1 (+7.8pp)

May 19 corpus-hardening sweep: 600/600 usable records, 0 format failures after retry.

Model	Top-1	Top-3
Gemini 3.1 Pro	0.920	0.933
Composer 2	0.913	0.947
GPT-5.5	0.913	0.973
Claude Opus 4.7	0.840	0.933

Router negative controls now support expected_primary_skill: "none", so basic arithmetic, translation, and generic formatting prompts no longer reward catch-all skill loading.

Stage 3 effectiveness result

The first Stage 3 pilot was a null result: the runner worked, but the task was too easy. We used that failure correctly and hardened the benchmark.

The final hard pilot uses task 002-filesystem-scratchpad-behavior, which requires a compact scratch/ evidence artifact from a large diagnostic trace. The runner inlines selected full SKILL.md bodies so the benchmark measures actual skill-body content.

Result on composer-2, 3 replications per condition:

Condition	Success
No-skill control	0/3
`filesystem-context` target	3/3
Irrelevant `bdi-mental-states` negative control	0/3
Target + related skill	3/3
Target + unrelated skill	3/3
Full corpus	1/3

This is +100pp target-minus-control and +100pp target-minus-negative on a deterministic behavior gate. The full-corpus result is also important: precise activation beat loading every skill, which validates the repository's central context-engineering claim.

Corpus-wide hardening

All 15 skill bodies now include explicit ownership, adjacent Do not activate routing, practical guidance, examples, gotchas, integration boundaries, and references.
bdi-mental-states and hosted-agents were fixed as structural outliers and now pass strict health.
filesystem-context gained compact evidence artifact guidance validated by the Stage 3 hard pilot.
Mechanism registry: 5 -> 16 accepted mechanisms.
Claim provenance: 6 -> 12 records.
Activation fixtures: 8 initial / 14 post-router -> 19, now covering every published skill.
Strict skill health: 0.8111 with 2 flagged skills -> 0.9117 with 0 flagged skills.
validate_repo.py --strict now enforces full body sections, router prompt schemas, effectiveness task schemas, and explicit non-activation boundaries.

Runner hardening

The Cursor SDK runner now has the required paid-loop safety features:

Resume by scanning existing per-run JSON files
Bounded concurrency
Per-run progress logging
Automatic retry for transient empty or unparsable router outputs
Worst-case cost forecast that accounts for retry attempts
Stage 3 task execution with condition-specific full-body skill context

Special thanks to Cursor for the API credits that made these sweeps practical.

Verification

Green locally after the hard Stage 3 pilot:

python -m py_compile ... for changed researcher scripts
python researcher/scripts/validate_repo.py --strict -> 0 errors, 0 warnings, 15 skills
python researcher/scripts/skill_health.py --strict --no-history -> corpus score 0.9117, 0 flagged skills
python researcher/scripts/check_activation_cases.py -> 19 cases, 0 failures
python researcher/scripts/run_benchmarks.py -> 3 checks, 0 failures, 7 scenarios
npm run typecheck in researcher/benchmarks/sdk-runner
Router dry-run with retry-aware cost cap
Effectiveness dry-run for task 002 with cost cap
Full paid router sweep -> 600/600 usable records
Paid Stage 3 hard pilot -> target 3/3 vs control 0/3
IDE lints clean

Honest scope caveats

Stage 3 now has a task-specific positive result for filesystem-context, not a corpus-wide effectiveness claim.
Full-corpus loading underperformed targeted loading in the hard pilot; composition needs Stage 4 work.
No LLM-judge adapter yet for automatic source evaluation or novelty scoring.
Automated source discovery remains manual-seed based.
The draft release references PR #87; code still requires human-approved push/merge.

muratcankoylan/Agent-Skills-for-Context-Engineering v2.3.0 v2.3.0 — Measured Router Benchmark + Corpus-Wide Skill Hardening on GitHub