github muratcankoylan/Agent-Skills-for-Context-Engineering v2.3.0
v2.3.0 — Measured Router Benchmark + Corpus-Wide Skill Hardening

6 hours ago

The first release with measured benchmark results across four frontier models via the Cursor SDK, a corpus-wide hardening pass across all 15 published skills, and the first full-body Stage 3 effectiveness result.

The correction in this release is important: we did not stop at the three descriptions that the router benchmark complained about. Every skill body was audited against the same standard, and the machine-readable substrate moved with it: mechanisms, claims, corpus index, activation fixtures, validators, template, docs, PR/release narrative.

Pull request: #87
Full inventory: CHANGELOG.md
Latest router report: researcher/benchmarks/router/results-published/2026-05-19.md
Stage 3 hard pilot: researcher/benchmarks/effectiveness/results-published/2026-05-19-stage3-hard-pilot.md
Project narrative: researcher/insights/how-we-built-this.md
Technical findings: researcher/insights/auto-research-experiment.md
Benchmark plan: researcher/benchmarks/PLAN.md

Research charts

Research-to-Skill Operating System

Router Benchmark Leaderboard

Measured Skill Routing Gains

Corpus-Wide Hardening Metrics

Stage 2 router results

Three 600-record sweeps at seed=1, same fixture, same shuffle orders:

  • Baseline
  • Post-description rewrite
  • Post-corpus hardening

Measured routing gains:

  • context-fundamentals: 0.255 -> 0.489 top-1 (+23.4pp)
  • project-development: 0.750 -> 1.000 top-1 (+25pp)
  • tool-design: 0.729 -> 0.807 top-1 (+7.8pp)

May 19 corpus-hardening sweep: 600/600 usable records, 0 format failures after retry.

Model Top-1 Top-3
Gemini 3.1 Pro 0.920 0.933
Composer 2 0.913 0.947
GPT-5.5 0.913 0.973
Claude Opus 4.7 0.840 0.933

Router negative controls now support expected_primary_skill: "none", so basic arithmetic, translation, and generic formatting prompts no longer reward catch-all skill loading.

Stage 3 effectiveness result

The first Stage 3 pilot was a null result: the runner worked, but the task was too easy. We used that failure correctly and hardened the benchmark.

The final hard pilot uses task 002-filesystem-scratchpad-behavior, which requires a compact scratch/ evidence artifact from a large diagnostic trace. The runner inlines selected full SKILL.md bodies so the benchmark measures actual skill-body content.

Result on composer-2, 3 replications per condition:

Condition Success
No-skill control 0/3
filesystem-context target 3/3
Irrelevant bdi-mental-states negative control 0/3
Target + related skill 3/3
Target + unrelated skill 3/3
Full corpus 1/3

This is +100pp target-minus-control and +100pp target-minus-negative on a deterministic behavior gate. The full-corpus result is also important: precise activation beat loading every skill, which validates the repository's central context-engineering claim.

Corpus-wide hardening

  • All 15 skill bodies now include explicit ownership, adjacent Do not activate routing, practical guidance, examples, gotchas, integration boundaries, and references.
  • bdi-mental-states and hosted-agents were fixed as structural outliers and now pass strict health.
  • filesystem-context gained compact evidence artifact guidance validated by the Stage 3 hard pilot.
  • Mechanism registry: 5 -> 16 accepted mechanisms.
  • Claim provenance: 6 -> 12 records.
  • Activation fixtures: 8 initial / 14 post-router -> 19, now covering every published skill.
  • Strict skill health: 0.8111 with 2 flagged skills -> 0.9117 with 0 flagged skills.
  • validate_repo.py --strict now enforces full body sections, router prompt schemas, effectiveness task schemas, and explicit non-activation boundaries.

Runner hardening

The Cursor SDK runner now has the required paid-loop safety features:

  • Resume by scanning existing per-run JSON files
  • Bounded concurrency
  • Per-run progress logging
  • Automatic retry for transient empty or unparsable router outputs
  • Worst-case cost forecast that accounts for retry attempts
  • Stage 3 task execution with condition-specific full-body skill context

Special thanks to Cursor for the API credits that made these sweeps practical.

Verification

Green locally after the hard Stage 3 pilot:

  • python -m py_compile ... for changed researcher scripts
  • python researcher/scripts/validate_repo.py --strict -> 0 errors, 0 warnings, 15 skills
  • python researcher/scripts/skill_health.py --strict --no-history -> corpus score 0.9117, 0 flagged skills
  • python researcher/scripts/check_activation_cases.py -> 19 cases, 0 failures
  • python researcher/scripts/run_benchmarks.py -> 3 checks, 0 failures, 7 scenarios
  • npm run typecheck in researcher/benchmarks/sdk-runner
  • Router dry-run with retry-aware cost cap
  • Effectiveness dry-run for task 002 with cost cap
  • Full paid router sweep -> 600/600 usable records
  • Paid Stage 3 hard pilot -> target 3/3 vs control 0/3
  • IDE lints clean

Honest scope caveats

  • Stage 3 now has a task-specific positive result for filesystem-context, not a corpus-wide effectiveness claim.
  • Full-corpus loading underperformed targeted loading in the hard pilot; composition needs Stage 4 work.
  • No LLM-judge adapter yet for automatic source evaluation or novelty scoring.
  • Automated source discovery remains manual-seed based.
  • The draft release references PR #87; code still requires human-approved push/merge.

Don't miss a new Agent-Skills-for-Context-Engineering release

NewReleases is sending notifications on new releases.