What's in this release
Formal evaluation using Anthropic's skill-creator framework, plus security hardening of the hook system. The numbers are in. The skill works.
Benchmark Results
10 parallel subagents. 5 task types. 30 objectively verifiable assertions. 3 blind A/B comparisons.
| Test | with_skill | without_skill |
|---|---|---|
| Pass rate (30 assertions) | 96.7% (29/30) | 6.7% (2/30) |
| 3-file pattern followed | 5/5 evals | 0/5 evals |
| Blind A/B wins | 3/3 (100%) | 0/3 |
| Avg blind rubric score | 10.0/10 | 6.8/10 |
| Avg tokens per run | 19,926 | 11,899 |
| Avg time per run | 115s | 98s |
Every blind comparator chose with_skill without knowing which was which. Without the skill, agents default to ad-hoc file naming, skip structured planning, and produce less information-dense output. The delta is +90 percentage points on workflow fidelity.
What's New
docs/evals.md— Full methodology: 5 task types, 30 assertions, grading criteria, timing and token data from real subagent runs, blind comparator quotesdocs/article.md— Technical write-up covering the security hardening and eval results — ready to publishREADME.md— Benchmark Results section + 3 verification badges (96.7% pass rate, A/B 3/3 wins, Security Verified)CHANGELOG.md— v2.22.0 entry
Security Hardening (v2.21.0)
A proactive security audit identified a prompt injection amplification vector: the PreToolUse hook re-reads task_plan.md before every tool call — which is exactly what makes the skill effective — but also means anything in that file gets injected into context repeatedly. Declaring WebFetch/WebSearch in allowed-tools created a path for untrusted content to reach that file.
Fixed by scoping allowed-tools to the skill's actual purpose (planning and file management), and adding an explicit Security Boundary section to SKILL.md. Applied across all 7 IDE variants. The formal evals in this release verify zero regression in workflow fidelity.
Full technical write-up: docs/article.md
Eval Task Types
| ID | Task |
|---|---|
| 1 | Python CLI todo tool with persistence |
| 2 | Research + compare Python testing frameworks |
| 3 | Systematic FastAPI TypeError debugging |
| 4 | Django 3.2 → 4.2 migration planning (50k LOC) |
| 5 | CI/CD pipeline for TypeScript monorepo |