github OthmanAdi/planning-with-files v2.22.0
v2.22.0 — Formally Benchmarked & Security Hardened

latest releases: v3.1.3, v3.1.2, v3.1.1...
3 months ago

What's in this release

Formal evaluation using Anthropic's skill-creator framework, plus security hardening of the hook system. The numbers are in. The skill works.


Benchmark Results

10 parallel subagents. 5 task types. 30 objectively verifiable assertions. 3 blind A/B comparisons.

Test with_skill without_skill
Pass rate (30 assertions) 96.7% (29/30) 6.7% (2/30)
3-file pattern followed 5/5 evals 0/5 evals
Blind A/B wins 3/3 (100%) 0/3
Avg blind rubric score 10.0/10 6.8/10
Avg tokens per run 19,926 11,899
Avg time per run 115s 98s

Every blind comparator chose with_skill without knowing which was which. Without the skill, agents default to ad-hoc file naming, skip structured planning, and produce less information-dense output. The delta is +90 percentage points on workflow fidelity.


What's New

  • docs/evals.md — Full methodology: 5 task types, 30 assertions, grading criteria, timing and token data from real subagent runs, blind comparator quotes
  • docs/article.md — Technical write-up covering the security hardening and eval results — ready to publish
  • README.md — Benchmark Results section + 3 verification badges (96.7% pass rate, A/B 3/3 wins, Security Verified)
  • CHANGELOG.md — v2.22.0 entry

Security Hardening (v2.21.0)

A proactive security audit identified a prompt injection amplification vector: the PreToolUse hook re-reads task_plan.md before every tool call — which is exactly what makes the skill effective — but also means anything in that file gets injected into context repeatedly. Declaring WebFetch/WebSearch in allowed-tools created a path for untrusted content to reach that file.

Fixed by scoping allowed-tools to the skill's actual purpose (planning and file management), and adding an explicit Security Boundary section to SKILL.md. Applied across all 7 IDE variants. The formal evals in this release verify zero regression in workflow fidelity.

Full technical write-up: docs/article.md


Eval Task Types

ID Task
1 Python CLI todo tool with persistence
2 Research + compare Python testing frameworks
3 Systematic FastAPI TypeError debugging
4 Django 3.2 → 4.2 migration planning (50k LOC)
5 CI/CD pipeline for TypeScript monorepo

Don't miss a new planning-with-files release

NewReleases is sending notifications on new releases.