OthmanAdi/planning-with-files v2.22.0 on GitHub

What's in this release

Formal evaluation using Anthropic's skill-creator framework, plus security hardening of the hook system. The numbers are in. The skill works.

Benchmark Results

10 parallel subagents. 5 task types. 30 objectively verifiable assertions. 3 blind A/B comparisons.

Test	with_skill	without_skill
Pass rate (30 assertions)	96.7% (29/30)	6.7% (2/30)
3-file pattern followed	5/5 evals	0/5 evals
Blind A/B wins	3/3 (100%)	0/3
Avg blind rubric score	10.0/10	6.8/10
Avg tokens per run	19,926	11,899
Avg time per run	115s	98s

Every blind comparator chose with_skill without knowing which was which. Without the skill, agents default to ad-hoc file naming, skip structured planning, and produce less information-dense output. The delta is +90 percentage points on workflow fidelity.

What's New

docs/evals.md — Full methodology: 5 task types, 30 assertions, grading criteria, timing and token data from real subagent runs, blind comparator quotes
docs/article.md — Technical write-up covering the security hardening and eval results — ready to publish
README.md — Benchmark Results section + 3 verification badges (96.7% pass rate, A/B 3/3 wins, Security Verified)
CHANGELOG.md — v2.22.0 entry

Security Hardening (v2.21.0)

A proactive security audit identified a prompt injection amplification vector: the PreToolUse hook re-reads task_plan.md before every tool call — which is exactly what makes the skill effective — but also means anything in that file gets injected into context repeatedly. Declaring WebFetch/WebSearch in allowed-tools created a path for untrusted content to reach that file.

Fixed by scoping allowed-tools to the skill's actual purpose (planning and file management), and adding an explicit Security Boundary section to SKILL.md. Applied across all 7 IDE variants. The formal evals in this release verify zero regression in workflow fidelity.

Full technical write-up: docs/article.md

Eval Task Types

ID	Task
1	Python CLI todo tool with persistence
2	Research + compare Python testing frameworks
3	Systematic FastAPI TypeError debugging
4	Django 3.2 → 4.2 migration planning (50k LOC)
5	CI/CD pipeline for TypeScript monorepo

OthmanAdi/planning-with-files v2.22.0 v2.22.0 — Formally Benchmarked & Security Hardened on GitHub

What's in this release

Benchmark Results

What's New

Security Hardening (v2.21.0)

Eval Task Types

OthmanAdi/planning-with-files v2.22.0
v2.22.0 — Formally Benchmarked & Security Hardened

on GitHub