đ What's New
New Skill: Advanced Evaluation
A comprehensive skill for mastering LLM-as-a-Judge evaluation techniques. Based on research from Eugene Yan's LLM-Evaluators.
Covers:
- Direct scoring vs. pairwise comparison selection
- Position, length, and verbosity bias mitigation
- Metric selection (Cohen's Îș, Spearman's Ï, Kendall's Ï)
- Production evaluation pipeline design
- 10 actionable guidelines for reliable evaluation
đ skills/advanced-evaluation/
New Example: LLM-as-Judge Skills
A complete TypeScript AI SDK-6 implementation demonstrating the Advanced Evaluation skill in practice.
Includes:
- 3 evaluation tools:
directScore,pairwiseCompare,generateRubric EvaluatorAgentclass with full evaluation workflows- 19 passing tests with real OpenAI API calls
- Position bias mitigation with automatic position swapping
- Zod schemas for type-safe inputs/outputs
đ examples/llm-as-judge-skills/
Quick Start
cd examples/llm-as-judge-skills
npm install
cp env.example .env # Add your OPENAI_API_KEY
npm test
Skills Applied
This example demonstrates how multiple skills work together:
advanced-evaluation- Core evaluation patternstool-design- Zod schemas and error handlingcontext-fundamentals- Structured evaluation promptsevaluation- Foundational evaluation concepts
Contributors
Full Changelog: v1.0.0...v1.1.0
Full Changelog: https://github.com/muratcankoylan/Agent-Skills-for-Context-Engineering/commits/v1.1.0