ConardLi/easy-dataset 1.7.0 on GitHub

In v1.7.0, we focus on addressing the core pain points of model evaluation: difficult test set construction, lack of quantitative metrics, and the gap between automation and manual testing. This release introduces a brand-new Evaluation Module, creating a closed loop from Test Set Generation to Automated Scoring and Human Blind Testing.

Here are the detailed updates:

🎉 Core Highlights

One-Stop Evaluation Loop: Support automatic test question generation from raw documents, one-click multi-model automated scoring, and visualized comparison reports.
LMArena Mode Integration: Built-in "Chatbot Arena" style human blind testing to quantify subjective "feelings" into actionable data.
Multi-Dimensional Question Support: Covers 5 core question types, satisfying evaluation needs ranging from fact-checking to logical reasoning.

🚀 New Features

1. Intelligent Test Set Generation

Stop worrying about the lack of QA pairs. You can now quickly build high-quality evaluation datasets through multiple methods:

Document Extraction: Automatically slice and extract questions from PDF/Markdown/Docx domain literature.
Dataset Variants: Generate variants based on existing training sets (e.g., converting multiple-choice questions to true/false) to expand testing diversity.
5 Core Question Types:
✅ True/False: Detect model hallucinations.
✅ Single/Multiple Choice: Test knowledge extraction and discrimination.
✅ Short Answer: Test the ability to capture and express core knowledge concisely.
✅ Open-Ended: Evaluate long-text reasoning and summarization skills.
Ratio Configuration: Support custom distribution ratios for different question types during the generation task.

2. Automated Evaluation Tasks (Auto-Evaluation)

Conduct concurrent assessments on multiple models just like an "exam". The system supports two grading modes:

Rule-Based Scoring (Objective): For Choice and True/False questions. No LLM is required; the system uses direct code-level comparison for zero cost and zero error.
Teacher Model Scoring (LLM-as-a-Judge): For Short Answer and Open-Ended questions. Configure a "Teacher Model" to score and comment on answers with customizable criteria (Prompts).

3. Human Blind Test Arena

Return to the authentic "Side-by-Side" evaluation experience:

Anonymous Battle: Model names are hidden. Answers are displayed side-by-side (supports streaming output).
Intuitive Voting: Simply click "Left is better", "Right is better", or "Tie".
Win Rate Statistics: Automatically generate win-rate comparison charts to eliminate brand bias.

4. Data & Ecosystem

Multi-Format Import/Export: Fully supports JSON, XLS, and XLSX formats for test set management.
Built-in Domain Question Banks: Pre-loaded standard test sets across multiple disciplines and domains, ready to use out of the box.
Fully Open Prompts: Complete access to configure prompts for the evaluation system (Question Generation, Answer Extraction, Scoring Standards), allowing for high customization.

在 v1.7.0 版本中，我们重点解决了模型评估中 “测试集难构造”、“量化指标难获取” 以及 “自动化与人工割裂” 的痛点。新增全新的 「评估」 模块，打通了从 测试集生成 到 自动化评分 再到 人工盲测 的全流程闭环。

以下是本次更新的详细内容：

🎉 核心亮点

一站式评估闭环：支持从原始文档自动生成测试题，一键发起多模型自动化评分，并提供可视化的对比报告。
LMArena 模式集成：内置类似 Chatbot Arena 的人工盲测竞技场，解决“感觉”无法量化的问题。
多维题型支持：覆盖 5 种核心题型，满足从事实核查到逻辑推理的全方位评估需求。

🚀 新增功能 (New Features)

1. 智能测试集构造 (Test Set Generation)

不再为没有 QA 对而发愁，现在支持通过多种方式快速构建高质量评估数据集：

文档提取：支持从 PDF/Markdown/Docx 领域文献中自动切片并提取题目。
数据集变体：支持基于现有训练集生成变体（如将选择题改为判断题），扩充测试多样性。
五大题型支持：
✅ 判断题：检测模型幻觉。
✅ 单选/多选题：考察知识提取与辨析。
✅ 简答题：考察核心知识点的精简表达。
✅ 开放题：考察长文本推理与逻辑总结。
比例配置：支持自定义生成任务中各题型的分布比例。

2. 自动化评估任务 (Auto-Evaluation)

像“考试”一样对多个模型进行并发测评，支持两种阅卷模式：

客观题规则评分：针对选择、判断题，无需调用大模型，直接代码级比对，零成本、零误差。
主观题教师模型评分 (LLM-as-a-Judge)：针对简答和开放题，配置“教师模型”进行打分和点评，支持自定义评分标准（Prompt）。

3. 人工盲测竞技场 (Human Blind Test)

回归真实体感的 Side-by-Side 评估：

匿名对战：隐藏模型名称，左右分屏展示回答（支持流式输出）。
直观投票：只需点击“左边好”、“右边好”或“平局”。
胜率统计：自动生成模型胜率对比图，消除品牌偏见。

4. 数据与生态 (Data & Ecosystem)

多格式导入导出：支持 JSON、XLS、XLSX 格式的测试集导入导出。
内置领域题库：预置多学科、多领域标准测试集，开箱即用。
Prompt 全开放：全面开放评估系统提示词配置（题目生成、答案提取、评分标准），支持高度定制化。

ConardLi/easy-dataset 1.7.0 [1.7.0] 2026-01-12 on GitHub