github ConardLi/easy-dataset 1.7.0
[1.7.0] 2026-01-12

10 hours ago

In v1.7.0, we focus on addressing the core pain points of model evaluation: difficult test set construction, lack of quantitative metrics, and the gap between automation and manual testing. This release introduces a brand-new Evaluation Module, creating a closed loop from Test Set Generation to Automated Scoring and Human Blind Testing.

Here are the detailed updates:

🎉 Core Highlights

  • One-Stop Evaluation Loop: Support automatic test question generation from raw documents, one-click multi-model automated scoring, and visualized comparison reports.
  • LMArena Mode Integration: Built-in "Chatbot Arena" style human blind testing to quantify subjective "feelings" into actionable data.
  • Multi-Dimensional Question Support: Covers 5 core question types, satisfying evaluation needs ranging from fact-checking to logical reasoning.

🚀 New Features

1. Intelligent Test Set Generation

Stop worrying about the lack of QA pairs. You can now quickly build high-quality evaluation datasets through multiple methods:

  • Document Extraction: Automatically slice and extract questions from PDF/Markdown/Docx domain literature.

  • Dataset Variants: Generate variants based on existing training sets (e.g., converting multiple-choice questions to true/false) to expand testing diversity.

  • 5 Core Question Types:

  • True/False: Detect model hallucinations.

  • Single/Multiple Choice: Test knowledge extraction and discrimination.

  • Short Answer: Test the ability to capture and express core knowledge concisely.

  • Open-Ended: Evaluate long-text reasoning and summarization skills.

  • Ratio Configuration: Support custom distribution ratios for different question types during the generation task.

2. Automated Evaluation Tasks (Auto-Evaluation)

Conduct concurrent assessments on multiple models just like an "exam". The system supports two grading modes:

  • Rule-Based Scoring (Objective): For Choice and True/False questions. No LLM is required; the system uses direct code-level comparison for zero cost and zero error.
  • Teacher Model Scoring (LLM-as-a-Judge): For Short Answer and Open-Ended questions. Configure a "Teacher Model" to score and comment on answers with customizable criteria (Prompts).

3. Human Blind Test Arena

Return to the authentic "Side-by-Side" evaluation experience:

  • Anonymous Battle: Model names are hidden. Answers are displayed side-by-side (supports streaming output).
  • Intuitive Voting: Simply click "Left is better", "Right is better", or "Tie".
  • Win Rate Statistics: Automatically generate win-rate comparison charts to eliminate brand bias.

4. Data & Ecosystem

  • Multi-Format Import/Export: Fully supports JSON, XLS, and XLSX formats for test set management.
  • Built-in Domain Question Banks: Pre-loaded standard test sets across multiple disciplines and domains, ready to use out of the box.
  • Fully Open Prompts: Complete access to configure prompts for the evaluation system (Question Generation, Answer Extraction, Scoring Standards), allowing for high customization.

在 v1.7.0 版本中,我们重点解决了模型评估中 “测试集难构造”、“量化指标难获取” 以及 “自动化与人工割裂” 的痛点。新增全新的 「评估」 模块,打通了从 测试集生成自动化评分 再到 人工盲测 的全流程闭环。

以下是本次更新的详细内容:

🎉 核心亮点

  • 一站式评估闭环:支持从原始文档自动生成测试题,一键发起多模型自动化评分,并提供可视化的对比报告。
  • LMArena 模式集成:内置类似 Chatbot Arena 的人工盲测竞技场,解决“感觉”无法量化的问题。
  • 多维题型支持:覆盖 5 种核心题型,满足从事实核查到逻辑推理的全方位评估需求。

🚀 新增功能 (New Features)

1. 智能测试集构造 (Test Set Generation)

不再为没有 QA 对而发愁,现在支持通过多种方式快速构建高质量评估数据集:

  • 文档提取:支持从 PDF/Markdown/Docx 领域文献中自动切片并提取题目。

  • 数据集变体:支持基于现有训练集生成变体(如将选择题改为判断题),扩充测试多样性。

  • 五大题型支持

  • 判断题:检测模型幻觉。

  • 单选/多选题:考察知识提取与辨析。

  • 简答题:考察核心知识点的精简表达。

  • 开放题:考察长文本推理与逻辑总结。

  • 比例配置:支持自定义生成任务中各题型的分布比例。

2. 自动化评估任务 (Auto-Evaluation)

像“考试”一样对多个模型进行并发测评,支持两种阅卷模式:

  • 客观题规则评分:针对选择、判断题,无需调用大模型,直接代码级比对,零成本、零误差。
  • 主观题教师模型评分 (LLM-as-a-Judge):针对简答和开放题,配置“教师模型”进行打分和点评,支持自定义评分标准(Prompt)。

3. 人工盲测竞技场 (Human Blind Test)

回归真实体感的 Side-by-Side 评估:

  • 匿名对战:隐藏模型名称,左右分屏展示回答(支持流式输出)。
  • 直观投票:只需点击“左边好”、“右边好”或“平局”。
  • 胜率统计:自动生成模型胜率对比图,消除品牌偏见。

4. 数据与生态 (Data & Ecosystem)

  • 多格式导入导出:支持 JSON、XLS、XLSX 格式的测试集导入导出。
  • 内置领域题库:预置多学科、多领域标准测试集,开箱即用。
  • Prompt 全开放:全面开放评估系统提示词配置(题目生成、答案提取、评分标准),支持高度定制化。

Don't miss a new easy-dataset release

NewReleases is sending notifications on new releases.