In v1.7.0, we focus on addressing the core pain points of model evaluation: difficult test set construction, lack of quantitative metrics, and the gap between automation and manual testing. This release introduces a brand-new Evaluation Module, creating a closed loop from Test Set Generation to Automated Scoring and Human Blind Testing.
Here are the detailed updates:
🎉 Core Highlights
- One-Stop Evaluation Loop: Support automatic test question generation from raw documents, one-click multi-model automated scoring, and visualized comparison reports.
- LMArena Mode Integration: Built-in "Chatbot Arena" style human blind testing to quantify subjective "feelings" into actionable data.
- Multi-Dimensional Question Support: Covers 5 core question types, satisfying evaluation needs ranging from fact-checking to logical reasoning.
🚀 New Features
1. Intelligent Test Set Generation
Stop worrying about the lack of QA pairs. You can now quickly build high-quality evaluation datasets through multiple methods:
-
Document Extraction: Automatically slice and extract questions from PDF/Markdown/Docx domain literature.
-
Dataset Variants: Generate variants based on existing training sets (e.g., converting multiple-choice questions to true/false) to expand testing diversity.
-
5 Core Question Types:
-
✅ True/False: Detect model hallucinations.
-
✅ Single/Multiple Choice: Test knowledge extraction and discrimination.
-
✅ Short Answer: Test the ability to capture and express core knowledge concisely.
-
✅ Open-Ended: Evaluate long-text reasoning and summarization skills.
-
Ratio Configuration: Support custom distribution ratios for different question types during the generation task.
2. Automated Evaluation Tasks (Auto-Evaluation)
Conduct concurrent assessments on multiple models just like an "exam". The system supports two grading modes:
- Rule-Based Scoring (Objective): For Choice and True/False questions. No LLM is required; the system uses direct code-level comparison for zero cost and zero error.
- Teacher Model Scoring (LLM-as-a-Judge): For Short Answer and Open-Ended questions. Configure a "Teacher Model" to score and comment on answers with customizable criteria (Prompts).
3. Human Blind Test Arena
Return to the authentic "Side-by-Side" evaluation experience:
- Anonymous Battle: Model names are hidden. Answers are displayed side-by-side (supports streaming output).
- Intuitive Voting: Simply click "Left is better", "Right is better", or "Tie".
- Win Rate Statistics: Automatically generate win-rate comparison charts to eliminate brand bias.
4. Data & Ecosystem
- Multi-Format Import/Export: Fully supports JSON, XLS, and XLSX formats for test set management.
- Built-in Domain Question Banks: Pre-loaded standard test sets across multiple disciplines and domains, ready to use out of the box.
- Fully Open Prompts: Complete access to configure prompts for the evaluation system (Question Generation, Answer Extraction, Scoring Standards), allowing for high customization.
在 v1.7.0 版本中,我们重点解决了模型评估中 “测试集难构造”、“量化指标难获取” 以及 “自动化与人工割裂” 的痛点。新增全新的 「评估」 模块,打通了从 测试集生成 到 自动化评分 再到 人工盲测 的全流程闭环。
以下是本次更新的详细内容:
🎉 核心亮点
- 一站式评估闭环:支持从原始文档自动生成测试题,一键发起多模型自动化评分,并提供可视化的对比报告。
- LMArena 模式集成:内置类似 Chatbot Arena 的人工盲测竞技场,解决“感觉”无法量化的问题。
- 多维题型支持:覆盖 5 种核心题型,满足从事实核查到逻辑推理的全方位评估需求。
🚀 新增功能 (New Features)
1. 智能测试集构造 (Test Set Generation)
不再为没有 QA 对而发愁,现在支持通过多种方式快速构建高质量评估数据集:
-
文档提取:支持从 PDF/Markdown/Docx 领域文献中自动切片并提取题目。
-
数据集变体:支持基于现有训练集生成变体(如将选择题改为判断题),扩充测试多样性。
-
五大题型支持:
-
✅ 判断题:检测模型幻觉。
-
✅ 单选/多选题:考察知识提取与辨析。
-
✅ 简答题:考察核心知识点的精简表达。
-
✅ 开放题:考察长文本推理与逻辑总结。
-
比例配置:支持自定义生成任务中各题型的分布比例。
2. 自动化评估任务 (Auto-Evaluation)
像“考试”一样对多个模型进行并发测评,支持两种阅卷模式:
- 客观题规则评分:针对选择、判断题,无需调用大模型,直接代码级比对,零成本、零误差。
- 主观题教师模型评分 (LLM-as-a-Judge):针对简答和开放题,配置“教师模型”进行打分和点评,支持自定义评分标准(Prompt)。
3. 人工盲测竞技场 (Human Blind Test)
回归真实体感的 Side-by-Side 评估:
- 匿名对战:隐藏模型名称,左右分屏展示回答(支持流式输出)。
- 直观投票:只需点击“左边好”、“右边好”或“平局”。
- 胜率统计:自动生成模型胜率对比图,消除品牌偏见。
4. 数据与生态 (Data & Ecosystem)
- 多格式导入导出:支持 JSON、XLS、XLSX 格式的测试集导入导出。
- 内置领域题库:预置多学科、多领域标准测试集,开箱即用。
- Prompt 全开放:全面开放评估系统提示词配置(题目生成、答案提取、评分标准),支持高度定制化。