⚠️ BreakChange(兼容性变更)
- 1.5.0 之前版本配置的自定义提示词将失效,升级后需重新配置核心提示词。
✨ 新功能
-
全量核心提示词开放自定义
→ Easy Dataset 所有核心提示词(如问题生成、答案生产、数据清洗等)均开放配置,后续无需修改代码即可灵活调整,适配不同场景需求。 -
AI 数据集质量评估(#546)
→ 新增数据集质量自动评估功能,支持:- 单个数据集即时评估(含相关性、准确性、完整性等维度);
- 批量数据集异步评估(后台任务处理,支持查看评估报告)。
-
多轮对话 SFT 数据集生成(#504)
→ 支持生成多轮对话格式的 SFT 数据集,两种生成方式:- 基于文献内容提取多轮问答;
- 直接从大模型蒸馏多轮对话数据。
-
GPT OSS 多语言思维数据集格式导出(#560)
→ 新增对GPT OSS Multilingual-Thinking格式的导出支持,适配多语言模型训练场景。 -
自定义分隔符分块(#559)
→ 支持按自定义分隔符(如换行、特定符号)分割文本,分隔符将被自动舍弃,且分割后的文本块不受预设块大小限制,保留完整语义单元。
⚡ 优化
-
模型输出结构化稳定性提升
→ 增加更多兼容解析逻辑,减少模型输出格式异常(如JSON解析失败、字段缺失),提升结构化数据生成的稳定性。 -
Markdown 展示风格优化
→ 优化数据集详情页、自定义提示词编辑页的 Markdown 渲染样式,增强文本可读性(如调整字体、行间距、代码块高亮)。
🔧 修复
-
文献目录过大导致上下文溢出
→ 优化文献目录处理逻辑,自动截断或分段处理超长大目录,避免模型上下文长度超限。 -
数据清洗异常内容引入(#504、#529)
→ 修复数据清洗过程中意外引入无关内容或思维链信息的问题,确保清洗后文本纯净度。 -
删除文件时领域树修订不准确
→ 修正文件删除后领域树节点更新逻辑,确保仅移除与删除文件相关的节点,避免误删或残留无效节点。
⚠️ BreakChange
- Custom prompts configured in versions prior to 1.5.0 will become invalid. Users need to reconfigure core prompts after upgrading to 1.5.0.
✨ New Features
-
Full Core Prompts Customization
→ All core prompts in Easy Dataset (e.g., question generation, answer production, data cleaning) are now configurable. No code changes are required for future adjustments, adapting to diverse scenarios. -
AI Dataset Quality Evaluation(#546)
→ Added automatic dataset quality evaluation, supporting:- Instant evaluation for single datasets (covering relevance, accuracy, completeness, etc.);
- Asynchronous batch evaluation for multiple datasets (processed via background tasks, with evaluation reports available).
-
Multi-turn Dialogue SFT Dataset Generation(#504)
→ Supports generating multi-turn dialogue SFT datasets through two methods:- Extracting multi-turn Q&A from literature content;
- Distilling multi-turn dialogue data directly from large models.
-
GPT OSS Multilingual-Thinking Dataset Export(#560)
→ Added export support forGPT OSS Multilingual-Thinkingformat, adapting to multilingual model training scenarios. -
Custom Delimiter Chunking(#559)
→ Supports text splitting by custom delimiters (e.g., line breaks, specific symbols). Delimiters are automatically discarded, and split text blocks are not restricted by preset chunk sizes, preserving complete semantic units.
⚡ Optimizations
-
Improved Stability of Structured Model Output
→ Added more compatible parsing logic to reduce format anomalies in model outputs (e.g., JSON parsing failures, missing fields), enhancing the stability of structured data generation. -
Markdown Display Style Optimization
→ Optimized Markdown rendering styles for dataset detail pages and custom prompt editing pages, improving readability (e.g., adjusted fonts, line spacing, code block highlighting).
🔧 Fixes
-
Context Overflow Due to Oversized Literature Catalogs
→ Optimized literature catalog processing logic to automatically truncate or segment overly large catalogs, avoiding model context length limits. -
Unexpected Content Introduction in Data Cleaning(#504、#529)
→ Fixed issues where irrelevant content or thought chain information was accidentally introduced during data cleaning, ensuring the purity of cleaned text. -
Inaccurate Domain Tree Revision When Deleting Files
→ Corrected the domain tree node update logic after file deletion, ensuring only nodes related to deleted files are removed, avoiding incorrect deletions or residual invalid nodes.