github ConardLi/easy-dataset 1.5.0
[1.5.0] 2025-09-29

latest release: 1.5.1
29 days ago

⚠️ BreakChange(兼容性变更)

  • 1.5.0 之前版本配置的自定义提示词将失效,升级后需重新配置核心提示词。

✨ 新功能

  1. 全量核心提示词开放自定义
    → Easy Dataset 所有核心提示词(如问题生成、答案生产、数据清洗等)均开放配置,后续无需修改代码即可灵活调整,适配不同场景需求。

  2. AI 数据集质量评估(#546
    → 新增数据集质量自动评估功能,支持:

    • 单个数据集即时评估(含相关性、准确性、完整性等维度);
    • 批量数据集异步评估(后台任务处理,支持查看评估报告)。
  3. 多轮对话 SFT 数据集生成(#504
    → 支持生成多轮对话格式的 SFT 数据集,两种生成方式:

    • 基于文献内容提取多轮问答;
    • 直接从大模型蒸馏多轮对话数据。
  4. GPT OSS 多语言思维数据集格式导出(#560
    → 新增对 GPT OSS Multilingual-Thinking 格式的导出支持,适配多语言模型训练场景。

  5. 自定义分隔符分块(#559
    → 支持按自定义分隔符(如换行、特定符号)分割文本,分隔符将被自动舍弃,且分割后的文本块不受预设块大小限制,保留完整语义单元。

⚡ 优化

  1. 模型输出结构化稳定性提升
    → 增加更多兼容解析逻辑,减少模型输出格式异常(如JSON解析失败、字段缺失),提升结构化数据生成的稳定性。

  2. Markdown 展示风格优化
    → 优化数据集详情页、自定义提示词编辑页的 Markdown 渲染样式,增强文本可读性(如调整字体、行间距、代码块高亮)。

🔧 修复

  1. 文献目录过大导致上下文溢出
    → 优化文献目录处理逻辑,自动截断或分段处理超长大目录,避免模型上下文长度超限。

  2. 数据清洗异常内容引入(#504#529
    → 修复数据清洗过程中意外引入无关内容或思维链信息的问题,确保清洗后文本纯净度。

  3. 删除文件时领域树修订不准确
    → 修正文件删除后领域树节点更新逻辑,确保仅移除与删除文件相关的节点,避免误删或残留无效节点。

⚠️ BreakChange

  • Custom prompts configured in versions prior to 1.5.0 will become invalid. Users need to reconfigure core prompts after upgrading to 1.5.0.

✨ New Features

  1. Full Core Prompts Customization
    → All core prompts in Easy Dataset (e.g., question generation, answer production, data cleaning) are now configurable. No code changes are required for future adjustments, adapting to diverse scenarios.

  2. AI Dataset Quality Evaluation(#546
    → Added automatic dataset quality evaluation, supporting:

    • Instant evaluation for single datasets (covering relevance, accuracy, completeness, etc.);
    • Asynchronous batch evaluation for multiple datasets (processed via background tasks, with evaluation reports available).
  3. Multi-turn Dialogue SFT Dataset Generation(#504
    → Supports generating multi-turn dialogue SFT datasets through two methods:

    • Extracting multi-turn Q&A from literature content;
    • Distilling multi-turn dialogue data directly from large models.
  4. GPT OSS Multilingual-Thinking Dataset Export(#560
    → Added export support for GPT OSS Multilingual-Thinking format, adapting to multilingual model training scenarios.

  5. Custom Delimiter Chunking(#559
    → Supports text splitting by custom delimiters (e.g., line breaks, specific symbols). Delimiters are automatically discarded, and split text blocks are not restricted by preset chunk sizes, preserving complete semantic units.

⚡ Optimizations

  1. Improved Stability of Structured Model Output
    → Added more compatible parsing logic to reduce format anomalies in model outputs (e.g., JSON parsing failures, missing fields), enhancing the stability of structured data generation.

  2. Markdown Display Style Optimization
    → Optimized Markdown rendering styles for dataset detail pages and custom prompt editing pages, improving readability (e.g., adjusted fonts, line spacing, code block highlighting).

🔧 Fixes

  1. Context Overflow Due to Oversized Literature Catalogs
    → Optimized literature catalog processing logic to automatically truncate or segment overly large catalogs, avoiding model context length limits.

  2. Unexpected Content Introduction in Data Cleaning(#504#529
    → Fixed issues where irrelevant content or thought chain information was accidentally introduced during data cleaning, ensuring the purity of cleaned text.

  3. Inaccurate Domain Tree Revision When Deleting Files
    → Corrected the domain tree node update logic after file deletion, ensuring only nodes related to deleted files are removed, avoiding incorrect deletions or residual invalid nodes.

Don't miss a new easy-dataset release

NewReleases is sending notifications on new releases.