github ConardLi/easy-dataset 1.4.0
[1.4.0] 2025-08-31

latest releases: 1.5.1, 1.5.0
one month ago

如果遇到 Github 下载慢的问题可以使用网盘下载:https://pan.quark.cn/s/194b7eedf16e

✨ 新功能

  1. 支持本地部署 MinerU 集成(#200#245
    → 可在任务设置中配置本地 MinerU 服务 URL,实现与本地部署的 MinerU 工具联动。

  2. 数据集增强管理功能(#81
    → 新增数据集评分、自定义标签及备注功能,支持基于这些属性进行筛选查询。

  3. 文献内容清洗功能(#516
    → 支持对原始文献内容进行预处理清洗,提升后续数据集生成质量;支持自定义数据清洗提示词,适配不同场景需求。

  4. 数据集导出选项扩展

    • 支持导出时选择包含原始文本块(自定义格式)(#288#185#476#464
    • 支持仅导出问题列表,适配轻量数据应用场景(#394
    • 支持平衡导出功能,可根据领域标签筛选导出数据集
  5. 文献格式支持扩展(#205
    → 新增对 .epub 格式文献的上传与分析功能,拓宽文献处理范围。

  6. 数据集导入功能(#498
    → 支持从本地文件导入已有数据集,快速复用外部数据资源。

⚡ 优化

  1. 数据集翻页体验优化(#497
    → 翻页时自动保存 Markdown 标签的选中状态,避免重复操作。

  2. 数据集列表筛选增强(#275
    → 支持筛选“是否为蒸馏数据集”,快速定位特定类型数据。

🔧 修复

  1. 超大数据集导出问题(#502
    → 修复大规模数据集导出时的卡死问题,新增分批导出机制,提升稳定性。

  2. 项目间问题冲突(#509
    → 修复不同项目中问题 DIFF 对比时出现的冲突异常,确保跨项目数据一致性。

✨ New Features

  1. Support for Local MinerU Deployment(#200#245
    → Allows configuration of local MinerU service URL in task settings, enabling integration with locally deployed MinerU tools.

  2. Enhanced Dataset Management(#81
    → Added dataset rating, custom tags, and notes functions, with support for filtering based on these attributes.

  3. Literature Content Cleaning(#516
    → Supports preprocessing and cleaning of original literature content to improve subsequent dataset quality; allows custom data cleaning prompts for different scenarios.

  4. Extended Dataset Export Options

    • Supports exporting with original text blocks (custom format)(#288#185#476#464
    • Supports exporting only question lists for lightweight data applications(#394
  5. Expanded Literature Format Support(#205
    → Added support for uploading and analyzing .epub format documents, broadening literature processing scope.

  6. Dataset Import Function(#498
    → Supports importing existing datasets from local files for quick reuse of external data resources.

⚡ Optimizations

  1. Dataset Pagination Improvement(#497
    → Automatically saves the selected state of Markdown tags during pagination to avoid repeated operations.

  2. Dataset List Filter Enhancement(#275
    → Added filtering for "whether it is a distilled dataset" to quickly locate specific data types.

🔧 Fixes

  1. Large Dataset Export Issue(#502
    → Fixed freezing when exporting large-scale datasets; added batch export mechanism to improve stability.

  2. Cross-Project Question Conflicts(#509
    → Resolved conflict anomalies in question DIFF comparisons between different projects, ensuring cross-project data consistency.

Don't miss a new easy-dataset release

NewReleases is sending notifications on new releases.