github ConardLi/easy-dataset 1.3.6
[1.3.6] 2025-06-02

latest releases: 1.6.0, 1.5.1, 1.5.0...
5 months ago

如果遇到 Github 下载慢的问题可以使用网盘下载:https://pan.quark.cn/s/194b7eedf16e

🔧 修复

  1. 选择模型后刷新列表跨域问题
→ 修复模型列表刷新时的跨域请求错误,确保不同域下模型数据正常加载。
  2. 上传 DOCX 文件处理超时
→ 优化文件解析线程池配置,解决大文件处理时的超时异常。
  3. 删除文献时原始目录删除失败
→ 修正文件系统操作逻辑,确保文献删除时关联的原始目录同步清理。

⚡ 优化

  1. Docker 打包脚本
→ 优化镜像构建流程,减少冗余依赖,提升打包效率。
  2. 数据蒸馏任务问题生成
→ 问题生成时不再包含标签序号,适配无结构化格式需求。
  3. 数据集详情 Token 展示
→ 在数据集详情页新增 Token 数量统计,直观显示文本长度(支持模型输入限制参考)。

✨ 新功能

  1. GA(载体、受众)对的数据集增强
    引入 “载体(Generator)- 受众(Audience)” 配对机制,根据数据应用场景生成针对性内容。
    文档:https://docs.easy-dataset.com/jin-jie-shi-yong/mga-zeng-qiang-shu-ju-ji

🔧 Fixes

  1. Cross-origin issue when refreshing model list
→ Fixed cross-origin request errors to ensure model data loads properly across domains.
  2. Timeout when processing uploaded DOCX files
→ Optimized file parsing thread pool to resolve timeouts during large document handling.
  3. Failed deletion of original directory when removing literature
→ Corrected file system logic to ensure associated original directories delete with literature.

⚡ Optimizations

  1. Docker packaging script
→ Optimized image build process to reduce redundant dependencies and improve packaging efficiency.
  2. Question generation in data distillation tasks
→ Removed label indices (e.g., "Q1:", "A1:") from generated questions for unstructured format compatibility.
  3. Dataset details Token display
→ Added Token count statistics on dataset pages for clear text length visualization (supports model input limit reference).

✨ New Feature: GA (Generator-Audience) Pair Dataset Enhancement
Introduces "Generator-Audience" pairing to generate targeted content based on usage scenarios:

Don't miss a new easy-dataset release

NewReleases is sending notifications on new releases.