github ConardLi/easy-dataset 1.3.1
[1.3.1] 2025-05-14

latest releases: 1.6.0, 1.5.1, 1.5.0...
6 months ago

🔧 修复

  1. 修复数据集优化过程中意外生成 COT 的问题
  2. 修复文本处理页上传时已移除文件仍被处理致报错的问题

⚡ 优化

  1. 将本地文件存储重构为本地数据库存储,大幅优化大量数据下的使用体验
  2. 随机取出问题中的问号(支持配置)
  3. 优化多项功能使用体验

✨ 新功能

  1. 领域树灵活管理模式

    • 新增/删除文献时支持三种模式:
      • 修订模式:仅修正新增/删除文献相关的领域树节点,最小化影响现有结构
      • 完全重建模式:基于所有文献目录重新生成领域树(现有逻辑)
      • 锁定模式:固定当前领域树,新增/删除文献不触发更新
        image
  2. 多种文本分块策略

    • Markdown分块:根据文档标题自动分割,保持语义完整性(适用于结构化Markdown)
    • 自定义分割符递归分块:按优先级递归尝试多级分隔符(可配置),适合复杂文档
    • 自定义分割符固定长度分块:按指定分隔符切分后组合为固定长度(可配置)
    • Token分块:基于Token数量分块(非字符数),适配模型输入要求
    • 程序代码智能分块:根据编程语言语法结构智能分割,避免语法断裂
      image
  3. 可视化自定义分块

    • 支持通过图形界面手动调整分块边界,实时预览分块效果
      image
  4. 客户端工具增强

    • 新增本地日志存储,可一键打开日志目录排查问题
    • 新增清除缓存功能,支持清理历史日志和数据库备份文件

🔧 Fixes

  1. Fixed the issue of accidental COT generation during dataset optimization.
  2. Fixed the error caused by processing removed files during upload on the text processing page.

⚡ Optimizations

  1. Refactored local file storage to local database storage, significantly improving performance with large datasets.
  2. Added configurable option to randomly remove question marks from generated questions.
  3. Enhanced user experience across multiple functions.

✨ New Features

  1. Flexible Domain Tree Management

    • Three modes for adding/deleting documents:
      • Revise Mode: Only update domain tree nodes related to new/deleted documents, minimizing impact on existing structure.
      • Rebuild Mode: Regenerate domain tree from all document catalogs (current logic).
      • Lock Mode: Freeze domain tree, no updates triggered by document changes.
  2. Multiple Text Chunking Strategies

    • Markdown Chunking: Auto-split by document headings to preserve semantic integrity (for structured Markdown).
    • Recursive Delimiter Chunking: Try multi-level delimiters recursively (configurable), ideal for complex documents.
    • Fixed-Length Delimiter Chunking: Split by specified delimiter (configurable) and combine into fixed-length chunks.
    • Token Chunking: Split based on token count (not character count) for model-friendly input.
    • Code Intelligence Chunking: Smart splitting by programming language syntax to avoid incomplete code segments.
  3. Visual Custom Chunking

    • Manual adjustment of chunk boundaries via graphical interface with real-time preview.
  4. Client Tool Enhancements

    • Local log storage added, with one-click access to log directory for troubleshooting.
    • Cache clearing function added to clean historical logs and database backups.

Don't miss a new easy-dataset release

NewReleases is sending notifications on new releases.