ConardLi/easy-dataset 1.6.0 on GitHub

如果遇到 Github 下载慢的问题可以使用网盘下载：https://pan.quark.cn/s/194b7eedf16e

生成图像问答（VQA）数据集（#130、#483、#537）
→ 支持上传图像文件，自动生成图像相关问题与答案，构建 VQA 数据集，适配视觉语言模型训练。

全自动蒸馏数据集后台异步任务（#432、#492、#495、#496）
→ 支持从触发蒸馏到生成数据集的全流程自动化，通过后台异步任务完成，无需手动干预，支持查看实时进度。

问题模版功能
→ 可创建多种自定义问题类型（如“描述图像内容”“分析文本观点”），并应用于所有图像或文本块批量生成对应问题，提升问题生成的标准化与场景适配性。

Fixed ModelId update error when saving models
→ Corrected the abnormal synchronization of the ModelId field during model configuration saving to ensure unique model identification.
Fixed issues with batch dataset evaluation（#576）
→ Added the ability to interrupt batch evaluation tasks, supporting manual termination of ongoing evaluations; optimized evaluation algorithms to improve batch processing speed.
Fixed input interruption caused by dataset shortcuts（#578）
→ Adjusted shortcut trigger logic to avoid conflicts with text input operations, ensuring input processes are not accidentally interrupted.
Fixed export failure when selecting large numbers of datasets（#578）
→ Optimized the export task sharding mechanism to resolve memory overflow or connection timeout issues caused by excessive data volume.
Fixed ineffective balanced export（#561）
→ Corrected sample distribution calculation errors in balanced export logic to ensure data of different categories are exported according to preset ratios.
Fixed errors when calling Qwen3 model via Alibaba Cloud Bailian（#412、#482）
→ Adapted to Qwen3 model interface protocols, corrected request parameter formats and authentication logic to ensure normal calls.

Improved stability of multi-turn dialogue dataset parsing
→ Enhanced compatible parsing of multi-turn dialogue formats (e.g., ShareGPT) to reduce parsing failures caused by format variations.
Asynchronous execution of single text block operations（#530、#494）
→ Changed "generate questions for single text blocks" and "AI intelligent dataset optimization" to background asynchronous tasks, which do not block other front-end operations during execution.
Enhanced text block filtering（#541）
→ Supports filtering text blocks by keyword search and word count range (e.g., 100-500 words) for quick定位 of target text.
Model configuration supports Top parameter control（#517）
→ Added Top parameter (e.g., Top-K/Top-P) settings on the model configuration page to adjust the diversity and determinism of generated content.
Filter by text block name（#275）
→ Question lists and dataset lists support filtering by associated text block (file) names, improving cross-module data positioning efficiency.

Fully automated dataset distillation background tasks（#432、#492、#495、#496）
→ Supports full-process automation from triggering distillation to dataset generation, completed via background asynchronous tasks without manual intervention, with real-time progress tracking.
Support for renaming distillation labels（#422）
→ Allows custom naming of labels generated during distillation to adapt to label management needs in different scenarios.
Generate Visual Question Answering (VQA) datasets（#130、#483、#537）
→ Supports uploading image files, automatically generating image-related questions and answers to build VQA datasets, suitable for vision-language model training.
Question template function
→ Enables creation of multiple custom question types (e.g., "describe image content", "analyze text opinions") and applies them to all images or text blocks to generate corresponding questions in batches, improving standardization and scenario adaptability of question generation.

What's Changed