[7.0.0] - 2025-08-28

🥳 Open Weights with Open Data, Local LLM 🤖 edition 🚀

This is the biggest release yet - 470+ commits since v6.0.1! Packed with new AI-powered features, fixes and significant performance improvements suite-wide!

With the release of OpenAI's gpt-oss open-weight reasoning model earlier this month setting the stage, we continue on our "Automagical Metadata" journey by revamping describegpt.

🤖 Revamped describegpt - AI-Powered Metadata Inferencing and Data Analysis:

Intelligent Metadata Generation: Automatically generate comprehensive metadata - Data Dictionaries, Description and Tags for your Datasets using Large Language Models (LLM) prompted with summary statistics and frequency tables as detailed context - without sending your data to the cloud!
Even if you elect to use a cloud-based LLM, your Raw Data is never sent.
Chat with your Data: If your prompt can be answered using this high-quality, high-resolution Metadata, describegpt will answer it! If your prompt is not remotely related to the data, it will politely refuse - "I'm sorry, I can only answer questions about the Dataset."
Auto SQL RAG Mode: Should the LLM decide that it doesn't have the necessary information in the metadata it compiled to answer your prompt, it will automatically enter SQL Retrieval-Augmented Generation (RAG) mode - using the rich metadata instead as context to craft an expert-level, deterministic, reproducible, "hallucination-free" SQL query¹ to respond to your prompt.
Database Engine Support: If DuckDB is installed or the Polars feature is enabled, and --sql-results <ANSWER.CSV> is specified - an optimized SQL query will be automatically executed with the query results saved to the specified file.
As both DuckDB and Polars are purpose-built OLAP engines that support direct queries (no database pre-loading required), you get answers in a few seconds² - even for very large datasets.
Multi-LLM Support: Works with any OpenAI-API compatible LLM - with special support for local LLMs like Ollama, Jan and LM Studio, with the ability to customize model behavior with the --addl-props option.
Advanced Caching: Disk and Redis caching support for performance and cost optimization.
Flexible Prompting: Custom prompt files and built-in intelligent templates for various analysis tasks.

Check out these examples using a 1 million row sample of NYC's 311 data!

--all option produces a Data Dictionary, Description and Tags - Markdown, JSON
--prompt "What are the top 10 complaint types per community board and borough?" - SQL result
--prompt "How tall is the Empire State Building?" - "I'm sorry, I can only answer questions about the Dataset."

On top of other improvements in Datapusher+ with its new Jinja-based "metadata suggestion engine" - we're using this AI-inferred metadata along with other precalcs to prepopulate DCATv3 (both US and European profiles) and Croissant metadata fields that are otherwise too hard and expensive to compile manually.

The inferred and precalculated metadata values are offered as "suggestions", using a UI/UX purpose-built to facilitate interactive metadata curation chats.

This allows Data Stewards to compile high-quality, high-resolution metadata catalogs with an accelerated "Data Steward in the Loop" data ingestion and metadata curation workflow.

If you want to see and learn more, we're Bologna-bound to attend csv,conf,v9 to present and share how we're using this to auto-infer metadata in CKAN. Hope to see you there!

Towards the People's API!

(Answering People/Policymaker Interface)

📊 Enhanced frequency Command:

Rank Column: Ranking of frequency results for better data insights
JSON Output Mode: New --json option not only provides structured output beyond the default CSV format - it also takes advantage of JSON's nested support to include 15 additional summary statistics per field
Performance Boost: Speed improvements with SIMD-accelerated number parsing, remaining performant even with the added functionality

⚡ stats Command Improvements:

Faster Still: Enabled by improvements in the underlying qsv-stats crate
Improved Precision: Faster, streamlined precision calculation
SIMD Number Parsing: Hardware-accelerated parsing for int/float values
Unix Epoch Support: Proper handling of Unix timestamp 0 as valid date
Enhanced Date Inference: Better date and boolean type inference capabilities

🔧 validate & schema Enhancements:

Fancy Regex Support: You can now use "advanced" regex features with your JSON Schema patterns with the --fancy-regex option. Previously, you can only use the standard Rust regex engine which does not support backreferences or look-arounds (for performance reasons)
JSON Schema Improvements: Better error handling and format validation options
Schema Validation Refinements: More granular validation control with --no-format-validation

🔄 rename Reverted and Improved:

When pairwise renaming was introduced in v6.0.0, it broke some some workflows. It's now fixed by introducing two modes:

Positional Mode: Renaming by position is now once again the default
Pairwise Mode: New --pairwise flag for column renaming by column pairs

🗂️ partition Improvements:

Case-Insensitive Safety: Improved case-aware partitioning algorithm. Previously, case insensitive file systems like macOS APFS and Windows NTFS was causing incorrect partitioning of case-sensitive values
Faster still: With better use of I/O bufferring - with deferred, batched, async writes instead of after every record

Added

frequency add rank info to frequency table #2878
frequency add --json output option #2868
validate add --fancy-regex option #2845
add CPU-accelerated, mem-mapped, chunked sha256 file checksum helper #2909

Changed

apply use SIMD-accelerated base64-simd crate for Encode64 and Decode64 operations #2863
stats faster precision calculation #2852
perf: Use simd_json instead of serde_json to serialize to JSON #2884
refactor: create and use reqwest client helpers to eliminate redundant code #2888
perf: Faster parallelized sha256 hash file #2918
refactor: describegpt #2890
refactor: describegpt setting --timeout to 0 sets no timeout #2891
refactor: describegpt more refinements #2892
feat: describegpt refactor round3 #2893
feat: describegpt disk & redis caching #2895
refactor: describegpt #2896
refactor: describegpt create get_cache_key helper; customizable stats options #2902
feat: describegpt auto SQL RAG for --prompt #2904
feat: describegpt major refactor #2913
refactor: describegpt default promptfile is now embedded in qsv binary; fine-tune tests #2924
feat: describegpt returning reasoning with --json option #2926
feat: describegpt add DuckDB support in SQL RAG mode #2929
feat: describegpt various DuckDB improvements #2936
refactor: describegpt improved cache miss handling #2938
feat: describegpt --addl-props is now part of cachekey #2939
deps: bump cached to 0.56 and remove our patched fork #2853
deps: bump polars from 0.49 to 0.50 #2869
deps: bump polars to 0.50.0 at the py-1.32.2 tag #2877
deps: bump polars to 0.50.0 at py-1.32.3 tag #2889
build(deps): bump actions/checkout from 4 to 5 by @dependabot[bot] in #2886
build(deps): bump arboard from 3.6.0 to 3.6.1 by @dependabot[bot] in #2920
build(deps): bump base62 from 2.2.1 to 2.2.2 by @dependabot[bot] in #2937
build(deps): bump bytemuck from 1.23.1 to 1.23.2 by @dependabot[bot] in #2876
build(deps): bump calamine from 0.29.0 to 0.30.0 by @dependabot[bot] in #2872
build(deps): bump criterion from 0.6.0 to 0.7.0 by @dependabot[bot] in #2855
build(deps): bump dns-lookup from 2.1.0 to 3.0.0 by @dependabot[bot] in #2915
build(deps): bump dynfmt2 from 0.2.0 to 0.3.0 by @dependabot[bot] in #2850
build(deps): bump foldhash from 0.1.5 to 0.2.0 by @dependabot[bot] in #2922
build(deps): bump file-format from 0.27.0 to 0.28.0 by @dependabot[bot] in #2873
build(deps): bump filetime from 0.2.25 to 0.2.26 by @dependabot[bot] in #2906
build(deps): bump governor from 0.10.0 to 0.10.1 by @dependabot[bot] in #2871
build(deps): bump hashbrown from 0.15.4 to 0.15.5 by @dependabot[bot] in #2874
build(deps): bump indexmap from 2.10.0 to 2.11.0 by @dependabot[bot] in #2917
build(deps): bump jsonschema from 0.32.1 to 0.33.0 by @dependabot[bot] in #2928
build(deps): bump libc from 0.2.174 to 0.2.175 by @dependabot[bot] in #2882
build(deps): bump memmap2 from 0.9.7 to 0.9.8 by @dependabot[bot] in #2914
build(deps): bump mimalloc from 0.1.47 to 0.1.48 by @dependabot[bot] in #2935
build(deps): bump minijinja-contrib from 2.11.0 to 2.12.0 by @dependabot[bot] in #2923
deps: bump mlua from 0.10.5 to 0.11.1 - upgrading Luau from 0.663 to 0.682 #2842
build(deps): bump mlua from 0.11.1 to 0.11.2 by @dependabot[bot] in #2879
build(deps): bump phf from 0.12.1 to 0.13.1 by @dependabot[bot] in #2921
build(deps): bump qsv-stats from 0.36.0 to 0.37.0 by @dependabot[bot] in #2856
build(deps): bump rand from 0.9.1 to 0.9.2 by @dependabot[bot] in #2851
build(deps): bump rayon from 1.10.0 to 1.11.0 by @dependabot[bot] in #2887
build(deps): bump redis from 0.32.4 to 0.32.5 by @dependabot[bot] in #2880
build(deps): bump regex from 1.11.1 to 1.11.2 by @dependabot[bot] in #2925
build(deps): bump reqwest from 0.12.22 to 0.12.23 by @dependabot[bot] in #2885
build(deps): bump serde_json from 1.0.140 to 1.0.141 by @dependabot[bot] in #2847
build(deps): bump serde_json from 1.0.141 to 1.0.142 by @dependabot[bot] in #2865
build(deps): bump serde_json from 1.0.142 to 1.0.143 by @dependabot[bot] in #2898
build(deps): bump strum from 0.27.1 to 0.27.2 by @dependabot[bot] in #2848
build(deps): bump strum_macros from 0.27.1 to 0.27.2 by @dependabot[bot] in #2849
build(deps): bump sysinfo from 0.36.0 to 0.36.1 by @dependabot[bot] in #2846
build(deps): bump sysinfo from 0.36.1 to 0.37.0 by @dependabot[bot] in #2881
build(deps): bump tempfile from 3.20.0 to 3.21.0 by @dependabot[bot] in #2900
build(deps): bump tokio from 1.46.1 to 1.47.0 by @dependabot[bot] in #2857
build(deps): bump tokio from 1.47.0 to 1.47.1 by @dependabot[bot] in #2866
build(deps): bump url from 2.5.4 to 2.5.6 by @dependabot[bot] in #2912
build(deps): bump url from 2.5.6 to 2.5.7 by @dependabot[bot] in #2919
build(deps): bump uuid from 1.17.0 to 1.18.0 by @dependabot[bot] in #2883
build(deps): bump zip from 4.3.0 to 4.5.0 by @dependabot[bot] in #2911
applied select clippy suggestions
updated indirect dependencies
bumped MSRV to Rust 1.89

Fixed

fix: json more robust error-handling of invalid JSON input; #2844
fix: template fix stdin regression #2907
fix:rename add --positional option #2930
fix: rename the real fix - positional is now the default and pairwise is the option #2931
fix: partition case insensitive filesystems #2934
docs: fix inconsistent formatting in command help examples by @abobov in #2862

New Contributors

@abobov made their first contribution in #2862

Full Changelog: `6.0.1...7.0.0`

Footnotes

LLMs can still hallucinate a syntactically wrong SQL query. But once a valid SQL query is generated, its fully reproducible. ↩
Depending on your LLM setup, SQL query generation may take some time. Once generated however, the SQL query itself will be blazing-fast. ↩

dathere/qsv 7.0.0 on GitHub