[7.0.0] - 2025-08-28
🥳 Open Weights with Open Data, Local LLM 🤖 edition 🚀
This is the biggest release yet - 470+ commits since v6.0.1! Packed with new AI-powered features, fixes and significant performance improvements suite-wide!
With the release of OpenAI's gpt-oss open-weight reasoning model earlier this month setting the stage, we continue on our "Automagical Metadata" journey by revamping describegpt
.
🤖 Revamped describegpt
- AI-Powered Metadata Inferencing and Data Analysis:
- Intelligent Metadata Generation: Automatically generate comprehensive metadata - Data Dictionaries, Description and Tags for your Datasets using Large Language Models (LLM) prompted with summary statistics and frequency tables as detailed context - without sending your data to the cloud!
Even if you elect to use a cloud-based LLM, your Raw Data is never sent. - Chat with your Data: If your prompt can be answered using this high-quality, high-resolution Metadata,
describegpt
will answer it! If your prompt is not remotely related to the data, it will politely refuse - "I'm sorry, I can only answer questions about the Dataset." - Auto SQL RAG Mode: Should the LLM decide that it doesn't have the necessary information in the metadata it compiled to answer your prompt, it will automatically enter SQL Retrieval-Augmented Generation (RAG) mode - using the rich metadata instead as context to craft an expert-level, deterministic, reproducible, "hallucination-free" SQL query1 to respond to your prompt.
- Database Engine Support: If DuckDB is installed or the Polars feature is enabled, and
--sql-results <ANSWER.CSV>
is specified - an optimized SQL query will be automatically executed with the query results saved to the specified file.
As both DuckDB and Polars are purpose-built OLAP engines that support direct queries (no database pre-loading required), you get answers in a few seconds2 - even for very large datasets. - Multi-LLM Support: Works with any OpenAI-API compatible LLM - with special support for local LLMs like Ollama, Jan and LM Studio, with the ability to customize model behavior with the
--addl-props
option. - Advanced Caching: Disk and Redis caching support for performance and cost optimization.
- Flexible Prompting: Custom prompt files and built-in intelligent templates for various analysis tasks.
Check out these examples using a 1 million row sample of NYC's 311 data!
--all
option produces a Data Dictionary, Description and Tags - Markdown, JSON- --prompt "What are the top 10 complaint types per community board and borough?" - SQL result
--prompt "How tall is the Empire State Building?"
- "I'm sorry, I can only answer questions about the Dataset."
On top of other improvements in Datapusher+ with its new Jinja-based "metadata suggestion engine" - we're using this AI-inferred metadata along with other precalcs to prepopulate DCATv3 (both US and European profiles) and Croissant metadata fields that are otherwise too hard and expensive to compile manually.
The inferred and precalculated metadata values are offered as "suggestions", using a UI/UX purpose-built to facilitate interactive metadata curation chats.
This allows Data Stewards to compile high-quality, high-resolution metadata catalogs with an accelerated "Data Steward in the Loop" data ingestion and metadata curation workflow.
If you want to see and learn more, we're Bologna-bound to attend csv,conf,v9 to present and share how we're using this to auto-infer metadata in CKAN. Hope to see you there!
Towards the People's API!
(Answering People/Policymaker Interface)
📊 Enhanced frequency
Command:
- Rank Column: Ranking of frequency results for better data insights
- JSON Output Mode: New
--json
option not only provides structured output beyond the default CSV format - it also takes advantage of JSON's nested support to include 15 additional summary statistics per field - Performance Boost: Speed improvements with SIMD-accelerated number parsing, remaining performant even with the added functionality
⚡ stats
Command Improvements:
- Faster Still: Enabled by improvements in the underlying qsv-stats crate
- Improved Precision: Faster, streamlined precision calculation
- SIMD Number Parsing: Hardware-accelerated parsing for int/float values
- Unix Epoch Support: Proper handling of Unix timestamp 0 as valid date
- Enhanced Date Inference: Better date and boolean type inference capabilities
🔧 validate
& schema
Enhancements:
- Fancy Regex Support: You can now use "advanced" regex features with your JSON Schema patterns with the
--fancy-regex
option. Previously, you can only use the standard Rust regex engine which does not support backreferences or look-arounds (for performance reasons) - JSON Schema Improvements: Better error handling and format validation options
- Schema Validation Refinements: More granular validation control with
--no-format-validation
🔄 rename
Reverted and Improved:
When pairwise renaming was introduced in v6.0.0, it broke some some workflows. It's now fixed by introducing two modes:
- Positional Mode: Renaming by position is now once again the default
- Pairwise Mode: New
--pairwise
flag for column renaming by column pairs
🗂️ partition
Improvements:
- Case-Insensitive Safety: Improved case-aware partitioning algorithm. Previously, case insensitive file systems like macOS APFS and Windows NTFS was causing incorrect partitioning of case-sensitive values
- Faster still: With better use of I/O bufferring - with deferred, batched, async writes instead of after every record
Added
frequency
add rank info to frequency table #2878frequency
add--json
output option #2868validate
add--fancy-regex
option #2845- add CPU-accelerated, mem-mapped, chunked sha256 file checksum helper #2909
Changed
apply
use SIMD-accelerated base64-simd crate for Encode64 and Decode64 operations #2863stats
faster precision calculation #2852- perf: Use simd_json instead of serde_json to serialize to JSON #2884
- refactor: create and use reqwest client helpers to eliminate redundant code #2888
- perf: Faster parallelized sha256 hash file #2918
- refactor:
describegpt
#2890 - refactor:
describegpt
setting--timeout
to 0 sets no timeout #2891 - refactor:
describegpt
more refinements #2892 - feat:
describegpt
refactor round3 #2893 - feat:
describegpt
disk & redis caching #2895 - refactor:
describegpt
#2896 - refactor:
describegpt
createget_cache_key
helper; customizable stats options #2902 - feat:
describegpt
auto SQL RAG for--prompt
#2904 - feat:
describegpt
major refactor #2913 - refactor:
describegpt
default promptfile is now embedded in qsv binary; fine-tune tests #2924 - feat:
describegpt
returning reasoning with --json option #2926 - feat:
describegpt
add DuckDB support in SQL RAG mode #2929 - feat:
describegpt
various DuckDB improvements #2936 - refactor:
describegpt
improved cache miss handling #2938 - feat:
describegpt
--addl-props
is now part of cachekey #2939 - deps: bump cached to 0.56 and remove our patched fork #2853
- deps: bump polars from 0.49 to 0.50 #2869
- deps: bump polars to 0.50.0 at the py-1.32.2 tag #2877
- deps: bump polars to 0.50.0 at py-1.32.3 tag #2889
- build(deps): bump actions/checkout from 4 to 5 by @dependabot[bot] in #2886
- build(deps): bump arboard from 3.6.0 to 3.6.1 by @dependabot[bot] in #2920
- build(deps): bump base62 from 2.2.1 to 2.2.2 by @dependabot[bot] in #2937
- build(deps): bump bytemuck from 1.23.1 to 1.23.2 by @dependabot[bot] in #2876
- build(deps): bump calamine from 0.29.0 to 0.30.0 by @dependabot[bot] in #2872
- build(deps): bump criterion from 0.6.0 to 0.7.0 by @dependabot[bot] in #2855
- build(deps): bump dns-lookup from 2.1.0 to 3.0.0 by @dependabot[bot] in #2915
- build(deps): bump dynfmt2 from 0.2.0 to 0.3.0 by @dependabot[bot] in #2850
- build(deps): bump foldhash from 0.1.5 to 0.2.0 by @dependabot[bot] in #2922
- build(deps): bump file-format from 0.27.0 to 0.28.0 by @dependabot[bot] in #2873
- build(deps): bump filetime from 0.2.25 to 0.2.26 by @dependabot[bot] in #2906
- build(deps): bump governor from 0.10.0 to 0.10.1 by @dependabot[bot] in #2871
- build(deps): bump hashbrown from 0.15.4 to 0.15.5 by @dependabot[bot] in #2874
- build(deps): bump indexmap from 2.10.0 to 2.11.0 by @dependabot[bot] in #2917
- build(deps): bump jsonschema from 0.32.1 to 0.33.0 by @dependabot[bot] in #2928
- build(deps): bump libc from 0.2.174 to 0.2.175 by @dependabot[bot] in #2882
- build(deps): bump memmap2 from 0.9.7 to 0.9.8 by @dependabot[bot] in #2914
- build(deps): bump mimalloc from 0.1.47 to 0.1.48 by @dependabot[bot] in #2935
- build(deps): bump minijinja-contrib from 2.11.0 to 2.12.0 by @dependabot[bot] in #2923
- deps: bump mlua from 0.10.5 to 0.11.1 - upgrading Luau from 0.663 to 0.682 #2842
- build(deps): bump mlua from 0.11.1 to 0.11.2 by @dependabot[bot] in #2879
- build(deps): bump phf from 0.12.1 to 0.13.1 by @dependabot[bot] in #2921
- build(deps): bump qsv-stats from 0.36.0 to 0.37.0 by @dependabot[bot] in #2856
- build(deps): bump rand from 0.9.1 to 0.9.2 by @dependabot[bot] in #2851
- build(deps): bump rayon from 1.10.0 to 1.11.0 by @dependabot[bot] in #2887
- build(deps): bump redis from 0.32.4 to 0.32.5 by @dependabot[bot] in #2880
- build(deps): bump regex from 1.11.1 to 1.11.2 by @dependabot[bot] in #2925
- build(deps): bump reqwest from 0.12.22 to 0.12.23 by @dependabot[bot] in #2885
- build(deps): bump serde_json from 1.0.140 to 1.0.141 by @dependabot[bot] in #2847
- build(deps): bump serde_json from 1.0.141 to 1.0.142 by @dependabot[bot] in #2865
- build(deps): bump serde_json from 1.0.142 to 1.0.143 by @dependabot[bot] in #2898
- build(deps): bump strum from 0.27.1 to 0.27.2 by @dependabot[bot] in #2848
- build(deps): bump strum_macros from 0.27.1 to 0.27.2 by @dependabot[bot] in #2849
- build(deps): bump sysinfo from 0.36.0 to 0.36.1 by @dependabot[bot] in #2846
- build(deps): bump sysinfo from 0.36.1 to 0.37.0 by @dependabot[bot] in #2881
- build(deps): bump tempfile from 3.20.0 to 3.21.0 by @dependabot[bot] in #2900
- build(deps): bump tokio from 1.46.1 to 1.47.0 by @dependabot[bot] in #2857
- build(deps): bump tokio from 1.47.0 to 1.47.1 by @dependabot[bot] in #2866
- build(deps): bump url from 2.5.4 to 2.5.6 by @dependabot[bot] in #2912
- build(deps): bump url from 2.5.6 to 2.5.7 by @dependabot[bot] in #2919
- build(deps): bump uuid from 1.17.0 to 1.18.0 by @dependabot[bot] in #2883
- build(deps): bump zip from 4.3.0 to 4.5.0 by @dependabot[bot] in #2911
- applied select clippy suggestions
- updated indirect dependencies
- bumped MSRV to Rust 1.89
Fixed
- fix:
json
more robust error-handling of invalid JSON input; #2844 - fix:
template
fix stdin regression #2907 - fix:
rename
add--positional
option #2930 - fix:
rename
the real fix - positional is now the default and pairwise is the option #2931 - fix:
partition
case insensitive filesystems #2934 - docs: fix inconsistent formatting in command help examples by @abobov in #2862
New Contributors
Full Changelog: 6.0.1...7.0.0
Footnotes
-
LLMs can still hallucinate a syntactically wrong SQL query. But once a valid SQL query is generated, its fully reproducible. ↩
-
Depending on your LLM setup, SQL query generation may take some time. Once generated however, the SQL query itself will be blazing-fast. ↩