github dathere/qsv 7.0.0

latest release: 7.0.1
6 days ago

[7.0.0] - 2025-08-28

🥳 Open Weights with Open Data, Local LLM 🤖 edition 🚀

This is the biggest release yet - 470+ commits since v6.0.1! Packed with new AI-powered features, fixes and significant performance improvements suite-wide!

With the release of OpenAI's gpt-oss open-weight reasoning model earlier this month setting the stage, we continue on our "Automagical Metadata" journey by revamping describegpt.

🤖 Revamped describegpt - AI-Powered Metadata Inferencing and Data Analysis:

  • Intelligent Metadata Generation: Automatically generate comprehensive metadata - Data Dictionaries, Description and Tags for your Datasets using Large Language Models (LLM) prompted with summary statistics and frequency tables as detailed context - without sending your data to the cloud!
    Even if you elect to use a cloud-based LLM, your Raw Data is never sent.
  • Chat with your Data: If your prompt can be answered using this high-quality, high-resolution Metadata, describegpt will answer it! If your prompt is not remotely related to the data, it will politely refuse - "I'm sorry, I can only answer questions about the Dataset."
  • Auto SQL RAG Mode: Should the LLM decide that it doesn't have the necessary information in the metadata it compiled to answer your prompt, it will automatically enter SQL Retrieval-Augmented Generation (RAG) mode - using the rich metadata instead as context to craft an expert-level, deterministic, reproducible, "hallucination-free" SQL query1 to respond to your prompt.
  • Database Engine Support: If DuckDB is installed or the Polars feature is enabled, and --sql-results <ANSWER.CSV> is specified - an optimized SQL query will be automatically executed with the query results saved to the specified file.
    As both DuckDB and Polars are purpose-built OLAP engines that support direct queries (no database pre-loading required), you get answers in a few seconds2 - even for very large datasets.
  • Multi-LLM Support: Works with any OpenAI-API compatible LLM - with special support for local LLMs like Ollama, Jan and LM Studio, with the ability to customize model behavior with the --addl-props option.
  • Advanced Caching: Disk and Redis caching support for performance and cost optimization.
  • Flexible Prompting: Custom prompt files and built-in intelligent templates for various analysis tasks.

Check out these examples using a 1 million row sample of NYC's 311 data!

  • --all option produces a Data Dictionary, Description and Tags - Markdown, JSON
  • --prompt "What are the top 10 complaint types per community board and borough?" - SQL result
  • --prompt "How tall is the Empire State Building?" - "I'm sorry, I can only answer questions about the Dataset."

On top of other improvements in Datapusher+ with its new Jinja-based "metadata suggestion engine" - we're using this AI-inferred metadata along with other precalcs to prepopulate DCATv3 (both US and European profiles) and Croissant metadata fields that are otherwise too hard and expensive to compile manually.

The inferred and precalculated metadata values are offered as "suggestions", using a UI/UX purpose-built to facilitate interactive metadata curation chats.

This allows Data Stewards to compile high-quality, high-resolution metadata catalogs with an accelerated "Data Steward in the Loop" data ingestion and metadata curation workflow.

If you want to see and learn more, we're Bologna-bound to attend csv,conf,v9 to present and share how we're using this to auto-infer metadata in CKAN. Hope to see you there!

Towards the People's API!

(Answering People/Policymaker Interface)


📊 Enhanced frequency Command:

  • Rank Column: Ranking of frequency results for better data insights
  • JSON Output Mode: New --json option not only provides structured output beyond the default CSV format - it also takes advantage of JSON's nested support to include 15 additional summary statistics per field
  • Performance Boost: Speed improvements with SIMD-accelerated number parsing, remaining performant even with the added functionality

stats Command Improvements:

  • Faster Still: Enabled by improvements in the underlying qsv-stats crate
  • Improved Precision: Faster, streamlined precision calculation
  • SIMD Number Parsing: Hardware-accelerated parsing for int/float values
  • Unix Epoch Support: Proper handling of Unix timestamp 0 as valid date
  • Enhanced Date Inference: Better date and boolean type inference capabilities

🔧 validate & schema Enhancements:

  • Fancy Regex Support: You can now use "advanced" regex features with your JSON Schema patterns with the --fancy-regex option. Previously, you can only use the standard Rust regex engine which does not support backreferences or look-arounds (for performance reasons)
  • JSON Schema Improvements: Better error handling and format validation options
  • Schema Validation Refinements: More granular validation control with --no-format-validation

🔄 rename Reverted and Improved:

When pairwise renaming was introduced in v6.0.0, it broke some some workflows. It's now fixed by introducing two modes:

  • Positional Mode: Renaming by position is now once again the default
  • Pairwise Mode: New --pairwise flag for column renaming by column pairs

🗂️ partition Improvements:

  • Case-Insensitive Safety: Improved case-aware partitioning algorithm. Previously, case insensitive file systems like macOS APFS and Windows NTFS was causing incorrect partitioning of case-sensitive values
  • Faster still: With better use of I/O bufferring - with deferred, batched, async writes instead of after every record

Added

  • frequency add rank info to frequency table #2878
  • frequency add --json output option #2868
  • validate add --fancy-regex option #2845
  • add CPU-accelerated, mem-mapped, chunked sha256 file checksum helper #2909

Changed

  • apply use SIMD-accelerated base64-simd crate for Encode64 and Decode64 operations #2863
  • stats faster precision calculation #2852
  • perf: Use simd_json instead of serde_json to serialize to JSON #2884
  • refactor: create and use reqwest client helpers to eliminate redundant code #2888
  • perf: Faster parallelized sha256 hash file #2918
  • refactor: describegpt #2890
  • refactor: describegpt setting --timeout to 0 sets no timeout #2891
  • refactor: describegpt more refinements #2892
  • feat: describegpt refactor round3 #2893
  • feat: describegpt disk & redis caching #2895
  • refactor: describegpt #2896
  • refactor: describegpt create get_cache_key helper; customizable stats options #2902
  • feat: describegpt auto SQL RAG for --prompt #2904
  • feat: describegpt major refactor #2913
  • refactor: describegpt default promptfile is now embedded in qsv binary; fine-tune tests #2924
  • feat: describegpt returning reasoning with --json option #2926
  • feat: describegpt add DuckDB support in SQL RAG mode #2929
  • feat: describegpt various DuckDB improvements #2936
  • refactor: describegpt improved cache miss handling #2938
  • feat: describegpt --addl-props is now part of cachekey #2939
  • deps: bump cached to 0.56 and remove our patched fork #2853
  • deps: bump polars from 0.49 to 0.50 #2869
  • deps: bump polars to 0.50.0 at the py-1.32.2 tag #2877
  • deps: bump polars to 0.50.0 at py-1.32.3 tag #2889
  • build(deps): bump actions/checkout from 4 to 5 by @dependabot[bot] in #2886
  • build(deps): bump arboard from 3.6.0 to 3.6.1 by @dependabot[bot] in #2920
  • build(deps): bump base62 from 2.2.1 to 2.2.2 by @dependabot[bot] in #2937
  • build(deps): bump bytemuck from 1.23.1 to 1.23.2 by @dependabot[bot] in #2876
  • build(deps): bump calamine from 0.29.0 to 0.30.0 by @dependabot[bot] in #2872
  • build(deps): bump criterion from 0.6.0 to 0.7.0 by @dependabot[bot] in #2855
  • build(deps): bump dns-lookup from 2.1.0 to 3.0.0 by @dependabot[bot] in #2915
  • build(deps): bump dynfmt2 from 0.2.0 to 0.3.0 by @dependabot[bot] in #2850
  • build(deps): bump foldhash from 0.1.5 to 0.2.0 by @dependabot[bot] in #2922
  • build(deps): bump file-format from 0.27.0 to 0.28.0 by @dependabot[bot] in #2873
  • build(deps): bump filetime from 0.2.25 to 0.2.26 by @dependabot[bot] in #2906
  • build(deps): bump governor from 0.10.0 to 0.10.1 by @dependabot[bot] in #2871
  • build(deps): bump hashbrown from 0.15.4 to 0.15.5 by @dependabot[bot] in #2874
  • build(deps): bump indexmap from 2.10.0 to 2.11.0 by @dependabot[bot] in #2917
  • build(deps): bump jsonschema from 0.32.1 to 0.33.0 by @dependabot[bot] in #2928
  • build(deps): bump libc from 0.2.174 to 0.2.175 by @dependabot[bot] in #2882
  • build(deps): bump memmap2 from 0.9.7 to 0.9.8 by @dependabot[bot] in #2914
  • build(deps): bump mimalloc from 0.1.47 to 0.1.48 by @dependabot[bot] in #2935
  • build(deps): bump minijinja-contrib from 2.11.0 to 2.12.0 by @dependabot[bot] in #2923
  • deps: bump mlua from 0.10.5 to 0.11.1 - upgrading Luau from 0.663 to 0.682 #2842
  • build(deps): bump mlua from 0.11.1 to 0.11.2 by @dependabot[bot] in #2879
  • build(deps): bump phf from 0.12.1 to 0.13.1 by @dependabot[bot] in #2921
  • build(deps): bump qsv-stats from 0.36.0 to 0.37.0 by @dependabot[bot] in #2856
  • build(deps): bump rand from 0.9.1 to 0.9.2 by @dependabot[bot] in #2851
  • build(deps): bump rayon from 1.10.0 to 1.11.0 by @dependabot[bot] in #2887
  • build(deps): bump redis from 0.32.4 to 0.32.5 by @dependabot[bot] in #2880
  • build(deps): bump regex from 1.11.1 to 1.11.2 by @dependabot[bot] in #2925
  • build(deps): bump reqwest from 0.12.22 to 0.12.23 by @dependabot[bot] in #2885
  • build(deps): bump serde_json from 1.0.140 to 1.0.141 by @dependabot[bot] in #2847
  • build(deps): bump serde_json from 1.0.141 to 1.0.142 by @dependabot[bot] in #2865
  • build(deps): bump serde_json from 1.0.142 to 1.0.143 by @dependabot[bot] in #2898
  • build(deps): bump strum from 0.27.1 to 0.27.2 by @dependabot[bot] in #2848
  • build(deps): bump strum_macros from 0.27.1 to 0.27.2 by @dependabot[bot] in #2849
  • build(deps): bump sysinfo from 0.36.0 to 0.36.1 by @dependabot[bot] in #2846
  • build(deps): bump sysinfo from 0.36.1 to 0.37.0 by @dependabot[bot] in #2881
  • build(deps): bump tempfile from 3.20.0 to 3.21.0 by @dependabot[bot] in #2900
  • build(deps): bump tokio from 1.46.1 to 1.47.0 by @dependabot[bot] in #2857
  • build(deps): bump tokio from 1.47.0 to 1.47.1 by @dependabot[bot] in #2866
  • build(deps): bump url from 2.5.4 to 2.5.6 by @dependabot[bot] in #2912
  • build(deps): bump url from 2.5.6 to 2.5.7 by @dependabot[bot] in #2919
  • build(deps): bump uuid from 1.17.0 to 1.18.0 by @dependabot[bot] in #2883
  • build(deps): bump zip from 4.3.0 to 4.5.0 by @dependabot[bot] in #2911
  • applied select clippy suggestions
  • updated indirect dependencies
  • bumped MSRV to Rust 1.89

Fixed

  • fix: json more robust error-handling of invalid JSON input; #2844
  • fix: template fix stdin regression #2907
  • fix:rename add --positional option #2930
  • fix: rename the real fix - positional is now the default and pairwise is the option #2931
  • fix: partition case insensitive filesystems #2934
  • docs: fix inconsistent formatting in command help examples by @abobov in #2862

New Contributors

Full Changelog: 6.0.1...7.0.0

Footnotes

  1. LLMs can still hallucinate a syntactically wrong SQL query. But once a valid SQL query is generated, its fully reproducible. ↩

  2. Depending on your LLM setup, SQL query generation may take some time. Once generated however, the SQL query itself will be blazing-fast. ↩

Don't miss a new qsv release

NewReleases is sending notifications on new releases.