[11.0.2] - 2025-12-08
qsv 11.0.2 brings significant enhancements to larger-than-memory data processing, AI-powered metadata inferencing, JSON Schema inferencing & validation, and data viewing capabilities, along with important bug fixes and performance improvements.
All in preparation for at-scale, secure, interactive, "zero-copy" "Data Steward-in-the-Loop" FAIRification on the desktop in qsv pro.
🌟 Major Features
stats & frequency
- Larger than Memory Files:
stats&frequencycan now handle arbitrarily large files, even when "advanced" statistics are enabled with its new dynamic parallel chunk sizing algorithm! (example stats, frequency) - N Counts: Added "n_counts" (
n_negative,n_zeroandn_positive) columns tostatsoutput for more detailed count information for numeric fields.
describegpt
The describegpt command has received substantial improvements for AI-powered metadata inferencing:
-
"Neuro-Procedural" Data Dictionaries: combines deterministically computed statistics and frequency distribution data with AI-inferred Human-Friendly Labels and Descriptions to compile an expanded Data Dictionary (not quite "neuro-symbolic" (YET!))
-
Chat with your Data!: Improved DuckDB and Polars SQL guidance mean more reliable transformations of your Natural Language queries to SQL - leading to fast, deterministic, reproducible, hallucination-free answers! (example, SQL result)
-
Format Option: Replaced
--jsonflag with--formatoption for more flexible output formatting- Supports multiple output formats - Markdown (default), TSV and JSON
- Removed
--jsonloption for cleaner API
-
Controlled Tag Vocabulary: New tag vocabulary system for consistent categorization
--tag-vocaboption to specify controlled vocabulary- Lookup support for tag vocabularies - retrieve a tag vocabulary from a local or remote CSV
usinghttp://,https://,dathere://andckan://URL schemes.
-
Enhanced Boolean Inference:
--infer-booleanis now enabled by default for better data type detection -
Performance Metrics: Added elapsed time tracking to monitor processing duration
-
Improved Prompt Templates: Updated default description prompt with PII/PHI alerts and better attribution metadata
schema & validate
Enhanced JSON Schema inference and validation capabilities:
-
Strict Formats: New
--strict-formatsoption for stricter JSON Schema format validation,
enforcing JSON Schema format constraints for email, hostname & IP address (IPV4/IPV6) formats. -
Output Option: New
--outputoption for specifying schema output destination- Polars schema now uses consistent naming conventions across commands
- Updated
joinp,pivotp, andsqlpcommands to use new.pschema.jsonnaming convention
-
Configurable Email Validation:
validatehas numerous options to tweak email validation
- taking advantage ofschema's email format constraint inferencing.
sample time-series sampling
A new --timeseries sampling method with grouping (hourly, daily, weekly),
adaptive sampling (prefer business hours or weekends) with various aggregation (mean, sum, min, max)
within each interval with configurable starting points (first, last or random).
lens "real-time" Features
Enhanced CSV viewing capabilities with csvlens integration:
-
Auto-Reload: New
--auto-reloadoption to automatically reload file when it changes- Useful for monitoring live data files
-
Streaming stdin: New
--streaming-stdinoption for real-time data viewing- Supports viewing data as it's being piped in
-
Row Marking: Updated csvlens dependency with row marking feature
Breaking Changes
describegpt:--jsonflag replaced with--formatoptiondescribegpt:--jsonloption removedschema,joinp,pivotp,sqlp: Updated Polars schema naming conventions
(existing workflows should work but output format may differ slightly)
Added
- Created Event Logo Archive with AI-generated seasonal/version logos
describegpt: add controlled vocabulary support for tags #3122describegpt: add elapsed time #3168describegpt: add lookup support #3170excel: add--celloption #3133frequency: add dynamic parallel chunk sizing #3135lens: add--auto-reloadoption #3128lens: add--streaming-stdinoption #3171sample: add timeseries sampling options #3130schema: infer addl JSON Schema predefined formats - email, ipv4, ipv6, hostname #3125schema: add--outputoption and standardize Polars Schema file name #3126stats: dynamic parallel chunk sizing with indexed files #3134stats: add n_negative, n_zero, n_positive count columns #3157validate:add email validation options #3148tests: add tests for https://100.dathere.com/lessons/4 by @rzmk in #3151- Added Claude AI guidance for contributors
- Enhanced
--versionoutput with more comprehensive system metadata
Changed
- refactor:
describegptimprove tags inferencing with Tag Vocabulary #3139 - feat:
describegpt- major refactor #3143 - feat:
describegptimproved Polars SQL processing #3147 - feat:
describegptreplace--jsonoption with--formatoption supporting 3 formats - markdown, json and TSV; remove--jsonloption #3167 - refactor:
frequency&stats- parallel chunk sizing - allow forcing of cpu based chunking #3138 - Align partition stdin handling with split/stats pattern by @Copilot in #3162
- deps: use latest polars upstream with new SQL fixes and features (pola-rs/polars@e1be17f)
- build(deps): bump actions/setup-python from 6.0.0 to 6.1.0 by @dependabot[bot] in #3120
- build(deps): bump actix-web from 4.12.0 to 4.12.1 by @dependabot[bot] in #3127
- build(deps): bump flate2 from 1.1.5 to 1.1.7 by @dependabot[bot] in #3159
- build(deps): bump jsonschema from 0.37.1 to 0.37.2 by @dependabot[bot] in #3129
- build(deps): bump jsonschema from 0.37.2 to 0.37.3 by @dependabot[bot] in #3131
- build(deps): bump jsonschema from 0.37.3 to 0.37.4 by @dependabot[bot] in #3140
- build(deps): bump log from 0.4.28 to 0.4.29 by @dependabot[bot] in #3150
- build(deps): bump minijinja from 2.12.0 to 2.13.0 by @dependabot[bot] in #3142
- build(deps): bump minijinja-contrib from 2.12.0 to 2.13.0 by @dependabot[bot] in #3141
- build(deps): bump pyo3 from 0.27.1 to 0.27.2 by @dependabot[bot] in #3137
- build(deps): bump qsv-stats from 0.40.0 to 0.41.0 by @dependabot[bot] in #3136
- build(deps): bump qsv-stats from 0.41.0 to 0.42.0 by @dependabot[bot] in #3156
- build(deps): bump qsv-stats from 0.42.0 to 0.43.0 by @dependabot[bot] in #3169
- build(deps): bump rfd from 0.15.4 to 0.16.0 by @dependabot[bot] in #3121
- build(deps): bump uuid from 1.18.1 to 1.19.0 by @dependabot[bot] in #3146
- Improved qsvpy build process for Apple Silicon
- Updated GitHub Actions workflows for better reliability
- bumped several indirect dependencies
- applied select clippy & Codacy suggestions
- Improved dependency version management
- Better feature flag handling
Fixed
- fix:
applypanic on empty selection #3165 - fix: more robust snappy and file extension detection #3166
- fix:
partitionadd proper stdin handling regression introduced when--limitoption was added #3161 - Fix broken layout of environment variable documentation by @tmtmtmtm in #3163
Removed
New Contributors
- @Copilot made their first contribution in #3162
Full Changelog: 10.0.0...11.0.2