[12.0.0] - 2025-12-24 🎄
Stuff your virtual stocking and jingle your data bells - qsv 12.0.0 slides down the chimney packed fuller than Santa’s sleigh! Unwrap delightful surprises like the shiny new moarstats command, gift-wrapped weighted statistics, and AI-powered FAIR metadata inferencing now speaking in multiple languages (no elf translation required). As the star on top, meet TOON - the brand new LLM-optimized, token-efficient format - ready to sleigh your AI projects all through 2026. Ho-ho-hold my data, this update’s a festive feast!
🌟 Major Features
NEW: moarstats Command
A powerful new command for "moar" advanced statistical analysis, providing statistics beyond what the stats command offers:
-
Comprehensive Statistics: Over 50+ advanced statistical measures including:
- Detailed outlier analysis (count, sum, average)
- Winsorized and trimmed means (5%, 10%, 20%, 25%)
- Multiple dispersion measures (IQR to range ratio, quartile coefficient of dispersion)
- Distribution statistics (skewness, multiple kurtosis measures)
-
Advanced Option (
--advanced): Access computationally intensive statistics:- Gini coefficient for inequality measurement
- Excess Kurtosis to measure "tailedness" of the distribution
- Shannon Entropy for data diversity analysis
-
Available on all binary variants for universal access
Enhanced describegpt Command
Major enhancements to AI-powered data description capabilities:
-
⛩️ Minijinja Template Engine Integration:
- Custom prompt templating with full Minijinja and Minijinja-contrib filters
- More powerful and flexible prompt customization
-
Multilingual Support:
--languageoption for generating descriptions in any language/dialect- Automatic language detection in prompts
- SQL comments also generated in requested language
- beyond language/dialect, this option can also be used to describe a dataset
using a persona (e.g. Yoda, Spock, Valley Girl, Christopher Walken,
Silly Santa after taking a Data Science Course, etc.)
-
Advanced Features:
--addl-columnsoption with detailed attribution and system metadata--export-prompt <file>to save the default prompts to the specified file.
This file can then be tailored and used with the--prompt-file <file>option.- Iterative, session-based SQL RAG with
--promptoption - Sampling in prompt mode for better SQL generation
- Lookup table and CKAN support for controlled vocabularies
- Convenience values for
--addl-cols-list
(i.e., "everything", "everything!", "moar", "moar!")
Weighted Statistics Support
Comprehensive weighted statistics implementation across multiple commands:
-
stats Command (
--weight <column>):- Weighted mean, standard deviation, variance
- Weighted MAD (Median Absolute Deviation) and percentiles
- Weighted modes and antimodes
- Weighted harmonic and geometric means
- All weighted calculations handle non-finite values gracefully
-
frequency Command (
--weight <column>):- Weighted frequency distributions
- Proper handling of weighted "Other" and "ALL UNIQUE" category
- Non-finite weights automatically skipped
Token Object Oriented Notation (TOON) Format Support
-
A compact, human-readable encoding of the JSON data model for LLM prompts
-
Commands Supporting TOON:
describegpt --format TOONfrequency --toon
-
Benefits: More readable than JSON, easier to parse than CSV for hierarchical data
and more token-efficient, terse format targeted for LLMs
stats Command Enhancements
-
Percentile Improvements:
--percentile-listspecial values: "deciles" and "quintiles"- Percentile labels now include prefix before value (e.g., "p50: 42.5")
- Validation of percentile-list on startup
-
New Columns: Added
n_countsfor more detailed count information -
Performance Optimizations:
- Optimized Stats struct layout
- Eliminated redundant, unnecessary sorting
- Removed redundant filtering for weighted stats functions
- Microoptimizations throughout
transpose Command
- New
--longOption: Transform data from wide to long format- Column selection support using select syntax
- Streaming implementation per GitHub Copilot review suggestions
diff Command
- upgraded csv-diff from 0.1.1 to faster 0.1.2, improving performance
in optimal cases by up to 25% 🚀
lens Command
- Aligned
--no-streaming-stdinbehavior with csvlens upstream
📊 Output Format Changes
schema Command
- Updated
$schemafrom Draft 7 to JSON Schema Draft 2020-12
⚡ Performance Improvements
suite-wide
- replaced already fast ryu float to string conversion crate crate with even
faster zmij crate (https://vitaut.net/posts/2025/faster-dtoa/)
stats Command
- Optimized Stats struct memory layout
- Eliminated redundant sorting operations
- Removed unnecessary clone operations
- Better handling of real-world data (assumes no infinity values)
frequency Command
- Microoptimizations for faster frequency computation
- Optimized top_n/bottom_n retrieval
🐛 Bug Fixes
frequency Command
- Fixed behavior when compiling weighted frequencies with
ALL_UNIQUE - Fixed issue where "Other (0),0,0,0" could appear in output
- Proper handling of non-finite weights (automatically skipped)
🏗️ Infrastructure & Quality
Testing
- Test suite expanded from 2,060 to 2,380 tests
- Comprehensive test coverage for all new features
- Weighted statistics thoroughly tested
- Advanced moarstats options validated
Code Quality
- Extensive GitHub Copilot review integration
- Multiple refactoring passes for code clarity
- Clippy suggestions incorporated throughout
- Better error handling and edge case management
FAIR Principles
- Added CITATION.cff (by rzmk) for academic citation
- Added Zenodo DOI badge for dataset citation
- Enhanced FAIRification of qsv as a research tool
📚 Documentation Improvements
Statistical Documentation
- Comprehensive documentation for statistics produced by stats command (by @kulnor) WIP
- Enhanced usage text for stats, frequency, and moarstats
- Better examples throughout documentation
Command Documentation
- Updated describegpt with multilingual examples
- Added controlled tag vocabulary examples
- Enhanced TOON format documentation
- Better SQL RAG workflow documentation
Migration Notes
Breaking Changes
-
schema command:
$schemaoutput changed from Draft 7 to Draft 2020-12- Most schemas should be compatible
- Validation tools must support JSON Schema Draft 2020-12
-
stats command: Output now includes percentile label prefixes
- Example: "p50: 10" of the 50th percentile value instead of just the value "10"
- May affect parsing scripts that expect raw numbers
Added
- feat:
describegptadd--add-colsand--addl-cols-list <list>options #3179 - feat:
describegptadd--languageoption #3184 - feat:
describegptuse minijinja engine for prompt processing #3188 - feat:
describegptadd language autodetection in--prompt(chat) mode #3193 - feat:
describegptsampling in prompt mode for better SQL generation… #3198 - feat:
describegptadd --prompt sessions for iterative SQL RAG refinement #3200 - feat:
describegptadd TOON format support #3205 - feat:
frequencyadd TOON format #3206 - feat:
frequencyadd weighted frequencies #3218 - feat: add new
moarstatscommand #3207 - feat:
moarstatsadd even moar! Now with detailed outliers info! #3208 - feat:
moarstats- add configurable Winsorized and Trimmed means #3209 - build(deps): bump ryu from 1.0.20 to 1.0.21 by @dependabot[bot] in #3210
- chore:
moarstatsremove redundant Bowley's Skewness Coefficient #3212 - feat:
moarstatsadd kurtosis & gini stats behind--advancedoption #3217 - feat:
moarstatsmoar, moar, moar stats! #3220 - feat:
moarstatsadd shannon entropy to advanced statistics #3227 - feat:
stats--percentile-listspecial values "deciles" and "quintiles" #3176 - docs: added qsv stats descriptions document by @kulnor in #3172
- feat: add CITATION.cff by @rzmk in #3182
- feat:
statsadd percentile label prefixes in front of percentile values #3183 - feat:
statsadd weighted statistics #3213 - feat:
transposeadd--longoption #3194 - feat:
transposeadd--longcolumn selection #3197
Changed
- feat:
schemachange$schemafromhttps://json-schema.org/draft-07/schematohttps://json-schema.org/draft/2020-12/schema#3203 - deps: bump blake3 to latest upstream
- deps: bump csvlens to 0.15.0
- deps: bump geozero to 0.15.0
- deps: indexmap - enable serde feature
- deps: bump redis to 1
- deps: cached use upstream fork with redis updated to 1
- deps: jsonschema use latest upstream
- deps: polars use latest upstream
- deps: replaced ryu with faster zmij binary to decimal floating point library
- build(deps): bump actions/upload-artifact from 5 to 6 by @dependabot[bot] in #3189
- build(deps): bump csv-diff from 0.1.1 to 0.1.2 by @dependabot[bot] in #3228
- build(deps): bump governor from 0.10.2 to 0.10.4 by @dependabot[bot] in #3196
- build(deps): bump itoa from 1.0.15 to 1.0.16 by @dependabot[bot] in #3214
- build(deps): bump minijinja from 2.13.0 to 2.14.0 by @dependabot[bot] in #3185
- build(deps): bump minijinja-contrib from 2.13.0 to 2.14.0 by @dependabot[bot] in #3186
- build(deps): bump qsv-stats from 0.43.0 to 0.44.0 by @dependabot[bot] in #3215
- build(deps): bump qsv-stats from 0.44.0 to 0.45.0 by @dependabot[bot] in #3216
- build(deps): bump reqwest from 0.12.24 to 0.12.25 by @dependabot[bot] in #3177
- build(deps): bump reqwest from 0.12.25 to 0.12.26 by @dependabot[bot] in #3191
- build(deps): bump reqwest from 0.12.26 to 0.12.27 by @dependabot[bot] in #3221
- build(deps): bump reqwest from 0.12.27 to 0.12.28 by @dependabot[bot] in #3226
- build(deps): bump serde_json from 1.0.145 to 1.0.146 by @dependabot[bot] in #3219
- build(deps): bump serde_json from 1.0.146 to 1.0.147 by @dependabot[bot] in #3229
- build(deps): bump tempfile from 3.23.0 to 3.24.0 by @dependabot[bot] in #3230
- build(deps): bump toml from 0.9.8 to 0.9.9+spec-1.0.0 by @dependabot[bot] in #3199
- bumped several indirect dependencies
- applied select clippy & Codacy suggestions
- bumped MSRV to 1.92
Fixed:
- fix:
frequencyfix ALL_UNIQUE weighted behavior #3224 - fix:
frequencyfix "Other (0),0,0,0" should never happen #3225
Removed:
- deps: blake3 removed unnecessary conditional compilation directive
Full Changelog: 11.0.2...12.0.0