github dathere/qsv 12.0.0

8 hours ago

[12.0.0] - 2025-12-24 🎄

Stuff your virtual stocking and jingle your data bells - qsv 12.0.0 slides down the chimney packed fuller than Santa’s sleigh! Unwrap delightful surprises like the shiny new moarstats command, gift-wrapped weighted statistics, and AI-powered FAIR metadata inferencing now speaking in multiple languages (no elf translation required). As the star on top, meet TOON - the brand new LLM-optimized, token-efficient format - ready to sleigh your AI projects all through 2026. Ho-ho-hold my data, this update’s a festive feast!

🌟 Major Features

NEW: moarstats Command

A powerful new command for "moar" advanced statistical analysis, providing statistics beyond what the stats command offers:

  • Comprehensive Statistics: Over 50+ advanced statistical measures including:

    • Detailed outlier analysis (count, sum, average)
    • Winsorized and trimmed means (5%, 10%, 20%, 25%)
    • Multiple dispersion measures (IQR to range ratio, quartile coefficient of dispersion)
    • Distribution statistics (skewness, multiple kurtosis measures)
  • Advanced Option (--advanced): Access computationally intensive statistics:

    • Gini coefficient for inequality measurement
    • Excess Kurtosis to measure "tailedness" of the distribution
    • Shannon Entropy for data diversity analysis
  • Available on all binary variants for universal access

Enhanced describegpt Command

Major enhancements to AI-powered data description capabilities:

  • ⛩️ Minijinja Template Engine Integration:

    • Custom prompt templating with full Minijinja and Minijinja-contrib filters
    • More powerful and flexible prompt customization
  • Multilingual Support:

    • --language option for generating descriptions in any language/dialect
    • Automatic language detection in prompts
    • SQL comments also generated in requested language
    • beyond language/dialect, this option can also be used to describe a dataset
      using a persona (e.g. Yoda, Spock, Valley Girl, Christopher Walken,
      Silly Santa after taking a Data Science Course, etc.)
  • Advanced Features:

    • --addl-columns option with detailed attribution and system metadata
    • --export-prompt <file> to save the default prompts to the specified file.
      This file can then be tailored and used with the --prompt-file <file> option.
    • Iterative, session-based SQL RAG with --prompt option
    • Sampling in prompt mode for better SQL generation
    • Lookup table and CKAN support for controlled vocabularies
    • Convenience values for --addl-cols-list
      (i.e., "everything", "everything!", "moar", "moar!")

Weighted Statistics Support

Comprehensive weighted statistics implementation across multiple commands:

  • stats Command (--weight <column>):

    • Weighted mean, standard deviation, variance
    • Weighted MAD (Median Absolute Deviation) and percentiles
    • Weighted modes and antimodes
    • Weighted harmonic and geometric means
    • All weighted calculations handle non-finite values gracefully
  • frequency Command (--weight <column>):

    • Weighted frequency distributions
    • Proper handling of weighted "Other" and "ALL UNIQUE" category
    • Non-finite weights automatically skipped

Token Object Oriented Notation (TOON) Format Support

  • A compact, human-readable encoding of the JSON data model for LLM prompts

  • Commands Supporting TOON:

    • describegpt --format TOON
    • frequency --toon
  • Benefits: More readable than JSON, easier to parse than CSV for hierarchical data
    and more token-efficient, terse format targeted for LLMs

stats Command Enhancements

  • Percentile Improvements:

    • --percentile-list special values: "deciles" and "quintiles"
    • Percentile labels now include prefix before value (e.g., "p50: 42.5")
    • Validation of percentile-list on startup
  • New Columns: Added n_counts for more detailed count information

  • Performance Optimizations:

    • Optimized Stats struct layout
    • Eliminated redundant, unnecessary sorting
    • Removed redundant filtering for weighted stats functions
    • Microoptimizations throughout

transpose Command

  • New --long Option: Transform data from wide to long format
    • Column selection support using select syntax
    • Streaming implementation per GitHub Copilot review suggestions

diff Command

  • upgraded csv-diff from 0.1.1 to faster 0.1.2, improving performance
    in optimal cases by up to 25% 🚀

lens Command

  • Aligned --no-streaming-stdin behavior with csvlens upstream

📊 Output Format Changes

schema Command

  • Updated $schema from Draft 7 to JSON Schema Draft 2020-12

⚡ Performance Improvements

suite-wide

stats Command

  • Optimized Stats struct memory layout
  • Eliminated redundant sorting operations
  • Removed unnecessary clone operations
  • Better handling of real-world data (assumes no infinity values)

frequency Command

  • Microoptimizations for faster frequency computation
  • Optimized top_n/bottom_n retrieval

🐛 Bug Fixes

frequency Command

  • Fixed behavior when compiling weighted frequencies with ALL_UNIQUE
  • Fixed issue where "Other (0),0,0,0" could appear in output
  • Proper handling of non-finite weights (automatically skipped)

🏗️ Infrastructure & Quality

Testing

  • Test suite expanded from 2,060 to 2,380 tests
  • Comprehensive test coverage for all new features
  • Weighted statistics thoroughly tested
  • Advanced moarstats options validated

Code Quality

  • Extensive GitHub Copilot review integration
  • Multiple refactoring passes for code clarity
  • Clippy suggestions incorporated throughout
  • Better error handling and edge case management

FAIR Principles

  • Added CITATION.cff (by rzmk) for academic citation
  • Added Zenodo DOI badge for dataset citation
  • Enhanced FAIRification of qsv as a research tool

📚 Documentation Improvements

Statistical Documentation

  • Comprehensive documentation for statistics produced by stats command (by @kulnor) WIP
  • Enhanced usage text for stats, frequency, and moarstats
  • Better examples throughout documentation

Command Documentation

  • Updated describegpt with multilingual examples
  • Added controlled tag vocabulary examples
  • Enhanced TOON format documentation
  • Better SQL RAG workflow documentation

Migration Notes

Breaking Changes

  1. schema command: $schema output changed from Draft 7 to Draft 2020-12

    • Most schemas should be compatible
    • Validation tools must support JSON Schema Draft 2020-12
  2. stats command: Output now includes percentile label prefixes

    • Example: "p50: 10" of the 50th percentile value instead of just the value "10"
    • May affect parsing scripts that expect raw numbers

Added

  • feat: describegpt add --add-cols and --addl-cols-list <list> options #3179
  • feat: describegpt add --language option #3184
  • feat: describegpt use minijinja engine for prompt processing #3188
  • feat: describegpt add language autodetection in --prompt (chat) mode #3193
  • feat: describegpt sampling in prompt mode for better SQL generation… #3198
  • feat: describegpt add --prompt sessions for iterative SQL RAG refinement #3200
  • feat: describegpt add TOON format support #3205
  • feat: frequency add TOON format #3206
  • feat: frequency add weighted frequencies #3218
  • feat: add new moarstats command #3207
  • feat: moarstats add even moar! Now with detailed outliers info! #3208
  • feat: moarstats - add configurable Winsorized and Trimmed means #3209
  • build(deps): bump ryu from 1.0.20 to 1.0.21 by @dependabot[bot] in #3210
  • chore: moarstats remove redundant Bowley's Skewness Coefficient #3212
  • feat: moarstats add kurtosis & gini stats behind --advanced option #3217
  • feat: moarstats moar, moar, moar stats! #3220
  • feat: moarstats add shannon entropy to advanced statistics #3227
  • feat: stats --percentile-list special values "deciles" and "quintiles" #3176
  • docs: added qsv stats descriptions document by @kulnor in #3172
  • feat: add CITATION.cff by @rzmk in #3182
  • feat: stats add percentile label prefixes in front of percentile values #3183
  • feat: stats add weighted statistics #3213
  • feat: transpose add --long option #3194
  • feat: transpose add --long column selection #3197

Changed

  • feat: schema change $schema from https://json-schema.org/draft-07/schema to https://json-schema.org/draft/2020-12/schema #3203
  • deps: bump blake3 to latest upstream
  • deps: bump csvlens to 0.15.0
  • deps: bump geozero to 0.15.0
  • deps: indexmap - enable serde feature
  • deps: bump redis to 1
  • deps: cached use upstream fork with redis updated to 1
  • deps: jsonschema use latest upstream
  • deps: polars use latest upstream
  • deps: replaced ryu with faster zmij binary to decimal floating point library
  • build(deps): bump actions/upload-artifact from 5 to 6 by @dependabot[bot] in #3189
  • build(deps): bump csv-diff from 0.1.1 to 0.1.2 by @dependabot[bot] in #3228
  • build(deps): bump governor from 0.10.2 to 0.10.4 by @dependabot[bot] in #3196
  • build(deps): bump itoa from 1.0.15 to 1.0.16 by @dependabot[bot] in #3214
  • build(deps): bump minijinja from 2.13.0 to 2.14.0 by @dependabot[bot] in #3185
  • build(deps): bump minijinja-contrib from 2.13.0 to 2.14.0 by @dependabot[bot] in #3186
  • build(deps): bump qsv-stats from 0.43.0 to 0.44.0 by @dependabot[bot] in #3215
  • build(deps): bump qsv-stats from 0.44.0 to 0.45.0 by @dependabot[bot] in #3216
  • build(deps): bump reqwest from 0.12.24 to 0.12.25 by @dependabot[bot] in #3177
  • build(deps): bump reqwest from 0.12.25 to 0.12.26 by @dependabot[bot] in #3191
  • build(deps): bump reqwest from 0.12.26 to 0.12.27 by @dependabot[bot] in #3221
  • build(deps): bump reqwest from 0.12.27 to 0.12.28 by @dependabot[bot] in #3226
  • build(deps): bump serde_json from 1.0.145 to 1.0.146 by @dependabot[bot] in #3219
  • build(deps): bump serde_json from 1.0.146 to 1.0.147 by @dependabot[bot] in #3229
  • build(deps): bump tempfile from 3.23.0 to 3.24.0 by @dependabot[bot] in #3230
  • build(deps): bump toml from 0.9.8 to 0.9.9+spec-1.0.0 by @dependabot[bot] in #3199
  • bumped several indirect dependencies
  • applied select clippy & Codacy suggestions
  • bumped MSRV to 1.92

Fixed:

  • fix: frequency fix ALL_UNIQUE weighted behavior #3224
  • fix: frequency fix "Other (0),0,0,0" should never happen #3225

Removed:

  • deps: blake3 removed unnecessary conditional compilation directive

Full Changelog: 11.0.2...12.0.0

Don't miss a new qsv release

NewReleases is sending notifications on new releases.