github dathere/qsv 11.0.2

12 hours ago

[11.0.2] - 2025-12-08

qsv 11.0.2 brings significant enhancements to larger-than-memory data processing, AI-powered metadata inferencing, JSON Schema inferencing & validation, and data viewing capabilities, along with important bug fixes and performance improvements.

All in preparation for at-scale, secure, interactive, "zero-copy" "Data Steward-in-the-Loop" FAIRification on the desktop in qsv pro.

🌟 Major Features

stats & frequency

  • Larger than Memory Files: stats & frequency can now handle arbitrarily large files, even when "advanced" statistics are enabled with its new dynamic parallel chunk sizing algorithm! (example stats, frequency)
  • N Counts: Added "n_counts" (n_negative, n_zero and n_positive) columns to stats output for more detailed count information for numeric fields.

describegpt

The describegpt command has received substantial improvements for AI-powered metadata inferencing:

  • "Neuro-Procedural" Data Dictionaries: combines deterministically computed statistics and frequency distribution data with AI-inferred Human-Friendly Labels and Descriptions to compile an expanded Data Dictionary (not quite "neuro-symbolic" (YET!))

  • Chat with your Data!: Improved DuckDB and Polars SQL guidance mean more reliable transformations of your Natural Language queries to SQL - leading to fast, deterministic, reproducible, hallucination-free answers! (example, SQL result)

  • Format Option: Replaced --json flag with --format option for more flexible output formatting

    • Supports multiple output formats - Markdown (default), TSV and JSON
    • Removed --jsonl option for cleaner API
  • Controlled Tag Vocabulary: New tag vocabulary system for consistent categorization

    • --tag-vocab option to specify controlled vocabulary
    • Lookup support for tag vocabularies - retrieve a tag vocabulary from a local or remote CSV
      using http://, https://, dathere:// and ckan:// URL schemes.
  • Enhanced Boolean Inference: --infer-boolean is now enabled by default for better data type detection

  • Performance Metrics: Added elapsed time tracking to monitor processing duration

  • Improved Prompt Templates: Updated default description prompt with PII/PHI alerts and better attribution metadata

schema & validate

Enhanced JSON Schema inference and validation capabilities:

  • Strict Formats: New --strict-formats option for stricter JSON Schema format validation,
    enforcing JSON Schema format constraints for email, hostname & IP address (IPV4/IPV6) formats.

  • Output Option: New --output option for specifying schema output destination

    • Polars schema now uses consistent naming conventions across commands
    • Updated joinp, pivotp, and sqlp commands to use new .pschema.json naming convention
  • Configurable Email Validation: validate has numerous options to tweak email validation
    - taking advantage of schema's email format constraint inferencing.

sample time-series sampling

A new --timeseries sampling method with grouping (hourly, daily, weekly),
adaptive sampling (prefer business hours or weekends) with various aggregation (mean, sum, min, max)
within each interval with configurable starting points (first, last or random).

lens "real-time" Features

Enhanced CSV viewing capabilities with csvlens integration:

  • Auto-Reload: New --auto-reload option to automatically reload file when it changes

    • Useful for monitoring live data files
  • Streaming stdin: New --streaming-stdin option for real-time data viewing

    • Supports viewing data as it's being piped in
  • Row Marking: Updated csvlens dependency with row marking feature

Breaking Changes

  • describegpt: --json flag replaced with --format option
  • describegpt: --jsonl option removed
  • schema, joinp, pivotp, sqlp: Updated Polars schema naming conventions
    (existing workflows should work but output format may differ slightly)

Added

  • Created Event Logo Archive with AI-generated seasonal/version logos
  • describegpt: add controlled vocabulary support for tags #3122
  • describegpt: add elapsed time #3168
  • describegpt: add lookup support #3170
  • excel: add --cell option #3133
  • frequency: add dynamic parallel chunk sizing #3135
  • lens: add --auto-reload option #3128
  • lens: add --streaming-stdin option #3171
  • sample: add timeseries sampling options #3130
  • schema: infer addl JSON Schema predefined formats - email, ipv4, ipv6, hostname #3125
  • schema: add --output option and standardize Polars Schema file name #3126
  • stats: dynamic parallel chunk sizing with indexed files #3134
  • stats: add n_negative, n_zero, n_positive count columns #3157
  • validate: add email validation options #3148
  • tests: add tests for https://100.dathere.com/lessons/4 by @rzmk in #3151
  • Added Claude AI guidance for contributors
  • Enhanced --version output with more comprehensive system metadata

Changed

  • refactor: describegpt improve tags inferencing with Tag Vocabulary #3139
  • feat: describegpt - major refactor #3143
  • feat: describegpt improved Polars SQL processing #3147
  • feat: describegpt replace --json option with --format option supporting 3 formats - markdown, json and TSV; remove --jsonl option #3167
  • refactor: frequency & stats - parallel chunk sizing - allow forcing of cpu based chunking #3138
  • Align partition stdin handling with split/stats pattern by @Copilot in #3162
  • deps: use latest polars upstream with new SQL fixes and features (pola-rs/polars@e1be17f)
  • build(deps): bump actions/setup-python from 6.0.0 to 6.1.0 by @dependabot[bot] in #3120
  • build(deps): bump actix-web from 4.12.0 to 4.12.1 by @dependabot[bot] in #3127
  • build(deps): bump flate2 from 1.1.5 to 1.1.7 by @dependabot[bot] in #3159
  • build(deps): bump jsonschema from 0.37.1 to 0.37.2 by @dependabot[bot] in #3129
  • build(deps): bump jsonschema from 0.37.2 to 0.37.3 by @dependabot[bot] in #3131
  • build(deps): bump jsonschema from 0.37.3 to 0.37.4 by @dependabot[bot] in #3140
  • build(deps): bump log from 0.4.28 to 0.4.29 by @dependabot[bot] in #3150
  • build(deps): bump minijinja from 2.12.0 to 2.13.0 by @dependabot[bot] in #3142
  • build(deps): bump minijinja-contrib from 2.12.0 to 2.13.0 by @dependabot[bot] in #3141
  • build(deps): bump pyo3 from 0.27.1 to 0.27.2 by @dependabot[bot] in #3137
  • build(deps): bump qsv-stats from 0.40.0 to 0.41.0 by @dependabot[bot] in #3136
  • build(deps): bump qsv-stats from 0.41.0 to 0.42.0 by @dependabot[bot] in #3156
  • build(deps): bump qsv-stats from 0.42.0 to 0.43.0 by @dependabot[bot] in #3169
  • build(deps): bump rfd from 0.15.4 to 0.16.0 by @dependabot[bot] in #3121
  • build(deps): bump uuid from 1.18.1 to 1.19.0 by @dependabot[bot] in #3146
  • Improved qsvpy build process for Apple Silicon
  • Updated GitHub Actions workflows for better reliability
  • bumped several indirect dependencies
  • applied select clippy & Codacy suggestions
  • Improved dependency version management
  • Better feature flag handling

Fixed

  • fix: apply panic on empty selection #3165
  • fix: more robust snappy and file extension detection #3166
  • fix: partition add proper stdin handling regression introduced when --limit option was added #3161
  • Fix broken layout of environment variable documentation by @tmtmtmtm in #3163

Removed

  • describegpt: remove --jsonl option #3167
  • chore: remove jemalloc support #3153

New Contributors

  • @Copilot made their first contribution in #3162

Full Changelog: 10.0.0...11.0.2

Don't miss a new qsv release

NewReleases is sending notifications on new releases.