github dathere/qsv 11.0.0

latest releases: 11.0.2, 11.0.1
pre-release14 hours ago

[11.0.0] - 2025-12-07

qsv 11.0.0 brings significant enhancements to larger-than-memory data processing, AI-powered metadata inferencing, schema validation, and data viewing capabilities, along with important bug fixes and performance improvements.

All in preparation for at-scale, interactive, "Data Steward-in-the-Loop" FAIRification in qsv pro.

🌟 Major Features

stats & frequency Command Enhancements

  • Larger than Memory Files: stats & frequency can now handle arbitrarily large files, even when "advanced" statistics are enabled with its new dynamic parallel chunk sizing algorithm!
  • N Counts: Added n_counts (n_negative, n_zero and n_positive) columns to stats output for more detailed count information for numeric fields

describegpt Command Enhancements

The describegpt command has received substantial improvements for AI-powered metadata inferencing:

  • Format Option: Replaced --json flag with --format option for more flexible output formatting

    • Supports multiple output formats - Markdown (default), TSV and JSON
    • Removed --jsonl option for cleaner API
  • Controlled Tag Vocabulary: New tag vocabulary system for consistent categorization

    • --tag-vocab option to specify controlled vocabulary
    • Lookup support for tag vocabularies - retrieve a tag vocabulary from a local or remote CSV using http://, https://, dathere:// and ckan:// URL schemes.
  • Enhanced Boolean Inference: --infer-boolean is now enabled by default for better data type detection

  • Performance Metrics: Added elapsed time tracking to monitor processing duration

  • Improved Prompts: Updated default description prompt with PII/PHI alerts and better attribution metadata

schema & validate Command Improvements

Enhanced schema inference and validation capabilities:

  • Strict Formats: New --strict-formats option for stricter JSON Schema format validation, enforcing JSON Schema format constraints for email, hostname and IP address (IPV4 and IPV6) formats.

  • Output Option: New --output option for specifying schema output destination

    • Polars schema now uses consistent naming conventions across commands
    • Updated joinp, pivotp, and sqlp commands to use new .pschema.json naming convention
  • Configurable Email Validation: validate has numerous options to tweak email validation - taking advantage of schema's email format constraint inferencing.

sample Command time-series sampling

A new --timeseries sampling method with grouping (hourly, daily, weekly),
adaptive sampling (prefer business hours or weekends) with various aggregation (mean, sum, min, max)
within each interval with configurable starting points (first, last or random).

lens Command Features

Enhanced CSV viewing capabilities with csvlens integration:

  • Auto-Reload: New --auto-reload option to automatically reload file when it changes

    • Useful for monitoring live data files
  • Streaming stdin: New --streaming-stdin option for real-time data viewing

    • Supports viewing data as it's being piped in
  • Row Marking: Updated csvlens dependency with row marking feature

Breaking Changes

  • describegpt: --json flag replaced with --format option
  • describegpt: --jsonl option removed
  • schema, joinp, pivotp, sqlp: Updated Polars schema naming conventions (existing workflows should work but output format may differ slightly)

Added

  • Created Event Logo Archive with AI-generated seasonal/version logos
  • describegpt: add controlled vocabulary support for tags #3122
  • describegpt: add elapsed time #3168
  • describegpt: add lookup support #3170
  • excel: add --cell option #3133
  • frequency: add dynamic parallel chunk sizing #3135
  • lens: add --auto-reload option #3128
  • lens: add --streaming-stdin option #3171
  • sample: add timeseries sampling options #3130
  • schema: infer addl JSON Schema predefined formats - email, ipv4, ipv6, hostname #3125
  • schema: add --output option and standardize Polars Schema file name #3126
  • stats: dynamic parallel chunk sizing with indexed files #3134
  • stats: add n_negative, n_zero, n_positive count columns #3157
  • validate: add email validation options #3148
  • tests: add tests for https://100.dathere.com/lessons/4 by @rzmk in #3151
  • Added Claude AI guidance for contributors
  • Enhanced --version output with more comprehensive system metadata

Changed

  • refactor: describegpt improve tags inferencing with Tag Vocabulary #3139
  • feat: describegpt - major refactor #3143
  • feat: describegpt improved Polars SQL processing #3147
  • feat: describegpt replace --json option with --format option supporting 3 formats - markdown, json and TSV; remove --jsonl option #3167
  • refactor: frequency & stats - parallel chunk sizing - allow forcing of cpu based chunking #3138
  • Align partition stdin handling with split/stats pattern by @Copilot in #3162
  • deps: use latest polars upstream with new SQL fixes and features (pola-rs/polars@e1be17f)
  • deps: latest self_update upstream
  • build(deps): bump actions/setup-python from 6.0.0 to 6.1.0 by @dependabot[bot] in #3120
  • build(deps): bump actix-web from 4.12.0 to 4.12.1 by @dependabot[bot] in #3127
  • build(deps): bump flate2 from 1.1.5 to 1.1.7 by @dependabot[bot] in #3159
  • build(deps): bump jsonschema from 0.37.1 to 0.37.2 by @dependabot[bot] in #3129
  • build(deps): bump jsonschema from 0.37.2 to 0.37.3 by @dependabot[bot] in #3131
  • build(deps): bump jsonschema from 0.37.3 to 0.37.4 by @dependabot[bot] in #3140
  • build(deps): bump log from 0.4.28 to 0.4.29 by @dependabot[bot] in #3150
  • build(deps): bump minijinja from 2.12.0 to 2.13.0 by @dependabot[bot] in #3142
  • build(deps): bump minijinja-contrib from 2.12.0 to 2.13.0 by @dependabot[bot] in #3141
  • build(deps): bump pyo3 from 0.27.1 to 0.27.2 by @dependabot[bot] in #3137
  • build(deps): bump qsv-stats from 0.40.0 to 0.41.0 by @dependabot[bot] in #3136
  • build(deps): bump qsv-stats from 0.41.0 to 0.42.0 by @dependabot[bot] in #3156
  • build(deps): bump qsv-stats from 0.42.0 to 0.43.0 by @dependabot[bot] in #3169
  • build(deps): bump rfd from 0.15.4 to 0.16.0 by @dependabot[bot] in #3121
  • build(deps): bump uuid from 1.18.1 to 1.19.0 by @dependabot[bot] in #3146
  • Improved qsvpy build process for Apple Silicon
  • Updated GitHub Actions workflows for better reliability
  • bumped several indirect dependencies
  • applied select clippy & Codacy suggestions
  • Improved dependency version management
  • Better feature flag handling

Fixed

  • fix: apply panic on empty selection #3165
  • fix: more robust snappy and file extension detection #3166
  • fix: partition add proper stdin handling regression introduced when --limit option was added #3161
  • Fix broken layout of environment variable documentation by @tmtmtmtm in #3163

Removed

  • describegpt: remove --jsonl option #3167
  • chore: remove jemalloc support #3153

New Contributors

  • @Copilot made their first contribution in #3162

Full Changelog: 10.0.0...11.0.0

Don't miss a new qsv release

NewReleases is sending notifications on new releases.