github dathere/qsv 21.0.0

pre-release6 hours ago

[21.0.0] - 2026-06-07 🌐 The "F_AI_Rification" Release πŸ“‡

A major release headlined by two brand-new commands β€” get, which fetches tabular data from HTTP(S), cloud object stores (S3/GCS/Azure) and CKAN portals into a content-addressed local cache, and profile, which extracts standards-compliant dataset metadata (DCAT-US v3, DCAT-AP v3, Croissant 1.1 and Geoconnex). It also raises the minimum supported Rust version to 1.96 and upgrades to Polars 0.54, which is why this is a major version bump β€” existing pipelines are otherwise source-compatible (the one breaking profile flag change affects only the new command, which did not exist in 20.1.0).

Headline

  • πŸ†• get β€” fetch tabular data from anywhere into a local cache. A new command (issue #2263) that retrieves CSV/TSV and other tabular data from HTTP(S) URLs, cloud object stores (s3://, gs://, az://), and CKAN portals (ckan://), then stores it in a content-addressed disk cache. Cached entries are addressable via a dc:<name> input prefix usable by any other qsv command, carry BLAKE3 + ETag provenance, support TTL/policy controls, and revalidate conditionally (HTTP If-None-Match / 304 Not Modified). Cloud sources are gated behind the opt-in get_cloud sub-feature; streaming, ranged/parallel downloads and a dc: stats cache landed in Phase 3 (#3953, #3958).
  • πŸ†• profile β€” generate standards-compliant dataset metadata. A new command that profiles a dataset and projects it into open metadata standards β€” DCAT-US v3, DCAT-AP v3, Croissant 1.1, and Geoconnex β€” via a YAML-driven projection engine, with optional SHACL/mlcroissant/pyshacl validation and embedded descriptive statistics & frequency tables (#3898, #3901, #3908, #3912, #3916, #3918).
  • geocode goes online with OpenCage. New geocode subcommands call the OpenCage geocoding API for forward and reverse geocoding, with a persistent on-disk result cache and %dyncols: support (issue #1295, #3876, #3878).
  • describegpt describes meaning, not just types. A richer "semanticmd" Data Dictionary format for agents & catalogs, a JSON Schema (draft 2020-12) output format, and LLM-inferred date/datetime content types round out describegpt's semantic-description capabilities (#3933, #3935, #3871, #3884).
  • Mergeable / variance-bounded sampling in sample β€” two new sampling modes plus a sketch-IO surface that lets users sample sharded inputs and combine the results without re-reading the whole corpus. Both modes are native Rust implementations written from the original algorithm papers. The Apache DataSketches project's Sampling family implements the same family of algorithms in C++/Java/Python β€” qsv does not bind to or depend on that code (the datasketches Rust crate doesn't expose Sampling-family sketches), so the on-disk format is qsv-specific and not interoperable with DataSketches serialized sketches.
    • --varopt <col> β€” variance-bounded weighted reservoir sampling using the A-ExpJ keying scheme of Efraimidis & Spirakis (2006). Each record gets a key u^(1/w) and the top-k keys are retained. Unlike --weighted (which is single-pass acceptance-rejection requiring a max_weight from the stats cache), --varopt is a true reservoir sampler β€” no stats cache required, single pass, bounded memory, and mergeable across partitions.
    • --mergeable-reservoir β€” uniform reservoir using Vitter's Algorithm R. Same statistical distribution as the default RESERVOIR method, but the resulting sampler state is mergeable.
    • --sketch-out <file> / --sketch-in <file1,file2,...> β€” serialize the sampler state to a binary blob and merge across runs. Sketches embed the source CSV header so --sketch-in re-emits a schema-bearing CSV without consulting the source files. Sampler-kind mismatch (mixing a reservoir blob with a varopt blob) is rejected. Works with both new sampling modes.

Added

  • sample: --varopt <col> flag for variance-bounded weighted reservoir sampling (A-ExpJ keying, Efraimidis & Spirakis 2006). See Headline above.
  • sample: --mergeable-reservoir flag for a uniform reservoir sampler whose state is mergeable across runs (same distribution as the default RESERVOIR method). See Headline above.
  • sample: --sketch-out <file> / --sketch-in <files> for serializing and merging sampler state across runs. Sketches carry their source CSV header so merged output is schema-bearing.
  • geocode: new cache-clear, cache-prune & cache-info subcommands to manage the persistent on-disk OpenCage result cache. cache-clear wipes the cache, cache-prune --older-than <val> deletes entries older than an absolute date or a relative age (e.g. 30d, 2w), and cache-info reports the cache directory, entry count, on-disk size and oldest/newest entry timestamps.
  • profile: new bundled geoconnex projection profile + pyshacl validator wired to the Internet of Water's Geoconnex SHACL shapes (vendored under resources/geoconnex/shacl/, embedded in the qsv binary). Phase 1 is dataset-level only β€” DatasetShape / ProviderShape / PublisherShape / DistributionShape coverage; the row-per-feature LocationOrientedShape (with mandatory gsp:asWKT geometry synthesis from lat/lon columns) is deferred to a follow-up. Gated behind a new geoconnex cargo feature β€” present in qsv (via distrib_features) and as an opt-in for qsvdp (-F datapusher_plus,geoconnex); not available in qsvlite / qsvmcp.
  • πŸ†• get: new command for fetching tabular data from HTTP(S) URLs, cloud object stores (s3:///gs:///az://) and CKAN portals (ckan://) into a content-addressed local disk cache. Cached entries are reusable by any other qsv command via the dc:<name> input prefix, carry BLAKE3/ETag provenance plus record-count and TTL metadata, and revalidate conditionally over HTTP (If-None-Match β†’ 304 Not Modified). Subcommands include cache-set-ttl, cache-set-policy and cache-list --verify. Cloud sources are gated behind the opt-in get_cloud sub-feature (via object_store, no new transitive crates). Available in qsv/qsvmcp/qsvdp (not qsvlite). Issue #2263 (#3953, #3958).
  • πŸ†• profile: new command for profiling a dataset and projecting it into open metadata standards β€” DCAT-US v3, DCAT-AP v3 and Croissant β€” through a YAML-driven projection engine, with optional external validation (mlcroissant for Croissant, pyshacl for DCAT-AP/Geoconnex SHACL shapes) and embedded descriptive statistics & frequency tables. Accepts local files, URL inputs and stdin. Available in qsv, qsvmcp and qsvdp; not in qsvlite. (The bundled geoconnex projection profile is the only part gated further β€” to qsv/qsvdp via the geoconnex feature.) (#3898, #3901, #3904, #3908, #3910, #3911, #3912, #3918).
  • geocode: new OpenCage online geocoding subcommands for forward and reverse geocoding via the OpenCage API, including %dyncols: support to materialize multiple result fields as new columns. Issue #1295 (#3876, #3878).
  • describegpt: new --format semanticmd output β€” a richer Markdown Data Dictionary describing what each column means (not just its data type), designed for agents and data catalogs (#3933, #3935).
  • describegpt: new JSON Schema (draft 2020-12) output format (#3871).
  • describegpt: date/datetime content-type tokens with LLM-inferred chrono format, and inferred date format rendered for Min/Max in the JSON/TOON/JSONSchema dictionaries (#3884, #3922).
  • synthesize: preserve inter-column relationships when generating synthetic data, so correlated columns stay correlated in the output (#3888).
  • stats: new --zero-padded-numeric flag/column to opt into treating zero-padded numeric codes as a marked numeric type (#3934, #3938).
  • sniff: indexed distributed sampling for better type and date inference on large files (#3926).
  • excel: expose has_1904_epoch in the workbook metadata output (#3905).
  • template: new shared data-wrangling MiniJinja filters available across commands that use templates (#3921).
  • help-md: render NOTE/WARNING/IMPORTANT/TIP/CAUTION blocks as GitHub Alerts in generated help markdown (#3927).

Changed

  • deps: migrate cached 1.1 β†’ 2.0.1 (cached_proc_macro 1.1.0 β†’ 2.0.0; see the upstream migration guides). Moved fetch/fetchpost/geocode to 2.0's builder-based store construction (LruCache::builder().max_size(n).build()) and to the new automatic Result/Option handling (the result/option macro attributes were removed and size renamed to max_size; 2.0 skips caching Err/None by default). On-disk cache format is unchanged (DISK_FILE_VERSION is still 1), so existing fetch/fetchpost/describegpt/geocode caches remain valid. Behavioral note: under 2.0's defaults, geocode no longer caches misses β€” failed DNS resolutions and "country not found" lookups are retried per-record instead of being cached for the run (more correct, since transient DNS failures are no longer sticky).

  • perf(geocode): replaced the single-RwLock<LruCache> in-memory caches behind geocode suggest/reverse (the search_index result cache) and iplookup (the cached_dns_lookup hostnameβ†’IP cache) with sharded concurrent LRUs (cached 2.0's #[concurrent_cached] ShardedLruCache). The old caches took an exclusive lock on every read hit (LRU recency bump), serializing geocode's rayon-parallel pipeline; sharding spreads that across many independent locks (count auto-derived from available_parallelism()). ~2.2x faster on a 16-core, high-cache-hit workload (5M rows: 3.17s β†’ 1.41s; system time from lock contention dropped 16.5s β†’ 1.1s), byte-identical output, and no single-threaded regression (--jobs 1 is unchanged). Real-world gains scale with cache-hit ratio and core count.

  • build: set panic = "abort" in the release profile, shrinking the all-features qsv binary ~18% (138.5 MB β†’ 113.5 MB). Removes the stack-unwinding machinery (__eh_frame, __gcc_except_tab, __unwind_info β‰ˆ 14.5 MB) plus the per-callsite landing-pad code (~10 MB of __text). Tests are unaffected (Cargo forces panic = "unwind" for the test/bench harness) and human-panic still reports before aborting. The release-samply profiling profile keeps panic = "unwind" for full backtraces. Exception: Luau-enabled builds (the main qsv binary) use the new release-luau profile (panic = "unwind") because mlua requires unwinding β€” see Fixed below (#3937).

  • build: dropped the luau feature from the qsvmcp binary and the qsvdp Debian variant β€” those binaries keep panic = "abort" for the size win, and Luau (which needs panic = "unwind") is unnecessary in them. The full qsv binary still bundles Luau. #3937

  • perf(stats): resolve the --infer-dates "sniff" dates-whitelist in-process instead of forking a qsv sniff subprocess β€” eliminates a second 138 MB binary load, redundant re-sampling, and a JSON round-trip to obtain data the sniff code already has as a struct. Output is byte-identical. ~1.4–2.4x faster on cold --infer-dates runs, with the biggest wins on small and many-file workloads. #3924

  • perf(stats): reuse the sniff-resolved dates-whitelist on warm stats-cache hits instead of re-sniffing an unchanged file on every run. The cache sidecar now records the original "sniff" value as provenance (flag_dates_whitelist_raw) so it can safely reuse the previously-resolved date columns while preserving content-based cache sharing with schema/profile/frequency. Since "sniff" is the default --dates-whitelist, warm --infer-dates repeat runs are now ~4.4–6.6x faster (~10 ms regardless of file size). #3924

  • stats: zero-padded decimal codes (e.g. ICD-9 007.1, Dewey Decimal 05.10, Harmonized System tariff codes) are now inferred as String instead of Float, preserving their leading/padding zeros β€” mirroring how zero-padded integers (zip codes like 07306) have always been kept as text. Previously they parsed as floats and silently lost their padding (007.1 β†’ 7.1) when loaded/cast downstream (schema, tojsonl, etc.). This is the data-integrity default and is not gated behind --zero-padded-numeric, which remains an opt-in marker column. Ordinary fractions (0.5) and pure trailing-zero codes (7.10) are unaffected. The only added cost is a leading-zero byte-check after a successful float parse on the type-inference hot path (negligible β€” a couple of byte comparisons in the common case, zero cost on non-float columns). Follow-up to #3938.

  • profile: bumped the bundled croissant projection profile from Croissant 1.0 β†’ Croissant 1.1. conformsTo now points at .../croissant/1.1; @context expanded to the canonical 1.1 prefix table (adds sc:, rai:, and shortcut terms for recordSet, field, fileObject, source, extract, column, dataType, citeAs, etc.); Distribution @type switched from sc:FileObject to cr:FileObject; FileObject and RecordSet now carry stable @id slugs so per-Field source blocks ({fileObject: {@id}, extract: {column}}) can cross-reference them. New 1.1 citeAs field populated from pkg.citation. Behavioral note: the file-hash slot switched from the non-canonical cr:fileFingerprint+cr:Checksum nested shape (BLAKE3) to the canonical direct sha256 property mlcroissant validates against β€” slower on multi-GB inputs but spec-compliant. #3916

  • docs(alloc): evaluated two more jemalloc levers from TUNING.md β€” metadata_thp:auto and percpu_arena:percpu β€” beyond the background_thread + dirty/muzzy-decay tuning shipped in #3948. On a high-cardinality Linux stress workload neither was a consistent win (metadata_thp made frequency ~4% faster but stats ~3% slower at ~+5% peak RSS; percpu_arena showed no RSS-safe gain), so qsv does not enable them by default. Documented how to opt in via _RJEM_MALLOC_CONF, and added a manual Bench jemalloc metadata_thp GitHub Actions workflow (A/B/C wall-clock + peak-RSS on a synthetic high-cardinality dataset) to reproduce the measurement.

  • MSRV: raised the minimum supported Rust version to 1.96.

  • deps: upgraded Polars to 0.54 (py-1.41.1) and relaxed the chrono dependency to 0.4.

  • perf(frequency,schema): faster top-N and enum-list computation via qsv-stats 0.53 (#3950).

  • perf(polars): trimmed Polars features and gated AVX-512 to x86_64, shrinking the binary by ~4.5 MiB (#3930).

  • perf(alloc): jemalloc tuning (background_thread + dirty/muzzy decay) for parallel aggregation commands (#3948).

  • profile: generalized the --validate / --strict / --no-projection flags into a unified projection/validation interface. Marked breaking (feat(profile)!), but affects only the new profile command, which did not exist in 20.1.0 (#3915).

Fixed

  • luau: error conditions in Luau scripts no longer abort the process in release builds. With panic = "abort", a Luau callback error (e.g. qsv_cumsum overflow or an invalid qsv_shellcmd) β€” which mlua surfaces by unwinding across the Lua/C boundary β€” was turned into a hard process abort instead of the expected error message. Luau-enabled builds now use the release-luau profile (panic = "unwind"), so these errors are reported gracefully. #3937
  • describegpt: honor the QSV_LLM_BASE_URL env var and an explicit --base-url even when its value matches the default; repaired stale describegpt tests (#3889, #3893).
  • describegpt: keep JSON Schema examples valid against the inferred property type (#3885).
  • geocode: index-load now correctly reports that only the prebuilt cities15000 index is supported (#3883).
  • moarstats: hardened bivariate analysis against page-cache and temp-file races in joined-CSV processing, and fixed the clustered -o output guard (#3873, #3881, #3882, #3892, #3894).

Full Changelog: 20.1.0...21.0.0

Don't miss a new qsv release

NewReleases is sending notifications on new releases.