[21.0.0] - 2026-06-07 π The "F_AI_Rification" Release π
A major release headlined by two brand-new commands β get, which fetches tabular data from HTTP(S), cloud object stores (S3/GCS/Azure) and CKAN portals into a content-addressed local cache, and profile, which extracts standards-compliant dataset metadata (DCAT-US v3, DCAT-AP v3, Croissant 1.1 and Geoconnex). It also raises the minimum supported Rust version to 1.96 and upgrades to Polars 0.54, which is why this is a major version bump β existing pipelines are otherwise source-compatible (the one breaking profile flag change affects only the new command, which did not exist in 20.1.0).
Headline
- π
getβ fetch tabular data from anywhere into a local cache. A new command (issue #2263) that retrieves CSV/TSV and other tabular data from HTTP(S) URLs, cloud object stores (s3://,gs://,az://), and CKAN portals (ckan://), then stores it in a content-addressed disk cache. Cached entries are addressable via adc:<name>input prefix usable by any other qsv command, carry BLAKE3 + ETag provenance, support TTL/policy controls, and revalidate conditionally (HTTPIf-None-Match/304 Not Modified). Cloud sources are gated behind the opt-inget_cloudsub-feature; streaming, ranged/parallel downloads and adc:stats cache landed in Phase 3 (#3953, #3958). - π
profileβ generate standards-compliant dataset metadata. A new command that profiles a dataset and projects it into open metadata standards β DCAT-US v3, DCAT-AP v3, Croissant 1.1, and Geoconnex β via a YAML-driven projection engine, with optional SHACL/mlcroissant/pyshaclvalidation and embedded descriptive statistics & frequency tables (#3898, #3901, #3908, #3912, #3916, #3918). geocodegoes online with OpenCage. Newgeocodesubcommands call the OpenCage geocoding API for forward and reverse geocoding, with a persistent on-disk result cache and%dyncols:support (issue #1295, #3876, #3878).describegptdescribes meaning, not just types. A richer "semanticmd" Data Dictionary format for agents & catalogs, a JSON Schema (draft 2020-12) output format, and LLM-inferred date/datetime content types round out describegpt's semantic-description capabilities (#3933, #3935, #3871, #3884).- Mergeable / variance-bounded sampling in
sampleβ two new sampling modes plus a sketch-IO surface that lets users sample sharded inputs and combine the results without re-reading the whole corpus. Both modes are native Rust implementations written from the original algorithm papers. The Apache DataSketches project's Sampling family implements the same family of algorithms in C++/Java/Python β qsv does not bind to or depend on that code (thedatasketchesRust crate doesn't expose Sampling-family sketches), so the on-disk format is qsv-specific and not interoperable with DataSketches serialized sketches.--varopt <col>β variance-bounded weighted reservoir sampling using the A-ExpJ keying scheme of Efraimidis & Spirakis (2006). Each record gets a keyu^(1/w)and the top-kkeys are retained. Unlike--weighted(which is single-pass acceptance-rejection requiring amax_weightfrom the stats cache),--varoptis a true reservoir sampler β no stats cache required, single pass, bounded memory, and mergeable across partitions.--mergeable-reservoirβ uniform reservoir using Vitter's Algorithm R. Same statistical distribution as the default RESERVOIR method, but the resulting sampler state is mergeable.--sketch-out <file>/--sketch-in <file1,file2,...>β serialize the sampler state to a binary blob and merge across runs. Sketches embed the source CSV header so--sketch-inre-emits a schema-bearing CSV without consulting the source files. Sampler-kind mismatch (mixing a reservoir blob with a varopt blob) is rejected. Works with both new sampling modes.
Added
sample:--varopt <col>flag for variance-bounded weighted reservoir sampling (A-ExpJ keying, Efraimidis & Spirakis 2006). See Headline above.sample:--mergeable-reservoirflag for a uniform reservoir sampler whose state is mergeable across runs (same distribution as the default RESERVOIR method). See Headline above.sample:--sketch-out <file>/--sketch-in <files>for serializing and merging sampler state across runs. Sketches carry their source CSV header so merged output is schema-bearing.geocode: newcache-clear,cache-prune&cache-infosubcommands to manage the persistent on-disk OpenCage result cache.cache-clearwipes the cache,cache-prune --older-than <val>deletes entries older than an absolute date or a relative age (e.g.30d,2w), andcache-inforeports the cache directory, entry count, on-disk size and oldest/newest entry timestamps.profile: new bundledgeoconnexprojection profile +pyshaclvalidator wired to the Internet of Water's Geoconnex SHACL shapes (vendored underresources/geoconnex/shacl/, embedded in the qsv binary). Phase 1 is dataset-level only βDatasetShape/ProviderShape/PublisherShape/DistributionShapecoverage; the row-per-featureLocationOrientedShape(with mandatorygsp:asWKTgeometry synthesis from lat/lon columns) is deferred to a follow-up. Gated behind a newgeoconnexcargo feature β present inqsv(viadistrib_features) and as an opt-in forqsvdp(-F datapusher_plus,geoconnex); not available inqsvlite/qsvmcp.- π
get: new command for fetching tabular data from HTTP(S) URLs, cloud object stores (s3:///gs:///az://) and CKAN portals (ckan://) into a content-addressed local disk cache. Cached entries are reusable by any other qsv command via thedc:<name>input prefix, carry BLAKE3/ETag provenance plus record-count and TTL metadata, and revalidate conditionally over HTTP (If-None-Matchβ304 Not Modified). Subcommands includecache-set-ttl,cache-set-policyandcache-list --verify. Cloud sources are gated behind the opt-inget_cloudsub-feature (viaobject_store, no new transitive crates). Available inqsv/qsvmcp/qsvdp(notqsvlite). Issue #2263 (#3953, #3958). - π
profile: new command for profiling a dataset and projecting it into open metadata standards β DCAT-US v3, DCAT-AP v3 and Croissant β through a YAML-driven projection engine, with optional external validation (mlcroissantfor Croissant,pyshaclfor DCAT-AP/Geoconnex SHACL shapes) and embedded descriptive statistics & frequency tables. Accepts local files, URL inputs and stdin. Available inqsv,qsvmcpandqsvdp; not inqsvlite. (The bundledgeoconnexprojection profile is the only part gated further β toqsv/qsvdpvia thegeoconnexfeature.) (#3898, #3901, #3904, #3908, #3910, #3911, #3912, #3918). geocode: new OpenCage online geocoding subcommands for forward and reverse geocoding via the OpenCage API, including%dyncols:support to materialize multiple result fields as new columns. Issue #1295 (#3876, #3878).describegpt: new--format semanticmdoutput β a richer Markdown Data Dictionary describing what each column means (not just its data type), designed for agents and data catalogs (#3933, #3935).describegpt: new JSON Schema (draft 2020-12) output format (#3871).describegpt: date/datetime content-type tokens with LLM-inferred chrono format, and inferred date format rendered forMin/Maxin the JSON/TOON/JSONSchema dictionaries (#3884, #3922).synthesize: preserve inter-column relationships when generating synthetic data, so correlated columns stay correlated in the output (#3888).stats: new--zero-padded-numericflag/column to opt into treating zero-padded numeric codes as a marked numeric type (#3934, #3938).sniff: indexed distributed sampling for better type and date inference on large files (#3926).excel: exposehas_1904_epochin the workbook metadata output (#3905).template: new shared data-wrangling MiniJinja filters available across commands that use templates (#3921).help-md: renderNOTE/WARNING/IMPORTANT/TIP/CAUTIONblocks as GitHub Alerts in generated help markdown (#3927).
Changed
-
deps: migratecached1.1 β 2.0.1 (cached_proc_macro1.1.0 β 2.0.0; see the upstream migration guides). Movedfetch/fetchpost/geocodeto 2.0's builder-based store construction (LruCache::builder().max_size(n).build()) and to the new automaticResult/Optionhandling (theresult/optionmacro attributes were removed andsizerenamed tomax_size; 2.0 skips cachingErr/Noneby default). On-disk cache format is unchanged (DISK_FILE_VERSIONis still1), so existingfetch/fetchpost/describegpt/geocodecaches remain valid. Behavioral note: under 2.0's defaults,geocodeno longer caches misses β failed DNS resolutions and "country not found" lookups are retried per-record instead of being cached for the run (more correct, since transient DNS failures are no longer sticky). -
perf(geocode): replaced the single-RwLock<LruCache>in-memory caches behindgeocode suggest/reverse(thesearch_indexresult cache) andiplookup(thecached_dns_lookuphostnameβIP cache) with sharded concurrent LRUs (cached 2.0's#[concurrent_cached]ShardedLruCache). The old caches took an exclusive lock on every read hit (LRU recency bump), serializing geocode's rayon-parallel pipeline; sharding spreads that across many independent locks (count auto-derived fromavailable_parallelism()). ~2.2x faster on a 16-core, high-cache-hit workload (5M rows: 3.17s β 1.41s; system time from lock contention dropped 16.5s β 1.1s), byte-identical output, and no single-threaded regression (--jobs 1is unchanged). Real-world gains scale with cache-hit ratio and core count. -
build: setpanic = "abort"in the release profile, shrinking the all-featuresqsvbinary ~18% (138.5 MB β 113.5 MB). Removes the stack-unwinding machinery (__eh_frame,__gcc_except_tab,__unwind_infoβ 14.5 MB) plus the per-callsite landing-pad code (~10 MB of__text). Tests are unaffected (Cargo forcespanic = "unwind"for the test/bench harness) andhuman-panicstill reports before aborting. Therelease-samplyprofiling profile keepspanic = "unwind"for full backtraces. Exception: Luau-enabled builds (the mainqsvbinary) use the newrelease-luauprofile (panic = "unwind") because mlua requires unwinding β see Fixed below (#3937). -
build: dropped theluaufeature from theqsvmcpbinary and theqsvdpDebian variant β those binaries keeppanic = "abort"for the size win, and Luau (which needspanic = "unwind") is unnecessary in them. The fullqsvbinary still bundles Luau. #3937 -
perf(stats): resolve the--infer-dates"sniff" dates-whitelist in-process instead of forking aqsv sniffsubprocess β eliminates a second 138 MB binary load, redundant re-sampling, and a JSON round-trip to obtain data the sniff code already has as a struct. Output is byte-identical. ~1.4β2.4x faster on cold--infer-datesruns, with the biggest wins on small and many-file workloads. #3924 -
perf(stats): reuse the sniff-resolved dates-whitelist on warm stats-cache hits instead of re-sniffing an unchanged file on every run. The cache sidecar now records the original "sniff" value as provenance (flag_dates_whitelist_raw) so it can safely reuse the previously-resolved date columns while preserving content-based cache sharing withschema/profile/frequency. Since "sniff" is the default--dates-whitelist, warm--infer-datesrepeat runs are now ~4.4β6.6x faster (~10 ms regardless of file size). #3924 -
stats: zero-padded decimal codes (e.g. ICD-9007.1, Dewey Decimal05.10, Harmonized System tariff codes) are now inferred as String instead of Float, preserving their leading/padding zeros β mirroring how zero-padded integers (zip codes like07306) have always been kept as text. Previously they parsed as floats and silently lost their padding (007.1β7.1) when loaded/cast downstream (schema,tojsonl, etc.). This is the data-integrity default and is not gated behind--zero-padded-numeric, which remains an opt-in marker column. Ordinary fractions (0.5) and pure trailing-zero codes (7.10) are unaffected. The only added cost is a leading-zero byte-check after a successful float parse on the type-inference hot path (negligible β a couple of byte comparisons in the common case, zero cost on non-float columns). Follow-up to #3938. -
profile: bumped the bundledcroissantprojection profile from Croissant 1.0 β Croissant 1.1.conformsTonow points at.../croissant/1.1;@contextexpanded to the canonical 1.1 prefix table (addssc:,rai:, and shortcut terms forrecordSet,field,fileObject,source,extract,column,dataType,citeAs, etc.); Distribution@typeswitched fromsc:FileObjecttocr:FileObject; FileObject and RecordSet now carry stable@idslugs so per-Fieldsourceblocks ({fileObject: {@id}, extract: {column}}) can cross-reference them. New 1.1citeAsfield populated frompkg.citation. Behavioral note: the file-hash slot switched from the non-canonicalcr:fileFingerprint+cr:Checksumnested shape (BLAKE3) to the canonical directsha256propertymlcroissantvalidates against β slower on multi-GB inputs but spec-compliant. #3916 -
docs(alloc): evaluated two more jemalloc levers from TUNING.md βmetadata_thp:autoandpercpu_arena:percpuβ beyond thebackground_thread+ dirty/muzzy-decay tuning shipped in #3948. On a high-cardinality Linux stress workload neither was a consistent win (metadata_thpmadefrequency~4% faster butstats~3% slower at ~+5% peak RSS;percpu_arenashowed no RSS-safe gain), so qsv does not enable them by default. Documented how to opt in via_RJEM_MALLOC_CONF, and added a manualBench jemalloc metadata_thpGitHub Actions workflow (A/B/C wall-clock + peak-RSS on a synthetic high-cardinality dataset) to reproduce the measurement. -
MSRV: raised the minimum supported Rust version to 1.96. -
deps: upgraded Polars to 0.54 (py-1.41.1) and relaxed thechronodependency to0.4. -
perf(frequency,schema): faster top-N and enum-list computation viaqsv-stats0.53 (#3950). -
perf(polars): trimmed Polars features and gated AVX-512 tox86_64, shrinking the binary by ~4.5 MiB (#3930). -
perf(alloc): jemalloc tuning (background_thread+ dirty/muzzy decay) for parallel aggregation commands (#3948). -
profile: generalized the--validate/--strict/--no-projectionflags into a unified projection/validation interface. Marked breaking (feat(profile)!), but affects only the newprofilecommand, which did not exist in 20.1.0 (#3915).
Fixed
luau: error conditions in Luau scripts no longer abort the process in release builds. Withpanic = "abort", a Luau callback error (e.g.qsv_cumsumoverflow or an invalidqsv_shellcmd) β which mlua surfaces by unwinding across the Lua/C boundary β was turned into a hard process abort instead of the expected error message. Luau-enabled builds now use therelease-luauprofile (panic = "unwind"), so these errors are reported gracefully. #3937describegpt: honor theQSV_LLM_BASE_URLenv var and an explicit--base-urleven when its value matches the default; repaired stale describegpt tests (#3889, #3893).describegpt: keep JSON Schema examples valid against the inferred property type (#3885).geocode:index-loadnow correctly reports that only the prebuiltcities15000index is supported (#3883).moarstats: hardened bivariate analysis against page-cache and temp-file races in joined-CSV processing, and fixed the clustered-ooutput guard (#3873, #3881, #3882, #3892, #3894).
Full Changelog: 20.1.0...21.0.0