github dathere/qsv 20.0.0

pre-release8 hours ago

[20.0.0] - 2026-05-03 🧹 The "Spring Cleaning" Release 🌱

Using Claude Code, roborev, Serena, Context7 and GitHub Copilot orchestrated using an adversarial review workflow β€” we systematically audited every command for correctness, safety, and performance. Over the past four weeks, we did the first end-to-end pass over the qsv codebase. The result is the largest correctness-and-safety sweep in qsv's history: ALL commands were touched by review-driven cleanups, with dozens of latent bugs, panic paths, and performance cliffs swept out, while adding more than 250 new tests across the board.

This is a major version bump because that sweep also surfaced four user-visible behaviors that were demonstrably wrong and could not be fixed without breaking compatibility:

  • safenames verify-mode now correctly counts duplicate-suffix renames as unsafe (previously under-reported).
  • enum --hash is now collision-resistant across multi-column inputs (previously ["ab","c"] and ["a","bc"] hashed identically).
  • excel --metadata csv column ordering now actually matches its header row (previously the type, visible, and headers columns held each other's values).
  • util::safe_header_names now enforces its 60-char cap in bytes end-to-end (previously chars-based, allowing UTF-8 names up to 240 bytes β€” past Postgres' NAMEDATALEN).

Plus a few smaller but breaking corrections: headers --intersect is renamed to --union (the flag never computed an intersection), luau qsv_loadcsv headersvare now 1-indexed per Lua convention, and MSRV is bumped to Rust 1.95.

Beyond the cleanup, this release adds one new top-level command:

  • NEW implode command: the inverse of explode. Groups rows by key column(s) and joins a value column into a single delimited string per group β€” useful for collapsing normalized output back into compact form.

And a notable performance win:

  • frequency: parallel tree-reduce of partial frequency tables delivers a ~1.3x speedup on multi-core machines. Smaller per-command perf wins also landed in fill (+22%), datefmt (+9%), cat, dedup, replace, search/searchset, and transpose.

Detailed MCP Server and Cowork Plugin changes are documented in the MCP Server/Cowork Plugin CHANGELOG.

Important

This is a major release with breaking changes. Pipelines that consume qsv excel --metadata csv by column position, store qsv enum --hash digests across versions, parse qsv safenames verify-mode output, or invoke qsv headers --intersect will need updates. See the Changed and Removed sections below for migration notes.


Added

  • implode: new command β€” inverse of explode #3733 (closes #917)
  • generators: mark required options in help markdown and MCP skills #3734
  • sortcheck: add --numeric and --natural flags; allocation-free streaming loop #3756
  • exclude: add stdin support and memcheck #3749

Changed

  • BREAKING excel: --metadata csv column ordering for type, visible, and headers is corrected. Previously the CSV header row declared type, visible, headers but the data rows pushed values in the order headers, typ, visible, so under each named column the wrong values appeared (the type column held the headers list, visible held the type, and headers held the visibility). The CSV output now matches the --metadata json (SheetMetadata struct) field order: index, sheet_name, type, visible, headers, column_count, …. Pipelines that consumed qsv excel --metadata csv and indexed by column position must shift those three columns; consumers that indexed by header name see corrected values automatically.
  • BREAKING enum: --hash digest values change. The hashed input now carries a u64 length prefix per field (to fix the multi-column collision bug above), so every --hash digest differs from earlier qsv versions β€” single-column hashes change identity values too, and stored hashes from earlier qsv versions will not match. Same input still hashes deterministically across rows and runs in β‰₯ this version.
  • BREAKING luau: qsv_loadcsv now returns the headers table 1-indexed (per Lua convention). Scripts that accessed headers[0] or iterated for i = 0, #headers - 1 must shift to headers[1] and for i = 1, #headers (or ipairs(headers)). Previously headers[1] returned the second header.
  • BREAKING headers: rename --intersect to --union. The flag has always computed a deduplicated union of headers across inputs, not a true set intersection β€” the name was a long-standing misnomer. --intersect is removed entirely (no alias) given the surrounding breaking-change window. Migration: replace qsv headers --intersect … with qsv headers --union …; output is unchanged.
  • BREAKING safenames: verify-mode (--mode v / V / j / J) outputs change. (1) Verify counts now include header positions that would be renamed by the duplicate-suffix pass β€” inputs containing duplicate column names will report higher unsafe counts than earlier qsv versions; the count now matches what --mode a would actually rewrite. (2) --mode V / j / J displays unsafe-header strings with leading/trailing whitespace and surrounding " already trimmed (matching what the safe-rename pass actually evaluates), and duplicate_headers is now sorted alphabetically rather than appearing in undefined HashMap iteration order. Pipelines that parsed verbose/JSON output and depended on the old ordering or untrimmed strings must update.
  • BREAKING util::safe_header_names: the 60-length cap is now enforced in bytes on the final name, including any duplicate-disambiguation suffix. Previously the truncation was chars-based (take(60).sum()) and only applied to the base, so non-ASCII headers could produce up to ~240-byte names and duplicate-disambiguated headers added _<n> after truncation, pushing past Postgres' NAMEDATALEN (63 bytes). Now the rewrite path lowercases and prepends the leading-_ prefix before truncating, then snaps to a UTF-8 char boundary at ≀60 bytes. ASCII-only inputs see the same output as before for non-suffixed cases. Long ASCII headers that previously generated 61–63-char suffixed variants will be 1–2 chars shorter at the boundary. Headers containing multibyte UTF-8 (CJK, accented chars, emoji) that previously produced names >60 bytes will now be aggressively trimmed to fit. Affects every caller (safenames, applydp, apply, fetch, python); stored mappings keyed on the old over-long forms will not match.
  • BREAKING MSRV bumped to Rust 1.95
  • describegpt: split process_phase_output into per-branch helpers (dictionary context-only, full dictionary, JSON, TSV, TOON, Markdown). No behavior change β€” same output, smaller functions.
  • luau: qsv_coalesce now stringifies non-string values (numbers and booleans render via to_string; nil / arrays / objects are skipped). Previously, numbers and booleans were silently treated as missing values via as_str().unwrap_or_default(). Scripts relying on qsv_coalesce(some_bool, fallback) to skip booleans will now return "true"/"false" for the boolean.
  • describegpt: per-phase helper split, widened cache key, ~21% LOC reduction #3720 #3721 #3722
  • frequency: parallel tree-reduce of partial FTables (~1.3x speedup) #3728
  • moarstats: collapse duplicated outlier bivariate scan; safety/perf cleanup, unit tests #3718 #3719
  • validate: use cold_hint (stabilized in Rust 1.95) #3717; correctness, perf cleanup #3743 #3779
  • frequency: correctness, perf, refactor cleanup #3745
  • apply: review-driven cleanup, perf #3741
  • template: subdir bug fix, lookup perf, render-error visibility, helper extraction #3740
  • dedup: allocation-free ignore-case #3754
  • datefmt: ~9% perf #3753
  • fill: ~22% faster hot path #3762
  • replace: streaming parallel write; dead match-flag tracking #3777
  • search/searchset: parallel memory streaming; --quick fixes; USAGE alignment #3776
  • cat: rowskey speedup #3750
  • transpose: correctness, perf cleanup, polish #3781
  • cleanup: rename fail_oom_clierror; surface geocode update-check error #3806
  • applied select clippy lints

Fixed

  • excel: review-driven cleanup of src/cmd/excel.rs β€” fix four correctness bugs. (1) Negative --sheet indices that overshot the sheet count silently selected a wrong sheet because the abs_diff clamp "bounced" past zero (e.g. --sheet -4 on a 3-sheet workbook returned the 2nd sheet); now errors with usage error: negative sheet index N is out of range for K sheets. (2) get_requested_range lower-cased the sheet name into the caller's sheet variable and never restored it, so --range 'Sheet2!A1:B2' --error-format formula|both failed because calamine's worksheet_formula is case-sensitive, and the success message printed the wrong case; now the canonical name from the workbook's sheet list is preserved. (3) The float-to-i64 conversion guard used *float_val > i64::MAX as f64, but i64::MAX as f64 rounds up past i64::MAX, so a value of 2^63 slipped through, saturating-cast to i64::MAX, and was emitted as 9223372036854775807 (off by one). Replaced with is_finite() && val >= i64::MIN as f64 && val < i64::MAX as f64 && fract() == 0.0, with a NaN/Inf fallback to Display (the previous code would have hit zmij::format_finite UB on non-finite floats). (4) --range against an empty sheet reported "larger than sheet" because the bounds check used a (0, 0) fallback that swallowed the empty-sheet case; now reports "sheet is empty" distinctly. Also: hoisted to_lowercase() allocations out of the table-name and named-range search loops, fixed two USAGE typos (3nd β†’ 3rd, stray paren in the negative-index example), removed a stray blank line in NamesMetadata's derive, and dropped the now-unused cmp import.
  • enum: review-driven cleanup of src/cmd/enumerate.rs β€” fix silent multi-column hash collision and lossy UTF-8 hashing. Previously --hash over multiple columns concatenated the UTF-8 strings of each field into a single buffer before hashing, so distinct rows like ["ab","c"] and ["a","bc"] flattened to "abc" and hashed identically; non-UTF-8 fields were silently replaced with the empty string via unwrap_or_default(), masking different invalid byte sequences. Switch to streaming Xxh3 over raw bytes with a per-field length prefix, eliminating both bugs and removing per-row String allocation. Also fix a degenerate-selection bug where a --hash selector resolving to only the existing hash column (e.g. --hash hash on name,hash headers) silently re-included the hash column β€” the post-filter parse/selection round-trip turned the empty selection into "all columns"; now constructed via Selection::from_indices and rejected with a clear error. Also: collapse three redundant select() calls in hash setup, avoid per-row ByteRecord reallocation when an existing hash column is dropped, fix docstring mode count (was "four", now "six") and duplicated 3. numbering, and add enumerate_hash_no_concat_collision regression test.
  • util: review-driven hardening of src/util.rs β€” defend process_input against zip-slip, fix panics in optimal_batch_size, safe_header_names (whitespace-only headers), qsv_check_for_update (empty release list), and format_systemtime (pre-1970 timestamps); drop UB-prone unsafe { get_unchecked } from write_json / csv_to_jsonl (this also fixes a long-standing trailing-comma bug when records were shorter than headers); log warnings (including the input path) when row counting hits a CSV read error instead of silently undercounting. Behavior change: download_file now propagates timeout_secs validation errors instead of silently coercing a bad timeout to 30s β€” passing --timeout 0 or --timeout > 3600 to a downloading command will now error. get_envvar_flag now trims whitespace, accepts on/off, and warns at most once per unrecognized (key, value) pair.
  • describegpt: widen cache key to cover all template-affecting flags (--tag-vocab, --num-tags, --enum-threshold, --sample-size, --fewshot-examples, the QSV_DUCKDB_PATH toggle, and the generated Data Dictionary). Previously, changing any of these between runs silently returned stale cached output. First run after upgrade will re-invoke the LLM once per phase as prior cache entries no longer match.
  • luau: fix stale per-column globals when a row has fewer fields than headers (in both sequential and random-access modes); fix _LASTROW/_ROWCOUNT math under --no-headers in random-access mode (was off-by-one); fix _LASTROW underflow on empty input; fix qsv_loadcsv swallowing CSV parse errors; bound qsv_lag / qsv_diff history with VecDeque to prevent unbounded memory growth on large CSVs; reject lag/periods <= 0 with clear errors; short-circuit MAIN on empty random-access input so END can still run.
  • headers: review-driven cleanup of src/cmd/headers.rs. The docopt USAGE for --intersect claimed it computed a set intersection, but the implementation (and the test fixtures) have always emitted a deduplicated union across inputs β€” see the BREAKING --union rename in Changed. Tightened the multi-input path: now actually sets flag_just_names when more than one input is given (previously the docopt promised this but the code only suppressed the index prefix), removing a redundant num_inputs == 1 guard and the wasted TabWriter wrap on that path. --trim now operates on raw bytes (no String::from_utf8_lossy round-trip, so non-UTF-8 header bytes pass through untouched) and additionally strips leading/trailing tab characters in addition to spaces and double-quotes.
  • safenames: review-driven cleanup of src/cmd/safenames.rs β€” fix two correctness bugs in verify modes (--mode v / V / j / J). (1) Headers that the rename pass changed only via the duplicate-suffix step (e.g. col1, col1 β†’ col1, col1_2) were counted as safe because the verify loop checked membership against the rewritten list rather than comparing positionally; the count now agrees with always-mode's changed_count for the same input. See BREAKING note in Changed for output impact. (2) The verify loop iterated the original (untrimmed, quote-included) headers but compared against the quote-and-space-trimmed list that was actually passed to safe_header_names, so a literally-quoted header like "col" was wrongly flagged unsafe; now compares trimmed-to-trimmed. Also: replaced two simd_json::to_string{,_pretty}(...).unwrap() calls with ? propagation (a serialize failure now surfaces a CLI error instead of panicking); collapsed the verify-mode body into a single pass (replaces O(NΒ²) Vec::contains lookups + four String allocations per header with a HashSet dedup and entry().or_insert(0) count update); sorted duplicate_headers for deterministic output (previously foldhash::HashMap iteration order leaked into stderr/JSON, forcing tests to accept two orderings); write safe_headers directly in the always/conditional path instead of clearing a StringRecord and re-pushing each field; rewrote the --mode docopt block to make case sensitivity explicit (the parser keys off only the first character, with case-sensitive v/V and j/J distinctions previously buried) and dropped a misleading "verify does not count quoted identifiers as unsafe" note that contradicted the actual implementation and the in-tree example output. Tightened the Some(reserved_names_vec).as_ref() call to Some(&reserved_names_vec).
  • util: safe_header_names now enforces the length limit in bytes (≀60, snapped to a UTF-8 char boundary) end-to-end, including any disambiguation suffix. Previously the truncation was a hybrid: is_safe_name rejected names >60 bytes, but the rewrite path used chars().map(char::len_utf8).take(60).sum() (chars-based, up to 240 bytes for 4-byte UTF-8) and the duplicate-suffix step appended _2/_3 after truncation, so a non-ASCII or duplicate-disambiguated header could land at 62–240+ bytes β€” past the documented bound and Postgres' default NAMEDATALEN of 63 bytes. The function now lowercases and prepends the leading-_ prefix before truncation (case-folding can change byte length, prefixing adds bytes) and snaps the truncation to a char boundary via str::floor_char_boundary. Affects every caller (safenames, applydp, apply, fetch, python); see BREAKING note in Changed.
  • apply: malformed CSV, init bugs, doc typos, perf #3741
  • blake3: check-mode interop fixes, parser polish #3782
  • cat: --no-headers fix #3750
  • clipboard: surface error details, use ? for clipboard ops #3783
  • color: drop dead clones, mark in-memory #3785
  • config: HumanCount overflow, env-var unwrap panics, sniff UTF-8 panic #3770
  • count: quoting fix, stdin temp-file leak #3751
  • datefmt: strict tz/flag validation #3753
  • dedup: edge-case fixes #3754
  • diff: surface builder error, dedupe index/colname parsing #3784
  • edit: --in-place, bounds checks, silent no-ops #3786
  • explode: validate separator and column selection #3787
  • extdedup: key-collision and dupes-writer issues #3759
  • extsort: CRLF off-by-one fix (lineβ†’record) #3790; error propagation, temp-file handle release #3789
  • fetch/fetchpost: cache, panic, safety fixes #3747
  • fixlengths: --remove-empty crash on flexible rows; widen insert arithmetic #3764
  • fmt: --no-final-newline bugs, tempfile leak #3767
  • foreach: dry-run truncation, panic, multi-column drop, child-failure propagation #3757
  • geocode: remove latent panics, fix FIPS JSON shape, perf #3739
  • geoconvert: lat/lon swap, tempfile leak, UTF-8 panic #3768
  • input: panic, validation, clarity fixes #3791
  • join: unify key transform; fix silent --keys-output drop #3769
  • joinp: correctness, validation, schema-handling #3731
  • json: preserve BigInt precision, surface jaq runtime errors #3794; clarify --jaq numeric precision in USAGE; defer BigInt to_string allocation #3795
  • jsonl: honor --ignore-errors for header inference; fix line-number reporting #3796
  • lens: fix --streaming-stdin; reject invalid --wrap-mode #3797
  • lookup: harden cache, download errors, CKAN auth handling #3803
  • luau: correctness and consistency #3742
  • moarstats: Atkinson re-population bug; harden test coverage #3799
  • partition: UTF-8 panic, O(N) collision check #3771
  • pivotp: correctness, clarity, cleanup #3732
  • pragmastat: Windows backup path; suppress meaningless date ratios #3805
  • prompt: stream file I/O; avoid unnecessary clone #3798
  • pseudo: reject --increment 0; preserve last-valid-counter row on overflow #3792
  • py: hoist Python module setup; jagged-row panic #3758
  • reverse: avoid u64 underflow on indexed reverse #3808
  • sample: streaming bernoulli header bug; dead retry loop; cluster pre-alloc #3774; add integration tests for streaming Bernoulli URL path #3775
  • schema: correctness, panic-safety #3746
  • scoresql: USING panic, string-literal handling, filter heuristic + tests #3810
  • select: review-driven cleanup, fix / panic, quoted-name CSV-escape, empty-name silent fall-through #3772; --sort round-trip (quotes, non-UTF-8, dup names) #3773
  • slice: panic and underflow fixes #3748
  • snappy: preserve validate error; guard decompress ratio on stdin #3809
  • sort: --numeric --natural --unique consistency #3755
  • split: correctness fixes, error propagation, tests #3780; Windows --filter quoted-arg corruption fix via raw_arg #3788
  • sqlp: word-boundary alias replacement #3730
  • stats: cache & boolean-pattern fixes #3744; close cache short-circuit gaps for select/round/typesonly/infer-boolean #3800
  • tojsonl: guard non-finite Number; hoist header escaping; drop unused unused_assignments allows #3807

Removed

  • BREAKING headers: removed --intersect flag (use --union instead) #3763

Dependencies

  • Bump polars to 0.53.0 (multiple bumps; latest tracks py-1.40.1)
  • Bump mlua to 0.12.0-rc.1; Luau from 709 to 716
  • Bump zip from 7 to 8
  • Switch csv crate to qsv-tuned fork (replaces ryu with zmij)
  • Bump jsonschema from 0.46.1 to 0.46.4 #3723 #3778 #3804
  • Bump mimalloc from 0.1.49 to 0.1.50 #3729
  • Bump rayon from 1.11.0 to 1.12.0 #3710
  • Bump tokio from 1.51.1 to 1.52.0 #3712
  • Bump libc from 0.2.184 to 0.2.186 #3709 #3736
  • Bump reqwest from 0.13.2 to 0.13.3 #3766
  • Bump blake3 from 1.8.4 to 1.8.5 #3738
  • Bump magika from 1.0.1 to 1.1.0 #3737
  • Bump robinraju/release-downloader from 1.12 to 1.13 #3726
  • Bump qsv-stats from 0.49.0 to 0.50.0 #3727
  • Bump qsv_docopt from 1.9.0 to 1.10.0 #3724
  • Update self_update to latest upstream (qsv PR merged)
  • Update geosuggest to 0.8.3
  • Update csvs_convert
  • Removed kiddo patch fork now that 0.5.3 is released with our PR merged

Full Changelog: 19.1.0...20.0.0

Don't miss a new qsv release

NewReleases is sending notifications on new releases.