[20.1.0] - 2026-05-17 🤖 The "Synthetic Data" Release 🎲
This minor release lands a new top-level command, deepens AI/LLM-assisted dictionary inference in describegpt, expands the opt-in Apache DataSketches estimators in stats/frequency that began in 20.0.0, and sweeps in a long tail of correctness and big-endian fixes. No breaking changes — pipelines built against 20.0.0 should upgrade in place.
Headline
- NEW
synthesizecommand — generates statistically-faithful synthetic CSVs from a source file. Runsstats+frequency+counton the source, then emits N rows that reproduce per-column attributes:- Categorical / low-cardinality columns are reproduced by frequency-weighted sampling of the real value set (cardinality, weights, and repetition structure preserved exactly).
- Numeric and date/datetime columns use quartile-bucketed generation, so the shape of the distribution is preserved — not just
[min, max]. - Null ratios are reproduced per column.
--seedmakes output fully reproducible (single masterStdRngthreads through both selection logic and every faker call).--dictionary <file>layers in semantic Content Types fromdescribegpt --dictionary --infer-content-type— each token maps to afake-rsfaker (47-token vocabulary covering names, emails, addresses, UUIDs, license plates, IPv4/IPv6, etc.; 44 are faker-mapped, 3 —category,unique_id,unknown— fall back to enumeration/frequency-based generation). Bounded-cardinality faker columns sample from a fixed pre-generated pool of distinct fake values (the--consistent-fakesmechanism, so a given logical value maps consistently).--infer-content-typerunsdescribegptinternally to build the dictionary on the fly (needsQSV_LLM_APIKEY).--localeselects from 14fake-rslocales for region-aware faker output.- Gated behind the new
synthesizefeature flag; wired into theqsvandqsvmcpbinaries (notlite, notdatapusher_plus). Cross-column correlation is explicitly out of scope for v1.
Added
synthesize: new top-level command (see Headline) #3854synthesize:--consistent-fakesfor stable source→fake mapping #3865synthesize:--localeoption for 14 fake-rs locales #3860describegpt:--two-passcross-field Data Dictionary refinement #3863describegpt: deterministicunique_idContent Type for ALL_UNIQUE fields #3862describegpt,synthesize: infer Content Type for temporal fields with LLM-hinted duration cap #3861describegpt,synthesize: 5 new Content Type tokens —street_name,license_plate,industry,profession,ipv6_addressdescribegpt:--markdown-templatefor customizable Markdown output #3834pivotp:--agg quantile@<p>(aliasq@<p>) with linear interpolation #3842stats/frequency: opt-in Apache DataSketches modes — HLL cardinality, Frequent Items top-K #3840stats: widened BLAKE3 fingerprint to cover all streaming stats #3824
Changed
stats/frequency: auto-enable Apache DataSketches estimators (t-digest + HyperLogLog forstats; Misra-Gries Frequent Items forfrequency) whenutil::mem_file_checkreports OOM, in addition to the existing auto-index fallback. Awwarn!is emitted listing the auto-enabled estimators; explicit--quantile-method exact/--cardinality-method exact/--sketch-method exactstill suppresses the auto-enable #3843stats: three opt-in micro-optimizations — simdutf8 output, t-digest quantiles, mode-cardinality cap #3839synthesize: use string-length stats for unstructured text columns #3864describegpt: inline{{ dictionary }}in default description/tags prompts; skip redundant chat-message dictionary injection when the template already inlines itsynthesize: handle both describegpt-wrapped and raw dictionary JSONrefactor: adopt Rust 1.95cfg_select!macro at platform-conditional sites #3846perf: promotebytes_to_cow_strhelper toutiland sweep callsitesperf(moarstats): hint rare branches withcore::hint::cold_path()#3823perf(stats): mark non-UTF-8 branch coldperf(frequency): hint UTF-8 failure as cold in the ignore-case hot loop #3821refactor(stats): shrink and tidyWhichStats#3822refactor(publish): fetch tags and enforce SemVer for debian package releasesrefactor(benchmarks): hardenbenchmarks.sherror handling and cross-platform support #3814deps: bump polars (latest upstream), calamine 0.34→0.35, csvlens fork with bumped arrow, sysinfo 0.38.4→0.39.2, rust_decimal 1.41→1.42, tokio 1.52.1→1.52.3, filetime 0.2.27→0.2.29, jsonschema 0.46.4→0.46.5, rand_xoshiro 0.8.0→0.8.1, redis 1.2.0→1.2.1, qsv-dateparser 0.14→0.15 (adds support for ISO 8601T-separated datetimes without a timezone suffix — e.g.2020-01-15T08:00:00, the form produced by Python'sdatetime.isoformat()withoutastimezone(); previously misclassified byqsv stats --infer-datesasString)- assorted clippy cleanups across
stats,frequency,pivotp,partition
Fixed
stats: preserve length & lex stats when column type widens to String #3856stats: remove duplicate big-endianTDigestStub/HllSketchStubdefs #3857stats: restore big-endian build by giving slot fallbacks an accessible.0#3850stats/frequency: gate Apache DataSketches behind little-endian targets #3847apply/applydp: thousands negative fractions; scope<NULL>toregex_replace#3845moarstats: retry on stats coverage mismatch + fsync joined CSV parent dir #3838moarstats: close fsync race that silently dropped joined columns on macOS #3830util: open subprocess output with write access for fsync (Windows) #3831qsvdp: only list commands actually compiled into the binary #3816 #3819help-md-gen: infer real argument type for Options "Type" column #3858 #3859synthesize: Date/DateTime columns now always usebuild_date()with the source's real min/max bounds — previously, if the LLM (incorrectly) tagged a date column with a faker-mappedcontent_typeliketime(which is time-of-day, e.g.14:30:45) or any other temporal token, either faker branch inColumnGenerator::build()could fire before the type-based match and emit a time-of-day string for a date column. Suppress any LLM-emittedcontent_typefor Date/DateTime columns at function entry so both faker branches fall through to the type-based fallback (regression test intests/test_synthesize.rs::synthesize_date_column_ignores_time_content_type)test(fetch,sample): bind to ephemeral port to fix flaky macOS CI #3827test(moarstats): serially execute some flaky CI tests; add missingserial_test::serialimporttest(stats): fix s390x big-endian quantile-method rejection test
Docs
- AI Policy section added to README.md and CONTRIBUTING.md, with cross-links and contributor attribution guidance
docs-driftCI check added; audit-detected drift fixed #3868- README emoji legend audited and normalized; help docs regenerated #3832
- "Processing Very Large Files" guidance added; large-file recipe inline-comment fixes
stats: explicit Count Reference tables for 47stats/ 55moarstatsmeasures; count conventions clarifiedSTATS_DEFINITIONSaudit;statsDEVELOPER NOTE wordsmithedfeatures: correctedself_updateprobability + nightly sub-features documentation #3833- Test count updated to the verified exact total (3,094)
Detailed MCP Server and Cowork Plugin changes are documented in the MCP Server/Cowork Plugin CHANGELOG.