github dathere/qsv 2.0.0

3 days ago

qsv v2.0.0 is here! 🎉

It took 193 releases to get to v1.0.0, and we're already at v2.0.0 a month later!?!

Yes! We wanted a running start for 2025, and qsv 2.0.0 marks qsv's biggest release yet!

  • It fully enables the "Data Resource Upload First (DRUF)" workflow, allowing Datapusher+ to infer "automagical metadata" from the data itself. It exposes two Domain Specific Language (DSL) options - Luau and MiniJinja - to enable powerful data transformation and validation capabilities. This allows data stewards to upload data first, then use qsv's DSL capabilities inside DP+ to automatically generate rich metadata - including data dictionaries, field descriptions, data quality rules, and data validation schemas. This "automagical metadata" approach dramatically reduces the friction in compiling high-quality, high-resolution metadata (using the DCAT-US 3.0 specification as a reference) that would otherwise be a manual, laborious, and error-prone process.
    Under the hood, the fetchpost, template, stats, validate and luau commands now have the necessary scaffolding to fully support this workflow inside Datapusher+ and ckanext-scheming.
  • It adds a new "smart" pivotp command, powered by Polars, to enable fast pivot operations on large datasets. It's "smart" as it uses the stats cache to automatically suggest an aggregation based on a column's data type and summary statistics. You can now pivot your data in seconds by simply specifying the columns to pivot on while blowing past Excel's PivotTable limitations.
  • stats now computes geometric mean and harmonic mean and adds string length stats, all while getting a performance boost.
  • join and joinp got a lot of love in this release, with several new options:
    • joinp: non-equi join support! 🎉💯🥳
      See "Lightning Fast and Space Efficient Inequality Joins" paper and this Polars non-equi join tracking issue.
    • join & joinp: --right-anti and --right-semi joins
    • joinp: --ignore-leading-zeros option for join keys
    • joinp: --maintain-order option to maintain the order of the either the left or right dataset in the output
    • joinp: expanded --cache-schema options to make joinp smarter/faster by leveraging the stats cache
    • join: --keys-output option to write successfully joined keys to a separate output file.

This release lays the groundwork for the outliers "smart" command to quickly identify outliers using stats/frequency info.

It also sets the stage for an initial implementation of our "Data Concierge" that leverages all the high-quality, high-res metadata we automagically compile with DRUF to enable Metadata Gardening Agents to proactively link seemingly unrelated data and glean insights as it constantly grooms the Data Catalog - effectively making it a FAIR Data Factory.


Added

  • fetchpost: add --globals-json option #2357
  • fixlengths: add --remove-empty option; refactored for performance. Fulfills #2391. #2411
  • join: add --keys-output option. Fulfills #2407. #2408
  • join: add --right-anti and --right-semi options. Fulfills #2379. #2380
  • joinp: add non-equi join support! 🎉💯🥳 #2409
  • joinp: add --ignore-leading-zeros option. Fulfills #2398. #2400
  • joinp: add --maintain-order option #2338
  • joinp: add --right-anti and --right-semi options. Fulfills #2377. #2378
  • luau: addl helper functions. Fulfills #1782. #2362
  • luau: add qsv_writejson helper #2375
  • pivotp: new polars polars-powered command. Fulfills #799. #2364
  • pivotp: "smart" pivotp. #2367
  • stats: add geometric mean and harmonic mean. Fulfills #2227. #2342
  • stats: add string length stats to set stage for upcoming outliers "smart" command to quickly identify outliers using stats/frequency info #2390
  • template: add --globals-json option #2356
  • tojsonl: add --quiet option. Fulfills #2335. #2336
  • validate: add --validate-schema option to check if the JSON Schema itself is valid #2393
  • contrib(completions): add joinp --ignore-case and slice --invert by @rzmk in #2322
  • contrib(completions): add --quiet to tojsonl by @rzmk in #2337
  • ci: add qsv_glibc_2.31-headless to action by @rzmk in #2330
  • Add license to MSI installer by @rzmk in #2321

Changed

  • lens: optimized csvlens library usage, dropping clap dependency #2403
  • pivotp: an even smarter pivotp #2368
  • stats: performance boost 51349ba
  • Update deb package by @tino097 in #2226
  • ci: attempt using files-folder instead of files by @rzmk in #2320
  • Setting QSV_FREEMEMORY_HEADROOM_PCT to 0 disables memory availability check #2353
  • build(deps): bump actix-governor from 0.7.0 to 0.8.0 by @dependabot in #2351
  • build(deps): bump bytemuck from 1.20.0 to 1.21.0 by @dependabot in #2361
  • build(deps): bump chrono from 0.4.38 to 0.4.39 by @dependabot in #2345
  • build(deps): bump crossbeam-channel from 0.5.13 to 0.5.14 by @dependabot in #2354
  • build(deps): bump flexi_logger from 0.29.6 to 0.29.7 by @dependabot in #2348
  • build(deps): bump governor from 0.7.0 to 0.8.0 by @dependabot in #2347
  • build(deps): bump itertools from 0.13.0 to 0.14.0 by @dependabot in #2413
  • build(deps): bump jsonschema from 0.26.1 to 0.26.2 by @dependabot in #2355
  • build(deps): bump jsonschema from 0.26.2 to 0.27.0 by @dependabot in #2371
  • build(deps): bump jsonschema from 0.27.1 to 0.28.0 by @dependabot in #2389
  • build(deps): bump jsonschema from 0.28.0 to 0.28.1 by @dependabot in #2396
  • bump polars from 0.44.2 to 0.45 #2340
  • build(deps): bump polars from 0.45.0 to 0.45.1 by @dependabot in #2344
  • bump pyo3 from 0.22 to 0.23 now that Polars supports it #2352
  • build(deps): bump redis from 0.27.5 to 0.27.6 by @dependabot in #2331
  • build(deps): bump reqwest from 0.12.9 to 0.12.11 by @dependabot in #2385
  • build(deps): bump reqwest from 0.12.11 to 0.12.12 by @dependabot in #2395
  • build(deps): bump rfd from 0.15.1 to 0.15.2 by @dependabot in #2404
  • build(deps): bump serde from 1.0.215 to 1.0.216 by @dependabot in #2349
  • build(deps): bump serde from 1.0.216 to 1.0.217 by @dependabot in #2384
  • build(deps): bump serde_json from 1.0.133 to 1.0.134 by @dependabot in #2365
  • build(deps): bump sysinfo from 0.32.1 to 0.33.0 by @dependabot in #2334
  • build(deps): bump sysinfo from 0.33.0 to 0.33.1 by @dependabot in #2383
  • deps: bump tabwriter to 1.4.1 bbcbeba
  • build(deps): bump tokio from 1.41.1 to 1.42.0 by @dependabot in #2333
  • build(deps): bump xxhash-rust from 0.8.12 to 0.8.13 by @dependabot in #2359
  • build(deps): bump xxhash-rust from 0.8.13 to 0.8.14 by @dependabot in #2372
  • build(deps): bump xxhash-rust from 0.8.14 to 0.8.15 by @dependabot in #2392
  • apply several clippy suggestions
  • bumped numerous indirect dependencies to latest versions
  • bumped Rust nightly from 2024-11-28 to 2024-12-19 (same version used by Polars)

Fixed

  • joinp: refactor --cache-schema option. Resolves #2369. #2370
  • extsort underflow in CSV mode. Resolves #2391. #2412
  • instantiate logger properly 9c0c1a7
  • fix util::get_stats_records() to no longer infer boolean in StatsMode::PolarsSchema. Resolves #2369. cebb664

Full Changelog: 1.0.0...2.0.0

Don't miss a new qsv release

NewReleases is sending notifications on new releases.