jqnatividad/qsv 0.131.0 on GitHub

Highlights

Refactored frequency to make it smarter and faster.
frequency's core algorithm essentially compiles an in-memory hashmap to determine the frequency of each unique value for each column. It does this using multi-threaded, multi-I/O techniques to make it blazing fast.
However, for columns with ALL unique values (e.g. ID columns), this takes a comparatively long time and consumes a lot of memory as it essentially compiles a hashmap of the ENTIRE column, with a hashmap entry for each column value with a count of 1.
Now, with the new --stats-mode option (enabled by default), frequency can compile the dataset in a more intelligent way by looking up a column's cardinality in the stats cache.
If the cardinality of a column is equal to the CSV's rowcount (indicating a column with ALL unique values), it short-circuits frequency calculations for that column - dramatically reducing the time and memory requirements for the ID column as it eliminates the need to maintain a hashmap for it.
Practically speaking, this makes frequency able to handle "real-world" datasets of any size.
To ensure frequency is as fast as possible, be sure to index and compute stats for your datasets beforehand.
Setting the stage for Datapusher+ v1 and...
The "itches we've been scratching" the past few months have been informed by our work at several clients towards the release of Datapusher+ 1.0 and qsv pro 1.0 (more info below) - both targeted for release this month.
DP+ is our third-gen, high-speed data ingestion/registration tool for CKAN that uses qsv as its data wrangling/analysis engine. It will enable us to reinvent the way data is ingested into CKAN - with exponentially faster data ingestion, metadata inferencing, data validation, computed metadata fields, and more!
We're particularly excited how qsv will allow us to compute and infer high-quality metadata for datasets (with a focus on inferring optional recommended DCAT-US v3 metadata fields) in "near real-time", while dataset publishers are still entering metadata. This will be a game-changer for CKAN administrators and data publishers!
...qsv pro 1.0
qsv pro is datHere's enterprise-grade data wrangling/curation workbench that’s planned for v1.0 release this month.
Building the core functionality of qsv pro's Workflow feature is one of the primary reasons for a v1.0 release.
We feel qsv pro may be a game-changer for data wranglers and data curators who need to work with spreadsheets and large datasets to view statistical data and metadata while also performing complex data wrangling operations in a user-friendly way without having to write code.

Added

docs: added Shell Completion section 556a2ff
docs: add 🪄 emoji in legend to indicate "automagical" commands 2753c90
Add building deb package (WIP) by @tino097 in #2029
Added GitHub workflow to test debian package (WIP) by @tino097 in #2032
tests: added false positive to _typos.toml configuration d576af2
added more benchmarks
added more tests

Changed

fetch & fetchpost: remove expired diskcache entries on startup 9b6ab5d
frequency: smarter frequency compilation with new --stats-mode option #2030
json: refactored for maintainability & performance 62e9216 and 4e44b18
improved self-update messages 5c874e0 and 0aa0b13
contrib(completions): frequency updates & remove bashly/fish by @rzmk in #2031
Debian package update by @tino097 in #2017
publish: optimized enabled CPU features when building release binaries in all GitHub Actions "publishing" workflows
publish: ensure latest Python patch release is used when building qsvpy binary variants 2ab03a0 and ec6f486
tests: also enabled CPU features in CI tests
docs: wordsmith qsv "elevator pitch" cc47fe6
docs: point to https://100.dathere.com in Whirlwind tour fc49aef
deps: bump polars to latest upstream post py-1.41.1 release at the time of this release
build(deps): bump bytes from 1.6.1 to 1.7.0 by @dependabot in #2018
build(deps): bump bytes from 1.7.0 to 1.7.1 by @dependabot in #2021
build(deps): bump flate2 from 1.0.30 to 1.0.31 by @dependabot in #2027
build(deps): bump indexmap from 2.2.6 to 2.3.0 by @dependabot in #2020
build(deps): bump jaq-parse from 1.0.2 to 1.0.3 by @dependabot in #2016
build(deps): bump redis from 0.26.0 to 0.26.1 by @dependabot in #2023
build(deps): bump regex from 1.10.5 to 1.10.6 by @dependabot in #2025
build(deps): bump serde_json from 1.0.121 to 1.0.122 by @dependabot in #2022
build(deps): bump sysinfo from 0.30.13 to 0.31.0 by @dependabot in #2019
build(deps): bump sysinfo from 0.31.0 to 0.31.2 by @dependabot in #2024
build(deps): bump tempfile from 3.11.0 to 3.12.0 by @dependabot in #2033
build(deps): bump serde from 1.0.204 to 1.0.205 by @dependabot in #2036
apply select clippy suggestions
updated several indirect dependencies
made various usage text improvements
bumped MSRV to 1.80.1

Fixed

sqlp & joinp: fixed .ssv.sz output auto-compression support 5397f6c & d86ba63
docs: fix link by @uncenter in #2026
tests: correct misnamed test 8ae6000
tests: fix flaky reverse property tests d86ba63

Removed

docs: "Quicksilver" is the name of the logo horse, not how you pronounce "qsv" e4551ae

New Contributors

@uncenter made their first contribution in #2026

Full Changelog: 0.130.0...0.131.0