Highlights
- Refactored
frequency
to make it smarter and faster.
frequency
's core algorithm essentially compiles an in-memory hashmap to determine the frequency of each unique value for each column. It does this using multi-threaded, multi-I/O techniques to make it blazing fast.
However, for columns with ALL unique values (e.g. ID columns), this takes a comparatively long time and consumes a lot of memory as it essentially compiles a hashmap of the ENTIRE column, with a hashmap entry for each column value with a count of 1.
Now, with the new--stats-mode
option (enabled by default),frequency
can compile the dataset in a more intelligent way by looking up a column's cardinality in the stats cache.
If the cardinality of a column is equal to the CSV's rowcount (indicating a column with ALL unique values), it short-circuits frequency calculations for that column - dramatically reducing the time and memory requirements for the ID column as it eliminates the need to maintain a hashmap for it.
Practically speaking, this makesfrequency
able to handle "real-world" datasets of any size.
To ensurefrequency
is as fast as possible, be sure toindex
and computestats
for your datasets beforehand. - Setting the stage for Datapusher+ v1 and...
The "itches we've been scratching" the past few months have been informed by our work at several clients towards the release of Datapusher+ 1.0 and qsv pro 1.0 (more info below) - both targeted for release this month.
DP+ is our third-gen, high-speed data ingestion/registration tool for CKAN that uses qsv as its data wrangling/analysis engine. It will enable us to reinvent the way data is ingested into CKAN - with exponentially faster data ingestion, metadata inferencing, data validation, computed metadata fields, and more!
We're particularly excited how qsv will allow us to compute and infer high-quality metadata for datasets (with a focus on inferring optional recommended DCAT-US v3 metadata fields) in "near real-time", while dataset publishers are still entering metadata. This will be a game-changer for CKAN administrators and data publishers! - ...qsv pro 1.0
qsv pro is datHere's enterprise-grade data wrangling/curation workbench that’s planned for v1.0 release this month.
Building the core functionality of qsv pro's Workflow feature is one of the primary reasons for a v1.0 release.
We feel qsv pro may be a game-changer for data wranglers and data curators who need to work with spreadsheets and large datasets to view statistical data and metadata while also performing complex data wrangling operations in a user-friendly way without having to write code.
Added
docs
: added Shell Completion section 556a2ffdocs:
add 🪄 emoji in legend to indicate "automagical" commands 2753c90- Add building deb package (WIP) by @tino097 in #2029
- Added GitHub workflow to test debian package (WIP) by @tino097 in #2032
tests
: added false positive to _typos.toml configuration d576af2- added more benchmarks
- added more tests
Changed
fetch
&fetchpost
: remove expired diskcache entries on startup 9b6ab5dfrequency
: smarter frequency compilation with new--stats-mode
option #2030json
: refactored for maintainability & performance 62e9216 and 4e44b18- improved
self-update
messages 5c874e0 and 0aa0b13 contrib(completions)
:frequency
updates & remove bashly/fish by @rzmk in #2031- Debian package update by @tino097 in #2017
publish
: optimized enabled CPU features when building release binaries in all GitHub Actions "publishing" workflowspublish
: ensure latest Python patch release is used when buildingqsvpy
binary variants 2ab03a0 and ec6f486tests
: also enabled CPU features in CI testsdocs
: wordsmith qsv "elevator pitch" cc47fe6docs
: point to https://100.dathere.com in Whirlwind tour fc49aefdeps
: bump polars to latest upstream post py-1.41.1 release at the time of this release- build(deps): bump bytes from 1.6.1 to 1.7.0 by @dependabot in #2018
- build(deps): bump bytes from 1.7.0 to 1.7.1 by @dependabot in #2021
- build(deps): bump flate2 from 1.0.30 to 1.0.31 by @dependabot in #2027
- build(deps): bump indexmap from 2.2.6 to 2.3.0 by @dependabot in #2020
- build(deps): bump jaq-parse from 1.0.2 to 1.0.3 by @dependabot in #2016
- build(deps): bump redis from 0.26.0 to 0.26.1 by @dependabot in #2023
- build(deps): bump regex from 1.10.5 to 1.10.6 by @dependabot in #2025
- build(deps): bump serde_json from 1.0.121 to 1.0.122 by @dependabot in #2022
- build(deps): bump sysinfo from 0.30.13 to 0.31.0 by @dependabot in #2019
- build(deps): bump sysinfo from 0.31.0 to 0.31.2 by @dependabot in #2024
- build(deps): bump tempfile from 3.11.0 to 3.12.0 by @dependabot in #2033
- build(deps): bump serde from 1.0.204 to 1.0.205 by @dependabot in #2036
- apply select clippy suggestions
- updated several indirect dependencies
- made various usage text improvements
- bumped MSRV to 1.80.1
Fixed
sqlp
&joinp
: fixed.ssv.sz
output auto-compression support 5397f6c & d86ba63docs
: fix link by @uncenter in #2026tests
: correct misnamed test 8ae6000tests
: fix flakyreverse
property tests d86ba63
Removed
docs
: "Quicksilver" is the name of the logo horse, not how you pronounce "qsv" e4551ae
New Contributors
Full Changelog: 0.130.0...0.131.0