OPEN DATA DAY 2024 Release! 🎉🎉🎉
In celebration of Open Data Day, we're releasing qsv 0.123.0 - the biggest release ever with 330+ commits! qsv 0.123.0 continues to focus on performance, stability and reliability as we continue setting the stage for qsv's big brother - qsv pro.
We've been baking qsv pro for a while now, and it's almost ready for release. qsv pro is a cross-platform Desktop Data Wrangling tool marrying an Excel-like UI with the power of qsv, backed by cloud-based data cleaning, enrichment and enhancement service that's easy to use for casual Excel users and Data Publishers, yet powerful enough for data scientists and data engineers.
Stay tuned!
Highlights:
sqlp
now has automaticread_csv()
fast path optimization, often making optimized queries run dramatically faster - e.g what took 6.09 seconds for a non-trivial SQL aggregation on an 18 column, 657mb CSV with 7.43 million rows now takes just 0.14 seconds with the optimization - 🚀 43.5x FASTER 🚀 ! 1
# with fast path optimization turned off
/usr/bin/time qsv sqlp taxi.csv --no-optimizations "select VendorID,sum(total_amount) from taxi group by VendorID order by VendorID"
VendorID,total_amount
1,52377417.52985942
2,89959869.13054822
4,600584.610000027
(3, 2)
6.09 real 6.82 user 0.16 sys
# with fast path optimization, fully exploiting Polars' LazyFrame evaluation even for loading the CSV file!
/usr/bin/time qsv sqlp taxi.csv "select VendorID,sum(total_amount) from taxi group by VendorID order by VendorID"
VendorID,total_amount
1,52377417.52985942
2,89959869.13054822
4,600584.610000027
(3, 2)
0.14 real 1.09 user 0.09 sys
# in contrast, csvq takes 72.46 seconds - 517.57x slower
/usr/bin/time csvq "select VendorID,sum(total_amount) from taxi group by VendorID order by VendorID"
+----------+---------------------+
| VendorID | SUM(total_amount) |
+----------+---------------------+
| 1 | 52377417.529256366 |
| 2 | 89959869.1264675 |
| 4 | 600584.6099999828 |
+----------+---------------------+
72.46 real 65.15 user 75.17 sys
"Traditional" SQL engines
qsv and csvq both operate on "bare" CSVs. For comparison, let's contrast qsv's performance against "traditional" SQL engines
that require setup and import (aka ETL). Not counting setup and import time (which alone, takes several minutes), we get:
sqlite3.43.2 takes 2.910 seconds - 20.79x slower
sqlite> .timer on
sqlite> select VendorID,sum(total_amount) from taxi group by VendorID order by VendorID;
1,52377417.53
2,89959869.13
4,600584.61
Run Time: real 2.910 user 2.569494 sys 0.272972
PostgreSQL 15.6 using PgAdmin 4 v6.12 takes 18.527 seconds - 132.34x slower
even with an index, qsv sqlp is still 5.96x faster
sqlp
now supports JSONL output format and adds compression support for Avro and Arrow output formats.fetch
now has a--disk-cache
option, so you can cache web service responses to disk, complete with cache control and expiry handling!jsonl
is now multithreaded with additional--batch
and--job
options.split
now has three modes: split by record count, split by number of chunks and split by file size.datefmt
is a new top-level command for date formatting. We extracted it fromapply
to make it easier to use, and to set the stage for expanded date and timezone handling.enum
now has a--start
option.excel
now has a--keep-zero-time
option and now has improved datetime/duration parsing/handling with upgrade of calamine from 0.23 to 0.24.tojsonl
now has--trim
and--no-boolean
options and eliminated false positive boolean inferences.
Added
apply
: addgender_guess
operation #1569datefmt
: new top-level command for date formatting. #1638enum
: add--start
option #1631excel
: added--keep-zero-time
option; improved datetime/duration parsing/handling with upgrade of calamine from 0.23 to 0.24 #1595fetch
: add--disk-cache
option #1621jsonl
: major performance refactor! Now multithreaded with addl--batch
and--job
options #1553sniff
: added addl mimetype/file formats detected by bumpingfile-format
from 0.23 to 0.24 #1589split
: add<outdir>
error handling and add usage text examples #1585split
: added--chunks
option #1587split
: add--kb-size
option #1613sqlp
: added JSONL output format and compression support for AVRO and Arrow output formats in #1635tojsonl
: add--trim
option #1554- Add QSV_DOTENV_PATH env var #1562
- Add license scan report and status by @fossabot in #1550
- Added several benchmarks for new/changed commands
Changed
luau
: bumped Luau from 0.606 to 0.614freq
: major performance refactor - 1a3a4b4split
: migrate to rayon from threadpool #1555split
: refactored to actually create chunks <= desired--kb-size
, obviating need for hacky--sep-factor
option #1615tojsonl
: improved true/false boolean inferencing false positive handling #1641tojsonl
: fine-tune boolean inferencing #1643schema
: use parallel sort when sorting enums for fields 523c60a- Use array for rustflags to avoid conflicts with user flags by @clarfonthey in #1548
- Make it easier and more consistent to package for distros by @alerque in #1549
- Replace
simple_home_dir
withsimple_expand_tilde
crate #1578 - build(deps): bump rayon from 1.8.0 to 1.8.1 by @dependabot in #1547
- build(deps): bump rayon from 1.8.1 to 1.9.0 by @dependabot in #1623
- build(deps): bump uuid from 1.6.1 to 1.7.0 by @dependabot in #1551
- build(deps): bump jql-runner from 7.1.2 to 7.1.3 by @dependabot in #1552
- build(deps): bump jql-runner from 7.1.3 to 7.1.5 by @dependabot in #1602
- build(deps): bump jql-runner from 7.1.5 to 7.1.6 by @dependabot in #1637
- build(deps): bump flexi_logger from 0.27.3 to 0.27.4 by @dependabot in #1556
- build(deps): bump regex from 1.10.2 to 1.10.3 by @dependabot in #1557
- build(deps): bump cached from 0.47.0 to 0.48.0 by @dependabot in #1558
- build(deps): bump cached from 0.48.0 to 0.48.1 by @dependabot in #1560
- build(deps): bump cached from 0.48.1 to 0.49.2 by @dependabot in #1618
- build(deps): bump chrono from 0.4.31 to 0.4.32 by @dependabot in #1559
- build(deps): bump chrono from 0.4.32 to 0.4.33 by @dependabot in #1566
- build(deps): bump mlua from 0.9.4 to 0.9.5 by @dependabot in #1565
- build(deps): bump mlua from 0.9.5 to 0.9.6 by @dependabot in #1632
- build(deps): bump serde from 1.0.195 to 1.0.196 by @dependabot in #1568
- build(deps): bump serde from 1.0.196 to 1.0.197 by @dependabot in #1612
- build(deps): bump serde_json from 1.0.111 to 1.0.112 by @dependabot in #1567
- build(deps): bump serde_json from 1.0.112 to 1.0.113 by @dependabot in #1576
- build(deps): bump serde_json from 1.0.113 to 1.0.114 by @dependabot in #1610
- bump Polars from 0.36 to 0.37 #1570
- build(deps): bump polars from 0.37.0 to 0.38.0 by @dependabot in #1629
- build(deps): bump polars from 0.38.0 to 0.38.1 by @dependabot in #1634
- build(deps): bump strum from 0.25.0 to 0.26.1 by @dependabot in #1572
- build(deps): bump indexmap from 2.1.0 to 2.2.1 by @dependabot in #1575
- build(deps): bump indexmap from 2.2.1 to 2.2.2 by @dependabot in #1579
- build(deps): bump indexmap from 2.2.2 to 2.2.3 by @dependabot in #1601
- build(deps): bump indexmap from 2.2.4 to 2.2.5 by @dependabot in #1633
- build(deps): bump robinraju/release-downloader from 1.8 to 1.9 by @dependabot in #1574
- build(deps): bump itertools from 0.12.0 to 0.12.1 by @dependabot in #1577
- build(deps): bump rust_decimal from 1.33.1 to 1.34.0 by @dependabot in #1580
- build(deps): bump rust_decimal from 1.34.0 to 1.34.2 by @dependabot in #1582
- build(deps): bump rust_decimal from 1.34.2 to 1.34.3 by @dependabot in #1597
- build(deps): bump reqwest from 0.11.23 to 0.11.24 by @dependabot in #1581
- build(deps): bump tokio from 1.35.1 to 1.36.0 by @dependabot in #1583
- build(deps): bump tempfile from 3.9.0 to 3.10.0 by @dependabot in #1590
- build(deps): bump tempfile from 3.10.0 to 3.10.1 by @dependabot in #1622
- build(deps): bump indicatif from 0.17.7 to 0.17.8 by @dependabot in #1598
- build(deps): bump csvs_convert from 0.8.8 to 0.8.9 by @dependabot in #1596
- build(deps): bump ahash from 0.8.7 to 0.8.8 by @dependabot in #1599
- build(deps): bump ahash from 0.8.8 to 0.8.9 by @dependabot in #1611
- build(deps): bump ahash from 0.8.9 to 0.8.10 by @dependabot in #1624
- build(deps): bump ahash from 0.8.10 to 0.8.11 by @dependabot in #1640
- build(deps): bump governor from 0.6.0 to 0.6.3 by @dependabot in #1603
- build(deps): bump semver from 1.0.21 to 1.0.22 by @dependabot in #1606
- build(deps): bump ryu from 1.0.16 to 1.0.17 by @dependabot in #1605
- build(deps): bump anyhow from 1.0.79 to 1.0.80 by @dependabot in #1604
- build(deps): bump geosuggest-core from 0.6.0 to 0.6.1 by @dependabot in #1607
- build(deps): bump geosuggest-utils from 0.6.0 to 0.6.1 by @dependabot in #1608
- build(deps): bump pyo3 from 0.20.2 to 0.20.3 by @dependabot in #1616
- build(deps): bump crossbeam-channel from 0.5.11 to 0.5.12 by @dependabot in #1627
- build(deps): bump log from 0.4.20 to 0.4.21 by @dependabot in #1628
- build(deps): bump sysinfo from 0.30.5 to 0.30.6 by @dependabot in #1636
- build(deps): bump qsv-sniffer from 0.10.1 to 0.10.2 by @dependabot in #1644
- deps: bump halfbrown from 0.24 to 0.25 b32fc71
- apply select clippy suggestions
- update several indirect dependencies
- pin Rust nightly to 2024-02-23 - the nightly that Polars 0.38 can be built with
Fixed
- fix: fix feature = "cargo-clippy" deprecation by @rex4539 in #1626
stats
: fixed cache.json file not being updated properly b9c4371
Removed
- Removed
datefmt
subcommand fromapply
#1638
New Contributors
- @clarfonthey made their first contribution in #1548
- @alerque made their first contribution in #1549
- @fossabot made their first contribution in #1550
- @rex4539 made their first contribution in #1626
Full Changelog: 0.122.0...0.123.0
measurements taken on an Apple Mac Mini 2023 model with an M2 Pro chip with 12 CPU cores & 32GB of RAM, running macOS Sonoma 14.4 ↩
Footnotes