github jqnatividad/qsv 0.123.0

latest releases: 0.127.0, 0.126.0, 0.125.0...
one month ago

OPEN DATA DAY 2024 Release! 🎉🎉🎉

In celebration of Open Data Day, we're releasing qsv 0.123.0 - the biggest release ever with 330+ commits! qsv 0.123.0 continues to focus on performance, stability and reliability as we continue setting the stage for qsv's big brother - qsv pro.

We've been baking qsv pro for a while now, and it's almost ready for release. qsv pro is a cross-platform Desktop Data Wrangling tool marrying an Excel-like UI with the power of qsv, backed by cloud-based data cleaning, enrichment and enhancement service that's easy to use for casual Excel users and Data Publishers, yet powerful enough for data scientists and data engineers.

Stay tuned!

Highlights:

# with fast path optimization turned off
/usr/bin/time qsv sqlp taxi.csv --no-optimizations "select VendorID,sum(total_amount) from taxi group by VendorID order by VendorID"
VendorID,total_amount
1,52377417.52985942
2,89959869.13054822
4,600584.610000027
(3, 2)
        6.09 real         6.82 user         0.16 sys

# with fast path optimization, fully exploiting Polars' LazyFrame evaluation even for loading the CSV file!
 /usr/bin/time qsv sqlp taxi.csv "select VendorID,sum(total_amount) from taxi group by VendorID order by VendorID"
VendorID,total_amount
1,52377417.52985942
2,89959869.13054822
4,600584.610000027
(3, 2)
        0.14 real         1.09 user         0.09 sys

# in contrast, csvq takes 72.46 seconds - 517.57x slower
/usr/bin/time csvq "select VendorID,sum(total_amount) from taxi group by VendorID order by VendorID"
+----------+---------------------+
| VendorID |  SUM(total_amount)  |
+----------+---------------------+
| 1        |  52377417.529256366 |
| 2        |    89959869.1264675 |
| 4        |   600584.6099999828 |
+----------+---------------------+
       72.46 real        65.15 user        75.17 sys

"Traditional" SQL engines

qsv and csvq both operate on "bare" CSVs. For comparison, let's contrast qsv's performance against "traditional" SQL engines
that require setup and import (aka ETL). Not counting setup and import time (which alone, takes several minutes), we get:

sqlite3.43.2 takes 2.910 seconds - 20.79x slower

sqlite> .timer on
sqlite> select VendorID,sum(total_amount) from taxi group by VendorID order by VendorID;
1,52377417.53
2,89959869.13
4,600584.61
Run Time: real 2.910 user 2.569494 sys 0.272972

PostgreSQL 15.6 using PgAdmin 4 v6.12 takes 18.527 seconds - 132.34x slower

even with an index, qsv sqlp is still 5.96x faster

  • sqlp now supports JSONL output format and adds compression support for Avro and Arrow output formats.
  • fetch now has a --disk-cache option, so you can cache web service responses to disk, complete with cache control and expiry handling!
  • jsonl is now multithreaded with additional --batch and --job options.
  • split now has three modes: split by record count, split by number of chunks and split by file size.
  • datefmt is a new top-level command for date formatting. We extracted it from apply to make it easier to use, and to set the stage for expanded date and timezone handling.
  • enum now has a --start option.
  • excel now has a --keep-zero-time option and now has improved datetime/duration parsing/handling with upgrade of calamine from 0.23 to 0.24.
  • tojsonl now has --trim and --no-boolean options and eliminated false positive boolean inferences.

Added

  • apply: add gender_guess operation #1569
  • datefmt: new top-level command for date formatting. #1638
  • enum: add --start option #1631
  • excel: added --keep-zero-time option; improved datetime/duration parsing/handling with upgrade of calamine from 0.23 to 0.24 #1595
  • fetch: add --disk-cache option #1621
  • jsonl: major performance refactor! Now multithreaded with addl --batch and --job options #1553
  • sniff: added addl mimetype/file formats detected by bumping file-format from 0.23 to 0.24 #1589
  • split: add <outdir> error handling and add usage text examples #1585
  • split: added --chunks option #1587
  • split: add --kb-size option #1613
  • sqlp: added JSONL output format and compression support for AVRO and Arrow output formats in #1635
  • tojsonl: add --trim option #1554
  • Add QSV_DOTENV_PATH env var #1562
  • Add license scan report and status by @fossabot in #1550
  • Added several benchmarks for new/changed commands

Changed

  • luau: bumped Luau from 0.606 to 0.614
  • freq: major performance refactor - 1a3a4b4
  • split: migrate to rayon from threadpool #1555
  • split: refactored to actually create chunks <= desired --kb-size, obviating need for hacky --sep-factor option #1615
  • tojsonl: improved true/false boolean inferencing false positive handling #1641
  • tojsonl: fine-tune boolean inferencing #1643
  • schema: use parallel sort when sorting enums for fields 523c60a
  • Use array for rustflags to avoid conflicts with user flags by @clarfonthey in #1548
  • Make it easier and more consistent to package for distros by @alerque in #1549
  • Replace simple_home_dir with simple_expand_tilde crate #1578
  • build(deps): bump rayon from 1.8.0 to 1.8.1 by @dependabot in #1547
  • build(deps): bump rayon from 1.8.1 to 1.9.0 by @dependabot in #1623
  • build(deps): bump uuid from 1.6.1 to 1.7.0 by @dependabot in #1551
  • build(deps): bump jql-runner from 7.1.2 to 7.1.3 by @dependabot in #1552
  • build(deps): bump jql-runner from 7.1.3 to 7.1.5 by @dependabot in #1602
  • build(deps): bump jql-runner from 7.1.5 to 7.1.6 by @dependabot in #1637
  • build(deps): bump flexi_logger from 0.27.3 to 0.27.4 by @dependabot in #1556
  • build(deps): bump regex from 1.10.2 to 1.10.3 by @dependabot in #1557
  • build(deps): bump cached from 0.47.0 to 0.48.0 by @dependabot in #1558
  • build(deps): bump cached from 0.48.0 to 0.48.1 by @dependabot in #1560
  • build(deps): bump cached from 0.48.1 to 0.49.2 by @dependabot in #1618
  • build(deps): bump chrono from 0.4.31 to 0.4.32 by @dependabot in #1559
  • build(deps): bump chrono from 0.4.32 to 0.4.33 by @dependabot in #1566
  • build(deps): bump mlua from 0.9.4 to 0.9.5 by @dependabot in #1565
  • build(deps): bump mlua from 0.9.5 to 0.9.6 by @dependabot in #1632
  • build(deps): bump serde from 1.0.195 to 1.0.196 by @dependabot in #1568
  • build(deps): bump serde from 1.0.196 to 1.0.197 by @dependabot in #1612
  • build(deps): bump serde_json from 1.0.111 to 1.0.112 by @dependabot in #1567
  • build(deps): bump serde_json from 1.0.112 to 1.0.113 by @dependabot in #1576
  • build(deps): bump serde_json from 1.0.113 to 1.0.114 by @dependabot in #1610
  • bump Polars from 0.36 to 0.37 #1570
  • build(deps): bump polars from 0.37.0 to 0.38.0 by @dependabot in #1629
  • build(deps): bump polars from 0.38.0 to 0.38.1 by @dependabot in #1634
  • build(deps): bump strum from 0.25.0 to 0.26.1 by @dependabot in #1572
  • build(deps): bump indexmap from 2.1.0 to 2.2.1 by @dependabot in #1575
  • build(deps): bump indexmap from 2.2.1 to 2.2.2 by @dependabot in #1579
  • build(deps): bump indexmap from 2.2.2 to 2.2.3 by @dependabot in #1601
  • build(deps): bump indexmap from 2.2.4 to 2.2.5 by @dependabot in #1633
  • build(deps): bump robinraju/release-downloader from 1.8 to 1.9 by @dependabot in #1574
  • build(deps): bump itertools from 0.12.0 to 0.12.1 by @dependabot in #1577
  • build(deps): bump rust_decimal from 1.33.1 to 1.34.0 by @dependabot in #1580
  • build(deps): bump rust_decimal from 1.34.0 to 1.34.2 by @dependabot in #1582
  • build(deps): bump rust_decimal from 1.34.2 to 1.34.3 by @dependabot in #1597
  • build(deps): bump reqwest from 0.11.23 to 0.11.24 by @dependabot in #1581
  • build(deps): bump tokio from 1.35.1 to 1.36.0 by @dependabot in #1583
  • build(deps): bump tempfile from 3.9.0 to 3.10.0 by @dependabot in #1590
  • build(deps): bump tempfile from 3.10.0 to 3.10.1 by @dependabot in #1622
  • build(deps): bump indicatif from 0.17.7 to 0.17.8 by @dependabot in #1598
  • build(deps): bump csvs_convert from 0.8.8 to 0.8.9 by @dependabot in #1596
  • build(deps): bump ahash from 0.8.7 to 0.8.8 by @dependabot in #1599
  • build(deps): bump ahash from 0.8.8 to 0.8.9 by @dependabot in #1611
  • build(deps): bump ahash from 0.8.9 to 0.8.10 by @dependabot in #1624
  • build(deps): bump ahash from 0.8.10 to 0.8.11 by @dependabot in #1640
  • build(deps): bump governor from 0.6.0 to 0.6.3 by @dependabot in #1603
  • build(deps): bump semver from 1.0.21 to 1.0.22 by @dependabot in #1606
  • build(deps): bump ryu from 1.0.16 to 1.0.17 by @dependabot in #1605
  • build(deps): bump anyhow from 1.0.79 to 1.0.80 by @dependabot in #1604
  • build(deps): bump geosuggest-core from 0.6.0 to 0.6.1 by @dependabot in #1607
  • build(deps): bump geosuggest-utils from 0.6.0 to 0.6.1 by @dependabot in #1608
  • build(deps): bump pyo3 from 0.20.2 to 0.20.3 by @dependabot in #1616
  • build(deps): bump crossbeam-channel from 0.5.11 to 0.5.12 by @dependabot in #1627
  • build(deps): bump log from 0.4.20 to 0.4.21 by @dependabot in #1628
  • build(deps): bump sysinfo from 0.30.5 to 0.30.6 by @dependabot in #1636
  • build(deps): bump qsv-sniffer from 0.10.1 to 0.10.2 by @dependabot in #1644
  • deps: bump halfbrown from 0.24 to 0.25 b32fc71
  • apply select clippy suggestions
  • update several indirect dependencies
  • pin Rust nightly to 2024-02-23 - the nightly that Polars 0.38 can be built with

Fixed

  • fix: fix feature = "cargo-clippy" deprecation by @rex4539 in #1626
  • stats: fixed cache.json file not being updated properly b9c4371

Removed

  • Removed datefmt subcommand from apply #1638

New Contributors

Full Changelog: 0.122.0...0.123.0

Footnotes

  1. measurements taken on an Apple Mac Mini 2023 model with an M2 Pro chip with 12 CPU cores & 32GB of RAM, running macOS Sonoma 14.4 ↩

Don't miss a new qsv release

NewReleases is sending notifications on new releases.