🏆 Highlights
- Enable common subplan elimination across plans in
collect_all
(#21747) - Add lazy sinks (#21733)
- Add
PartitionByKey
for new streaming sinks (#21689) - Enable new streaming memory sinks by default (#21589)
🚀 Performance improvements
- Implement linear-time rolling_min/max (#21770)
- Improve InputIndependentSelect by delegating to InMemorySourceNode (#21767)
- Enable common subplan elimination across plans in
collect_all
(#21747) - Allow elementwise functions in recursive lowering (#21653)
- Add primitive single-key hashtable to new-streaming join (#21712)
- Remove unnecessary black_boxes in Kahan summation (#21679)
- Box large enum variants (#21657)
- Improve join performance for new-streaming engine (#21620)
- Pre-fill caches (#21646)
- Optimize only a single cache input (#21644)
- Collect parquet statistics in one contiguous buffer (#21632)
- Update Cargo.lock (mainly for zstd 1.5.7) (#21612)
- Don't maintain order when maintain_order=False in new streaming sinks (#21586)
- Pre-sort groups in group-by-dynamic (#21569)
✨ Enhancements
- Add support for rolling_(sum/min/max) for booleans through casting (#21748)
- Support multi-column sort for all nested types and nested search-sorted (#21743)
- Add lazy sinks (#21733)
- Add
PartitionByKey
for new streaming sinks (#21689) - Fix replace flags (#21731)
- Add
mkdir
flag to sinks (#21717) - Enable joins on list/array dtypes (#21687)
- Add a config option to specify the default engine to attempt to use during lazyframe calls (#20717)
- Support all elementwise functions in IO plugin predicates (#21705)
- Stabilize Enum datatype (#21686)
- Support Polars int128 in from arrow (#21688)
- Use FFI to read dataframe instead of transmute (#21673)
- Enable new streaming memory sinks by default (#21589)
- Cloud support for new-streaming scans and sinks (#21621)
- Add len method to arr (#21618)
- Closeable files on unix (#21588)
- Add new
PartitionMaxSize
sink (#21573) - Support engine callback for
LazyFrame.profile
(#21534) - Dispatch new-streaming CSV negative slice to separate node (#21579)
- Add NDJSON source to new streaming engine (#21562)
- Support passing
token
instorage_options
for GCP cloud (#21560)
🐞 Bug fixes
- Expose and document partitions (#21765)
- Fix lazy schema for truediv ops involving List/Array dtypes (#21764)
- Fix error due to race condition in file cache (#21753)
- Clear NaNs due to zero-weight division in rolling var/std (#21761)
- Allow init from BigQuery Arrow data containing ExtensionType cols with irrelevant metadata (#21492)
- Disallow cast from boolean to categorical/enum (#21714)
- Don't check sortedness in
join_asof
when 'by' groups supplied, but issue warning (#21724) - Incorrect multithread path taken for aggregations (#21727)
- Disallow cast to empty Enum (#21715)
- Fix
list.mean
andlist.median
returning Float64 for temporal types (#21144) - Incorrect (FixedSize)ListArrayBuilder gather implementation (#21716)
- Always fallback in SkipBatchPredicate (#21711)
- New streaming multiscan deadlock (#21694)
- Ensure new-streaming join BuildState is correct even if never fed morsels (#21708)
- IO plugin; support empty iterator (#21704)
- Support nulls in multi-column sort (#21702)
- Window function check length of groups state (#21697)
- Support 128 sum reduction on new streaming (#21691)
- IPC round-trip of list of empty view with non-empty bufferset (#21671)
- Variance can never be negative (#21678)
- Incorrect loop length in new-streaming group by (#21670)
- Right join on multiple columns not coalescing left_on columns (#21669)
- Casting Struct to String panics if n_chunks > 1 (#21656)
- Fix
Future attached to different loop
error onread_database_uri
(#21641) - Fix deadlock in cache + hconcat (#21640)
- Properly handle phase transitions in row-wise sinks (#21600)
- Enable new streaming memory sinks by default (#21589)
- Always use global registry for object (#21622)
- Check enum categories when reading csv (#21619)
- Unspecialized prefiltering on nullable arrays (#21611)
- Release the gil on explain (#21607)
- Take into account scalar/partitioned columns in DataFrame::split_chunks (#21606)
- Bad null handling in unordered row encoding (#21603)
- Fix deadlock in new streaming CSV / NDJSON sinks (#21598)
- Bad view index in BinaryViewBuilder (#21590)
- Fix CSV count with comment prefix skipped empty lines (#21577)
- New streaming IPC enum scan (#21570)
- Several aspects related to ParquetColumnExpr (#21563)
- Don't hit parquet::pre-filtered in case of pre-slice (#21565)
📖 Documentation
- Add skrub to ecosystem.md (#21760)
- Add example for percentile rank (#21746)
- Make python/rust getting-started consistent and clarify performance risk of infer_schema_length=None (#21734)
- Add expression composability to PySpark comparison (#21473)
- Document
read_().lazy()
antipattern (#21623) - Update Polars Cloud interactive workflow examples (#21609)
- Add a
Plotnine
example to the visualization docs (#21597) - Add cloud api reference to Ref guide (#21566)
🛠️ Other improvements
- Remove variance numerical stability hack (#21749)
- Only use chrono_tz timezones in hypothesis testing (#21721)
- Remove order check from flaky test (#21730)
- Add sinks into the DSL before optimization (#21713)
- Add missing test case for #21701 (#21709)
- Remove old-streaming from engine argument (#21667)
- Add as_phys_any to PrivateSeries for downcasting (#21696)
- Use FFI to read dataframe instead of transmute (#21673)
- Work around typos ignore bug (#21672)
- Added Test For
datetime_range
Nanosecond Overflow (#21354) - Update to edition 2024 (#21662)
- Update rustc (#21647)
- Support object from chunks (#21636)
- Push versioned docs on workflow dispatch (#21630)
- Fail docs early (#21629)
- Check major/minor in docs (#21626)
- Add docs workflow (#21624)
- Add test for 21581 (#21617)
- Remove even more parquet multiscan handling (#21601)
- Remove multiscan handling from new streaming parquet source (#21584)
- Prepare skeleton for partitioning sinks (#21536)
Thank you to all our contributors for making this release possible!
@GaelVaroquaux, @Kevin-Patyk, @MarcoGorelli, @Matt711, @NathanHu725, @alexander-beedie, @coastalwhite, @dependabot[bot], @jrycw, @kdn36, @lukemanley, @mcrumiller, @nameexhaustion, @orlp, @r-brink, @ritchie46, @wence- and dependabot[bot]