⚠️ Deprecations
- Make parameter of
str.to_decimal
keyword-only (#20570)
🚀 Performance improvements
- Extend functionality on BitmapBuilder and use in Growables (#20754)
- Specialize first/last agg for simple types in new-streaming engine (#20728)
- Use PyO3 to convert between Python and Rust datetimes (#20660)
- Improve state caching and parallelism of window functions (#20689)
- Broadcast without materialization in
concat_arr
(#20681) - Cache rolling groups (#20675)
- Use downcast_ref instead of dtype equality in
<dyn SeriesTrait as AsRef<ChunkedArray<T>>
(#20664) - Fix performance regression for DataFrame serialization/pickling (#20641)
- Make Parquet
verify_dict_indices
SIMD (#20623) - Move to
zlib-rs
by default and usezstd::with_buffer
(#20614) - Skip filter expansion in eager (#20586)
- Improve unique pred-pd (#20569)
✨ Enhancements
- Allow different python versions for pickle (#20740)
- Add SQL support for the
NORMALIZE
string function (#20705) - Add 'allow_exact_matches' join_asof' (#20723)
- Add new-streaming first/last aggregations (#20716)
- Add Parquet Sink to new streaming engine (#20690)
- Make automatic use of Azure storage account keys opt-in (#20652)
- Reduce scan_csv() (and friends') memory usage when using BytesIO (#20649)
- Improve
GroupsProxy/GroupsPosition
to be sliceable and cheaply cloneable (#20673) - Add
str.normalize()
(#20483) - Allow more group_by agg expressions in the new streaming engine (#20663)
- Support loading Excel Table objects by name (#20654)
- Support writing to file objects from
write_excel
(#20638) - Raise
DuplicateError
if given a pyarrow Table object with duplicate column names (#20624) - Support writing partitioned parquet to cloud (#20590)
- Add hint to error message for extra struct field in JSON (#20612)
- Add
index_of()
function toSeries
andExpr
(#19894) - Update
sqlparser-rs
, enabling "LEFT" keyword to be optional for anti/semi joins in SQL queries (#20576) - Add
cat.starts_with
/cat.ends_with
(#20257)
🐞 Bug fixes
- Avoid blocking on async runtime when resolving cloud scans (#20750)
- Fix
allow_invalid_certificates
being ignored instorage_options
(#20744) - Incorrect output type for
map_groups
returning all-NULL column (#20743) - Fix
unique(maintain_order=True)
raisingInvalidOperationError
for null array (#20737) - Don't collapse into a Nested Loop Join if the cross join maintains order (#20729)
- Don't serialize credentials provider (#20741)
- Fix
Series.n_unique
raising for list of struct (#20724) - Fix incorrect top-k by sorted column, fix
head()
returning extra rows (#20722) - Add outer validity to AnyValueBufferTrusted for structs (#20713)
- Don't partition group-by with non-scalar literals in agg (#20704)
- Fix xor operation of selector with Expr (#20702)
- Incorrect view buffer dedup (#20691)
- Only verify Parquet ConvertedType if no LogicalType is given (#20682)
- Validate length of
schema_overrides
inread_csv
(#20672) - Fix
map_elements
ignoringskip_nulls=True
for struct dtype (#20668) - Check for MAP-GROUPS in cloud-eligible (#20662)
- Fix empty output of
to_arrow()
on filtered unit height DataFrame (#20656) - Add
.default
to azure credential provider scope URL (#20651) - Fix
join_asof
panicking for invalidtolerance
input (#20643) - Incorrect flag check on is_elementwise (#20646)
- Don't panic but set null type if type is unknown (#20647)
- Fix performance regression for DataFrame serialization/pickling (#20641)
- Fix
Int128
dtype serialization (#20629) - Ensure
read_excel
andread_ods
support reading from rawbytes
for all engines (#20636) - Ensure that SQL
LIKE
andILIKE
operators support multi-line matches (#20613) - Properly broadcast in sort_by (#20434)
- Properly load nested Parquet Statistics (#20610)
- AWS environment config was not loaded when credential provider was used (#20611)
- Fix order observability of group-by-dyn (#20615)
- Soundness when loading Parquet string statistics (#20585)
- Fix error filtering after
with_columns()
on unit height LazyFrame (#20584) - Propagate
tenant_id
toCredentialProviderAzure
if given (#20583) - Restore symbols on Apple by bumping nightly version (#20563)
- Fix type annotation of
str.strip_chars_*
methods (#20565) - Fix variable name in error message for "unsupported data type" in rolling and upsampling operations (#20553)
📖 Documentation
- Add more information for cross joins (#20753)
- Fix typo in sql functions (cosinus -> cosine) (#20676)
- Add links to
read_excel
"engine_options" and "read_options" docstring (#20661) - Fix small typo in plugins (polars-dt -> polars-st) (#20657)
- Add polars-h3 and polars-st to plugin list (#20653)
- Add docs reference for
Field
(#20625) - Update
DataFrame
join examples (#20587) - Miscellaneous minor updates/fixes (#20573)
- Update "group_by_rolling" (deprecated) to "rolling" in user guide (#20548)
📦 Build system
🛠️ Other improvements
- Fix remote benchmark script (#20755)
- Fix tests (#20745)
- Simplify hive predicate handling in
NEW_MULTIFILE
(#20730) - Add tests for various open issues (#20720)
- Fixes an Excel test following new
fastexcel
release (#20703) - Add tests for various open issues that have been fixed (#20680)
- Don't include debug symbols in benchmark run (#20571)
- Implement CSV, IPC and NDJson in the
MultiScanExec
node (#20648) - Don't rely on argument order of optimization_toggle (#20622)
- Fix Python deps installation in remote-benchmark workflow (#20619)
- Fix flaky categorical test (#20591)
- Bump multiversion from 0.7 to 0.8 (#20543)
- Remove unused nested function in
LazyFrame.fill_null
(#20558) - Improve bin size info (#20551)
Thank you to all our contributors for making this release possible!
@Jesse-Bakker, @MarcoGorelli, @MoizesCBF, @SamuelAllain, @alexander-beedie, @bschoenmaeckers, @coastalwhite, @eitsupi, @etiennebacher, @itamarst, @jqnatividad, @lukemanley, @mcrumiller, @nameexhaustion, @orlp, @ritchie46 and @stinodego