🚀 Performance improvements
- Improve
DataFrame.sort().limit/top_k
performance (#19731) - Improve cloud scan performance (#19728)
- Fix quadratic 'with_columns' behavior (#19701)
- Improve hive partition pruning with datetime predicates from SQL (#19680)
- Allow for arbitrary skips in Parquet Dictionary Decoding (#19649)
- Reorder conditions in is_leap_year (#19602)
- Rechunk in DataFrame.rows if needed (#19628)
- Dispatch Parquet Primitive PLAIN decoding to faster kernels when possible (#19611)
- Use faster iteration in 'starts_with'/'ends_with' (#19583)
- Branchless Parquet Prefiltering (#19190)
- Reduce size of IdxVec from 24 -> 16 bytes (#19550)
✨ Enhancements
- Try to support native SAP HANA driver via
read_database
(#19733) - Implement max/min methods for dtypes (#19494)
- Improve
n_chunks
typing (#19727) - Improve hive partition pruning with datetime predicates from SQL (#19680)
- Identify inefficient use of Python string
removeprefix
,removesuffix
, andzfill
inmap_elements
(#19672) - Automatically use boto3 / google-auth if installed when scanning cloud (#19677)
- Identify inefficient use of Python string
replace
inmap_elements
(#19668) - Parallel IPC sink for the new streaming engine (#19622)
- Add SQL support for
RIGHT JOIN
, fix an issue with wildcard aliasing (#19626) - Add show_graph to display a GraphViz plot for expressions (#19365)
- Streamline use of predicates connected by
&
with IEJoin (join_where
) (#19552) - Support use of
is_between
range predicate with IEJoin operations (join_where
) (#19547)
🐞 Bug fixes
- Use
cls
forto_python
(#19726) - Fix validation for inner and left join when join_nulls unflaged (#19698)
- SQL
ELSE
clause should be implicitlyNULL
when omitted (#19714) - Improve
n_chunks
typing (#19727) - Ensure
NoDataError
raised consistently between engines for Excel reads (#19712) - In group_by_dynamic, period and every were getting applied in reverse order for the window upper boundary (#19706)
- Only allow
list.to_struct
to be elementwise when width is fixed (#19688) - Make Array arithmetic ops fully elementwise (#19682)
- Address inconsistency with use of Python types in frame-level
cast
(#19657) - Update line-splitting logic in batched CSV reader (#19508)
- Fix incorrect lazy schema for
explode()
inagg()
(#19629) - Fix fill null types (#19656)
- Fix filter incorrectly pushed past struct unnest when unnested column name matches upper column name (#19638)
- Fix typing for SchemaDefinition (#19647)
- Ensure
mean_horizontal
raises on non-numeric input (#19648) - Reorder conditions in is_leap_year (#19602)
- Copy height in .vstack() for empty dataframes (#19641) (#19642)
- Correct wildcard and input expansion for some more functions (#19588)
- Allow
.struct.with_fields
insidelist.eval
(#19617) - Sortedness was incorrectly being preserved in dt.offset_by when offsetting by non-constant durations in the timezone-naive case (#19616)
- Fix incorrect
scan_parquet().with_row_index()
with non-zero slice or with streaming collect (#19609) - Fix mask and validity confusion in Parquet String decoding (#19614)
- Parquet decoding of nested dictionary values (#19605)
- Do not attempt to load default credentials when
credential_provider
is given (#19589) - Fix gather len in group-by state (#19586)
- Added input validation for
explode
operation in the array namespace (#19163) - Improve error message (#19546)
- Fix predicate pushdown into inequality joins (#19582)
- Correct categorical namespace error message (#19558)
- Fix performance regression for sort/gather on list/array columns (#19564)
- Ignore quoted newlines when skipping lines in CSV (#19543)
- Incorrect gather for FixedSizeList with outer validity but no inner validities (#19489)
- Make Duration parsing fallible and not panic (#19490)
📖 Documentation
- Revise and rework user-guide/expressions (#19360)
- Update Excel page of user guide to refer to fastexcel as the default engine (#19691)
- Alter examples for round_sig_figs to make behaviour clearer (#19667)
- Assorted fixes to Rust API docs (#19664)
- Improve
replace
andreplace_all
docstring explanation of the "$" character with reference to capture groups (vs use as a literal) (#19529) - Add credential provider section and examples to user guide (#19487)
- Fix various instances of repeated words in docs and comments (#19516)
📦 Build system
- Bump Rust toolchain to
nightly-2024-10-28
(#19492)
🛠️ Other improvements
- Remove unused Excel code (#19710)
- Use
Column
for the{try,}_apply_columns{_par,}
functions onDataFrame
(#19683) - Remove more
@scalar-opt
(#19666) - Move Series bitops to
std::ops::Bit...
(#19673) - Mark test_parquet.py test_dict_slices as slow (#19675)
- Get
Column
intopolars-expr
(#19660) - Streamline internal SQL join condition processing (#19658)
- Factor out logic for re-use by new streaming CSV source (#19637)
- Configure grouped Dependabot updates (#19604)
- Fix PyO3 error in CI (#19545)
- Update nightly compiler version (#19590)
- Added input validation for
explode
operation in the array namespace (#19163) - Fix lint (#19584)
- Add a
Column::Partitioned
variant (#19557) - Move to fast-float2 (#19578)
- Only run remote bench on rust changes (#19581)
- Remove unsafe *_release functions (#19554)
- Fix
test_rolling_by_integer
not using parameterized dtype (#19555) - Add
mindebug-dev
rust profile (#19524) - Add CI step to process benchmark results (#19530)
- Add CI benchmark on merge (#19518)
- Skip client check with env var (#19517)
- Improve makefile build commands (#19498)
Thank you to all our contributors for making this release possible!
@3tilley, @HansBambel, @MarcoGorelli, @alexander-beedie, @barak1412, @braaannigan, @cmdlineluser, @coastalwhite, @corwinjoy, @dependabot, @dependabot[bot], @eitsupi, @janpipek, @jqnatividad, @letkemann, @max-muoto, @nameexhaustion, @orlp, @ritchie46, @rodrigogiraoserrao, @siddharth-vi, @stinodego and @wence-