💥 Breaking changes
- Remove dedicated
sink_(parquet/ipc)_cloud
functions (#20164) - Experimental cloud write support (#20129)
🚀 Performance improvements
- Add fast paths for series.arg_sort and dataframe.sort (#19872)
- Utilize the RangedUniqueKernel for Enum/Categorical (#20150)
- Reduce memory copy when scanning from Python objects (#20142)
- Don't instantiate validity mask when unneeded in Parquet (#20149)
- Expand more filters (#20022)
- Cache the DataFrame schema in get_column_index (#20021)
- Reduce the size of row encoding UTF-8 (#19911)
- Memoize duplicates in rolling-gb-dyn (#19939)
- More efficient row encoding for
pl.List
(#19907) - Half the size of Booleans in row encoding (#19927)
- Rolling 'iter_lookbehind' breeze through duplicates (#19922)
- Initially trim leading and trailing filtered rows (#19850)
- Increase default async thread count for low core count systems (#19829)
- Move row group decode off async thread for local streaming parquet scan (#19828)
- Support use of Duration in
to_string
, ergonomic/perf improvement, tz-aware Datetime bugfix (#19697) - Improve
DataFrame.sort().limit/top_k
performance (#19731) - Improve cloud scan performance (#19728)
- Fix quadratic 'with_columns' behavior (#19701)
- Improve hive partition pruning with datetime predicates from SQL (#19680)
- Allow for arbitrary skips in Parquet Dictionary Decoding (#19649)
- Reorder conditions in is_leap_year (#19602)
- Rechunk in DataFrame.rows if needed (#19628)
- Dispatch Parquet Primitive PLAIN decoding to faster kernels when possible (#19611)
- Use faster iteration in 'starts_with'/'ends_with' (#19583)
- Branchless Parquet Prefiltering (#19190)
✨ Enhancements
- Retry with reloaded credentials on cloud error (#20185)
- Support reading Enum dtype from csv (#20188)
- Allow sorting of lists and arrays (#20169)
- Add
maintain_order
parameter to joins (#20026) - Allow for
to_datetime
/strftime
to automatically parse dates with single-digit hour/minute/second (#20144) - Experimental cloud write support (#20129)
- Allow setting and reading custom schema-level IPC metadata (#20066)
- Add optimized row encoding for Decimals (#20050)
- Add
drop_nans
method to DataFrame and LazyFrame (#20029) - Catch use of 'polars' in
to_string
for non-Duration dtypes and raise an informative error (#19977) - Add AhoCorasick backed 'find_many' (#19952)
- Speed up starts_with for small prefixes (#19904)
- Auto-enable hive partitioning if hive_schema was given (#19902)
- Add
pl.concat_arr
to concatenate columns into an Array column (#19881) - Support both "iso" and "iso:strict" format options for
dt.to_string
(#19840) - Add rounding for Decimal type (#19760)
- Improved array arithmetic support (#19837)
- Raise informative error on Unknown unnest (#19830)
- Support use of Duration in
to_string
, ergonomic/perf improvement, tz-aware Datetime bugfix (#19697) - Allow specification of
chunk_size
onLazyCsvReader.read_options
(#19819) - Add an
is_literal
method to expressionmeta
namespace (#19773) - A different approach to warning users of fork() issues with Polars (#19197)
- Add dylib (#19759)
- Add IPC source node for new streaming engine (#19454)
- Implement max/min methods for dtypes (#19494)
- Improve hive partition pruning with datetime predicates from SQL (#19680)
- Parallel IPC sink for the new streaming engine (#19622)
- Add SQL support for
RIGHT JOIN
, fix an issue with wildcard aliasing (#19626) - Add show_graph to display a GraphViz plot for expressions (#19365)
🐞 Bug fixes
- Don't trigger length check in array construction (#20205)
- Allow row encoding for 32-bit architectures (e.g. WASM) (#20186)
- Properly project unordered column in parquet prefiltered (#20189)
- Csv stop simd cache if eol char is hit (#20199)
- Estimated size for object (#20191)
- Respect parallel argument in parquet (#20187)
- Only validate UTF-8 for selected items when all below len 128 (#20183)
- Serialize categories of Enum in arrow metadata (#20181)
- Don't use RLE encoding for Parquet Boolean (#20172)
- Invalid
bitwise_xor
for ScalarColumn (#20140) - Add temporal feature gate in
is_elementwise_top_level
(#20177) - Column name mismatch or not found in Parquet scan with filter (#20178)
- Raise if apply returns different types (#20168)
- Deal with masked out list elements (#20161)
- Fix index out of bounds in uniform_hist_count (#20133)
- Implement
arg_sort
for Null series (#20135) - Handle slice pushdown in PythonUDF GroupBy (#20132)
- Check shape for
*_horizontal
functions (#20130) - Properly coerce types in lists (#20126)
- Incorrect aggregation of empty groups after slice (#20127)
- DataFrame
.get_column
afterdrop_in_place
(#20120) - Subtraction with underflow on empty FixedSizeBinaryArray (#20109)
- Materialize smallest dyn ints to use feature gate for i8/i16 (#20108)
- Return null instead of 0. for rolling_std when window contains a single element and ddof=1 and there are nulls elsewhere in the Series (#20077)
- Only slice after sort when slice is smaller than frame length (#20084)
- Preserve Series name in __rpow__ operation (#20072)
- Allow nested
is_in()
inwhen()/then()
for full-streaming (#20052) - Fix datetime cast behavior for pre-epoch times (#19949)
- Improve
hist
binning around breakpoints (#20054) - Fix invalid len due to projection pushdown selection of scalar (#20049)
- Fix empty scalar agg type (#20051)
- Improve binning in
Series.hist
withbin_count
when all values are the same (#20034) - Less intrusive forking warnings (#20032)
- Reading nullable sliced / masked Categoricals from Parquet (#20024)
- Regression in
hist
panicking on out of bounds index (#20016) - Fix starts_with out of bounds (#20006)
- Fix incorrect column order for parquet scan with hive columns in file (#19996)
- Incorrectly gave
list.len()
for masked-out rows (#19999) - Bug fix in existing fast path for sorted series (#20004)
- Incorrect
collect_schema()
forfill_null()
after an aggregation expression in group-by context (#19993) - Fix Decimal type fill_null (#19981)
- Fix panic on schema merge for prefiltering (#19972)
- Fix lazy frame join expression (#19974)
- Fix
gather_every
forScalar
(#19964) - Toggle 'fast_unique' on new_from_index (#19956)
- Raise proper error message when too small interval is passed to datetime_range (#19955)
- Fix scalar object (#19940)
- Raise InvalidOperationError for invalid float to decimal casts (e.g. Inf, NaN) (#19938)
- Fix panic with combination of hive and parquet prefiltering (#19905)
- Fix panic when joining with empty frame (debug only) (#19896)
- Fix incorrect result from inequality filter after join on LazyFrame (#19898)
- Misleading
ShapeError
error message on dataframe creation (#19901) - Fix panic with empty delta scan, or empty parquet scan with a provided schema (#19884)
- Ensure type object of inputs for cached any-value conversion functions are kept alive (#19866)
- Fix panic using
scan_parquet().with_row_index()
with hive partitioning enabled (#19865) - Improve histogram bin logic (#18761)
- Raise informative error instead of panicking for list arithmetic on some invalid dtypes (#19841)
- Properly handle Zero-Field Structs in row encoding (#19846)
- Incorrect explode schema for
LazyFrame.explode()
(#19860) - Ensure
List
element truncation ellipses respectASCII*
table formats (#19835) - Validate subnodes in validate IR (#19831)
- Raise if merge non-global categoricals in unpivot (#19826)
- Type hints for window_size incorrectly included timedelta in some rolling functions (#19827)
- Don't panic if column not found (#19824)
- Fix gather of Scalar null + idx w/ validity (#19823)
- Fix object chunked gather (#19811)
- Fix inconsistency between code and comment (#19810)
- Fix filter scalar nulls (#19786)
- Altair tooltip was being incorrectly applied to plots which did not accept it (#19789)
- Fix scanning google cloud with service account credentials file (#19782)
- Fix incorrect filter after right-join on LazyFrame (#19775)
- Fix incorrect lazy schema for explode on array columns (#19776)
- Fix incorrect lazy schema for aggregations (#19753)
- Fix validation for inner and left join when join_nulls unflaged (#19698)
- SQL
ELSE
clause should be implicitlyNULL
when omitted (#19714) - In group_by_dynamic, period and every were getting applied in reverse order for the window upper boundary (#19706)
- Only allow
list.to_struct
to be elementwise when width is fixed (#19688) - Make Array arithmetic ops fully elementwise (#19682)
- Update line-splitting logic in batched CSV reader (#19508)
- Fix incorrect lazy schema for
explode()
inagg()
(#19629) - Fix filter incorrectly pushed past struct unnest when unnested column name matches upper column name (#19638)
- Ensure
mean_horizontal
raises on non-numeric input (#19648) - Reorder conditions in is_leap_year (#19602)
- Copy height in .vstack() for empty dataframes (#19641) (#19642)
- Run join type coercion with correct schemas active (#19625)
- Correct wildcard and input expansion for some more functions (#19588)
- Allow
.struct.with_fields
insidelist.eval
(#19617) - Sortedness was incorrectly being preserved in dt.offset_by when offsetting by non-constant durations in the timezone-naive case (#19616)
- Fix incorrect
scan_parquet().with_row_index()
with non-zero slice or with streaming collect (#19609) - Fix mask and validity confusion in Parquet String decoding (#19614)
- Parquet decoding of nested dictionary values (#19605)
- Do not attempt to load default credentials when
credential_provider
is given (#19589) - Fix gather len in group-by state (#19586)
- Added input validation for
explode
operation in the array namespace (#19163) - Improve error message (#19546)
- Fix predicate pushdown into inequality joins (#19582)
📖 Documentation
- Add more Rust examples to User Guide (#20194)
- Expand plotting docs (#19719)
- Fix Rust examples in user guide (#20075)
- Update
by
param description for rolling_*_by functions (#19715) - Fix inconsistency between code and comment (#20070)
- Correct supported compression formats (#20085)
- Specify strictness in cast (#20067)
- Fix broken links to user guide (#19989)
- Minor doc fixes and cleanup (#19935)
- Complete parameters description and add an example for
clip()
(#19875) - Fix some warnings during docs build (#19848)
- Change dprint config (#19747)
- Fix formatting of nested list (#19746)
- Add
meta.is_column
to API docs (#19744) - Fix join API reference links (#19745)
- Revise and rework user-guide/expressions (#19360)
- Update Excel page of user guide to refer to fastexcel as the default engine (#19691)
- Alter examples for round_sig_figs to make behaviour clearer (#19667)
- Assorted fixes to Rust API docs (#19664)
- Improve
replace
andreplace_all
docstring explanation of the "$" character with reference to capture groups (vs use as a literal) (#19529)
📦 Build system
- Upgrade
sqlparser-rs
from version0.49
to0.52
(#20110) - Bump
memmap2
to version0.9
(#20105) - Bump
object_store
to version0.11
(#20102) - Bump
fs4
to version0.12
(#20101) - Fix path to
polars-dylib
crate in workspace (#20103) - Bump
thiserror
to version2
(#20097) - Bump
atoi_simd
to version0.16
(#20098) - Bump
chrono-tz
to0.10
(#20094) - Update Rust dependency
ndarray
to0.16
(#20093) - Bump Rust toolchain to
nightly-2024-11-28
(#20064) - Pin maturin (#20063)
- Use public windows runners in python release (#19982)
- Add windows-aarch64 to python binaries (#19966)
🛠️ Other improvements
- Deprecate ddof parameter for correlation coefficient (#20197)
- Move Bitwise aggregations to FunctionExpr (#20193)
- Add ragged lines test (#20182)
- Remove dedicated
sink_(parquet/ipc)_cloud
functions (#20164) - Move new-streaming parquet and CSV sources to under
io_sources/
(#20160) - Move horizontal methods to polars-ops (#20134)
- Remove useless SeriesTrait::get implementations (#20136)
- Add a bunch more automated row encoding sortedness tests (#20056)
- Replace custom
PushNode
trait withExtend
(#20107) - Update AWS doc dependencies (#20095)
- Move cast from polars-arrow to polars-compute (#19967)
- Implement nested row encoding / decoding (#19874)
- Remove use of cast in
ArrowArray::new
(#19899) - Switch back to PyO3 0.22 (#19851)
- Make chunked gathers generic over chunk bit width (#19856)
- Add proper tests for row encoding (#19843)
- Add ToField context for common args (#19833)
- Add new streaming CSV source (#19694)
- Add BytesIndexMap and use in RowEncodedHashGrouper (#19817)
- Use HashKeys abstraction (#19785)
- Migrate polars-expr AggregationContext to use
Column
(#19736) - Add InMemoryJoin to new-streaming engine (#19741)
- Use
Column
for the{try,}_apply_columns{_par,}
functions onDataFrame
(#19683) - Remove more
@scalar-opt
(#19666) - Move Series bitops to
std::ops::Bit...
(#19673) - Mark test_parquet.py test_dict_slices as slow (#19675)
- Get
Column
intopolars-expr
(#19660) - Remove unused file (#19661)
- Delegate feature flags for polars-stream (#19659)
- Streamline internal SQL join condition processing (#19658)
- Factor out logic for re-use by new streaming CSV source (#19637)
- Configure grouped Dependabot updates (#19604)
- Share source token between all sender tasks of source nodes in new-streaming engine (#19593)
- Fix PyO3 error in CI (#19545)
- Update nightly compiler version (#19590)
- Added input validation for
explode
operation in the array namespace (#19163) - Remove MutableStructArray (#19587)
- Fix lint (#19584)
- Add a
Column::Partitioned
variant (#19557) - Move to fast-float2 (#19578)
- Only run remote bench on rust changes (#19581)
Thank you to all our contributors for making this release possible!
@3tilley, @DzenanJupic, @MarcoGorelli, @TNieuwdorp, @YichiZhang0613, @alexander-beedie, @barak1412, @braaannigan, @cmdlineluser, @coastalwhite, @corwinjoy, @dependabot, @dependabot[bot], @eitsupi, @engylemure, @etiennebacher, @flowlight0, @gab23r, @henryharbeck, @iharthi, @iliya-malecki, @ion-elgreco, @itamarst, @jackxxu, @janpipek, @jqnatividad, @letkemann, @lukapeschke, @lukemanley, @max-muoto, @mcrumiller, @mhogervo, @nameexhaustion, @orlp, @ptiza, @ritchie46, @rodrigogiraoserrao, @siddharth-vi, @sn0rkmaiden, @stijnherfst, @stinodego, @wence- and @wsyxbcl