π₯ Breaking changes
- Remove, deprecate or change eager
Expr
s to be lazy compatible (#24027)
π Performance improvements
- Use specialized decoding for all predicates for Parquet dictionary encoding (#24403)
- Allocate only for read items when reading Parquet with predicate (#24401)
- Don't aggregate groups for strict cast if original len (#24381)
- Allocate only for read items when reading Parquet with predicate (#24324)
- Native streaming
int_range
withlen
orcount
(#24280) - Lower
arg_unique
natively to the streaming engine (#24279) - Move unordering optimization to end (#24286)
- Do ordering simplification step after common sub-plan elimination (#24269)
- Always simplify order requirements in IR (#24192)
- Basic de-duplication of filter expressions (#24220)
- Cache the IR in
pipe_with_schema
(#24213) - Lower
arg_where
natively to streaming engine (#24088) - Lower Expr.shift to streaming engine (#24106)
- Lower order-preserving groupby to streaming engine (#24053)
- Lower .sort(maintain_order=True).head() to streaming top_k (#24014)
- Lower top-k to streaming engine (#23979)
- Allow order pass through Filters and relax to row-seperable instead of elementwise (#23969)
β¨ Enhancements
- Roundtrip
BinaryOffset
type through Parquet (#24344) - Add opt-in unstable functionality to load interval types as
Struct
(#24320) - Add user guide section on AWS role assumption (#24421)
- Support
unique
/n_unique
/arg_unique
forarray
columns (#24406) - Support S3 virtual-hostedβstyle URI (#24405)
- Remove explicit file create for local async writes (#24358)
- Support Partitioning sinks in cloud (#24399)
- User-friendly error message on empty path expansion (#24337)
- Add Polars security policy (#24314)
- Allow pl.Expr.log to take in an expression (#24226)
- Implement diff() in streaming engine (#24189)
- Enable Expr.diff(n) for negative n (#24200)
- Allow upcasting null-typed columns to nested column types in scans (#24185)
- Log pyarrow predicate conversion result in sensitive verbose logs (#24186)
- Add a deprecation warning for pl.Series.shift(Null) (#24114)
- Improve Debug formatting of DataType (#24056)
- Add
cum_*
as native streaming nodes (#23977) - Add peak_{min,max} support for booleans (#24068)
- Add
DataFrame.map_columns
for eager evaluation (#23821) - Add native streaming for
peaks_{min,max}
(#24039) - IR graph arrows, monospace font, box nodes (#24021)
- Add
DataTypeExpr.default_value
(#23973) - Lower
rle
to a native streaming engine node (#23929) - Add support for
Int128
to pyo3-polars (#23959) - Lower
rle_id
to a native streaming node (#23894) - Pass
endpoint_url
loaded fromCredentialProviderAWS
toscan/write_delta
(#23812) - Dispatch
scan_iceberg
to native by default (#23912) - Lower
unique_counts
andvalue_counts
to streaming engine (#23890) - Implement
dt.days_in_month
function (#23119) - Fix errors on native
scan_iceberg
(#23811) - Reinterpret binary data to fixed size numerical array (#22840)
- Make
rolling_map
serializable (#23848)
π Bug fixes
- Fix
AggState
onall_literal
inBinaryExpr
(#24461) - Replace unsafe with collect (#24494)
- Show IR sort options in
explain
(#24465) - Benchmark CI import (#24463)
- Fix schema on
ApplyExpr
with single rowliteral
in agg context (#24422) - Fix planner schema for dividing
pl.Float32
by int (#24432) - Fix panic scanning from AWS legacy global endpoint URL (#24450)
- Emit proper tuple for Log in expression nodes (#24426)
- Do not propagate struct of nulls with null (#24420)
- Be stricter with invalid NDJSON input when
ignore_errors=False
(#24404) - Implement
approx_n_unique
for temporal dtypes and Null (#24417) - Correct
sink_ipc
overload for compression (#24398) - Enable all integer dtypes for
by
parameter injoin_asof
(#24384) - Fix Group-By + filter aggregation performs subsequent operations on all data instead of only filtered data (#24373)
- Fix incorrect output ordering for row-separable exprs (#24354)
- Fix
Series.__arrow_c_stream__
for Decimal and other logical types (#24120) - Match output type to engine for
Struct
arithmetic (#23805) - Make mmap use MAP_PRIVATE rather than MAP_SHARED (#24343)
- Fix cloud iceberg scan DATASET_PROVIDER_VTABLE error (#24338)
- Incorrect logic in negative streaming slice (#24326)
- Do not error on non-list
Sequence
forcolumns
parameter inread_excel
(#23967) - Invalid conversion from non-bit numpy bools (#24312)
- Make
dt.epoch('s')
serializable (#24302) - Make
Expr.rechunk
serializable (#24303) - Schema mismatch for 'log' operation (#24300)
- Incorrect first/last aggregate in streaming engine (#24289)
- Fix group offsets in sliced groups (#24274)
- Panic in inexact date(time) conversion (#24268)
- The
index_of
feature should not depends on theobject
feature (#24256) - Keep DSL cache after serialization and deserialization (#24265)
- Sanitize and warn about eval usage (#24262)
- Unique with keep="none" in new optimization pass (#24261)
- Correct size limits for Decimal cast (#24252)
- Unordered unions in check order observing pass (#24253)
- Fix dtype for
slice
onLiteral
in agg context (#24137) - Fix incorrect
filter(lit(True))
when scanning hive (#24237) - In-memory group_by on 128-bit integers (#24242)
- Fix panic in
gather
inside groupby with invalid indices (#24182) - Release the GIL in map_groups (#24225)
- Remove extra explode in
LazyGroupBy.{head,tail}
(#24221) - Fix panic in polars cloud CSV scan (#24197)
- Fix panic when loading categorical columns from IO plugin (#24205)
- Fix engine type for
concat_list
on AggScalarimplode
(#24160) - Rolling_mean handle centered weights with len(values) < window_size (#24158)
- Reading
is_in
predicate for Parquet plain strings (#24184) - Make PyCategories pickleable (#24170)
- Remove unused unsound function
to_mutable_slice
(#24173) - PyO3 extension types giving compat_level errors (#24166)
- Allow non-elementwise by in top_k (#24164)
- Fix
sort_by
forgroup_by_dynamic
context (#24152) - Input-independent length aggregations in streaming (#24153)
- Release GIL when iterating df in to_arrow (#24151)
- Respect non-elementwise join_where conditions (#24135)
- Resolve schema mismatch for div on Boolean (#24111)
- Keep name when doing empty group-aware aggregation (#24098)
- Implode instead of
reshape_list
(#24078) - Rolling mean with weights incorrect when min_samples < window_size (#23485)
- Allow
merge_sorted
for all types (#24077) - Include datatypes in
row_encode
expression (#24074) - Include UDF materialized type in serialization (#24073)
- Correct
.rolling()
output type for non-aggregations (#24072) - Correct planner output schema for
join_asof
(#24071) - Allow %B to work without specifying day (#24009)
- Correct output for
fold
andreduce
(#24069) - Expr.meta.output_name for struct fields (#24064)
- Ensure upcast operations on
pl.Date
default to microsecond precision (#23981) - Add peak_{min,max} support for booleans (#24068)
- Planner output type for
mean
with strange input type (#24052) - Remove, deprecate or change eager
Expr
s to be lazy compatible (#24027) - Scan of multiple sources with
null
datatype (#24065) - Categorical in nested data in row encoding (#24051)
- Missing length update in builder for pl.Array repetition (#24055)
- Race condition in global categories init (#24045)
- Revert "fix: Don't encode entire CategoricalMapping when going to Arrow (#24036)" (#24044)
- Error when using named functions (#24041)
- Don't encode entire CategoricalMapping when going to Arrow (#24036)
- Fix cast on arithmetic with
lit
(#23941) - Incorrect slice-slice pushdown (#24032)
- Dedup common cache subplan in IR graph (#24028)
- Allow join on Decimal in in-memory engine (#24026)
- Fix datatypes for
eval.list
in aggregation context (#23911) - Allocator capsule fallback panic (#24022)
- Accept another zlib "magic header" file signature (#24013)
- Fix
truediv
dtypes socast
inlist.eval
is not dropped (#23936) - Don't reuse cached
return_dtype
for expanded map expressions (#24010) - Cache id is not a valid dot node id (#24005)
- Align
map_elements
with and withoutreturn_dtype
(#24007) - Fix column dtype lifetime for
csv_write
segfault onCategorical
(#23986) - Allow serializing
LazyGroupBy.map_groups
(#23964) - Correct allocator name in
PyCapsule
(#23968) - Mismatched types for
write
function for windows (#23915) - Fix
unpivot
panic whenindex=
column not found (#23958) - Fix
assert_frame_equal
withcheck_dtypes=False
for all-null series with different types (#23943) - Return correct python package version (#23951)
- Categorical namespace functions fail on
Enum
columns (#23925) - Properly set sumwise complete on filter for missing columns (#23877)
- Restore Arrow-FFI-based Python<->Rust conversion in pyo3-polars (#23881)
- Group By with filters (#23917)
- Fix
read_csv
ignoring Decimal schema for header-only data (#23886) - Ensure
collect()
native Iceberg always scans latest when nosnapshot_id
is given (#23907) - Writing List(Array) columns to JSON without panic (#23875)
- Fill Iceberg missing fields with partition values if present in metadata (#23900)
- Create file for streaming sink even if unspawned (#23672)
- Update cloud testing environment (#23908)
- Parquet filtering on multiple RGs with literal predicate (#23903)
- Incorrect datatype passed to libc::write (#23904)
- Properly feature gate TZ_AWARE_RE usage (#23888)
- Improve identification of "non group-key" aggregates in SQL
GROUP BY
queries (#23191) - Spawning tokio task outside reactor (#23884)
- Correctly raise DuplicateError on asof_join with suffix="" (#23864)
- Fix errors on native
scan_iceberg
(#23811) - Fix index out of bounds panic filtering parquet (#23850)
- Fix error on empty range requests (#23844)
- Fix handling of hive partitioning
hive_start_idx
parameter (#23843)
π Documentation
- Rename
avg_birthday
->avg_age
in examples aggregation (#23726) - Update Polars Cloud user guide (#24366)
- Update to Polars Cloud user guide (#24187)
- Update distributed page (#24323)
- Add Polars security policy (#24314)
- Fix few typos (#24305)
- Add missing reference to
LazyFrame.pipe_with_schema()
on the website (#24285) - Fix formatting of Series.value_counts examples (#24245)
- Add
DataFrame.map_columns
to API (#24128) - Update multiple pages in the Polars Cloud user guide (#23661)
- Improve StackOverflow links in contributing guide (#23895)
- Fix
pyo3
documentation page link (#23839) - Document the pureness requirements of udfs (#23787)
π¦ Build system
π οΈ Other improvements
- Use
PlanCallback
inname.map_*
(#24484) - Replace unsafe with collect (#24494)
- Move dataset expansion to end and refactor not to use stack optimizer (#24457)
- Pin
xlsvwriter
to3.2.5
or before (#24485) - Add methods to
EnumUnitVec
and shorten name (#24415) - Move CompressionUtils to polars-utils (#24430)
- Update github template to dispatch to cloud client (#24416)
- Bump c-api (#24412)
- Add a regression test for #7631 (#24363)
- Update cloud test
InteractiveQuery
toDirectQuery
(#24287) - Mark some tests as slow (#24327)
- Mark more tests as ready for cloud (#24315)
- Remove unnecessary stable_features for AVX512 (#24321)
- Remove PDS-H code (#24301)
- Get ready for even more cloud tests (#24292)
- Add tests for slices with caches (#24288)
- Readd ordering tests (#24284)
- Expand BitRepr to u8/u16 and use in in_memory group_by (#24248)
- Fix Makefile venv path (#24251)
- Remove unnecessary parentheses (#24244)
- Remove some transmutes (#24246)
- Wrap Py* data structures in polars-python in locks (#24209)
- Make non-nested shift{,_and_fill} ops generic (#24224)
- Remove unused
Wrap
(#24214) - Propagate some python feature flags (#24201)
- Allow upcasting null-typed columns to nested column types in scans (#24185)
- Automatically label a few more types of PR (#24147)
- Update toolchain (#24156)
- InMemoryJoin should be coloured as InMemoryFallback (#24154)
- Fool-proof retrieve_error_msg (#24132)
- Add
order_sensitive
property forAExpr
(#24116) - Mark more tests as not possible on cloud (#24103)
- Turn
AggExpr::Count
from tuple to struct (#24096) - Mark tests that may fail in cloud (#24067)
- Make CI perf failures more lenient (#24066)
- Fix hive partition string encoding in CI by upgrading
deltalake
(#24018) - Avoid unreachable if dtype feature is not enabled (#24062)
- Make tests with sinks run on cloud again (#24048)
- Update pyo3-polars versions (#24031)
- Remove insert_error_function (#24023)
- Remove cache hits, clean up in-mem prefill (#24019)
- Use .venv instead of venv in pyo3-polars examples (#24024)
- Fix test failing mypy (#24017)
- Remove outdated comment (#23998)
- Add a
_plr.pyi
to removemypy
issues (#23970) - Don't define CountStar as dyn OptimizationRule (#23976)
- Rename
atol
andrtol
toabs_tol
andrel_tol
(#23961) - Introduce
Row{Encode,Decode}
as FunctionExpr (#23933) - Dispatch through
pl.map_batches
andAnonymousColumnsUdf
(#23867) - Ensure
clippy
andrustfmt
run in CI when changingpyo3-polars
(#23930) - Split
column_selector.rs
(#23921) - Fix pyo3-polars proc-macro re-exports (#23918)
- Make
GetBatchState
polling functions unsafe (#23795) - Rewrite
evaluate_on_groups
for.gather
/.get
(#23700) - Remove
Context
from logical layer (#23863) - Add
proptest
strategy for PolarsDataType
schemas (#23854) - Move Python C API to
python-polars
(#23876) - Refactor directory structure of streaming multi-scan (#23865)
- Add subphase and query task spawning to StreamingExecState (#23725)
- Update Rust Polars versions (#23861)
- Make polars-parquet optional (#23860)
- Relax constraint on maximum Python version for
numba
(#23838)
Thank you to all our contributors for making this release possible!
@Gusabary, @JakubValtar, @Kevin-Patyk, @MarcoGorelli, @Matt711, @NeejWeej, @VictorAtIfInsurance, @agossard, @alexander-beedie, @aparna2198, @borchero, @c-peters, @camriddell, @cgevans, @cmdlineluser, @coastalwhite, @deanm0000, @dsprenkels, @eitsupi, @etiennebacher, @gab23r, @gfvioli, @henryharbeck, @iishutov, @itamarst, @jarondl, @jimmmmmmmmmmmy, @jjurm, @joshuamarkovic, @juansolm, @kdn36, @kuril, @math-hiyoko, @mcrumiller, @mpasa, @mrkn, @mroeschke, @nameexhaustion, @nesb1, @orlp, @pka, @pomo-mondreganto, @r-brink, @rawhuul, @ritchie46, @stijnherfst, @vdrn and @wence-