🏆 Highlights
- Add new
Enum
categorical data type which allows a fixed set of categories (#11822)
💥 Breaking changes
- Rename
Utf8
data type toString
(#13224) - Rename
set_at_idx
toscatter
(#12687) - Preserve left and right join keys in outer joins (#12963)
- Implement
dtype
parameter forint_range
on Rust side (#12940) - Update
Expr.count
to ignore null values by default (#12934) - Change
value_counts
resulting column name fromcounts
tocount
(#12506) - Change default
join
behavior with regard to nulls, addjoin_nulls
parameter to keep existing behavior (#12840) - Smaller integer data types for datetime components (#12070)
- Fix
NaN
ordering to make NaNs compare greater than any other float, and equal to themselves (#12721) - Rename
frame_equal
/series_equal
toequals
(#12663) - Rename
not_
expression tonot
on the Rust side (#12587) - Rename
str.json_extract
tostr.json_decode
(#12586) - Rename DataFrame column index methods (#12542)
🚀 Performance improvements
- optimize set bit count (#13317)
- speed up
.dt.truncate
for large numbers of years (#13310) - don't eagerly evaluate error branches (#13311)
- don't needlessly allocate validity in concat/rechunk (#13288)
- add fast path to
count_bits_set_by_offsets
(#13253) - make
.dt.truncate('*mo')
more than 3x faster (#13192) - ensure single expression evaluation for replace (#13147)
- Elide allocation in outer join materialization (#12992)
- Ensure we reduce for
any/all_horizontal
(#12976) - Add fast paths for UTC in
truncate
(#12965) - Improve
rolling_median
algorithm (#12704) - Use fast path for non-null data in new SQL-like null matching (#12874)
- improve
merge_local_rhs_categorical
traversal (#12660) - make values_size estimate correct for sliced arrays (#12658)
- improve parquet utf8 validation (#12655)
- parquet pre-allocate buffer in binary plain encode (#12652)
- optimize dict binary decoding in parquet (#12648)
- ensure we only check the values within bounds (#12633)
- parquet; elide recursion in hot path (#12625)
- improve cov/corr algorithm (#12590)
- apply left side predicate pushdown also to right side on semi join (#12565)
- ensure streaming parquet download remains concurrent
~7x
(#12552) - speed up parquet download of streaming engine (#12544)
✨ Enhancements
- support negative indices in
gather
ingroup_by
context (#13373) - support negative indexing in gather (select context) (#13343)
- support min_periods for temporal rolling aggregations (#13342)
- support
REGEXP
andRLIKE
pattern matching in SQL engine (#13359) - gracefully handle panics in plugins (#13329)
- Implement
unique/n_unique/unique_counts/is_unique/is_duplicated
forNull
series (#13307) - support common variant spelling
STDEV
in the SQL engine (in addition toSTDDEV
) (#13303) - change doc links to new url docs.pola.rs (#13290)
- support horizontal concatenation of LazyFrames (#13139)
- Impl serde for array dtype (#13168)
- dispatch strict_cast via cast (#13255)
- Impl any/all for array type (#13250)
- add cancellable queries (#13178)
- add
offset
parameter togather_every
(#13156) - Support
Array
dtype AnyValue Series construction (#12817) - Allow
step
parameter inint_ranges
to take an expression (#13148) - Implement
count
for DataFrame/LazyFrame (#13153) - Move from GA to more privacy friendly framework (#13155)
- Rename
set_at_idx
toscatter
(#12687) - prune all/any_horizontals with single inputs (#13146)
- ensure we get cleaner logical plans with
any/all_horizontal
(#13144) - Add
str.contains_any
andstr.replace_many
(Aho-Corasick algorithms) (#13073) - Auto-infer credentials from
.aws
folder (#13062) - Support private cloud S3 storage in
scan_parquet
(#13060) - Allow order operators (<,>,>=,<=) on Enum types (#12982)
- Reimplement
replace
expression on the Rust side (#13002) - Use tokio semaphore for concurrency handling (#13026)
- Improve and expressify
hist
(#13014) - Preserve left and right join keys in outer joins (#12963)
- Allow
end
beforestart
indate/time_range
(#12964) - Implement group-tuples for
Null
dtype (#12975) - Implement
dtype
parameter forint_range
on Rust side (#12940) - Cast to an enum from int (#12954)
- Move categorical ordering into dtype (#12911)
- Update
Expr.count
to ignore null values by default (#12934) - Enable partial predicate pushdown past window expressions (#12710)
- Add
str.reverse
(#12878) - Change
value_counts
resulting column name fromcounts
tocount
(#12506) - Implement
std
andvar
forDuration
columns (#12865) - Change default
join
behavior with regard to nulls, addjoin_nulls
parameter to keep existing behavior (#12840) - Preserve base dtype when raising to
UInt
power (#10446) - Smaller integer data types for datetime components (#12070)
- Support SQL subqueries for
JOIN
andFROM
(#12819) - parquet support required deltabyte encoding (#12836)
- Add new
Enum
categorical data type which allows a fixed set of categories (#11822) - support nested null in vstack/append/extend/concat (#12771)
- Improve error messages on attempted Arrow conversions involving incompatible/unknown dtypes (#12421)
- determine mode parallelism depending on current tasks (#12764)
- enable slice push down past
with_columns
(#12742) - implement From<LazyGroupBy> for LazyFrame (#12562)
- Rename
frame_equal
/series_equal
toequals
(#12663) - Join operations on local categoricals (#12657)
- use RLE_DICTIONARY for integers in parquet (#12647)
- Add configuration option for where Polars spills to disk (#12595)
- implement RLE_DICT encoding for utf8/binary columns (reduced parquet file size) (#12623)
- implement 'DeltaByteArray' decoding for parquet (#12602)
- warn if
by
column is not sorted in rolling aggregations (as opposed to raising), add warn_if_unsorted argument (#12398) - struct -> json encoding expression (#12583)
- Implement support for multi-character comments in
read_csv
(#12519) - Implement
LazyFrame.sink_ndjson
(#10786) - improve concurrency parameters (#12567)
- Adds sink_ipc_cloud (#12556)
- Adds sink_ipc_cloud (#11008)
- In explain(), rename PIPELINE to STREAMING so it's clearer what it means (#12547)
🐞 Bug fixes
- range/ranges output name should follow lhs rule (#13369)
- updated Display trait for enum categoricals (#13331)
- nested dtypes: export logical type in plugins (#13325)
- fix invalid dtype setting in array (#13327)
- fix
csv
parser error when commented-out rows precede the header row (#13318) - invalid schema outer join after projection pd (#13315)
- invalid predicate optimization (#13313)
- Account for null values in categorical
unique/n_unique
(#13308) - fix schema when subtracting (#13309)
- broadcasting of unit LHS in string operations (#12737)
- casting list/arr to arr/list shouldn't convert chunks to logical type (#13259)
- sorting categorical lexically bugs on null values (#13271)
- improve replace on categoricals (#13223)
- round trip to JSON and back should preserve Enum type (#13267)
- enable and fix SIMD in polars-compute (#13251)
- match_chunks shouldn't change the dtype (#13222)
- sink_csv deadlock (#13239)
is_in
operator for categoricals (#13205)- Better handle mismatched dtypes in
replace
(#13213) - Fix
replace
fast path by castingold
input to the right data type (#13176) - ndjson nested null schema inference (#13206)
- slice for
NullChunked
no longer force single chunk (#13174) - don't cast to unknown dtypes (#13197)
- Allow casting nullable list to array (#13196)
- maintain old join behavior in window expression (#13179)
- Fix comparison of categoricals (#13137)
- Use the name of the leftmost expression in horizontal operations (#13143)
- any_value should supports cast to boolean (#13125)
- Update offsets of null value correctly for all
from_iter_xxx_trusted_len
(#13132) - fix neq for series cmp str (#13128)
- fix category list builder append series with multiple chunks (#13116)
- repeat_by should not raise if by contains nulls (#13105)
- [csv] raise on single quote char (#13104)
- Raise if scan zstd compressed csv file (#13102)
- Don't check map length if input is literal (#13098)
- use
FunctionExpr
's scalar return type foris_in
(#13091) - rolling_quantile can get incorrect state (#13088)
- Fix off-by-one error in
quantile(method="nearest")
(#13058) - Fix incorrect schema inference on nested columns (#13057)
- Don't raise for
datetime_range
if starting on ambiguous datetime and earliest was specified (#13050) - add cast safety to literals (#12983)
- Parse
json_decode
per max buffer length (#13029) - Parse
00:00
time zone as UTC (#13034) - Fix timeout errors in concurrent downloads (#13023)
- Fix SQL substring indexing (#13016)
- Allow broadcasting in
ranges
(#11900) - Prevent deadlock in
sink_csv
(#12991) - Don't get mutable if buffer is sliced (#12979)
- Dataframes with Decimal columns cannot be pickled (#12955)
- Fix
truncate
when truncating by multiple weeks (#12948) - Fix segfault / memory corruption after plugins return
Err
result (#12953) - Don't panic when
ambiguous
parameter is not Utf8 (#12913) - don't panic on empty df in
merge_sort
(#12923) - Patch
rolling_var
/rolling_std
numerical stability (#12909) - Fix incorrect Int16
min
/max
due to incorrect SIMD mask construction (#12908) - Fix OOB error in list set operations on empty frame (#12845)
- Fix repr of
Expr.gather
(which was still showing deprecated take) (#12864) - Fix
nan_min/max
incorrectly aggregating chunks with addition (#12848) - write only one dict page per row rowgroup (#12831)
- incorrect values from parquet RLE decoding (#12818)
- Handle aggregation for all-NaN groups in
group_by
(#12304) - Use total float ordering in
is_in
(#12800) - Fix
NaN
ordering to make NaNs compare greater than any other float, and equal to themselves (#12721) - don't use streaming engine if aggregate is unknown (#12769)
- hold align_chunks_invariant (#12738)
- allow leading zero and plus in integer parsing (#12744)
- csv lines iter, always return remainder (#12739)
- fix oob in set operations (#12736)
- undo regression in ability to read certain parquet files (#12731)
- corr return nan if denominator is invalid (#12708)
- parquet decimal statistics and schema (#12705)
- support
append
/extend
with null series (#11824) (#12686) - fix carrying over infinity into other windows (#12685)
- json null inference (#12677)
- cov/corr respect f32 type (#12676)
- fix ternary zip_with null broadcast (#12668)
- support negative slice on eager frame (#12644)
- fix concurrency budget assertion (#12641)
- fix oob in set operations (#12640)
- Rename
not_
expression tonot
on the Rust side (#12587) - panic reading parquet nested struct column (#12614)
- features:
performant,lazy,random
(#12600) - error when invalid list to array is given (#12584)
- parquet: do not extend existing nested that is already complete (#12569)
- accidental panic if predicate selects no files (#12575)
- fix lazy parquet slice with nested columns (#12558)
- ensure stats-evalutor exists (#12566)
- list schema of list
eval
(#12563) - ensure concurrency budget never locks (#12555)
- Fix lazy schema for
group_by_dynamic
androlling
(#12551) - address overflow on vec capacity calculation for
int_ranges
with negative step (#12548)
🛠️ Other improvements
- Update CODEOWNERS (#13292)
- Change base url of docs/guide to
docs.pola.rs
(#13281) - Add note about Rust examples versioning in user guide (#13280)
- split-up file_sink module (#13256)
- Rename
Utf8
data type toString
(#13224) - update rustc (#13219)
- fix horizontal concatenation documentation (#13141)
- Set minimum version for
bytemuck
to1.11
(#13191) - bump sysinfo from 0.29.11 to 0.30.0 (#13188)
- Remove
polars-algo
reference in Cargo.toml (#13187) - Use the name of the leftmost expression in horizontal operations (#13143)
- make pre_agg generic (#13150)
- move StaticArray to polars-arrow (#13106)
- ensure we get cleaner logical plans with
any/all_horizontal
(#13144) - Update
auto_explode
param name toreturns_scalar
(#13119) - don't compile polars-ops by default (#13100)
- update user-defined-functions for 0.19.x (#13071)
- Linting updates (#13069)
- take pl.concat out of StringCache context manager in "mismatched string cache" error message (#13076)
- add Enum to dtype list (#13080)
- further use TotalOrd (#13046)
- Minor typo fix (#13003)
- use new MinMax kernels (#12961)
- Refer to arrow crate unambiguously from polars-parquet (#12939)
- Fix issue with docs for
group_by_dynamic
(#12906) - Fix failing tests (#12859)
- Update
make check
to only checkpolars
crate (#12834) - apply TotalOrd in more places (#12810)
- Use latest
atoi_simd
release (#12748) - simplify rolling_median update (#12745)
- move nan_cmp and IsFloat to polars_utils (#12691)
- remove utf8 code in favor of binary (#12604)
- update custom allocator instructions to include macOS (#12593)
- Rename
str.json_extract
tostr.json_decode
(#12586) - parquet refactors (#12574)
- convert all recursive parquet deserialize to iterative (#12560)
- Rename DataFrame column index methods (#12542)
Thank you to all our contributors for making this release possible!
@0siride, @MarcoGorelli, @Object905, @PierreAttard, @Qqwy, @RoDmitry, @SeanTroyUWO, @TNieuwdorp, @Yerachmiel-Feltzman, @adamreeve, @alexander-beedie, @c-peters, @cardoso, @cjfuller, @dependabot, @dependabot[bot], @dmitrybugakov, @eitsupi, @fernandocast, @gab23r, @ion-elgreco, @itamarst, @jankislinger, @jeroenboeye, @kszlim, @mcrumiller, @nameexhaustion, @oli-clive-griffin, @orlp, @paddymul, @petrosbar, @r-brink, @rancomp, @reswqa, @ritchie46, @rob-sil, @robvanmieghem, @romanovacca, @stinodego, @tkarabela, @uchiiii and @xuestrange