🏆 Highlights
- Add new
Int128Type
(#20232)
💥 Breaking changes
- Support writing partitioned parquet to cloud (#20590)
🚀 Performance improvements
- Use BitmapBuilder in yet more places (#20868)
- Make an owned version of append (#20800)
- Use BitmapBuilder in a lot more places (#20776)
- Extend functionality on BitmapBuilder and use in Growables (#20754)
- Specialize first/last agg for simple types in new-streaming engine (#20728)
- Improve state caching and parallelism of window functions (#20689)
- Broadcast without materialization in
concat_arr
(#20681) - Cache rolling groups (#20675)
- Use downcast_ref instead of dtype equality in
<dyn SeriesTrait as AsRef<ChunkedArray<T>>
(#20664) - Fix performance regression for DataFrame serialization/pickling (#20641)
- Make Parquet
verify_dict_indices
SIMD (#20623) - Move to
zlib-rs
by default and usezstd::with_buffer
(#20614) - Skip filter expansion in eager (#20586)
- Use AtomicWaker in async engine task joiner (#20604)
- Move morsel distribution to the computational async engine (#20600)
- Improve unique pred-pd (#20569)
- Collapse expanded filters in eager (#20493)
- Remove predicate from
IR::DataFrame
(#20492) - Add proper distributor to new-streaming parquet reader (#20372)
- Use different binview dedup strategy depending on chunks ratio (#20451)
- Generalize the
arg_sort
fast path ontoColumn
(#20437) - Dedup binviews up front (#20449)
- Re-enable common subplan elim for new-streaming engine (#20443)
- Don't collect all LHS arrays in gather (#20441)
- Remove prepare_series for gather kernels (#20439)
- Don't always take all data buffers when gathering views (#20435)
- Order observability optimizations (#20396)
- Purge ChunkedArray Metadata (#20371)
- Drop probe tables in parallel in new-streaming equi-join (#20373)
- Explicit transpose in new-streaming equi-join finalize (#20363)
- Cache dtype on ExprIR (#20331)
✨ Enhancements
- Expose descending and nulls last in window order-by (#20919)
- Add
linear_space
(#20678) - Implement df.unique() on new-streaming engine (#20875)
- Add unique operations for Decimal dtype (#20855)
- Add NDJson sink for the new streaming engine (#20805)
- Support nested keys in window functions (#20837)
- Add CSV sink for the new streaming engine (#20804)
- Periodically check python signals ('CTRL-C' handling) (#20826)
- Experimental unity catalog client (#20798)
- Support cumulative aggregations for
Decimal
dtype (#20802) - Improve window function caching strategy (#20791)
- Allow different python versions for pickle (#20740)
- Add SQL support for the
NORMALIZE
string function (#20705) - Add 'allow_exact_matches' join_asof' (#20723)
- Add new-streaming first/last aggregations (#20716)
- Add Parquet Sink to new streaming engine (#20690)
- Expose IRBuilder (#20710)
- Make automatic use of Azure storage account keys opt-in (#20652)
- Improve
GroupsProxy/GroupsPosition
to be sliceable and cheaply cloneable (#20673) - Add
str.normalize()
(#20483) - Allow more group_by agg expressions in the new streaming engine (#20663)
- Support writing partitioned parquet to cloud (#20590)
- Add hint to error message for extra struct field in JSON (#20612)
- Add
index_of()
function toSeries
andExpr
(#19894) - Update
sqlparser-rs
, enabling "LEFT" keyword to be optional for anti/semi joins in SQL queries (#20576) - Add
cat.starts_with
/cat.ends_with
(#20257) - Add
Int128
IO support for csv & ipc (#20535) - Support arbitrary expressions in 'join_where' (#20525)
- Allow more join lossless casting (#20474)
- Always resolve dynamic types in schema (#20406)
- Order observability optimizations (#20396)
- Add FirstArgLossless supertype (#20394)
- Add
dt.replace
(#19708) - Polars build for Pyodide (#20383)
- Add Azure credential provider using
DefaultAzureCredential()
(#20384) - Add env var to ignore file cache allocate error (#20356)
- Enable joins between compatible differing numeric key columns (#20332)
- Cache dtype on ExprIR (#20331)
- Serialize DataFrame/Series using IPC in serde (#20266)
- Improve error message on SchemaError (#20326)
- Use better error messages when opening files (#20307)
- Add 'skip_lines' for CSV (#20301)
- Allow subtraction of time dtype columns (#20300)
- Add
bin.reinterpret
(#20263) - Allow decoding of non-Polars arrow dictionaries in Arrow and Parquet (#20248)
- Add new
Int128Type
(#20232) - IR formatting QoL improvements (#20246)
- Add
cat.len_chars
andcat.len_bytes
(#20211) - Expose AexprArena (#20230)
🐞 Bug fixes
- Fix
from_numpy
returning Null dtype for empty 1D numpy array (#20907) - Fix
map_elements
panicking with Decimal type (#20905) - Warn if asof keys not sorted (#20887)
- Avoid name collisions and panicking in object conversion (#20890)
- Incorrect scale used in
log
andexp
for Decimal type (#20888) - Don't deep clone manuallydrop in GroupsPosition (#20886)
- Fix DuplicateError when selecting columns after
join_where
or cross join + filter (#20865) - Incorrect
Decimal
value forfill_null(strategy="one")
(#20844) - Fix one edge case (out of many) of int128 literals not working (#20830)
- Add height check to frame-level row indexing when key is int (#20778)
- Remove
assert
that panics ongroup_by
followed byhead(n)
, wheren
is larger then the frame height (#20819) - Fix panic
InvalidHeaderValue
scanning from S3 on Windows (#20820) - Fix
clip
forDecimal
returning wrong values (#20814) - Incorrect height from slicing after projecting only the file path column (#20817)
- Shift mask when skipping Bitpacked values in Parquet (#20810)
- Error instead of truncate if length mismatch for several
str
functions (#20781) - Support cumulative aggregations for
Decimal
dtype (#20802) - Do not print sensitive information to output on
POLARS_VERBOSE
(#20797) - Ignore file cache allocation error if
fallocate()
is not permitted (#20796) - Incorrect logic in
assert_series_equal
for infinities (#20763) - Avoid blocking on async runtime when resolving cloud scans (#20750)
- Fix
allow_invalid_certificates
being ignored instorage_options
(#20744) - Incorrect output type for
map_groups
returning all-NULL column (#20743) - Fix
unique(maintain_order=True)
raisingInvalidOperationError
for null array (#20737) - Don't collapse into a Nested Loop Join if the cross join maintains order (#20729)
- Don't serialize credentials provider (#20741)
- Fix
Series.n_unique
raising for list of struct (#20724) - Fix incorrect top-k by sorted column, fix
head()
returning extra rows (#20722) - Add outer validity to AnyValueBufferTrusted for structs (#20713)
- Don't partition group-by with non-scalar literals in agg (#20704)
- Incorrect view buffer dedup (#20691)
- Only verify Parquet ConvertedType if no LogicalType is given (#20682)
- Validate length of
schema_overrides
inread_csv
(#20672) - Fix
map_elements
ignoringskip_nulls=True
for struct dtype (#20668) - Check for MAP-GROUPS in cloud-eligible (#20662)
- Fix empty output of
to_arrow()
on filtered unit height DataFrame (#20656) - Add
.default
to azure credential provider scope URL (#20651) - Fix
join_asof
panicking for invalidtolerance
input (#20643) - Incorrect flag check on is_elementwise (#20646)
- Don't panic but set null type if type is unknown (#20647)
- Fix performance regression for DataFrame serialization/pickling (#20641)
- Fix
Int128
dtype serialization (#20629) - Ensure that SQL
LIKE
andILIKE
operators support multi-line matches (#20613) - Properly broadcast in sort_by (#20434)
- Properly load nested Parquet Statistics (#20610)
- AWS environment config was not loaded when credential provider was used (#20611)
- Fix order observability of group-by-dyn (#20615)
- Soundness when loading Parquet string statistics (#20585)
- Fix error filtering after
with_columns()
on unit height LazyFrame (#20584) - Restore symbols on Apple by bumping nightly version (#20563)
- Fix variable name in error message for "unsupported data type" in rolling and upsampling operations (#20553)
- Output index type instead of u32 for
sum_horizontal
with boolean inputs (#20531) - Fix more global categorical issues (#20547)
- Update eager join doctest on multiple columns (#20542)
- Revert categorical unique code (#20540)
- Add
unique
fast path for empty categoricals (#20536) - Fix various
Int128
operations (#20515) - Fix global cat unique (#20524)
- Fix union (#20523)
- Fix rolling aggregations for various integer types (#20512)
- Ensure
ignore_nulls
is respected in horizontal sum/mean (#20469) - Fix incorrectly added sorted flag after append for lexically ordered categorical series (#20414)
- More
Int128
testing and related fixes (#20494) - Validate column names in
unique()
for empty DataFrames (#20411) - Implement
list.min
andlist.max
forlist[i128]
(#20488) - Decimal from physical in horizontal min/max and shift (#20487)
- Don't remove sort if first/last strategy is set in unique (#20481)
- Fix join literal behavior (#20477)
- Validate asof join by args in IR resolving phase (#20473)
- Fix
align_frames
with single row panicking (#20466) - Allow multiple column sort for Decimal (#20452)
- Fix mode panicking for String dtype (#20458)
- Return correct schema for
sum_horizontal
with boolean dtype (#20459) - Properly handle
to_physical_repr
of nested types (#20413) - Workaround for
mmap
crash under Emscripten (#20418) - Fix using
new_columns
inscan_csv
with compressed file (#20412) - Fix decimal arithmetic schema (#20398)
- Raise on categorical search_sorted (#20395)
- Don't try to load non-existend List/FSL statistics (#20388)
- Propagate nulls for float methods on all numeric types (#20386)
- Add env var to ignore file cache allocate error (#20356)
- Flip order on right join (#20358)
- Fix incorrect object store caching for ADLS URI (#20357)
- Use the same encoding for nullable as non-nullable arrays (#20323)
- Improve error message on SchemaError (#20326)
- Boolean optional slice pushdown (#20315)
- Properly handle
from_physical
for List/Array (#20311) - Ignore quotes in csv comments (#20306)
- Ensure pl.datetime returns empty column when input columns are empty (#20278)
- Ensure output height does not change on lazy projection pushdown with aggregations (#20223)
- Fix error writing on Windows to locations outside of C drive (#20245)
- Incorrect comparison in some cases with filtered list/array columns (#20243)
- Ensure height is maintained in SQL
SELECT 1 FROM
(#20241) - Properly account for updated Categorical in .unique() kernel (#20235)
- Fix incorrect lazy
select(len())
with some select orderings (#20222) - Fix assertion panic on LazyFrame
scratch.is_empty()
(#20219)
📖 Documentation
- Update source URL for
legislators-historical.csv
(#20858) - Fix typo in sql functions (cosinus -> cosine) (#20676)
- Fix small typo in plugins (polars-dt -> polars-st) (#20657)
- Add polars-h3 and polars-st to plugin list (#20653)
- Add docs reference for
Field
(#20625) - Miscellaneous minor updates/fixes (#20573)
- Update "group_by_rolling" (deprecated) to "rolling" in user guide (#20548)
- Fix flaky doctests (#20516)
- Clarify the join pre-condition of
join_asof
(#20509) - Fix
Expr.all
description of Kleene logic (#20409) - Improve docstring clarity (#20416)
- Fix "forcolumnar" typo in docs (#20401)
- Remove Plugins overview page without information (#20348)
- Small fixes/clarifications in user guide (#20335)
- Improve docs about NaN (#20310)
- Fix typo in
fork
warning (#20258)
🛠️ Other improvements
- Add tests for already resolved issues (#20921)
- Fix the
verify_dict_indices
codegen (#20920) - Add ProjectionContext in projection pushdown opt (#20918)
- Disable 'catalog' in build (#20897)
- Implement negative slice for new streaming IPC (#20866)
- Remove last instances of itoa (#20881)
- Reduce bloat in static_array_collect by using BitmapBuilders (#20891)
- Use defunctionalization in polars-core scalar.rs in order to reduce code duplication (#20377)
- Simplify decimal formatting and remove itoap dep (#20880)
- Remove polars(_core)::export (#20869)
- Debloat Series bitops (#20873)
- Move sum kernel to polars-compute (#20867)
- Remove todo and test restriction for new-streaming (#20861)
- Dispatch to the in-mem engine for
AExpr::Gather
(#20862) - Dispatch to the in-memory engine for multifile sources (#20860)
- Add tests for open issues (#20857)
- Mark 'register_startup' as unsafe (#20841)
- Reduce mode bloat (#20839)
- Rename
ContainsMany
toContainsAny
(#20785) - Unpin NumPy in type checking workflow (#20792)
- Add various tests (#20768)
- Small drive-by's (#20772)
- Touch the upload probe for the remote benchmark (#20767)
- Fix remote benchmark script (#20755)
- Fix tests (#20745)
- Simplify hive predicate handling in
NEW_MULTIFILE
(#20730) - Add tests for various open issues (#20720)
- Add tests for various open issues that have been fixed (#20680)
- Don't include debug symbols in benchmark run (#20571)
- Remove implicit reverse from AExpr::replace_inputs() (#20659)
- Implement CSV, IPC and NDJson in the
MultiScanExec
node (#20648) - Fix Python deps installation in remote-benchmark workflow (#20619)
- Fix rust-analyzer misinterpretation (#20595)
- Remove unused file (#20594)
- Rename is_numeric to is_primitive_numeric (#20574)
- Reduce size of ArrowDataType by boxing heavy variants (#20588)
- Bump multiversion from 0.7 to 0.8 (#20543)
- Groundwork for allowing multi-output nodes in the new streaming engine (#20550)
- Improve bin size info (#20551)
- Increase categorical test coverage (#20514)
- Report wheel sizes (#20541)
- Add tests for
floor/ceil
on integers (#20479) - Expose and rewrite 'can_pre_agg' (#20450)
- Skip test on windows; kuzu import segfaults (#20463)
- Add a
TypeCheckRule
to the optimizer (#20425) - Fix duplicate cols in new-streaming parquet prefilter (#20419)
- Move gather kernels to polars-compute (#20415)
- Temporarily disable common subplan elim for new-streaming (#20374)
- Remove unused IR::Reduce node (#20392)
- Enable masked out list, struct and array elements in parametric tests (#20365)
- Dispatch slice/filter lowering properly (#20390)
- Move hive partitioning/multi-file handling outside of readers (#20203)
- Purge ChunkedArray Metadata (#20371)
- Add equi joins to new streaming engine (#19869)
- Make parametric tests include
pl.List
andpl.Array
by default (#20319) - Use Column in Row Encoding (#20312)
- Don't warn on fork hook (#20309)
- Don't deconstruct
CsvParseOptions
(#20302) - Allow decoding of non-Polars arrow dictionaries in Arrow and Parquet (#20248)
- Add
FunctionCastOptions
and conservative IR-level cast type-checking (#20286) - Add more descriptive error message for failure of vstack/extend (#20299)
- Expose AexprArena (#20230)
Thank you to all our contributors for making this release possible!
@Biswas-N, @FBruzzesi, @IndexSeek, @Jesse-Bakker, @MarcoGorelli, @MoizesCBF, @Prathamesh-Ghatole, @SamuelAllain, @Terrigible, @ZemanOndrej, @alexander-beedie, @arnabanimesh, @balbok0, @beckernick, @braaannigan, @brifitz, @bschoenmaeckers, @burakemir, @coastalwhite, @deanm0000, @dependabot[bot], @dimfeld, @eitsupi, @etiennebacher, @georgestagg, @hamdanal, @haocheng6, @ion-elgreco, @itamarst, @jqnatividad, @kszlim, @lukemanley, @mcrumiller, @nameexhaustion, @noexecstack, @orlp, @ptiza, @r-brink, @ritchie46, @rodrigogiraoserrao, @siddharth-vi, @stijnherfst, @stinodego, @tswast, @zero-stroke and dependabot[bot]