🏆 Highlights
- Out-of-core unique (#8573)
⚠️ Breaking changes
- Rename
concat_lst
toconcat_list
(#8597) - Schema improvements (#8286)
- don't create duplicate pivot names (#8002)
- rename
toggle_string_cache
toenable_string_cache
(#7970) - change top_k(descending) -> bottom_k (#7969)
- in
sort
,top_k
,sort_by
, andarg_sort_by
, raise ifdescending
is a sequence and its length doesn't match the number of columns to sort by (#7957)
🚀 Performance improvements
- elide function calls in AnyValue::eq (#8725)
- add fused multiply add optimization for expressions (#8690)
- use expression for dot product (#8686)
- improve nested grouptuples related code (#8618)
- buffer spill partitions in ooc sort.
~10/20%
(#8616) - improve OOC sort performance during partition phase (#8590)
- remove some unnecessary calls and matches (#8490)
- less naive count (#8473)
- parallelize almost all flattens (#8468)
- optimize horizontal min/max (#8463)
- reinstate old behavior in numeric group-tuples (#8445)
- remove false sharing in perfect hash table
>2x
(#8432) - further optimised conversions to python date/datetime (#8417)
- optimize join inner materialization of single keys (#8405)
- parallelize sorted group tuple materialization (#8387)
- improve materialization of huge cardinality group tuples (#8382)
- improve group_tuples materialization (#8375)
- use online variance kernel for aggregation (#8306)
- add specialized boolean aggregation for min/max (#8294)
- fail fast on non-inferable strings in strptime if no
fmt
is provided (#8111) - make chunks search more resilient (#8229)
- SIMD accelerated
arg_min
/arg_max
(viaargminmax
) (#8074) - speed up csv parsing for slower datetimes formats (#8213)
arr.eval
run on groupby expression engine when possible (#8199)FromParalleIter<Option<str>> for Utf8Chunked
~1.9x
(#8058)- speed up from_par_iter Option<bool>
~2.5x
(#8057) - parallelize numeric ChunkedArray materialization
~2x
. (#8053) - parallelize
into_groups
materialization ~-25%
(#8036) - use a trusted anyvalue builder (#8001)
- numeric grouptuples with nulls hash in single pass
~25%
(#7980) - use perfect hash table for categoricals (#7951)
- improve group_tuples of high cardinality data
~10%
(#7938) - use streaming instead of partitioned groupby (#7907)
- don't auto-stream groupby (#7906)
- rechunk before aggs (#7903)
- don't re-allocate groups in sorted to_dummies (#7897)
✨ Enhancements
- add support for
DISTINCT
keyword in SQL select clauses (#8740) - support any day of the week in 'start_by' in groupby_dynamic (#8720)
- add support for
USING
clause in SQL join operations (#8731) - add support for
HAVING
clause to SQLGROUP BY
operations (#8704) - streaming unions (#8676)
- expression cache (#8674)
- rolling covariance and correlation (#8671)
- Add
dt.to_string
alias fordt.strftime
(#8290) - use temp dir for ooc spills (#8614)
- make ooc-sort resilient against chunk_size (#8588)
- Set
strptime
defaultstrict/exact=true
(#8587) - Out-of-core unique (#8573)
- Add
to_date
,to_datetime
,to_time
to String namespace (#8579) - more detailed error message on failure to cast
List
dtype (#8583) - don't trigger unreachable code if no dtype is set (#8532)
- accept expressions in
groupby_dynamic/rolling
(#8528) - expose quantile/mean for duration (#8491)
- require explicitly sorted flag for upsample (#8488)
- allow for _saturating suffix in duration strings (#8479)
- let duration string accept "1mo_saturating" (#8469)
- add dt.month_start and dt.month_end (#8435)
- add SQL support for cumulative functions (#8457)
- add
str_slice
method toStringNameSpace
(#8427) - allow negative 'arange' expression (#8413)
- warn if argument is not explicitly sorted (#8409)
- Schema improvements (#8286)
- add support for SQL "IN" expr (#8396)
- cli output mode & sql read_json (#8336)
- rename 'csv-file' to 'csv' (#8101)
- preserve time zone in combine (#8263)
- add
use_earliest
argument toreplace_time_zone
for dealing with ambiguous datetimes (#8087) - SQL CTE's (#8208)
- add duration cumsum and remainder (#8219)
- better algorithm for streaming unique (#8003)
- Add approx distinct count via
approx_unique()
(#7937) - adopt
FunctionExpr
forcat
namespace (#8173) DatetimeArgs
ergonomics (#8133)- Remove Seek constraint from IpcStreamReader and SerReader (#8166)
- implement
FunctionExpr
for bound and round methods (#8172) - display skipped row if same number of rows (#8170)
- move all boolean expressions into
BooleanFunction
enum (#8132) - rewrite log expressions to make them serializable (#8126)
- make unique expr serde and cmp (#8153)
- adopt
FunctionExpr
forabs
to allow for serialization (#8129) - adopt
FunctionExpr
forcum*
functions (#8130) - support negative index in
pct_change
(#8137) - add
log1p
to list of mathematical functions (#8102) - expand list of tz-aware formats which can be auto-inferred (#8085)
- clearer error message if strptime without a fmt specified fails (#8086)
- infer tz-aware formats with try_parse_dates in read_csv (#8084)
- feat(python, rust)! make 'mo' interval raise if the target date does not exist (#8078)
- auto-infer fmt for tz-aware date strings (#7405)
- multiple sql contexts & optional sql highlighting in cli (#8072)
- implement arg_sort for struct dtype (#8051)
- support struct in df.unique (#7976)
- change top_k(descending) -> bottom_k (#7969)
- optimize away nested unions in lp (#7861)
- Add seed argument to rank for random (#7913)
- auto-infer detecting time-zone-awareness of fmt argument in strptime; deprecate tz_aware argument (#7886)
- deal with null values in cut/qcut (#7878)
- support datetime/date subclasses (e.g. FreezeGun) (#7819)
🐞 Bug fixes
- groupby_dynamic was unnecessarily failing on ambiguous local datetime (#8737)
- ensure count aggregation has proper length when spilling (#8735)
- fix return value of std for single-element sequence with ddof=1 (#8730)
- don't take logical plan during streaming fmt (#8711)
- Don't upcast in round() for f32 when decimal is 0 (#8706)
- block predicate containing shifts and windows after sort (#8670)
- ensure perfect hash table processes the nulls (#8668)
- Reading more tiny CSVs than workers in parallel will deadlock (#8441)
- respect maintain_order in partitioned groupby (#8653)
- fix explode null series (#8654)
- fix categorical agg type (#8645)
- allow list<null> -> list<cat> (#8636)
- maintain sorted info on top-k and empty sort (#8615)
- maintain sortedness in date -> datetime cast (#8606)
- fix determining of supertype for tz-aware and tz-naive datetimes (#8585)
- fix csv reader with new line in header (#8580)
- correct for nested offsets in json serialization (#8584)
- fix wrong dtype init in streaming groupby (#8574)
- fix categorical/string_cache fill_null panic (#8562)
- fix window function contention in binary expression (#8544)
- fix StructChunked
not_equal
comparator/operator (#8547) - fix struct pyarrow ffi (#8543)
- don't trigger unreachable code if no dtype is set (#8532)
- keep sorted info on agg_first and simple singleton… (#8526)
- unset fast_unique coming from arrow (#8521)
- correct sign-reversed scale on DecimalChunked to Python Decimal conversion (fixes #8423) (#8508)
- don't error on cast if column is not projected (#8495)
- ensure window function succeeds on empty frame (#8492)
- don't set verbose on union (#8487)
- check literal/group length before claiming agg sta… (#8486)
- fix error message of offset_by if offsetting by negative number of months (#8464)
- fix sorted warning (#8462)
- fix features serde and dtype-struct not compiling together (#8439)
- respect dtype in anonymous list builder in case of… (#8428)
- infer supertype in json serde (#8411)
- duration on empty df (#8403)
- don't inadvertently set
Series
initialised with nested tuple data asObject
dtype (#8401) - use physical in streaming unique global table (#8390)
- recursively bubble up all dtypes in list cast (#8386)
- is_in struct logical types (#8378)
- fix nested null parquet read (#8372)
- fix logical type in ListChunked::new_from_index (#8367)
- bubble up logical type in recursive list cast (#8356)
- implement clone_inner for all series (#8357)
- fix fill_null for categorical (#8353)
- time.cast(str) as strftime (#8351)
- fix logical dtypes in parallel list collection (#8349)
- improve logical types of explode operation (#8348)
- logical type in anonymous list builders (#8346)
- escape csv header names if they contain special chars (#8331)
- nested struct/list/categorical logical/physical (#8334)
- fix deserialize empty list (#8326)
- fix coalesce schema (#8324)
- don't do null propagation (#8322)
- ensure invalid list eval raises (#8317)
- pass name to struct construction in aggregation (#8299)
- Use three slashes for doc comments (#8284)
- improve nested list construction (#8278)
- Fix DataFrame.sum returning empty column names (#8283)
- always sort in
top_k
fast path (#8275) - don't use fast paths for sorted join if there are … (#8272)
- fix boolean par materialization (#8257)
- improve null/empty list construction (#8255)
- fix offsets in parallel utf8 materialization (#8254)
- nested struct logical type consistency (#8249)
- keep literal state if elementwise function is applied (#8195)
- decimal ensure backed arrow arrays have correct dtype (#8193)
- ensure cached nodes are initialized once (#8103)
- validate
map
lenghts (#8147) - fix row-wise init of
UInt64
values that exceedInt64
upper bound (#8146) - implement list<null> constructor (#8143)
- add all primitives to av_buffer builder (#8140)
- struct
is_in
(#8139) - fix wrong display name of binary expressions (#8131)
- lazy: fix boolean sum schema (#8108)
- don't exponentially grow error messages (partial fix). (#8081)
- check element count in multi-column explode (#8050)
- set lower limit for chunk_size (#8048)
- impl to_static for struct (#8037)
- all/any empty sets (#8012)
- struct null_count, cast string, tranpose and describe (#8009)
- fix pivot and transpose of struct data (#8005)
- don't create duplicate pivot names (#8002)
- fix chunked literals in expression engine (#7973)
- in
sort
,top_k
,sort_by
, andarg_sort_by
, raise ifdescending
is a sequence and its length doesn't match the number of columns to sort by (#7957) - concat object types (#7958)
- fix decimal conversion alignment (#7954)
- Fix lazy encode schema (#7912)
- respect skip_nulls in apply for temporal types (#7908)
- fix lit agg (#7904)
- disable ooc groupby (#7901)
- fix abs logical type (#7895)
- fix boolean min/max output type and null handling (#7894)
- validate groupby_dynamic inputs (#7876)
- correct for chunks in arg_where (#7873)
- fix nested logical/physical list (#7872)
- fix arbitrary nested logical types (#7869)
- don't use fxhash in sink_sorted fast path (#7849)
- parquet stats & all kernel (#7846)
🛠️ Other improvements
- remove unnecessary feature flag requirement for start_by=monday in groupby_dynamic (#8716)
- remove some branches (#8688)
- streaming pipeline creation (#8656)
- simplify replace_time_zone (#8644)
- make slice attribute in UnionOptions consistent with … (#8639)
- document the dispatcher (#8637)
- Rename
concat_lst
toconcat_list
(#8597) - remove unreachable/duplicated code in get_supertype (#8592)
- change partition strategy (#8561)
- remove some unnecessary calls and matches (#8490)
- improve sorted warning/ fix tests (#8484)
- bubble up time_iter errors (#8467)
- Minor update to
strptime
(#8345) - use
concat_owned_array_unchecked
when possible (#8274) - Rename
strptime
/strftime
args (#8221) - change sampling ratio for groupby strategy (#8223)
- Rename
Expr.list
toimplode
(#8165) - introduce
FieldsMapper
utility class for obtainingFunctionExpr
schema (#8175) - don't panic on err in offset_by (#8210)
- remove unused list_construction (#8197)
- split dsl paragraph header (#8162)
- feature flag guards (#8117)
- use
map_private
where applicable to reduce code duplication (#8128) - remove unnecessary to_string (#8083)
- docs(rust) Add note about
-1
to show all rows. (#8080) - Fixed a bunch of clippy warnings (#7967)
- rename
toggle_string_cache
toenable_string_cache
(#7970) - Include license files in polars-error and polars-row crates (#7930)
- quantile typo in qcut (#7936)
- Improve
Duration::parse
docs (#7918) - improve shift and fill performance in case of periods >= ca.len() (#7843)
Thank you to all our contributors for making this release possible!
@DeflateAwning, @JoonHong-Kim, @LdRoW, @MarcoGorelli, @Newtoniano, @StefanBRas, @alexander-beedie, @alonme, @ankane, @avimallu, @ayemjay, @borchero, @cgevans, @chitralverma, @clickingbuttons, @dependabot, @dependabot[bot], @ghuls, @grantmcdermott, @jonashaag, @josh, @jvdd, @lorentzenchr, @mcrumiller, @mzjp2, @n8henrie, @pgimalac, @rben01, @ritchie46, @stinodego, @uchiiii, @universalmind303, @utkarshgupta137, @zaynetro and @zundertj