🏆 Highlights
- out of core sort on multiple columns (#7244)
🚀 Performance improvements
- rechunk dataframe before unique computation (#7814)
- improve hash quality (#7813)
- remove unnecessary copy in rolling function (#7801)
- always take sorted fast path group_tuples (#7787)
- change top_k algorithm (#7718)
- runtime SIMD target detection for
min/max/sum
and impl SIMDmean
~2-5x
(#7702) - implement top-k optimization (#7678)
- ooc-sort dump in thread local if IO-thread is full. (#7668)
- use perfect hash table for ooc partitioning (#7653)
- optimize string kernels, (elide redundant allocs) (#7602)
- optimize
str_replace
for same length replacements~2x
(#7580) - improve perf or
str.replace_n
and addn
argument~10x
(#7575) - speedup
replace_literal_all
of single byte replacements~15x
. (#7565) - set sorted flags (#7558)
- use atoi in favor of lexical in strptime
-25%
(#7501) - [csv] faster utf8 validation
~20%
(#7500) - [csv] SIMD accelerate SplitFields
-40%
(#7498) - (csv) don't use memchr for splitfields
-~0.15%
(#7494) - csv-file use fast-float for csv float parsing (#7492)
- speed up comparison of sorted arrays
~3.85x
. (#7478) - improve performance for datetime parsing with %Z (#7369)
- optimize str.replace_all (#7353)
- optimize str.replace
~2x
improvement (#7347) - ensure utf8 apply preallocates memory (#7345)
- improve batched csv readers perf and memory perf (#7329)
- use inlined strings for field and schema (#7272)
- reuse groups in binary expressions (#7202)
- improve perf of multi-args exprs in groupby context (#7186)
- improve single argument elementwise expression pe… (#7180)
- optimize arr.sum for list array with inner nulls (#7053)
- optimize arr.min/arr.max (#7050)
- optimize arr.mean (#7048)
- optimize arr.sum (#7047)
- optimize 'arg_where' (#7039)
- add
arr.count_match
expression and optimizearr.sum
forList<Boolean>
(#7023) - remove O^2 behavior in melt (#7003)
- improve vec_hash perf for boolean and utf8 (#6963)
- don't pack utf8 columns in grouptuples
~5-15%
(#6959) - don't pack integer keys in determining
~8-18%
group tuples. (#6956) - use fxhash for all integers (#6954)
- speedup quantile/median
~2x
(#6861) - remove unneeded series allocations in groupby aggs (#6855)
- faster
str.contains
literal matching in the small-string regime (#6811) - optimize
arg_min/arg_max
(#6799)
✨ Enhancements
- support mode for floats and categoricals (#7827)
- support sort by 'struct' type (#7822)
- thousand separators in shape of repr
DataFrame
(#7775) - deprecate default value of
aggregation_function
being'first'
inpivot
. In a future version, it will default toNone
(#7784) - add dt.datetime, dt.date, dt.time (#7735)
- add
qcut
(#7724) - add
maintain_order
option toSeries.cut
(#7723) - add
maintain_order
inarr.unique
(#7721) DataFrame.top_k/ LazyFrame.top_k
(#7720)- clearer error message when replace_time_zone encounters ambiguous or non-existent datetimes (#7685)
- anonymous_scan::as_any (#7715)
- include
set_fmt_float
value inConfig
load/save state (#7696) - raise on descending date_range arguments (#7671)
- add
is_leap_year
to temporal expressions (#7618) - full out-of core support for streaming groupby (#7630)
- clearer error message when creating duration string without integer (#7648)
- out-of-core
groupby/unique
of groupby on integer keys (#7604) - slightly more space-efficient table output (use ellipsis char, not three periods) (#7599)
- implement decimal -> dtype cast (#7600)
- overwrite streaming chunk size (#7543)
- slice pushdown in
LazyFrame.unique
(#7470) - streaming
LazyFrame.unique
(#7466) - automatically infer iso8601-like dates (#7457)
- convert decimal 256 to 128 on entry (#7448)
- dynamically change chunk_size in streaming `explo… (#7415)
- add unary +,-,! to sql (#7399)
- use IO backed reader when
low_memory=True
. (#7394) - The big error revamp (#7362)
- parse year-month-day as Datetime in slow-path (#7373)
- make melt streamable (#7364)
- don't rechunk before writing to csv (#7365)
- make
LazyFrame.explode
streamable. (#7341) - initial working version of Decimal Series (#7220)
- implement serde for literal datetime and series (#7301)
- improve error message if mmap fails in ipc (#7300)
- add support for serializing categoricals to json (#7276)
- enable min-max skipping for binary in parquet, enable min-max skipping for
is_in
exprs (#7169) - out of core sort on multiple columns (#7244)
- support nulls_last for multi-column sort (#7242)
- implement row encoding for boolean and binary (#7218)
- allow passing utc=True when parsing time-zone-naive date strings (#7203)
- add sql "ARRAY_AGG" (#7204)
- show column name if read_csv errors (#7177)
- add explode for binary (#7159)
- improve error message when read_csv fails (#7150)
- Improve usability of Null type. (#7136)
- add sort maintaining order row encoding (#7117)
- add glob support to scan_ndjson (#7143)
- streaming: scale chunk_size on table width (#7119)
- additional read functions (#7102)
- add 'use_statistics' option to parquet readers (#7087)
- add
arr.count_match
expression and optimizearr.sum
forList<Boolean>
(#7023) - add sort for struct dtype (#7021)
- raise informative error if invalid datetime_format passed to write_csv (#7005)
- rename parse_dates => try_parse_dates (#6987)
- add is_duplicated/is_unique for struct dtype (#6940)
- supported nested fixedsizebinary conversion (#6923)
- raise error on invalid aggregation expressions (#6921)
- properly implement null array (#6817)
- avoid panic error in strftime with invalid format (#6810)
🐞 Bug fixes
- fill null list (#7836)
- fix explode list[null] (#7832)
- fix unicode lower/uppercase (#7826)
- don't use naive name in partitioned agg (#7810)
- Ensure CsvReader always respects the n_rows parameter (#7789)
- ensure k is lower than height (#7779)
- raise error on invalid categorical cast (#7686)
- compile issue in polars-lazy (#7766)
- compile issues in "polars-core" with default features (#7765)
- make zip_with_same_type obligatory (#7761)
- fix melt projection pushdown node (#7752)
- fix predicate pushdown for 'unique' first/last (#7749)
- fix null propagation (#7748)
- avoid ambiguous time error when passing python Datetime to DataFrame constructor (#7711)
- Fix infering CSV schema when skip_rows_after_heade… (#7701)
- fix race condition in null handling of window fast… (#7695)
- respect time zone in groupby_rolling with negative offset (#7664)
- fix empty case str.replace (#7662)
- respect time zone in rolling_* functions (#7643)
- fix schema of decimal type reads (#7652)
- respect time zone in offset_by (#7626)
- respect time zone in dt.round (#7611)
- add decimal chunk_lengths (#7589)
- fix ooc sort. the fast path was invalid (#7588)
- Fix regression throwing AmbiguousTimeError in groupby_dynamic (#7454)
- activate dtype-duration for polars-ops (#7582)
- distinct project whole schema if not a subset (#7581)
- sql window functions (#7458)
- respect time zone in upsample (#7563)
- fix rolling windows for windows that shrink from lhs (#7556)
- pushdown key in merge sorted projection pd (#7542)
- don't upcast column to string in 'is_in' operation (#7538)
- Enable link to DateLikeNameSpace in the docs. (#7526)
- fix(rust, python) respect time zone in date_range (#7503)
- use physical types in sort-by args (#7518)
- fix projection pushdown of asof_joins (#7487)
- raise error on categorical by arguments if not fro… (#7464)
- sql floor & ceil (#7456)
- allow for hourly date_range to cross DST (#7430)
- respect lexical/physical in multi-column categoric… (#7417)
- fix null_dtype slice (#7414)
- sort_by logical types (#7412)
- parse single-digit months and dates when code would have gone down fastpath (#7391)
- creating empty struct series with some unit fields (#7383)
- don't panic when writing
NullArray
values to python row tuple (#7346) - fix projection pushdown on join with unused join key (#7326)
- raise error on time -> datetime cast (#7325)
- make
pl.struct
mappable (#7299) - err on duplicate with_column names (#7296)
- don't panic on
str.parse_int
(#7072) - improve concat_list with empty list error message (#7236)
- fix groupby_dynamic's binning when index_column is time-zone-aware (#7278)
- fix preservation of microseconds when converting Python datetime (#7271)
- no panic on empty cross join (#7266)
- raise error on ambiguous filter predicates (#7265)
- handle concat_list with first lit value (#7235)
- add type annotation to avoid potential build errors (#7223)
- floating point CSV parsing with escaping and whitespace (#7196)
- fix(rust, python); make list function 'map' and refactor multi-arg ex… (#7185)
- validate trees before inserting streaming node (#7179)
- fix list take logical types (#7163)
- fix null cmp fast paths (#7157)
- don't panic un unsupported arithmetic type (#7154)
- don't let a cast unset agg_state and keep logical … (#7151)
- expose sort expressions to stack-optimizer (#7148)
- improve error message when read_csv fails (#7150)
- make cast unknown a no-op (#7147)
- fix panic on cum_prod (#7141)
- respect f32 schema in deep expressions (#7146)
- fix deadlock in scan_csv()->sink_parquet() (#7118)
- make CSV reader respect n_rows with globbing (#6969)
- nested sql exprs (#7112)
- fix logical types in arr.get (#7094)
- allow fill_null in eager if type now known (#7092)
- do projection just before concat to ensure same sizes (#7089)
- fix 'filter' in groupby context when expression is… (#7041)
- reflect time zone conversion in lazy dataframe schema (#7022)
- ensure set_sorted never panics (#7013)
- fix struct append 0 sliced (#7012)
- fix coalesce supertype (#7000)
- fix fill_null for categoricals (#6998)
- dtype of pow function (#6985)
- fix is_duplicated for utf8 dtype (#6997)
- fix temporal logical types in pivot (#6957)
- ensure literals are expanded in streaming (#6952)
- str.contains strict=False took no effect (#6950)
- add special fast path for elementwise expression o… (#6924)
- fix arg_min/arg_max when sorted (#6927)
- fix anonymous list builder (#6916)
- reject multithreading on excessive ',\n' fields (#6906)
- dispatch suffix to asof_join by (#6899)
- improve recursive casting of nested data (#6897)
- don't fast explode on null introducing take (#6890)
- fix crash in write_csv when mixed tz-naive and tz-aware datetimes are present (#6828)
- Do not panic when infering schema from empty rows (#6849)
- fix schema of functions: (#6845)
- Do not panic when failing to extract numeric value (#6848)
- stabilize integer operation to minimal required dtype (#6841)
- respect schema in ndjson (#6819)
🛠️ Other improvements
- refactor(rust); split up
vector_hasher
module (#7807) - remove unnecessary copy in rolling function (#7801)
- cover uncovered paths in agg_* functions (#7800)
- Add "typos" as spell checking lint (#7759)
- fix typos (#7756)
- change some panics to errors (#7669)
- remove apply_on_tz_corrected (#7624)
- don't branch via error in read_csv::parse_dates (#7621)
- fix a bunch of cargo warnings & errors (#7549)
- factor out some utils into polars-time/src/utils (#7562)
- abstract memory collection in sinks (#7560)
- mark
DataFrame.get_columns_mut
as unsafe (#7557) - Use pre-installed rustup (#7544)
- refactor date parsing (#7517)
- refactor join pushdown (#7486)
- Use
eprintln!
instead ofeprint!
(#7473) - Improved JSON IO docs (#7445)
- update arrow (#7409)
- Rename Decimal
prec
toprecision
(#7401) - add more docstrings to
Expr
(#7258) - use SchemaRef in CSV modules (#7250)
- fix polars-row tests and add to ci (#7275)
- remove binary feature (#7219)
- Replace
num
withnum-traits
+ a few minor maintenance fixes (#7201) - simplify binary expression evaluation (#7195)
- ensure binary branches are executed in parall… (#7193)
- Build versioned API reference (#7114)
- update_arrow fix categorical statistics (#7098)
- separate crate for error type (#7096)
- Rename kwarg reverse to descending (#6914)
- update rayon (#7001)
- remove time 0.1 dep (#6979)
- add LazyFileListReader trait (#6937)
- cleanup is_unique impl (#6935)
- Clean up some warnings (#6934)
- update rustc to nightly-2023-02-14 (#6909)
- avoid unnecessary mut (#6894)
- setup support for fixedsizebinary convertion (#6867)
- split agg in modules and make quantile DRY (#6857)
- Rename argsort/argsort_by to arg_sort/arg_sort_by (#6829)
- Update dprint config excludes (#6822)
Thank you to all our contributors for making this release possible!
@CloseChoice, @Hofer-Julian, @LdRoW, @MarcoGorelli, @MatveyF, @SauravMaheshkar, @Trippy3, @Vincenthays, @adamgreg, @advoet, @aldanor, @alexander-beedie, @borchero, @chitralverma, @cjackal, @coinflip112, @csko, @datapythonista, @dependabot, @dependabot[bot], @didriksg, @duskmoon314, @ecashin, @foxcroftjn, @ghuls, @iamsmkr, @igmriegel, @jakob-keller, @jonashaag, @josemasar, @josh, @juba, @jvdd, @kngwyu, @minimav, @moritzwilksch, @mslapek, @nrebena, @oysols, @ozgrakkurt, @papparapa, @ptiza, @rben01, @ritchie46, @romanovacca, @s-banach, @sorhawell, @stinodego, @universalmind303, @vincev, @xhochy, @xyning and @zundertj