github pola-rs/polars rs-0.28
Rust Polars 0.28.0

latest releases: py-1.7.1, rs-0.43.1, py-1.7.0...
18 months ago

🏆 Highlights

  • out of core sort on multiple columns (#7244)

🚀 Performance improvements

  • rechunk dataframe before unique computation (#7814)
  • improve hash quality (#7813)
  • remove unnecessary copy in rolling function (#7801)
  • always take sorted fast path group_tuples (#7787)
  • change top_k algorithm (#7718)
  • runtime SIMD target detection for min/max/sum and impl SIMD mean ~2-5x (#7702)
  • implement top-k optimization (#7678)
  • ooc-sort dump in thread local if IO-thread is full. (#7668)
  • use perfect hash table for ooc partitioning (#7653)
  • optimize string kernels, (elide redundant allocs) (#7602)
  • optimize str_replace for same length replacements ~2x (#7580)
  • improve perf or str.replace_n and add n argument ~10x (#7575)
  • speedup replace_literal_all of single byte replacements ~15x. (#7565)
  • set sorted flags (#7558)
  • use atoi in favor of lexical in strptime -25% (#7501)
  • [csv] faster utf8 validation ~20% (#7500)
  • [csv] SIMD accelerate SplitFields -40% (#7498)
  • (csv) don't use memchr for splitfields -~0.15% (#7494)
  • csv-file use fast-float for csv float parsing (#7492)
  • speed up comparison of sorted arrays ~3.85x. (#7478)
  • improve performance for datetime parsing with %Z (#7369)
  • optimize str.replace_all (#7353)
  • optimize str.replace ~2x improvement (#7347)
  • ensure utf8 apply preallocates memory (#7345)
  • improve batched csv readers perf and memory perf (#7329)
  • use inlined strings for field and schema (#7272)
  • reuse groups in binary expressions (#7202)
  • improve perf of multi-args exprs in groupby context (#7186)
  • improve single argument elementwise expression pe… (#7180)
  • optimize arr.sum for list array with inner nulls (#7053)
  • optimize arr.min/arr.max (#7050)
  • optimize arr.mean (#7048)
  • optimize arr.sum (#7047)
  • optimize 'arg_where' (#7039)
  • add arr.count_match expression and optimize arr.sum for List<Boolean> (#7023)
  • remove O^2 behavior in melt (#7003)
  • improve vec_hash perf for boolean and utf8 (#6963)
  • don't pack utf8 columns in grouptuples ~5-15% (#6959)
  • don't pack integer keys in determining ~8-18% group tuples. (#6956)
  • use fxhash for all integers (#6954)
  • speedup quantile/median ~2x (#6861)
  • remove unneeded series allocations in groupby aggs (#6855)
  • faster str.contains literal matching in the small-string regime (#6811)
  • optimize arg_min/arg_max (#6799)

✨ Enhancements

  • support mode for floats and categoricals (#7827)
  • support sort by 'struct' type (#7822)
  • thousand separators in shape of repr DataFrame (#7775)
  • deprecate default value of aggregation_function being 'first' in pivot. In a future version, it will default to None (#7784)
  • add dt.datetime, dt.date, dt.time (#7735)
  • add qcut (#7724)
  • add maintain_order option to Series.cut (#7723)
  • add maintain_order in arr.unique (#7721)
  • DataFrame.top_k/ LazyFrame.top_k (#7720)
  • clearer error message when replace_time_zone encounters ambiguous or non-existent datetimes (#7685)
  • anonymous_scan::as_any (#7715)
  • include set_fmt_float value in Config load/save state (#7696)
  • raise on descending date_range arguments (#7671)
  • add is_leap_year to temporal expressions (#7618)
  • full out-of core support for streaming groupby (#7630)
  • clearer error message when creating duration string without integer (#7648)
  • out-of-core groupby/unique of groupby on integer keys (#7604)
  • slightly more space-efficient table output (use ellipsis char, not three periods) (#7599)
  • implement decimal -> dtype cast (#7600)
  • overwrite streaming chunk size (#7543)
  • slice pushdown in LazyFrame.unique (#7470)
  • streaming LazyFrame.unique (#7466)
  • automatically infer iso8601-like dates (#7457)
  • convert decimal 256 to 128 on entry (#7448)
  • dynamically change chunk_size in streaming `explo… (#7415)
  • add unary +,-,! to sql (#7399)
  • use IO backed reader when low_memory=True. (#7394)
  • The big error revamp (#7362)
  • parse year-month-day as Datetime in slow-path (#7373)
  • make melt streamable (#7364)
  • don't rechunk before writing to csv (#7365)
  • make LazyFrame.explode streamable. (#7341)
  • initial working version of Decimal Series (#7220)
  • implement serde for literal datetime and series (#7301)
  • improve error message if mmap fails in ipc (#7300)
  • add support for serializing categoricals to json (#7276)
  • enable min-max skipping for binary in parquet, enable min-max skipping for is_in exprs (#7169)
  • out of core sort on multiple columns (#7244)
  • support nulls_last for multi-column sort (#7242)
  • implement row encoding for boolean and binary (#7218)
  • allow passing utc=True when parsing time-zone-naive date strings (#7203)
  • add sql "ARRAY_AGG" (#7204)
  • show column name if read_csv errors (#7177)
  • add explode for binary (#7159)
  • improve error message when read_csv fails (#7150)
  • Improve usability of Null type. (#7136)
  • add sort maintaining order row encoding (#7117)
  • add glob support to scan_ndjson (#7143)
  • streaming: scale chunk_size on table width (#7119)
  • additional read functions (#7102)
  • add 'use_statistics' option to parquet readers (#7087)
  • add arr.count_match expression and optimize arr.sum for List<Boolean> (#7023)
  • add sort for struct dtype (#7021)
  • raise informative error if invalid datetime_format passed to write_csv (#7005)
  • rename parse_dates => try_parse_dates (#6987)
  • add is_duplicated/is_unique for struct dtype (#6940)
  • supported nested fixedsizebinary conversion (#6923)
  • raise error on invalid aggregation expressions (#6921)
  • properly implement null array (#6817)
  • avoid panic error in strftime with invalid format (#6810)

🐞 Bug fixes

  • fill null list (#7836)
  • fix explode list[null] (#7832)
  • fix unicode lower/uppercase (#7826)
  • don't use naive name in partitioned agg (#7810)
  • Ensure CsvReader always respects the n_rows parameter (#7789)
  • ensure k is lower than height (#7779)
  • raise error on invalid categorical cast (#7686)
  • compile issue in polars-lazy (#7766)
  • compile issues in "polars-core" with default features (#7765)
  • make zip_with_same_type obligatory (#7761)
  • fix melt projection pushdown node (#7752)
  • fix predicate pushdown for 'unique' first/last (#7749)
  • fix null propagation (#7748)
  • avoid ambiguous time error when passing python Datetime to DataFrame constructor (#7711)
  • Fix infering CSV schema when skip_rows_after_heade… (#7701)
  • fix race condition in null handling of window fast… (#7695)
  • respect time zone in groupby_rolling with negative offset (#7664)
  • fix empty case str.replace (#7662)
  • respect time zone in rolling_* functions (#7643)
  • fix schema of decimal type reads (#7652)
  • respect time zone in offset_by (#7626)
  • respect time zone in dt.round (#7611)
  • add decimal chunk_lengths (#7589)
  • fix ooc sort. the fast path was invalid (#7588)
  • Fix regression throwing AmbiguousTimeError in groupby_dynamic (#7454)
  • activate dtype-duration for polars-ops (#7582)
  • distinct project whole schema if not a subset (#7581)
  • sql window functions (#7458)
  • respect time zone in upsample (#7563)
  • fix rolling windows for windows that shrink from lhs (#7556)
  • pushdown key in merge sorted projection pd (#7542)
  • don't upcast column to string in 'is_in' operation (#7538)
  • Enable link to DateLikeNameSpace in the docs. (#7526)
  • fix(rust, python) respect time zone in date_range (#7503)
  • use physical types in sort-by args (#7518)
  • fix projection pushdown of asof_joins (#7487)
  • raise error on categorical by arguments if not fro… (#7464)
  • sql floor & ceil (#7456)
  • allow for hourly date_range to cross DST (#7430)
  • respect lexical/physical in multi-column categoric… (#7417)
  • fix null_dtype slice (#7414)
  • sort_by logical types (#7412)
  • parse single-digit months and dates when code would have gone down fastpath (#7391)
  • creating empty struct series with some unit fields (#7383)
  • don't panic when writing NullArray values to python row tuple (#7346)
  • fix projection pushdown on join with unused join key (#7326)
  • raise error on time -> datetime cast (#7325)
  • make pl.struct mappable (#7299)
  • err on duplicate with_column names (#7296)
  • don't panic on str.parse_int (#7072)
  • improve concat_list with empty list error message (#7236)
  • fix groupby_dynamic's binning when index_column is time-zone-aware (#7278)
  • fix preservation of microseconds when converting Python datetime (#7271)
  • no panic on empty cross join (#7266)
  • raise error on ambiguous filter predicates (#7265)
  • handle concat_list with first lit value (#7235)
  • add type annotation to avoid potential build errors (#7223)
  • floating point CSV parsing with escaping and whitespace (#7196)
  • fix(rust, python); make list function 'map' and refactor multi-arg ex… (#7185)
  • validate trees before inserting streaming node (#7179)
  • fix list take logical types (#7163)
  • fix null cmp fast paths (#7157)
  • don't panic un unsupported arithmetic type (#7154)
  • don't let a cast unset agg_state and keep logical … (#7151)
  • expose sort expressions to stack-optimizer (#7148)
  • improve error message when read_csv fails (#7150)
  • make cast unknown a no-op (#7147)
  • fix panic on cum_prod (#7141)
  • respect f32 schema in deep expressions (#7146)
  • fix deadlock in scan_csv()->sink_parquet() (#7118)
  • make CSV reader respect n_rows with globbing (#6969)
  • nested sql exprs (#7112)
  • fix logical types in arr.get (#7094)
  • allow fill_null in eager if type now known (#7092)
  • do projection just before concat to ensure same sizes (#7089)
  • fix 'filter' in groupby context when expression is… (#7041)
  • reflect time zone conversion in lazy dataframe schema (#7022)
  • ensure set_sorted never panics (#7013)
  • fix struct append 0 sliced (#7012)
  • fix coalesce supertype (#7000)
  • fix fill_null for categoricals (#6998)
  • dtype of pow function (#6985)
  • fix is_duplicated for utf8 dtype (#6997)
  • fix temporal logical types in pivot (#6957)
  • ensure literals are expanded in streaming (#6952)
  • str.contains strict=False took no effect (#6950)
  • add special fast path for elementwise expression o… (#6924)
  • fix arg_min/arg_max when sorted (#6927)
  • fix anonymous list builder (#6916)
  • reject multithreading on excessive ',\n' fields (#6906)
  • dispatch suffix to asof_join by (#6899)
  • improve recursive casting of nested data (#6897)
  • don't fast explode on null introducing take (#6890)
  • fix crash in write_csv when mixed tz-naive and tz-aware datetimes are present (#6828)
  • Do not panic when infering schema from empty rows (#6849)
  • fix schema of functions: (#6845)
  • Do not panic when failing to extract numeric value (#6848)
  • stabilize integer operation to minimal required dtype (#6841)
  • respect schema in ndjson (#6819)

🛠️ Other improvements

  • refactor(rust); split up vector_hasher module (#7807)
  • remove unnecessary copy in rolling function (#7801)
  • cover uncovered paths in agg_* functions (#7800)
  • Add "typos" as spell checking lint (#7759)
  • fix typos (#7756)
  • change some panics to errors (#7669)
  • remove apply_on_tz_corrected (#7624)
  • don't branch via error in read_csv::parse_dates (#7621)
  • fix a bunch of cargo warnings & errors (#7549)
  • factor out some utils into polars-time/src/utils (#7562)
  • abstract memory collection in sinks (#7560)
  • mark DataFrame.get_columns_mut as unsafe (#7557)
  • Use pre-installed rustup (#7544)
  • refactor date parsing (#7517)
  • refactor join pushdown (#7486)
  • Use eprintln! instead of eprint! (#7473)
  • Improved JSON IO docs (#7445)
  • update arrow (#7409)
  • Rename Decimal prec to precision (#7401)
  • add more docstrings to Expr (#7258)
  • use SchemaRef in CSV modules (#7250)
  • fix polars-row tests and add to ci (#7275)
  • remove binary feature (#7219)
  • Replace num with num-traits + a few minor maintenance fixes (#7201)
  • simplify binary expression evaluation (#7195)
  • ensure binary branches are executed in parall… (#7193)
  • Build versioned API reference (#7114)
  • update_arrow fix categorical statistics (#7098)
  • separate crate for error type (#7096)
  • Rename kwarg reverse to descending (#6914)
  • update rayon (#7001)
  • remove time 0.1 dep (#6979)
  • add LazyFileListReader trait (#6937)
  • cleanup is_unique impl (#6935)
  • Clean up some warnings (#6934)
  • update rustc to nightly-2023-02-14 (#6909)
  • avoid unnecessary mut (#6894)
  • setup support for fixedsizebinary convertion (#6867)
  • split agg in modules and make quantile DRY (#6857)
  • Rename argsort/argsort_by to arg_sort/arg_sort_by (#6829)
  • Update dprint config excludes (#6822)

Thank you to all our contributors for making this release possible!
@CloseChoice, @Hofer-Julian, @LdRoW, @MarcoGorelli, @MatveyF, @SauravMaheshkar, @Trippy3, @Vincenthays, @adamgreg, @advoet, @aldanor, @alexander-beedie, @borchero, @chitralverma, @cjackal, @coinflip112, @csko, @datapythonista, @dependabot, @dependabot[bot], @didriksg, @duskmoon314, @ecashin, @foxcroftjn, @ghuls, @iamsmkr, @igmriegel, @jakob-keller, @jonashaag, @josemasar, @josh, @juba, @jvdd, @kngwyu, @minimav, @moritzwilksch, @mslapek, @nrebena, @oysols, @ozgrakkurt, @papparapa, @ptiza, @rben01, @ritchie46, @romanovacca, @s-banach, @sorhawell, @stinodego, @universalmind303, @vincev, @xhochy, @xyning and @zundertj

Don't miss a new polars release

NewReleases is sending notifications on new releases.