github pola-rs/polars rs-0.46.0
Rust Polars 0.46.0

4 days ago

🏆 Highlights

💥 Breaking changes

  • Support writing partitioned parquet to cloud (#20590)

🚀 Performance improvements

  • Use BitmapBuilder in yet more places (#20868)
  • Make an owned version of append (#20800)
  • Use BitmapBuilder in a lot more places (#20776)
  • Extend functionality on BitmapBuilder and use in Growables (#20754)
  • Specialize first/last agg for simple types in new-streaming engine (#20728)
  • Improve state caching and parallelism of window functions (#20689)
  • Broadcast without materialization in concat_arr (#20681)
  • Cache rolling groups (#20675)
  • Use downcast_ref instead of dtype equality in <dyn SeriesTrait as AsRef<ChunkedArray<T>> (#20664)
  • Fix performance regression for DataFrame serialization/pickling (#20641)
  • Make Parquet verify_dict_indices SIMD (#20623)
  • Move to zlib-rs by default and use zstd::with_buffer (#20614)
  • Skip filter expansion in eager (#20586)
  • Use AtomicWaker in async engine task joiner (#20604)
  • Move morsel distribution to the computational async engine (#20600)
  • Improve unique pred-pd (#20569)
  • Collapse expanded filters in eager (#20493)
  • Remove predicate from IR::DataFrame (#20492)
  • Add proper distributor to new-streaming parquet reader (#20372)
  • Use different binview dedup strategy depending on chunks ratio (#20451)
  • Generalize the arg_sort fast path onto Column (#20437)
  • Dedup binviews up front (#20449)
  • Re-enable common subplan elim for new-streaming engine (#20443)
  • Don't collect all LHS arrays in gather (#20441)
  • Remove prepare_series for gather kernels (#20439)
  • Don't always take all data buffers when gathering views (#20435)
  • Order observability optimizations (#20396)
  • Purge ChunkedArray Metadata (#20371)
  • Drop probe tables in parallel in new-streaming equi-join (#20373)
  • Explicit transpose in new-streaming equi-join finalize (#20363)
  • Cache dtype on ExprIR (#20331)

✨ Enhancements

  • Expose descending and nulls last in window order-by (#20919)
  • Add linear_space (#20678)
  • Implement df.unique() on new-streaming engine (#20875)
  • Add unique operations for Decimal dtype (#20855)
  • Add NDJson sink for the new streaming engine (#20805)
  • Support nested keys in window functions (#20837)
  • Add CSV sink for the new streaming engine (#20804)
  • Periodically check python signals ('CTRL-C' handling) (#20826)
  • Experimental unity catalog client (#20798)
  • Support cumulative aggregations for Decimal dtype (#20802)
  • Improve window function caching strategy (#20791)
  • Allow different python versions for pickle (#20740)
  • Add SQL support for the NORMALIZE string function (#20705)
  • Add 'allow_exact_matches' join_asof' (#20723)
  • Add new-streaming first/last aggregations (#20716)
  • Add Parquet Sink to new streaming engine (#20690)
  • Expose IRBuilder (#20710)
  • Make automatic use of Azure storage account keys opt-in (#20652)
  • Improve GroupsProxy/GroupsPosition to be sliceable and cheaply cloneable (#20673)
  • Add str.normalize() (#20483)
  • Allow more group_by agg expressions in the new streaming engine (#20663)
  • Support writing partitioned parquet to cloud (#20590)
  • Add hint to error message for extra struct field in JSON (#20612)
  • Add index_of() function to Series and Expr (#19894)
  • Update sqlparser-rs, enabling "LEFT" keyword to be optional for anti/semi joins in SQL queries (#20576)
  • Add cat.starts_with/cat.ends_with (#20257)
  • Add Int128 IO support for csv & ipc (#20535)
  • Support arbitrary expressions in 'join_where' (#20525)
  • Allow more join lossless casting (#20474)
  • Always resolve dynamic types in schema (#20406)
  • Order observability optimizations (#20396)
  • Add FirstArgLossless supertype (#20394)
  • Add dt.replace (#19708)
  • Polars build for Pyodide (#20383)
  • Add Azure credential provider using DefaultAzureCredential() (#20384)
  • Add env var to ignore file cache allocate error (#20356)
  • Enable joins between compatible differing numeric key columns (#20332)
  • Cache dtype on ExprIR (#20331)
  • Serialize DataFrame/Series using IPC in serde (#20266)
  • Improve error message on SchemaError (#20326)
  • Use better error messages when opening files (#20307)
  • Add 'skip_lines' for CSV (#20301)
  • Allow subtraction of time dtype columns (#20300)
  • Add bin.reinterpret (#20263)
  • Allow decoding of non-Polars arrow dictionaries in Arrow and Parquet (#20248)
  • Add new Int128Type (#20232)
  • IR formatting QoL improvements (#20246)
  • Add cat.len_chars and cat.len_bytes (#20211)
  • Expose AexprArena (#20230)

🐞 Bug fixes

  • Fix from_numpy returning Null dtype for empty 1D numpy array (#20907)
  • Fix map_elements panicking with Decimal type (#20905)
  • Warn if asof keys not sorted (#20887)
  • Avoid name collisions and panicking in object conversion (#20890)
  • Incorrect scale used in log and exp for Decimal type (#20888)
  • Don't deep clone manuallydrop in GroupsPosition (#20886)
  • Fix DuplicateError when selecting columns after join_where or cross join + filter (#20865)
  • Incorrect Decimal value for fill_null(strategy="one") (#20844)
  • Fix one edge case (out of many) of int128 literals not working (#20830)
  • Add height check to frame-level row indexing when key is int (#20778)
  • Remove assert that panics on group_by followed by head(n), where n is larger then the frame height (#20819)
  • Fix panic InvalidHeaderValue scanning from S3 on Windows (#20820)
  • Fix clip for Decimal returning wrong values (#20814)
  • Incorrect height from slicing after projecting only the file path column (#20817)
  • Shift mask when skipping Bitpacked values in Parquet (#20810)
  • Error instead of truncate if length mismatch for several str functions (#20781)
  • Support cumulative aggregations for Decimal dtype (#20802)
  • Do not print sensitive information to output on POLARS_VERBOSE (#20797)
  • Ignore file cache allocation error if fallocate() is not permitted (#20796)
  • Incorrect logic in assert_series_equal for infinities (#20763)
  • Avoid blocking on async runtime when resolving cloud scans (#20750)
  • Fix allow_invalid_certificates being ignored in storage_options (#20744)
  • Incorrect output type for map_groups returning all-NULL column (#20743)
  • Fix unique(maintain_order=True) raising InvalidOperationError for null array (#20737)
  • Don't collapse into a Nested Loop Join if the cross join maintains order (#20729)
  • Don't serialize credentials provider (#20741)
  • Fix Series.n_unique raising for list of struct (#20724)
  • Fix incorrect top-k by sorted column, fix head() returning extra rows (#20722)
  • Add outer validity to AnyValueBufferTrusted for structs (#20713)
  • Don't partition group-by with non-scalar literals in agg (#20704)
  • Incorrect view buffer dedup (#20691)
  • Only verify Parquet ConvertedType if no LogicalType is given (#20682)
  • Validate length of schema_overrides in read_csv (#20672)
  • Fix map_elements ignoring skip_nulls=True for struct dtype (#20668)
  • Check for MAP-GROUPS in cloud-eligible (#20662)
  • Fix empty output of to_arrow() on filtered unit height DataFrame (#20656)
  • Add .default to azure credential provider scope URL (#20651)
  • Fix join_asof panicking for invalid tolerance input (#20643)
  • Incorrect flag check on is_elementwise (#20646)
  • Don't panic but set null type if type is unknown (#20647)
  • Fix performance regression for DataFrame serialization/pickling (#20641)
  • Fix Int128 dtype serialization (#20629)
  • Ensure that SQL LIKE and ILIKE operators support multi-line matches (#20613)
  • Properly broadcast in sort_by (#20434)
  • Properly load nested Parquet Statistics (#20610)
  • AWS environment config was not loaded when credential provider was used (#20611)
  • Fix order observability of group-by-dyn (#20615)
  • Soundness when loading Parquet string statistics (#20585)
  • Fix error filtering after with_columns() on unit height LazyFrame (#20584)
  • Restore symbols on Apple by bumping nightly version (#20563)
  • Fix variable name in error message for "unsupported data type" in rolling and upsampling operations (#20553)
  • Output index type instead of u32 for sum_horizontal with boolean inputs (#20531)
  • Fix more global categorical issues (#20547)
  • Update eager join doctest on multiple columns (#20542)
  • Revert categorical unique code (#20540)
  • Add unique fast path for empty categoricals (#20536)
  • Fix various Int128 operations (#20515)
  • Fix global cat unique (#20524)
  • Fix union (#20523)
  • Fix rolling aggregations for various integer types (#20512)
  • Ensure ignore_nulls is respected in horizontal sum/mean (#20469)
  • Fix incorrectly added sorted flag after append for lexically ordered categorical series (#20414)
  • More Int128 testing and related fixes (#20494)
  • Validate column names in unique() for empty DataFrames (#20411)
  • Implement list.min and list.max for list[i128] (#20488)
  • Decimal from physical in horizontal min/max and shift (#20487)
  • Don't remove sort if first/last strategy is set in unique (#20481)
  • Fix join literal behavior (#20477)
  • Validate asof join by args in IR resolving phase (#20473)
  • Fix align_frames with single row panicking (#20466)
  • Allow multiple column sort for Decimal (#20452)
  • Fix mode panicking for String dtype (#20458)
  • Return correct schema for sum_horizontal with boolean dtype (#20459)
  • Properly handle to_physical_repr of nested types (#20413)
  • Workaround for mmap crash under Emscripten (#20418)
  • Fix using new_columns in scan_csv with compressed file (#20412)
  • Fix decimal arithmetic schema (#20398)
  • Raise on categorical search_sorted (#20395)
  • Don't try to load non-existend List/FSL statistics (#20388)
  • Propagate nulls for float methods on all numeric types (#20386)
  • Add env var to ignore file cache allocate error (#20356)
  • Flip order on right join (#20358)
  • Fix incorrect object store caching for ADLS URI (#20357)
  • Use the same encoding for nullable as non-nullable arrays (#20323)
  • Improve error message on SchemaError (#20326)
  • Boolean optional slice pushdown (#20315)
  • Properly handle from_physical for List/Array (#20311)
  • Ignore quotes in csv comments (#20306)
  • Ensure pl.datetime returns empty column when input columns are empty (#20278)
  • Ensure output height does not change on lazy projection pushdown with aggregations (#20223)
  • Fix error writing on Windows to locations outside of C drive (#20245)
  • Incorrect comparison in some cases with filtered list/array columns (#20243)
  • Ensure height is maintained in SQL SELECT 1 FROM (#20241)
  • Properly account for updated Categorical in .unique() kernel (#20235)
  • Fix incorrect lazy select(len()) with some select orderings (#20222)
  • Fix assertion panic on LazyFrame scratch.is_empty() (#20219)

📖 Documentation

  • Update source URL for legislators-historical.csv (#20858)
  • Fix typo in sql functions (cosinus -> cosine) (#20676)
  • Fix small typo in plugins (polars-dt -> polars-st) (#20657)
  • Add polars-h3 and polars-st to plugin list (#20653)
  • Add docs reference for Field (#20625)
  • Miscellaneous minor updates/fixes (#20573)
  • Update "group_by_rolling" (deprecated) to "rolling" in user guide (#20548)
  • Fix flaky doctests (#20516)
  • Clarify the join pre-condition of join_asof (#20509)
  • Fix Expr.all description of Kleene logic (#20409)
  • Improve docstring clarity (#20416)
  • Fix "forcolumnar" typo in docs (#20401)
  • Remove Plugins overview page without information (#20348)
  • Small fixes/clarifications in user guide (#20335)
  • Improve docs about NaN (#20310)
  • Fix typo in fork warning (#20258)

🛠️ Other improvements

  • Add tests for already resolved issues (#20921)
  • Fix the verify_dict_indices codegen (#20920)
  • Add ProjectionContext in projection pushdown opt (#20918)
  • Disable 'catalog' in build (#20897)
  • Implement negative slice for new streaming IPC (#20866)
  • Remove last instances of itoa (#20881)
  • Reduce bloat in static_array_collect by using BitmapBuilders (#20891)
  • Use defunctionalization in polars-core scalar.rs in order to reduce code duplication (#20377)
  • Simplify decimal formatting and remove itoap dep (#20880)
  • Remove polars(_core)::export (#20869)
  • Debloat Series bitops (#20873)
  • Move sum kernel to polars-compute (#20867)
  • Remove todo and test restriction for new-streaming (#20861)
  • Dispatch to the in-mem engine for AExpr::Gather (#20862)
  • Dispatch to the in-memory engine for multifile sources (#20860)
  • Add tests for open issues (#20857)
  • Mark 'register_startup' as unsafe (#20841)
  • Reduce mode bloat (#20839)
  • Rename ContainsMany to ContainsAny (#20785)
  • Unpin NumPy in type checking workflow (#20792)
  • Add various tests (#20768)
  • Small drive-by's (#20772)
  • Touch the upload probe for the remote benchmark (#20767)
  • Fix remote benchmark script (#20755)
  • Fix tests (#20745)
  • Simplify hive predicate handling in NEW_MULTIFILE (#20730)
  • Add tests for various open issues (#20720)
  • Add tests for various open issues that have been fixed (#20680)
  • Don't include debug symbols in benchmark run (#20571)
  • Remove implicit reverse from AExpr::replace_inputs() (#20659)
  • Implement CSV, IPC and NDJson in the MultiScanExec node (#20648)
  • Fix Python deps installation in remote-benchmark workflow (#20619)
  • Fix rust-analyzer misinterpretation (#20595)
  • Remove unused file (#20594)
  • Rename is_numeric to is_primitive_numeric (#20574)
  • Reduce size of ArrowDataType by boxing heavy variants (#20588)
  • Bump multiversion from 0.7 to 0.8 (#20543)
  • Groundwork for allowing multi-output nodes in the new streaming engine (#20550)
  • Improve bin size info (#20551)
  • Increase categorical test coverage (#20514)
  • Report wheel sizes (#20541)
  • Add tests for floor/ceil on integers (#20479)
  • Expose and rewrite 'can_pre_agg' (#20450)
  • Skip test on windows; kuzu import segfaults (#20463)
  • Add a TypeCheckRule to the optimizer (#20425)
  • Fix duplicate cols in new-streaming parquet prefilter (#20419)
  • Move gather kernels to polars-compute (#20415)
  • Temporarily disable common subplan elim for new-streaming (#20374)
  • Remove unused IR::Reduce node (#20392)
  • Enable masked out list, struct and array elements in parametric tests (#20365)
  • Dispatch slice/filter lowering properly (#20390)
  • Move hive partitioning/multi-file handling outside of readers (#20203)
  • Purge ChunkedArray Metadata (#20371)
  • Add equi joins to new streaming engine (#19869)
  • Make parametric tests include pl.List and pl.Array by default (#20319)
  • Use Column in Row Encoding (#20312)
  • Don't warn on fork hook (#20309)
  • Don't deconstruct CsvParseOptions (#20302)
  • Allow decoding of non-Polars arrow dictionaries in Arrow and Parquet (#20248)
  • Add FunctionCastOptions and conservative IR-level cast type-checking (#20286)
  • Add more descriptive error message for failure of vstack/extend (#20299)
  • Expose AexprArena (#20230)

Thank you to all our contributors for making this release possible!
@Biswas-N, @FBruzzesi, @IndexSeek, @Jesse-Bakker, @MarcoGorelli, @MoizesCBF, @Prathamesh-Ghatole, @SamuelAllain, @Terrigible, @ZemanOndrej, @alexander-beedie, @arnabanimesh, @balbok0, @beckernick, @braaannigan, @brifitz, @bschoenmaeckers, @burakemir, @coastalwhite, @deanm0000, @dependabot[bot], @dimfeld, @eitsupi, @etiennebacher, @georgestagg, @hamdanal, @haocheng6, @ion-elgreco, @itamarst, @jqnatividad, @kszlim, @lukemanley, @mcrumiller, @nameexhaustion, @noexecstack, @orlp, @ptiza, @r-brink, @ritchie46, @rodrigogiraoserrao, @siddharth-vi, @stijnherfst, @stinodego, @tswast, @zero-stroke and dependabot[bot]

Don't miss a new polars release

NewReleases is sending notifications on new releases.