modin-project/modin 0.26.0 on GitHub

This release introduces a new, faster implementation for groupby.apply, as well as many performance fixes related to improving asynchronous execution, a new namespace for accessing experimental functions (for example, DataFrame.modin.to_pickle_distributed), a fix for a long-standing problem with the use of Modin objects inside UDFs for apply and many other fixes.

Note: to get Modin on MPI through unidist (as of unidist 0.5.0) fully working by installing with pip it is required to have a working MPI implementation installed beforehand.

Key Features and Updates Since 0.25.0

Stability and Bugfixes
- FIX-#4355: Fix rename algebraic operator to avoid copying (#4356)
- FIX-#6594: Fix usage of Modin objects inside UDFs for apply (#6673)
- FIX-#6664: Use @lazy_metadata_decorator for PandasDataFrame.finalize (#6720)
- FIX-#6684: Adapt to pandas 2.1.2 (#6685)
- FIX-#6687: Explicitly add users to CODEOWNERS (#6688)
- FIX-#6693: Revert creating an additional copy in astype op (#6692)
- FIX-#6703: Don't use set_index_name(None) (#6698)
- FIX-#6732: Fix inferring result dtypes for binary operations (#6737)
- FIX-#6745: Pin unidist <= 0.4.1 (#6746)
- FIX-#6752: Preserve dtypes cache on .insert() (#6757)
- FIX-#6768: Make sure to_numpy use **kwargs after #6704 (#6769)
- FIX-#6771: Avoid ValueError: assignment destination is read-only for cumsum (#6772)
- FIX-#6773: Make sure _to_pandas return mutable pandas objects (#6775)
- FIX-#6774: Modify conditions for loc to get similar behavior to pandas (#6798)
- FIX-#6778: Read parquet files without file extensions using fastparquet (#6790)
- FIX-#6779: Pass only one indexer into Series.__getitem__ (#6780)
- FIX-#6781: Use pandas.api.types.pandas_dtype to convert to valid numpy and pandas only dtypes (#6788)
- FIX-#6782: Filter pandas warnings when precomputing dtypes (#6811)
- FIX-#6786: Properly d2p for cross DataFrame.join (#6787)
- FIX-#6791: Pass additional environment variables to MPI workers (#6792)
- FIX-#6799: Allow creating incomplete ModinIndex objects (#6800)
- FIX-#6822: Do not propagate NotImplementedError to a user on a set_columns() with dupl labels (#6823)
- FIX-#6824: Invalidate ModinIndex._lengths_id on empty partitions filtering (#6825)
Performance enhancements
- PERF-#4777: Don't use copy=True parameter for concat calls inside to_pandas (#4778)
- PERF-#4804: Preserve lengths/widths caches in broadcast_apply_full_axis (#6760)
- PERF-#6666: Avoid internal reset_index for left merge (#6665)
- PERF-#6668: Use copy=False for internal usage of set_axis (#6667)
- PERF-#6669: Avoid one extra copy() call for Series.reset_index (#6670)
- PERF-#6671: Don't iterate over the result of the Series.tolist function (#6672)
- PERF-#6690: Use sync_labels=False for rank function (#6689)
- PERF-#6694: Use lazy_map_partitions() for dtypes conversion (#6695)
- PERF-#6696: Use cached dtypes in fillna when possible. (#6697)
- PERF-#6701: Use get_axis internal function instead of axes property (#6700)
- PERF-#6702: Don't materialize axes when calling to_numpy (#6699)
- PERF-#6710: Don't materialize index in _groupby_shuffle internal function (#6707)
- PERF-#6712: Copy _shape_hint in query_complier.copy function (#6713)
- PERF-#6714: Assign qc._shape_hint = column in columnarize function (#6715)
- PERF-#6716: Avoid materializing axes in _filter_empties (#6717)
- PERF-#6718: Use _get_axis_lengths function instead of _axes_lengths property (#6719)
- PERF-#6721: Use keep_partitioning=True, for duplicated implementation (#6722)
- PERF-#6723: Use _shape_hint = "column" in DataFrame.squeeze (#6724)
- PERF-#6727: Remove remaining result.name = None in groupby code (#6726)
- PERF-#6728: In the case of narrow dataframes, it is cheaper to convert partitions to numpy in the main process. (#6704)
- PERF-#6747: Preserve columns/dtypes cache when merging on a single index level (#6748)
- PERF-#6749: Preserve partial dtype for the result of reset_index() (#6751)
- PERF-#6753: Preserve dtypes cache on .__setitem__() (#6758)
- PERF-#6754: Merge partial dtype caches on .concat(axis=0) (#6759)
- PERF-#6756: Don't materialize index when sorting (#6755)
- PERF-#6762: Carry dtypes information in lazy indices (#6763)
Refactor Codebase
- REFACTOR-#0000: Cleanup one todo and flake8 issues in modin/utils.py (#6826)
- REFACTOR-#6739: Use execution_wrapper instead of directly addressing DaskWrapper (#6740)
- REFACTOR-#6805: Move all IO functions to modin.pandas.io module (#6806)
- REFACTOR-#6807: Rename experimental groupby and experimental numpy variables (#6809)
- REFACTOR-#6815: Move experimental parsers into modin.experimental folder (#6813)
- REFACTOR-#6818: Don't implicitly enable experimental mode (#6817)
Update testing suite
- TEST-#6705: Don't compare 'pkl' files (#6706)
- TEST-#6729: Use custom pytest mark instead of --extra-test-parameters option (#6730)
- TEST-#6777: Make to_csv tests on Unidist more stable (#6776)
- TEST-#6795: Don't use platform-dependent int type (#6796)
Documentation improvements
- DOCS-#0000: Add conda forge doc (#6627)
- DOCS-#6819: Update Modin on cluster documentation (#6678)
New Features
- FEAT-#5836: Introduce 'partial' dtypes cache (#6663)
- FEAT-#6735: Make Modin on MPI through unidist component more obvious (#6736)
- FEAT-#6767: Provide the ability to use experimental functionality when experimental mode is not enabled globally via an environment variable (#6764)
- FEAT-#6784: Add d2p implementations for DataFrame.__rdivmod__/__divmod__ (#6785)
- FEAT-#6801: Add modin.pandas.error module (#6802)
- FEAT-#6803: Enable range-partitioning impl for groupby.apply() by default (#6804)
- FEAT-#6820: Make sure IO functions works with path-like filenames (#6821)

Contributors

@AndreyPavlenko
@JignyasAnand
@RehanSD
@YarShev
@anmyachev
@devin-petersohn
@dchigarev
@mvashishtha
@seydar

modin-project/modin 0.26.0 Modin 0.26.0 on GitHub

Key Features and Updates Since 0.25.0

Contributors

modin-project/modin 0.26.0
Modin 0.26.0

on GitHub