This release introduces a new, faster implementation for groupby.apply
, as well as many performance fixes related to improving asynchronous execution, a new namespace for accessing experimental functions (for example, DataFrame.modin.to_pickle_distributed
), a fix for a long-standing problem with the use of Modin objects inside UDFs for apply
and many other fixes.
Note: to get Modin on MPI through unidist (as of unidist 0.5.0) fully working by installing with pip it is required to have a working MPI implementation installed beforehand.
Key Features and Updates Since 0.25.0
- Stability and Bugfixes
- FIX-#4355: Fix rename algebraic operator to avoid copying (#4356)
- FIX-#6594: Fix usage of Modin objects inside UDFs for
apply
(#6673) - FIX-#6664: Use
@lazy_metadata_decorator
forPandasDataFrame.finalize
(#6720) - FIX-#6684: Adapt to pandas 2.1.2 (#6685)
- FIX-#6687: Explicitly add users to CODEOWNERS (#6688)
- FIX-#6693: Revert creating an additional copy in
astype
op (#6692) - FIX-#6703: Don't use
set_index_name(None)
(#6698) - FIX-#6732: Fix inferring result dtypes for binary operations (#6737)
- FIX-#6745: Pin
unidist <= 0.4.1
(#6746) - FIX-#6752: Preserve dtypes cache on
.insert()
(#6757) - FIX-#6768: Make sure
to_numpy
use**kwargs
after #6704 (#6769) - FIX-#6771: Avoid
ValueError: assignment destination is read-only
forcumsum
(#6772) - FIX-#6773: Make sure
_to_pandas
return mutable pandas objects (#6775) - FIX-#6774: Modify conditions for
loc
to get similar behavior to pandas (#6798) - FIX-#6778: Read parquet files without file extensions using fastparquet (#6790)
- FIX-#6779: Pass only one indexer into
Series.__getitem__
(#6780) - FIX-#6781: Use
pandas.api.types.pandas_dtype
to convert to valid numpy and pandas only dtypes (#6788) - FIX-#6782: Filter pandas warnings when precomputing dtypes (#6811)
- FIX-#6786: Properly d2p for cross
DataFrame.join
(#6787) - FIX-#6791: Pass additional environment variables to MPI workers (#6792)
- FIX-#6799: Allow creating incomplete
ModinIndex
objects (#6800) - FIX-#6822: Do not propagate
NotImplementedError
to a user on aset_columns()
with dupl labels (#6823) - FIX-#6824: Invalidate
ModinIndex._lengths_id
on empty partitions filtering (#6825)
- Performance enhancements
- PERF-#4777: Don't use
copy=True
parameter forconcat
calls insideto_pandas
(#4778) - PERF-#4804: Preserve lengths/widths caches in
broadcast_apply_full_axis
(#6760) - PERF-#6666: Avoid internal
reset_index
for leftmerge
(#6665) - PERF-#6668: Use
copy=False
for internal usage ofset_axis
(#6667) - PERF-#6669: Avoid one extra
copy()
call forSeries.reset_index
(#6670) - PERF-#6671: Don't iterate over the result of the
Series.tolist
function (#6672) - PERF-#6690: Use
sync_labels=False
forrank
function (#6689) - PERF-#6694: Use
lazy_map_partitions()
for dtypes conversion (#6695) - PERF-#6696: Use cached dtypes in fillna when possible. (#6697)
- PERF-#6701: Use
get_axis
internal function instead ofaxes
property (#6700) - PERF-#6702: Don't materialize axes when calling
to_numpy
(#6699) - PERF-#6710: Don't materialize index in
_groupby_shuffle
internal function (#6707) - PERF-#6712: Copy
_shape_hint
inquery_complier.copy
function (#6713) - PERF-#6714: Assign
qc._shape_hint = column
incolumnarize
function (#6715) - PERF-#6716: Avoid materializing axes in
_filter_empties
(#6717) - PERF-#6718: Use
_get_axis_lengths
function instead of_axes_lengths
property (#6719) - PERF-#6721: Use
keep_partitioning=True
, forduplicated
implementation (#6722) - PERF-#6723: Use
_shape_hint = "column"
inDataFrame.squeeze
(#6724) - PERF-#6727: Remove remaining
result.name = None
in groupby code (#6726) - PERF-#6728: In the case of narrow dataframes, it is cheaper to convert partitions to numpy in the main process. (#6704)
- PERF-#6747: Preserve columns/dtypes cache when merging on a single index level (#6748)
- PERF-#6749: Preserve partial dtype for the result of
reset_index()
(#6751) - PERF-#6753: Preserve dtypes cache on
.__setitem__()
(#6758) - PERF-#6754: Merge partial dtype caches on
.concat(axis=0)
(#6759) - PERF-#6756: Don't materialize index when sorting (#6755)
- PERF-#6762: Carry dtypes information in lazy indices (#6763)
- PERF-#4777: Don't use
- Refactor Codebase
- REFACTOR-#0000: Cleanup one todo and flake8 issues in modin/utils.py (#6826)
- REFACTOR-#6739: Use
execution_wrapper
instead of directly addressingDaskWrapper
(#6740) - REFACTOR-#6805: Move all IO functions to
modin.pandas.io
module (#6806) - REFACTOR-#6807: Rename experimental groupby and experimental numpy variables (#6809)
- REFACTOR-#6815: Move experimental parsers into
modin.experimental
folder (#6813) - REFACTOR-#6818: Don't implicitly enable experimental mode (#6817)
- Update testing suite
- Documentation improvements
- New Features
- FEAT-#5836: Introduce 'partial' dtypes cache (#6663)
- FEAT-#6735: Make Modin on MPI through unidist component more obvious (#6736)
- FEAT-#6767: Provide the ability to use experimental functionality when experimental mode is not enabled globally via an environment variable (#6764)
- FEAT-#6784: Add d2p implementations for
DataFrame.__rdivmod__/__divmod__
(#6785) - FEAT-#6801: Add
modin.pandas.error
module (#6802) - FEAT-#6803: Enable range-partitioning impl for
groupby.apply()
by default (#6804) - FEAT-#6820: Make sure IO functions works with path-like filenames (#6821)
Contributors
@AndreyPavlenko
@JignyasAnand
@RehanSD
@YarShev
@anmyachev
@devin-petersohn
@dchigarev
@mvashishtha
@seydar