Modin 0.20.0
This release adds parallel implementations for some functions on Dask that were previously implemented for other engines.
It also includes support for pyhdk 0.5, many bug fixes and some performance enhancements.
Key Features and Updates Since 0.19.0
- Stability and Bugfixes
- FIX-#2850: use modin.pandas.Series instead of pandas.Series for
where
func (#5883) - FIX-#3925: Fixed AssertionError on columns and index drop (#5156)
- FIX-#4227: Calling
FactoryDispatcher.get_factory
also initializes the engine (#4228) - FIX-#4635: allow pass modin functions to
apply
(#5915) - FIX-#4924: fix read_excel when header is None (#5919)
- FIX-#5309: series iloc/loc raises IndexingError if a key is too long (#5784)
- FIX-#5373: Fix Series.shift() for named Series (#5823)
- FIX-#5432: don't return None when
astype
used withcopy=False
parameter (#5918) - FIX-#5454: add missed methods for
SeriesGroupBy
,DataFrameGroupBy
objects (#5866) - FIX-#5509: default to pandas for read_parquet if any additional kwargs are passed to the engine (#5911)
- FIX-#5566: Enable test_indexing test on the HDK engine and add to ci (#5567)
- FIX-#5576: Enable test_join_sort test on the HDK engine and add to CI (#5578)
- FIX-#5580: HDK-BUG: 'AVG|SUM' is only valid on integer and floating point (#5583)
- FIX-#5618: don't ignore 'errors' parameter for astype (#5895)
- FIX-#5653: implement
convert_dtypes
as a full-axis operation instead of using map approach (#5885) - FIX-#5737: BUG: String columns are converted to Categorical, if exported from HDK (#5738)
- FIX-#5767: cast
pathlib.Path
to str forread_parquet
(#5860) - FIX-#5770: Enable test_series test on the HDK engine and add to ci (#5771)
- FIX-#5774: Correctly calculate shape of single row (#5775)
- FIX-#5776: fix IndexError when concatenating dict of series along columns (#5804)
- FIX-#5781: Fix sort in descending order for columns with highly dense values (#5783)
- FIX-#5787: Enable test_reduce test on the HDK engine and add to ci (#5788)
- FIX-#5794: Enable test_default test on the HDK engine and add to ci (#5795)
- FIX-#5806: Enable test_io test on the HDK engine and add to ci (#5807)
- FIX-#5810: Enable test_binary test on the HDK engine (#5811)
- FIX-#5819: Fix np.argmax/argmin on 1D arrays (#5820)
- FIX-#5829: fix ndarray assignment via loc (#5847)
- FIX-#5846: add Series.str.removeprefix/removesuffix/fullmatch methods (#5845)
- FIX-#5849: add
Series.dt.day_of_week/day_of_year/isocalendar/asfreq
methods (#5848) - FIX-#5859: Fix '.sort_values()' when there's only one row partition (#5869)
- FIX-#5862: fix Inline strong start-string without end-string for read_custom_text (#5861)
- FIX-#5870: Enable test_general test on the HDK engine and add to ci (#5871)
- FIX-#5888: Fix to_parquet in s3. (#5912)
- FIX-#5891: BUG: HDK: Query execution fails because the query contains not supported self-join pattern (#5892)
- FIX-#5927: Enable
test_map_metadata
test on the HDK engine and add to ci (#5929) - FIX-#5934: Enable
test_window
test on the HDK engine and add to ci (#5935) - FIX-#5941: TEST: The test test_io.py fails on HDK (#5942)
- FIX-#5976: correct use of dtypes cache for
concat
op (#5975) - FIX-#5977: use
wrapper.materialize
instead ofwait_partitions
; use AWS env vars inpytest_sessionstart
function (#5981)
- FIX-#2850: use modin.pandas.Series instead of pandas.Series for
- Performance enhancements
- PERF-#5590: Precompute columns and dtypes metadata for '.merge()' (#5594)
- PERF-#5670: create
self._identity
in partitions only for "debug" logging level (#5679) - PERF-#5674: reduce data transferring in
_launch_tasks
function (#5678) - PERF-#5675: make index calculation for
read_csv
function lazy; introduceModinIndex
(#5677) - PERF-#5740: allow
read_csv
,read_fwf
,read_table
,read_custom_text
functions be executed fully asynchronous; introduceModinDtypes
(#5713) - PERF-#5777: Filter out empty bins at range-based reshuffling (#5779)
- PERF-#5778: Avoid extra materialization at range-based reshuffling (#5780)
- PERF-#5808: Delay metadata computations for '.sort_values' result (#5828)
- PERF-#5837: Defer index materialization for MapReduce implemented groupby (#5948)
- Refactor Codebase
- REFACTOR-#2863: remove 'other_name' from broadcast_apply (#5882)
- REFACTOR-#5414: Move
partition.get
into base class (#5408) - REFACTOR-#5417: fix FutureWarning: the
mangle_dupe_cols
keyword is deprecated (#5407) - REFACTOR-#5683: remove Engine.subscribe(_update_engine) in DataFrame/Series constructors (#5855)
- REFACTOR-#5786: align logging of Dask partitions with other executions (#5785)
- REFACTOR-#5799: Clean up numpy array operations (#5800)
- REFACTOR-#5830: rename experimental dispatchers and parsers (#5864)
- REFACTOR-#5874: move lazy_metadata_decorator into utils.py (#5872)
- REFACTOR-#5875: use default implementations for dt methods from the base query compiler (#5873)
- REFACTOR-#5902: use __make_read for non experimental IO classes (#5898)
- REFACTOR-#5908: remove unused parameters from 'run_exec_plan' (#5907)
- REFACTOR-#5910: remove '_dtypes_for_cols' internal function as unused (#5909)
- REFACTOR-#5922: let
upload-coverage
action fail if there is no.coverage
file (#5921) - REFACTOR-#5923: add
pragma: no cover
for functions that used inapply_full_axis
(#5920)
- Update testing suite
- TEST-#2544: delay
codecov
notifications until all reports have been sent (#5782) - TEST-#4261: test rolling with axis=1, win_type=, and center=True (#5881)
- TEST-#5477: fix typo: read_stata kwargs -> read_sas kwargs (#5854)
- TEST-#5790: add ASV configs for Dask and Unidist (#5789)
- TEST-#5802: update some actions in CI (#5801)
- TEST-#5826: remove _propagate_index_objs internal function usage from tests (#5813)
- TEST-#5832: Suppress pytest coverage messages in terminal (#5833)
- TEST-#5851: test api of cat/sparse accessors (#5850)
- TEST-#5878: exclude modin/experimental/batch/test/ folder from computing coverage (#5877)
- TEST-#5897: Add more robust tests for numpy API (#5900)
- TEST-#5913: Cancel CI for commits to same branch. (#5914)
- TEST-#5933: Add assert_array_equals utility to numpy tests (#5947)
- TEST-#5943: Rebalance tests between different CI jobs (#5890)
- TEST-#5977: Add AWS mock keys to moto in push-to-master.yml (#5978)
- TEST-#2544: delay
- Documentation improvements
- New Features
- FEAT-#4624: add
to_parquet
parallel implementation for Dask (#5876) - FEAT-#5497: add several experimental functions for Dask (#5496)
- FEAT-#5880: add
to_sql
parallel implementation for Dask (#5879) - FEAT-#5901: add
read_fwf
parallel implementation for Dask (#5899) - FEAT-#5930: Bump pyhdk version to 0.5 (#5931)
- FEAT-#4624: add
Contributors
@MSHADroo
@AndreyPavlenko
@RehanSD
@YarShev
@anmyachev
@dchigarev
@mvashishtha
@noloerino
@pyrito
@vnlitvinov