modin-project/modin 0.18.0 on GitHub

This release includes support for MPI backend using Unidist, improvements to the shuffling mechanism,
SQL query execution on the HDK backend (currently pyhdk==0.3), support for pandas 1.5.2 and external query compilers.
It also includes many bug fixes and some performance enhancements.

Key Features and Updates Since 0.17.0

Stability and Bugfixes
- FIX-#3823: Fix TypeError when creating Series from SparseArray (#5377)
- FIX-#4100: Fall back to Pandas on row drop (#4937)
- FIX-#4636: Allows read_parquet to detect column partitioning in non-local filesystems (#5192)
- FIX-#4859: Add support for PyArrow Dictionary Arrays to type mapping (#4864)
- FIX-#4859: Add support for PyArrow Dictionary Arrays to type mapping (#5271)
- FIX-#5016: Suppress spammy ray task errors. (#5298)
- FIX-#5114: Change mask name to resolve namespace conflict with numpy mask (#5215)
- FIX-#5137: df.info failure with default columns (#5251)
- FIX-#5138: df_categories_equals typo (#5250)
- FIX-#5171: Allow xgboost >= 1.7.0. (#5195)
- FIX-#5186: set_index case with multiindex (#5190)
- FIX-#5187: Fixed RecursionError in OmnisciLaunchParameters.get() (#5199)
- FIX-#5204: Fix binary operations with a dictionary (#5205)
- FIX-#5208: Support ray==2.1.0 (#5283)
- FIX-#5232: Stop changing original series names during binary ops. (#5249)
- FIX-#5234: Use query compiler str_repeat. (#5235)
- FIX-#5236: Allow binary operations with custom classes. (#5237)
- FIX-#5238: Make rmul really rmul instead of mul. (#5246)
- FIX-#5240: Fix dask[complete] syntax in conda environment files (#5241)
- FIX-#5252: Disable notebook tests until access control issues are resolved for modin-test bucket (#5257)
- FIX-#5277: Fix internal execute function (#5278)
- FIX-#5284: Move ray, redis, tqdm, xgboost packages from pip to conda deps (#5270)
- FIX-#5285: Check for both pyarrow and fastparquet when read parquet format (#5297)
- FIX-#5306: Fix code scanning alert - Use of the return value of a procedure (#5307)
- FIX-#5308: Allow custom execution with no known engine. (#5379)
- FIX-#5319: Do not use deprecated '.iteritems()' (#5320)
- FIX-#5325: Fix read_csv_glob with non-empty parse_dates dict (#5339)
- FIX-#5327: Bump mypy cap to fix CI. (#5328)
- FIX-#5364: Fix get_indices internal function (#5355)
- FIX-#5380: Fix warning about setting _cache attribute. (#5381)
- FIX-#5398: Resolve length 1 nonNA partition issue, and off by one error in sort (#5400)
- FIX-#5405: Pin ray>=1.13.0 (#5390)
Performance enhancements
- PERF-#5225: Do not convert 'value' to a list at '.insert()' (#5226)
- PERF-#5268: Call get on all partitions at once in to_pandas (#4776)
Refactor Codebase
- REFACTOR-#5202: Pass loc arguments to query compiler. (#5305)
- REFACTOR-#5262: Update the examples to the latest version of the omniscripts (#5263)
- REFACTOR-#5287: Remove code to test getting TypeError for Series.dropna (#5288)
- REFACTOR-#5294: Fix code scanning alert - Potentially uninitialized local variable (#5383)
- REFACTOR-#5299: Variable defined multiple times error found by CodeQL (#5300)
- REFACTOR-#5301: Fix code scanning alert - Duplicate key in dict literal (#5302)
- REFACTOR-#5303: Fix code scanning alert - Unused local variable (#5304)
- REFACTOR-#5310: Remove some hasattr('columns') checks. (#5311)
- REFACTOR-#5312: Let lazy query compilers check for astype and drop errors. (#5313)
- REFACTOR-#5322: Remove python3.7 related code from read_csv_glob (#5323)
- REFACTOR-#5330: Remove BaseIO._read (#5329)
- REFACTOR-#5332: Define PQ_INDEX_REGEX as class variable (#5333)
- REFACTOR-#5334: Make _validate as classmethod (#5331)
- REFACTOR-#5335: Remove unnecessary lambdas (#5336)
- REFACTOR-#5359: Fix code scanning alert - File is not always closed (#5362)
- REFACTOR-#5363: Introduce partition constructor; move add_to_apply_calls impl in base class (#5354)
- REFACTOR-#5382: Use pandas.util.cache_readonly for __constructors__ (#5368)
- REFACTOR-#5386: Move partition.split implementation in base class (#5384)
- REFACTOR-#5391: Improve setup function in TimeDropDuplicatesDataframe (#5389)
- REFACTOR-#5413: Check Index.dtype instead of isinstance(obj, Int64Index) (#5406)
Update testing suite
- TEST-#2073: Check that read_csv can use a parse_dates dict. (#4572)
- TEST-#4562: In windows CI, try to start ray a few times (#5101)
- TEST-#4821: Monkeypatch cache_readonly to avoid errors in doc_checker.py (#5365)
- TEST-#5123: Add CodeQL workflow for GitHub code scanning (#5222)
- TEST-#5219: Relax matplotlib and coverage pins (#5216)
- TEST-#5259: Use new URL for dataset (#5401)
- TEST-#5261: Port indexing, reindex and fillna benchmarks from pandas github (#5244)
- TEST-#5280: Test pandas objects for non-commutative multiply. (#5281)
- TEST-#5290: Add testing for unidist on push (#5291)
- TEST-#5340: Use dev requirements in test-ray-master to get fastparquet (#5347)
- TEST-#5341: Bump test-ray-master ray to 3.0. (#5342)
- TEST-#5343: Unpin test-ray-client ray version. (#5344)
- TEST-#5345: Stop running CI for some worfklow changes. (#5346)
- TEST-#5348: Instead of capping mypy, exclude 0.990. (#5349)
- TEST-#5350: Port DropDuplicates and LevelAlign benchmarks from pandas github (#5351)
- TEST-#5374: Port DatetimeAccessor and Categories benchmarks from pandas github (#5375)
- TEST-#5378: Port stack, unstack, replace and groups benchmarks from pandas (#5388)
Documentation improvements
- DOCS-#5279: Add documentation for pandas on unidist (#5289)
- DOCS-#5292: Make readme image links raw so they render on pypi.org. (#5293)
- DOCS-#5314: Update documentation for Ray Generic module (#5315)
- DOCS-#5356: Update conda install instructions (#5357)
- DOCS-#5402: Add warning about instability of sort (#5403)
New Features
- FEAT-#3535: Implement partition shuffling mechanism and algebra sort_by (#4601)
- FEAT-#4263: Efficiently construct dataframes from a dict of modin Series (#5193)
- FEAT-#4433: Add support of MultiIndex in reindex method (#4434)
- FEAT-#4747: Implement release notes generation (#5214)
- FEAT-#4897: Drop python 3.6 support. (#5229)
- FEAT-#5053: Add pandas on unidist execution with MPI backend (#5059)
- FEAT-#5223: Execute SQL queries on the HDK backend (#5224)
- FEAT-#5230: Support external query compiler and IO (#5231)
- FEAT-#5242: Implement str.extract when expand==True (#5243)
- FEAT-#5253: Upgrade pandas to 1.5.2 (#5254)
- FEAT-#5255: Add a timestamp to the folder names generated by the logger (#5321)
- FEAT-#5367: Introduce new API for repartitioning Modin objects (#5366)
- FEAT-#5387: Enable rebalance_partitions for Unidist (#5385)
- FEAT-#5396: Bump pyhdk version to 0.3 (#5397)

Contributors

@AndreyPavlenko
@Billy2551
@Garra1980
@RehanSD
@YarShev
@anmyachev
@arunjose696
@dchigarev
@devin-petersohn
@lgtm-migrator
@mvashishtha
@noloerino
@pyrito
@trgiangdo
@vnlitvinov
@Retribution98

modin-project/modin 0.18.0 Modin 0.18.0 on GitHub

Key Features and Updates Since 0.17.0

Contributors

modin-project/modin 0.18.0
Modin 0.18.0

on GitHub