github modin-project/modin 0.18.0
Modin 0.18.0

latest releases: 0.32.0, 0.31.0, 0.27.1...
21 months ago

This release includes support for MPI backend using Unidist, improvements to the shuffling mechanism,
SQL query execution on the HDK backend (currently pyhdk==0.3), support for pandas 1.5.2 and external query compilers.
It also includes many bug fixes and some performance enhancements.

Key Features and Updates Since 0.17.0

  • Stability and Bugfixes
    • FIX-#3823: Fix TypeError when creating Series from SparseArray (#5377)
    • FIX-#4100: Fall back to Pandas on row drop (#4937)
    • FIX-#4636: Allows read_parquet to detect column partitioning in non-local filesystems (#5192)
    • FIX-#4859: Add support for PyArrow Dictionary Arrays to type mapping (#4864)
    • FIX-#4859: Add support for PyArrow Dictionary Arrays to type mapping (#5271)
    • FIX-#5016: Suppress spammy ray task errors. (#5298)
    • FIX-#5114: Change mask name to resolve namespace conflict with numpy mask (#5215)
    • FIX-#5137: df.info failure with default columns (#5251)
    • FIX-#5138: df_categories_equals typo (#5250)
    • FIX-#5171: Allow xgboost >= 1.7.0. (#5195)
    • FIX-#5186: set_index case with multiindex (#5190)
    • FIX-#5187: Fixed RecursionError in OmnisciLaunchParameters.get() (#5199)
    • FIX-#5204: Fix binary operations with a dictionary (#5205)
    • FIX-#5208: Support ray==2.1.0 (#5283)
    • FIX-#5232: Stop changing original series names during binary ops. (#5249)
    • FIX-#5234: Use query compiler str_repeat. (#5235)
    • FIX-#5236: Allow binary operations with custom classes. (#5237)
    • FIX-#5238: Make rmul really rmul instead of mul. (#5246)
    • FIX-#5240: Fix dask[complete] syntax in conda environment files (#5241)
    • FIX-#5252: Disable notebook tests until access control issues are resolved for modin-test bucket (#5257)
    • FIX-#5277: Fix internal execute function (#5278)
    • FIX-#5284: Move ray, redis, tqdm, xgboost packages from pip to conda deps (#5270)
    • FIX-#5285: Check for both pyarrow and fastparquet when read parquet format (#5297)
    • FIX-#5306: Fix code scanning alert - Use of the return value of a procedure (#5307)
    • FIX-#5308: Allow custom execution with no known engine. (#5379)
    • FIX-#5319: Do not use deprecated '.iteritems()' (#5320)
    • FIX-#5325: Fix read_csv_glob with non-empty parse_dates dict (#5339)
    • FIX-#5327: Bump mypy cap to fix CI. (#5328)
    • FIX-#5364: Fix get_indices internal function (#5355)
    • FIX-#5380: Fix warning about setting _cache attribute. (#5381)
    • FIX-#5398: Resolve length 1 nonNA partition issue, and off by one error in sort (#5400)
    • FIX-#5405: Pin ray>=1.13.0 (#5390)
  • Performance enhancements
    • PERF-#5225: Do not convert 'value' to a list at '.insert()' (#5226)
    • PERF-#5268: Call get on all partitions at once in to_pandas (#4776)
  • Refactor Codebase
    • REFACTOR-#5202: Pass loc arguments to query compiler. (#5305)
    • REFACTOR-#5262: Update the examples to the latest version of the omniscripts (#5263)
    • REFACTOR-#5287: Remove code to test getting TypeError for Series.dropna (#5288)
    • REFACTOR-#5294: Fix code scanning alert - Potentially uninitialized local variable (#5383)
    • REFACTOR-#5299: Variable defined multiple times error found by CodeQL (#5300)
    • REFACTOR-#5301: Fix code scanning alert - Duplicate key in dict literal (#5302)
    • REFACTOR-#5303: Fix code scanning alert - Unused local variable (#5304)
    • REFACTOR-#5310: Remove some hasattr('columns') checks. (#5311)
    • REFACTOR-#5312: Let lazy query compilers check for astype and drop errors. (#5313)
    • REFACTOR-#5322: Remove python3.7 related code from read_csv_glob (#5323)
    • REFACTOR-#5330: Remove BaseIO._read (#5329)
    • REFACTOR-#5332: Define PQ_INDEX_REGEX as class variable (#5333)
    • REFACTOR-#5334: Make _validate as classmethod (#5331)
    • REFACTOR-#5335: Remove unnecessary lambdas (#5336)
    • REFACTOR-#5359: Fix code scanning alert - File is not always closed (#5362)
    • REFACTOR-#5363: Introduce partition constructor; move add_to_apply_calls impl in base class (#5354)
    • REFACTOR-#5382: Use pandas.util.cache_readonly for __constructors__ (#5368)
    • REFACTOR-#5386: Move partition.split implementation in base class (#5384)
    • REFACTOR-#5391: Improve setup function in TimeDropDuplicatesDataframe (#5389)
    • REFACTOR-#5413: Check Index.dtype instead of isinstance(obj, Int64Index) (#5406)
  • Update testing suite
    • TEST-#2073: Check that read_csv can use a parse_dates dict. (#4572)
    • TEST-#4562: In windows CI, try to start ray a few times (#5101)
    • TEST-#4821: Monkeypatch cache_readonly to avoid errors in doc_checker.py (#5365)
    • TEST-#5123: Add CodeQL workflow for GitHub code scanning (#5222)
    • TEST-#5219: Relax matplotlib and coverage pins (#5216)
    • TEST-#5259: Use new URL for dataset (#5401)
    • TEST-#5261: Port indexing, reindex and fillna benchmarks from pandas github (#5244)
    • TEST-#5280: Test pandas objects for non-commutative multiply. (#5281)
    • TEST-#5290: Add testing for unidist on push (#5291)
    • TEST-#5340: Use dev requirements in test-ray-master to get fastparquet (#5347)
    • TEST-#5341: Bump test-ray-master ray to 3.0. (#5342)
    • TEST-#5343: Unpin test-ray-client ray version. (#5344)
    • TEST-#5345: Stop running CI for some worfklow changes. (#5346)
    • TEST-#5348: Instead of capping mypy, exclude 0.990. (#5349)
    • TEST-#5350: Port DropDuplicates and LevelAlign benchmarks from pandas github (#5351)
    • TEST-#5374: Port DatetimeAccessor and Categories benchmarks from pandas github (#5375)
    • TEST-#5378: Port stack, unstack, replace and groups benchmarks from pandas (#5388)
  • Documentation improvements
    • DOCS-#5279: Add documentation for pandas on unidist (#5289)
    • DOCS-#5292: Make readme image links raw so they render on pypi.org. (#5293)
    • DOCS-#5314: Update documentation for Ray Generic module (#5315)
    • DOCS-#5356: Update conda install instructions (#5357)
    • DOCS-#5402: Add warning about instability of sort (#5403)
  • New Features
    • FEAT-#3535: Implement partition shuffling mechanism and algebra sort_by (#4601)
    • FEAT-#4263: Efficiently construct dataframes from a dict of modin Series (#5193)
    • FEAT-#4433: Add support of MultiIndex in reindex method (#4434)
    • FEAT-#4747: Implement release notes generation (#5214)
    • FEAT-#4897: Drop python 3.6 support. (#5229)
    • FEAT-#5053: Add pandas on unidist execution with MPI backend (#5059)
    • FEAT-#5223: Execute SQL queries on the HDK backend (#5224)
    • FEAT-#5230: Support external query compiler and IO (#5231)
    • FEAT-#5242: Implement str.extract when expand==True (#5243)
    • FEAT-#5253: Upgrade pandas to 1.5.2 (#5254)
    • FEAT-#5255: Add a timestamp to the folder names generated by the logger (#5321)
    • FEAT-#5367: Introduce new API for repartitioning Modin objects (#5366)
    • FEAT-#5387: Enable rebalance_partitions for Unidist (#5385)
    • FEAT-#5396: Bump pyhdk version to 0.3 (#5397)

Contributors

@AndreyPavlenko
@Billy2551
@Garra1980
@RehanSD
@YarShev
@anmyachev
@arunjose696
@dchigarev
@devin-petersohn
@lgtm-migrator
@mvashishtha
@noloerino
@pyrito
@trgiangdo
@vnlitvinov
@Retribution98

Don't miss a new modin release

NewReleases is sending notifications on new releases.