This release includes support for MPI backend using Unidist, improvements to the shuffling mechanism,
SQL query execution on the HDK backend (currently pyhdk==0.3), support for pandas 1.5.2 and external query compilers.
It also includes many bug fixes and some performance enhancements.
Key Features and Updates Since 0.17.0
- Stability and Bugfixes
- FIX-#3823: Fix TypeError when creating Series from SparseArray (#5377)
- FIX-#4100: Fall back to Pandas on row drop (#4937)
- FIX-#4636: Allows
read_parquet
to detect column partitioning in non-local filesystems (#5192) - FIX-#4859: Add support for PyArrow Dictionary Arrays to type mapping (#4864)
- FIX-#4859: Add support for PyArrow Dictionary Arrays to type mapping (#5271)
- FIX-#5016: Suppress spammy ray task errors. (#5298)
- FIX-#5114: Change mask name to resolve namespace conflict with numpy mask (#5215)
- FIX-#5137:
df.info
failure with default columns (#5251) - FIX-#5138:
df_categories_equals
typo (#5250) - FIX-#5171: Allow xgboost >= 1.7.0. (#5195)
- FIX-#5186:
set_index
case with multiindex (#5190) - FIX-#5187: Fixed RecursionError in OmnisciLaunchParameters.get() (#5199)
- FIX-#5204: Fix binary operations with a dictionary (#5205)
- FIX-#5208: Support
ray==2.1.0
(#5283) - FIX-#5232: Stop changing original series names during binary ops. (#5249)
- FIX-#5234: Use query compiler str_repeat. (#5235)
- FIX-#5236: Allow binary operations with custom classes. (#5237)
- FIX-#5238: Make rmul really rmul instead of mul. (#5246)
- FIX-#5240: Fix dask[complete] syntax in conda environment files (#5241)
- FIX-#5252: Disable notebook tests until access control issues are resolved for
modin-test
bucket (#5257) - FIX-#5277: Fix internal
execute
function (#5278) - FIX-#5284: Move ray, redis, tqdm, xgboost packages from pip to conda deps (#5270)
- FIX-#5285: Check for both pyarrow and fastparquet when read parquet format (#5297)
- FIX-#5306: Fix code scanning alert - Use of the return value of a procedure (#5307)
- FIX-#5308: Allow custom execution with no known engine. (#5379)
- FIX-#5319: Do not use deprecated '.iteritems()' (#5320)
- FIX-#5325: Fix
read_csv_glob
with non-emptyparse_dates
dict (#5339) - FIX-#5327: Bump mypy cap to fix CI. (#5328)
- FIX-#5364: Fix
get_indices
internal function (#5355) - FIX-#5380: Fix warning about setting _cache attribute. (#5381)
- FIX-#5398: Resolve length 1 nonNA partition issue, and off by one error in sort (#5400)
- FIX-#5405: Pin
ray>=1.13.0
(#5390)
- Performance enhancements
- Refactor Codebase
- REFACTOR-#5202: Pass loc arguments to query compiler. (#5305)
- REFACTOR-#5262: Update the examples to the latest version of the omniscripts (#5263)
- REFACTOR-#5287: Remove code to test getting TypeError for Series.dropna (#5288)
- REFACTOR-#5294: Fix code scanning alert - Potentially uninitialized local variable (#5383)
- REFACTOR-#5299:
Variable defined multiple times
error found by CodeQL (#5300) - REFACTOR-#5301: Fix code scanning alert - Duplicate key in dict literal (#5302)
- REFACTOR-#5303: Fix code scanning alert - Unused local variable (#5304)
- REFACTOR-#5310: Remove some hasattr('columns') checks. (#5311)
- REFACTOR-#5312: Let lazy query compilers check for astype and drop errors. (#5313)
- REFACTOR-#5322: Remove python3.7 related code from read_csv_glob (#5323)
- REFACTOR-#5330: Remove
BaseIO._read
(#5329) - REFACTOR-#5332: Define
PQ_INDEX_REGEX
as class variable (#5333) - REFACTOR-#5334: Make
_validate
as classmethod (#5331) - REFACTOR-#5335: Remove unnecessary lambdas (#5336)
- REFACTOR-#5359: Fix code scanning alert - File is not always closed (#5362)
- REFACTOR-#5363: Introduce partition constructor; move
add_to_apply_calls
impl in base class (#5354) - REFACTOR-#5382: Use
pandas.util.cache_readonly
for__constructors__
(#5368) - REFACTOR-#5386: Move partition.split implementation in base class (#5384)
- REFACTOR-#5391: Improve setup function in TimeDropDuplicatesDataframe (#5389)
- REFACTOR-#5413: Check
Index.dtype
instead ofisinstance(obj, Int64Index)
(#5406)
- Update testing suite
- TEST-#2073: Check that read_csv can use a parse_dates dict. (#4572)
- TEST-#4562: In windows CI, try to start ray a few times (#5101)
- TEST-#4821: Monkeypatch
cache_readonly
to avoid errors indoc_checker.py
(#5365) - TEST-#5123: Add CodeQL workflow for GitHub code scanning (#5222)
- TEST-#5219: Relax matplotlib and coverage pins (#5216)
- TEST-#5259: Use new URL for dataset (#5401)
- TEST-#5261: Port indexing, reindex and fillna benchmarks from pandas github (#5244)
- TEST-#5280: Test pandas objects for non-commutative multiply. (#5281)
- TEST-#5290: Add testing for unidist on push (#5291)
- TEST-#5340: Use dev requirements in test-ray-master to get fastparquet (#5347)
- TEST-#5341: Bump test-ray-master ray to 3.0. (#5342)
- TEST-#5343: Unpin test-ray-client ray version. (#5344)
- TEST-#5345: Stop running CI for some worfklow changes. (#5346)
- TEST-#5348: Instead of capping mypy, exclude 0.990. (#5349)
- TEST-#5350: Port DropDuplicates and LevelAlign benchmarks from pandas github (#5351)
- TEST-#5374: Port DatetimeAccessor and Categories benchmarks from pandas github (#5375)
- TEST-#5378: Port stack, unstack, replace and groups benchmarks from pandas (#5388)
- Documentation improvements
- DOCS-#5279: Add documentation for pandas on unidist (#5289)
- DOCS-#5292: Make readme image links raw so they render on pypi.org. (#5293)
- DOCS-#5314: Update documentation for Ray Generic module (#5315)
- DOCS-#5356: Update conda install instructions (#5357)
- DOCS-#5402: Add warning about instability of sort (#5403)
- New Features
- FEAT-#3535: Implement partition shuffling mechanism and algebra sort_by (#4601)
- FEAT-#4263: Efficiently construct dataframes from a dict of modin Series (#5193)
- FEAT-#4433: Add support of MultiIndex in
reindex
method (#4434) - FEAT-#4747: Implement release notes generation (#5214)
- FEAT-#4897: Drop python 3.6 support. (#5229)
- FEAT-#5053: Add pandas on unidist execution with MPI backend (#5059)
- FEAT-#5223: Execute SQL queries on the HDK backend (#5224)
- FEAT-#5230: Support external query compiler and IO (#5231)
- FEAT-#5242: Implement
str.extract
whenexpand==True
(#5243) - FEAT-#5253: Upgrade pandas to 1.5.2 (#5254)
- FEAT-#5255: Add a timestamp to the folder names generated by the logger (#5321)
- FEAT-#5367: Introduce new API for repartitioning Modin objects (#5366)
- FEAT-#5387: Enable
rebalance_partitions
for Unidist (#5385) - FEAT-#5396: Bump pyhdk version to 0.3 (#5397)
Contributors
@AndreyPavlenko
@Billy2551
@Garra1980
@RehanSD
@YarShev
@anmyachev
@arunjose696
@dchigarev
@devin-petersohn
@lgtm-migrator
@mvashishtha
@noloerino
@pyrito
@trgiangdo
@vnlitvinov
@Retribution98