What's Changed
Breaking changes:
- breaking change: Move dependencies to optional by @jaidisido in #1992
- breaking change: Use ExecuteStatement instead of Scan for DynamoDB read_partiql by @jaidisido in #1964
Features/Enhancements:
- enhancement: Refactor engine switching when Ray is installed by @LeonLuttenberger in #1792
- logging: Enable user to configure RayLogger by @jaidisido in #1801
- enhancement: Add support for boto3 kwargs to timestream.create_table by @cnfait in #1819
- enhancement: Upgrade Ray to 2.2.x and PyArrow to 7+ by @LeonLuttenberger in #1865
- enhancement: Unload ray default max file size by @kukushking in #1912
- enhancement: Remove session serialization/deserialization by @kukushking in #1957
- enhancement: Unify return values for write json by @LeonLuttenberger in #1960
- feature: Log data sizes in load test benchmarks by @LeonLuttenberger in #1949
- enhancement: Add write_table_args by @kukushking in #1978
- feature: Distribute DynamoDB Parallel Scan by @jaidisido in #1981
- enhancement: Use fast file metadata provider by @kukushking in #1997
- enhancement: Add
names
parameter support to PyArrow reading by @LeonLuttenberger in #2008 - enhancement: Add support for JSON PyArrow data source by @LeonLuttenberger in #2019
- enhancement: Set ray.data parallelisation to -1 by default by @jaidisido in #2022
- enhancement: Add distributed variant of the
_read_parquet_metadata_file
function based on the PyArrow file system by @LeonLuttenberger in #2050 - feature: Add faster Pyarrow S3fs listing in distributed mode by @jaidisido in #2030
- feature: Validate distributed kwargs by @kukushking in #2051
- enhancement: Distribute S3 describe_objects by @jaidisido in #2069
- feature: Distributed S3 copy/merge by @kukushking in #2070
- enhancement: Add
bulk_read
option for reading large amounts of Parquet files quickly by @LeonLuttenberger in #2033 - enhancement: Upgrade ray to 2.3 by @jaidisido in #2084
- enhancement: Extract
parallelism
andbulk_read
intoray_modin_args
by @LeonLuttenberger in #2081 - deprecate: boto3 resources by @kukushking in #2097
Fixes:
- fix: Check row count before creating the Ray dataset in S3 Select by @kukushking in #1808
- fix: Allow to pass pandas dfs to Ray/Modin calls by @kukushking in #1812
- fix: Fix empty arrow refs by @kukushking in #1816
- fix: Sanitize column names modifying the data frame in distributed mode by @LeonLuttenberger in #1926
Documentation:
- docs: AWS Glue on Ray docs by @jaidisido in #1810
- docs: Minor - clarify datasource.on_write_complete docs by @kukushking in #2100
Tests:
- tests: Add tests for Glue Ray jobs by @LeonLuttenberger in #1832
- tests: Remove
awswrangler.distributed
from coverage report by @LeonLuttenberger in #1884 - tests: Create oad Testing Benchmark Analytics by @malachi-constant in #1905
- tests: Adjust load test benchmark values by @malachi-constant in #1910
- tests: Remove exports from glueray stack by @malachi-constant in #2020
- tests: Add
test_modin_s3_read_parquet_many_files
by @LeonLuttenberger in #2096
Full Changelog: 3.0.0rc2...3.0.0rc3