aws/aws-sdk-pandas 3.0.0 on GitHub

Breaking changes 💥

Move dependencies to optional by @jaidisido in #1992 🔓
- Dependencies required by the following modules have been moved to optional: redshift, mysql, postgres, sqlserver, oracle, gremlin, sparql, deltalake
- The required dependencies can be easily installed with pip install awswrangler[<MODULE_NAME>], for example pip install awswrangler[redshift]
Change SQL formatters for Athena and LakeFormation so that they properly format types by @Taragolis and @LeonLuttenberger in #1416 #1543 #1684 💾
- For example a parameter of type dt.datetime is parsed into DATETIME xxxx-xx-xx xx:xx:xx, while a parameter of type str is formatted into "x"
Refactor function signatures so that closely related parameters are grouped into a single parameter defined as a TypeDict by @LeonLuttenberger and @kukushking in #1855 #1996 #2016 #2055 #2081 💼
- Glue catalog parameters are grouped together in to_parquet, to_csv and to_json
- Athena UNLOAD and CTAS parameters are grouped together
Deprecate wr.s3.merge_upsert_table by @kukushking in #2076 ⚠️
Deprecate updated_name parameter in update_ruleset by @jaidisido in #2122 ⚠️
Stop support for Python 3.7 ⚠️

New functionalities 🚀

AWS SDK for pandas can now run at scale 🚀💻🚀

Tutorials

AWS Blogs

Scale AWS SDK for pandas workloads with AWS Glue for Ray

Features/Enhancements 🚀

Thread-safety improvements by @kukushking in #2186
Allow Python 3.11 by @kukushking in #2101 🐍
Add use_theads parameter to dynamodb.read_items by @LeonLuttenberger in #2113 📈
Distribute wr.dynamodb.put_df with executor task by @LeonLuttenberger in #2118 📈
Add additional arg for glue database DatabaseInput by @malachi-constant in #2067 🔧
Add overloads for function which can have multiple return value types by @LeonLuttenberger #1855
Add support for boto3 kwargs to timestream.create_table by @cnfait in #1819
Upgrade Ray to 2.2.x and PyArrow to 7+ by @LeonLuttenberger in #1865
Upgrade to Ray 2.0 by @kukushking in #1635
Add partitioning on block level by @kukushking in #1653
Use fast file metadata provider by @kukushking in #1997
Distribute DynamoDB Parallel Scan by @jaidisido in #1981
Add faster Pyarrow S3fs listing in distributed mode by @jaidisido in #2030
Add distributed variant of the _read_parquet_metadata_file function based on the PyArrow file system by @LeonLuttenberger in #2050
Validate distributed kwargs by @kukushking in #2051
Add @Experimental and @Deprecated annotations by @kukushking in #2062
Distribute S3 describe_objects by @jaidisido in #2069
Distributed S3 copy/merge by @kukushking in #2070
Add bulk_read option for reading large amounts of Parquet files quickly by @LeonLuttenberger in #2033
Deprecate boto3 resources by @kukushking in #2097
Add retries for s3 select by @kukushking in #1780
Make tqdm progress reporting opt-in by @kukushking in #1741
Distribute data types inference by @jaidisido in #1692
Change to singledispatch, add repartitioning utility, fix distributed write text regression by @kukushking in #1611
Optimize distributed CSV I/O by adding PyArrow-based datasource by @LeonLuttenberger in #1699
Configure scheduling options, remove dependencies on internal ray impl by @kukushking in #1734
Validate partitions along row axis, add warning by @kukushking in #1700
Refactor executor module by @kukushking in #2120
Distribute parquet datasource and add missing features, enable all tests by @kukushking in #1711
Distribute Timestream write with executor by @jaidisido in #1715
Distribute s3.to_json and s3.to_csv by @LeonLuttenberger in #1631
Distribute s3.read_csv, s3.read_json and s3.read_fwf by @LeonLuttenberger in #1567 #1607
Distribute s3.wait_objects by @LeonLuttenberger in #1539
Distribute s3.to_parquet by @kukushking in #1526
Distribute s3.delete objects by @malachi-constant in #1474
Distribute s3.read_parquet by @jaidisido in #1513
Add ThreadPoolExecutor and RayExecutor; refactor threading/ray; add single-path distributed s3.select_query by @kukushking in #1446
Add distributed Lake Formation read by @jaidisido in #1397
Refactor ray datasources by @kukushking in #1687
Distribute S3 select over multiple paths and scan ranges by @jaidisido in #1445
Add Literal typing for mode and projection_types by @LeonLuttenberger in #2191

Fixes 🛠️

Sanitize bucketing col names by @kukushking in #2155
Allow writing files from an empty dataframe by @malachi-constant in #2045
Athena out of bound dates by @kukushking in #2180
Fix partition block overwriting by @kukushking in #1695
Distrib S3 Select - check row count before creating the Ray dataset by @kukushking in #1808
Allow to pass pandas dfs to Ray/Modin calls by @kukushking in #1812
Add retries to read_parquet_metadata_distributed by @jaidisido in #2196
Fix default utcnow argument in start_query by @LeonLuttenberger in #2193

Documentation 📚

Athena Iceberg tutorial by @kukushking in #2117
Add at scale section by @kukushking in #2119
Documentation spell-checking improvements by @LeonLuttenberger in #2165
Add AWS Glue on Ray docs by @jaidisido in #1810
Update config tutorial to include new configuration values by @LeonLuttenberger in #1696
Improve documentation on running SDK for pandas at scale by @jaidisido in #1697
Add "Introduction to Ray" Tutorials by @LeonLuttenberger in #1661
Add SDK for pandas job on ray cluster tutorial by @malachi-constant in #1616
Add typeddicts to docs by @LeonLuttenberger in #2167

Tests 🧪

Add PR linter Github action by @jaidisido in #2106
Replace load tests bucket with SSM parameter by @jaidisido in #2121
opensearch index cleanup / skip by @kukushking in #2149
Add benchmark tests by @jaidisido in #2143
Add tests for Glue Ray jobs by @LeonLuttenberger in #1832
Remove awswrangler.distributed from coverage report by @LeonLuttenberger in #1884
Consolidate unit and load tests by @jaidisido in #1525
Distribute tests in tox config by @malachi-constant in #1469

Full Changelog: 2.20.1...3.0.0

aws/aws-sdk-pandas 3.0.0 AWS SDK for pandas 3.0.0 on GitHub

Breaking changes 💥

New functionalities 🚀

Tutorials

AWS Blogs

Features/Enhancements 🚀

Fixes 🛠️

Documentation 📚

Tests 🧪

aws/aws-sdk-pandas 3.0.0
AWS SDK for pandas 3.0.0

on GitHub