Breaking changes ๐ฅ
- Move dependencies to optional by @jaidisido in #1992 ๐
- Dependencies required by the following modules have been moved to optional: redshift, mysql, postgres, sqlserver, oracle, gremlin, sparql, deltalake
- The required dependencies can be easily installed with
pip install awswrangler[<MODULE_NAME>]
, for examplepip install awswrangler[redshift]
- Change SQL formatters for Athena and LakeFormation so that they properly format types by @Taragolis and @LeonLuttenberger in #1416 #1543 #1684 ๐พ
- For example a parameter of type
dt.datetime
is parsed intoDATETIME xxxx-xx-xx xx:xx:xx
, while a parameter of typestr
is formatted into"x"
- For example a parameter of type
- Refactor function signatures so that closely related parameters are grouped into a single parameter defined as a
TypeDict
by @LeonLuttenberger and @kukushking in #1855 #1996 #2016 #2055 #2081 ๐ผ- Glue catalog parameters are grouped together in
to_parquet
,to_csv
andto_json
- Athena UNLOAD and CTAS parameters are grouped together
- Glue catalog parameters are grouped together in
- Deprecate
wr.s3.merge_upsert_table
by @kukushking in #2076 โ ๏ธ - Deprecate
updated_name
parameter inupdate_ruleset
by @jaidisido in #2122 โ ๏ธ - Stop support for Python 3.7 โ ๏ธ
New functionalities ๐
AWS SDK for pandas can now run at scale ๐๐ป๐
Tutorials
- 034 - Distributing Calls Using Ray
- 035 - Distributing Calls on Ray Remote Cluster
- 036 - Distributing Calls with Glue Interactive Sessions on Ray
AWS Blogs
Features/Enhancements ๐
- Thread-safety improvements by @kukushking in #2186
- Allow Python 3.11 by @kukushking in #2101 ๐
- Add
use_theads
parameter todynamodb.read_items
by @LeonLuttenberger in #2113 ๐ - Distribute
wr.dynamodb.put_df
with executor task by @LeonLuttenberger in #2118 ๐ - Add additional arg for glue database
DatabaseInput
by @malachi-constant in #2067 ๐ง - Add overloads for function which can have multiple return value types by @LeonLuttenberger #1855
- Add support for boto3 kwargs to
timestream.create_table
by @cnfait in #1819 - Upgrade Ray to 2.2.x and PyArrow to 7+ by @LeonLuttenberger in #1865
- Upgrade to Ray 2.0 by @kukushking in #1635
- Add partitioning on block level by @kukushking in #1653
- Use fast file metadata provider by @kukushking in #1997
- Distribute DynamoDB Parallel Scan by @jaidisido in #1981
- Add faster Pyarrow S3fs listing in distributed mode by @jaidisido in #2030
- Add distributed variant of the
_read_parquet_metadata_file
function based on the PyArrow file system by @LeonLuttenberger in #2050 - Validate distributed kwargs by @kukushking in #2051
- Add
@Experimental
and@Deprecated
annotations by @kukushking in #2062 - Distribute S3
describe_objects
by @jaidisido in #2069 - Distributed S3 copy/merge by @kukushking in #2070
- Add
bulk_read
option for reading large amounts of Parquet files quickly by @LeonLuttenberger in #2033 - Deprecate boto3 resources by @kukushking in #2097
- Add retries for s3 select by @kukushking in #1780
- Make tqdm progress reporting opt-in by @kukushking in #1741
- Distribute data types inference by @jaidisido in #1692
- Change to singledispatch, add repartitioning utility, fix distributed write text regression by @kukushking in #1611
- Optimize distributed CSV I/O by adding PyArrow-based datasource by @LeonLuttenberger in #1699
- Configure scheduling options, remove dependencies on internal ray impl by @kukushking in #1734
- Validate partitions along row axis, add warning by @kukushking in #1700
- Refactor executor module by @kukushking in #2120
- Distribute parquet datasource and add missing features, enable all tests by @kukushking in #1711
- Distribute Timestream write with executor by @jaidisido in #1715
- Distribute
s3.to_json
ands3.to_csv
by @LeonLuttenberger in #1631 - Distribute
s3.read_csv
,s3.read_json
ands3.read_fwf
by @LeonLuttenberger in #1567 #1607 - Distribute
s3.wait_objects
by @LeonLuttenberger in #1539 - Distribute
s3.to_parquet
by @kukushking in #1526 - Distribute
s3.delete objects
by @malachi-constant in #1474 - Distribute
s3.read_parquet
by @jaidisido in #1513 - Add ThreadPoolExecutor and RayExecutor; refactor threading/ray; add single-path distributed
s3.select_query
by @kukushking in #1446 - Add distributed Lake Formation read by @jaidisido in #1397
- Refactor ray datasources by @kukushking in #1687
- Distribute S3 select over multiple paths and scan ranges by @jaidisido in #1445
- Add
Literal
typing formode
andprojection_types
by @LeonLuttenberger in #2191
Fixes ๐ ๏ธ
- Sanitize bucketing col names by @kukushking in #2155
- Allow writing files from an empty dataframe by @malachi-constant in #2045
- Athena out of bound dates by @kukushking in #2180
- Fix partition block overwriting by @kukushking in #1695
- Distrib S3 Select - check row count before creating the Ray dataset by @kukushking in #1808
- Allow to pass pandas dfs to Ray/Modin calls by @kukushking in #1812
- Add retries to
read_parquet_metadata_distributed
by @jaidisido in #2196 - Fix default
utcnow
argument instart_query
by @LeonLuttenberger in #2193
Documentation ๐
- Athena Iceberg tutorial by @kukushking in #2117
- Add at scale section by @kukushking in #2119
- Documentation spell-checking improvements by @LeonLuttenberger in #2165
- Add AWS Glue on Ray docs by @jaidisido in #1810
- Update config tutorial to include new configuration values by @LeonLuttenberger in #1696
- Improve documentation on running SDK for pandas at scale by @jaidisido in #1697
- Add "Introduction to Ray" Tutorials by @LeonLuttenberger in #1661
- Add SDK for pandas job on ray cluster tutorial by @malachi-constant in #1616
- Add typeddicts to docs by @LeonLuttenberger in #2167
Tests ๐งช
- Add PR linter Github action by @jaidisido in #2106
- Replace load tests bucket with SSM parameter by @jaidisido in #2121
- opensearch index cleanup / skip by @kukushking in #2149
- Add benchmark tests by @jaidisido in #2143
- Add tests for Glue Ray jobs by @LeonLuttenberger in #1832
- Remove
awswrangler.distributed
from coverage report by @LeonLuttenberger in #1884 - Consolidate unit and load tests by @jaidisido in #1525
- Distribute tests in tox config by @malachi-constant in #1469
Full Changelog: 2.20.1...3.0.0