We are excited to announce the preview release of Delta Lake 2.3.0 on Apache Spark 3.3. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/2.3.0/
- Maven artifacts: https://oss.sonatype.org/content/repositories/iodelta-1066
- Python artifacts: https://test.pypi.org/project/delta-spark/2.3.0rc1/
The key features in this release are as follows
- Zero-copy convert to Delta from Iceberg tables using
CONVERT TO DELTA
. This generates a Delta table in the same location and does not rewrite any parquet files. - Support
SHALLOW CLONE
for Delta, Parquet, and Iceberg tables to clone a source table without copying the data files.SHALLOW CLONE
creates a copy of the source table’s definition but refers to the source table’s data files. - Support idempotent writes for DML operations. This feature adds idempotency to
INSERT
/DELETE
/UPDATE
/MERGE
etc. operations using SQL configurationsspark.databricks.delta.write.txnAppId
andspark.databricks.delta.write.txnVersion
. - Support “when not matched by source” clauses for the Merge command to update or delete rows in the chosen table that don’t have matches in the source table based on the merge condition. This clause is supported in the Python, Scala, and Java
DeltaTable
APIs. SQL Support will be added in Spark 3.4. - Support
CREATE TABLE LIKE
to create empty Delta tables using the definition and metadata of an existing table or view. - Support reading Change Data Feed (CDF) in SQL queries using the
table_changes
table-valued function. - Unblock Change Data Feed (CDF) batch reads on column mapping enabled tables when
DROP COLUMN
andRENAME COLUMN
have been used. - Improved read and write performance on S3 when writing from a single cluster. Efficient file listing decreases the metadata processing time when calculating a table snapshot. This is most impactful for tables with many commits. Set the Hadoop configuration
delta.enableFastS3AListFrom
totrue
to enable it. - Record
VACUUM
operations in the transaction log. With this feature,VACUUM
operations and their associated metrics (e.g.numDeletedFiles
) will now show up in table history. - Support reading Delta tables with deletion vectors.
- Other notable changes
- Support schema evolution in
MERGE
forUPDATE SET <assignments> and INSERT (...) VALUES (...) actions
. Previously, schema evolution was only supported forUPDATE SET *
andINSERT *
actions. - Add
.show()
support forCOUNT(*)
aggregate pushdown. - Enforce idempotent writes for
df.saveAsTable
for overwrite and append mode. - Support Table Features to selectively add individual features when upgrading the table protocol version. This enables users to only add active features and will facilitate connectivity as downstream Delta connectors can selectively implement feature support.
- Automatically generate partition filters for additional generation expressions.
- Support the
trunc
anddate_trunc
functions. - Support for the
date_format
function with formatyyyy-MM-dd
.
- Support the
- Block protocol downgrades when replacing a Delta table to prevent any incorrect time-travel or CDF queries.
- Fix
replaceWhere
with the DataFrame V2 overwrite API to correctly evaluate less than conditions. - Fix dynamic partition overwrite for tables with more than one partition data type.
- Fix schema evolution for
INSERT OVERWRITE
with complex data types when the source schema is read incompatible. - Fix Delta streaming source to correctly detect read-incompatible schema changes during backfill when there is exactly one schema change in the versions read.
- Fix a bug in
VACUUM
where sometimes the default retention period was used to remove files instead of the retention period specified in the table properties. - Include the table name in the DataFrame returned by the
deltaTable.details()
Python/Scala/Java API. - Improve the log message for
VACUUM table_name DRY RUN
.
- Support schema evolution in
How use the preview release
For this preview we have published the artifacts to a staging repository. Here’s how you can use them:
- spark-submit: Add
–-repositories https://oss.sonatype.org/content/repositories/iodelta-1066/
to the command line arguments. For example:spark-submit --packages io.delta:delta-core_2.12:2.3.0rc1 --repositories https://oss.sonatype.org/content/repositories/iodelta-1066/ examples/examples.py
- Currently Spark shells (PySpark and Scala) do not accept the external repositories option. However, once the artifacts have been downloaded to the local cache, the shells can be run with Delta
2.3.0rc1
by just providing the--packages io.delta:delta-core_2.12:2.3.0rc1
argument. - Maven project:
<repositories>
<repository>
<id>staging-repo</id>
<url> https://oss.sonatype.org/content/repositories/iodelta-1066/</url>
</repository>
</repositories>
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-core_2.12</artifactId>
<version>2.3.0rc1</version>
</dependency>
- SBT project:
libraryDependencies += "io.delta" %% "delta-core" % "2.3.0rc1"
resolvers += "Delta" at https://oss.sonatype.org/content/repositories/iodelta-1066/
- Delta-spark:
pip install -i https://test.pypi.org/simple/ delta-spark==2.3.0rc1
Credits
Allison Portis, Andreas Chatzistergiou, Andrew Li, Bo Zhang, Brayan Jules, Burak Yavuz, Christos Stavrakakis, Daniel Tenedorio, Dhruv Shah, Felipe Pessoto, Fred Liu, Fredrik Klauss, Gengliang Wang,
Haejoon Lee, Hussein Nagree, Jackie Zhang, Jiaheng Tang, Jintian Liang, Johan Lasperas, Jungtaek Lim, Kam Cheung Ting, Koki Otsuka, Lars Kroll, Lin Ma, Lukas Rupprecht, Ming DAI, Mitchell Riley, Ole Sasse, Paddy Xu, Prakhar Jain, Pranav, Rahul Shivu Mahadev, Rajesh Parangi, Ryan Johnson, Scott Sandre, Serge Rielau, Shixiong Zhu, Slim Ouertani, Tobias Fabritz, Tom van Bussel, Tushar Machavolu, Tyson Condie, Venki Korukanti, Vitalii Li, Wenchen Fan, Xinyi Yu, Yaohua Zhao, Yingyi Bu