delta-io/delta v2.3.0rc1 on GitHub

We are excited to announce the preview release of Delta Lake 2.3.0 on Apache Spark 3.3. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.

Documentation: https://docs.delta.io/2.3.0/
Maven artifacts: https://oss.sonatype.org/content/repositories/iodelta-1066
Python artifacts: https://test.pypi.org/project/delta-spark/2.3.0rc1/

The key features in this release are as follows

Zero-copy convert to Delta from Iceberg tables using CONVERT TO DELTA. This generates a Delta table in the same location and does not rewrite any parquet files.
Support SHALLOW CLONE for Delta, Parquet, and Iceberg tables to clone a source table without copying the data files. SHALLOW CLONE creates a copy of the source table’s definition but refers to the source table’s data files.
Support idempotent writes for DML operations. This feature adds idempotency to INSERT/DELETE/UPDATE/MERGE etc. operations using SQL configurations spark.databricks.delta.write.txnAppId and spark.databricks.delta.write.txnVersion.
Support “when not matched by source” clauses for the Merge command to update or delete rows in the chosen table that don’t have matches in the source table based on the merge condition. This clause is supported in the Python, Scala, and Java DeltaTable APIs. SQL Support will be added in Spark 3.4.
Support CREATE TABLE LIKE to create empty Delta tables using the definition and metadata of an existing table or view.
Support reading Change Data Feed (CDF) in SQL queries using the table_changes table-valued function.
Unblock Change Data Feed (CDF) batch reads on column mapping enabled tables when DROP COLUMN and RENAME COLUMN have been used.
Improved read and write performance on S3 when writing from a single cluster. Efficient file listing decreases the metadata processing time when calculating a table snapshot. This is most impactful for tables with many commits. Set the Hadoop configuration delta.enableFastS3AListFrom to true to enable it.
Record VACUUM operations in the transaction log. With this feature, VACUUM operations and their associated metrics (e.g. numDeletedFiles) will now show up in table history.
Support reading Delta tables with deletion vectors.
Other notable changes
- Support schema evolution in MERGE for UPDATE SET <assignments> and INSERT (...) VALUES (...) actions. Previously, schema evolution was only supported for UPDATE SET * and INSERT * actions.
- Add .show() support for COUNT(*) aggregate pushdown.
- Enforce idempotent writes for df.saveAsTable for overwrite and append mode.
- Support Table Features to selectively add individual features when upgrading the table protocol version. This enables users to only add active features and will facilitate connectivity as downstream Delta connectors can selectively implement feature support.
- Automatically generate partition filters for additional generation expressions.
  - Support the trunc and date_trunc functions.
  - Support for the date_format function with format yyyy-MM-dd.
- Block protocol downgrades when replacing a Delta table to prevent any incorrect time-travel or CDF queries.
- Fix replaceWhere with the DataFrame V2 overwrite API to correctly evaluate less than conditions.
- Fix dynamic partition overwrite for tables with more than one partition data type.
- Fix schema evolution for INSERT OVERWRITE with complex data types when the source schema is read incompatible.
- Fix Delta streaming source to correctly detect read-incompatible schema changes during backfill when there is exactly one schema change in the versions read.
- Fix a bug in VACUUM where sometimes the default retention period was used to remove files instead of the retention period specified in the table properties.
- Include the table name in the DataFrame returned by the deltaTable.details() Python/Scala/Java API.
- Improve the log message for VACUUM table_name DRY RUN.

How use the preview release

For this preview we have published the artifacts to a staging repository. Here’s how you can use them:

spark-submit: Add –-repositories https://oss.sonatype.org/content/repositories/iodelta-1066/ to the command line arguments. For example:
- spark-submit --packages io.delta:delta-core_2.12:2.3.0rc1 --repositories https://oss.sonatype.org/content/repositories/iodelta-1066/ examples/examples.py
Currently Spark shells (PySpark and Scala) do not accept the external repositories option. However, once the artifacts have been downloaded to the local cache, the shells can be run with Delta 2.3.0rc1 by just providing the --packages io.delta:delta-core_2.12:2.3.0rc1 argument.
Maven project:

<repositories>
  <repository>
    <id>staging-repo</id>
    <url> https://oss.sonatype.org/content/repositories/iodelta-1066/</url>
  </repository>
</repositories>
<dependency>
  <groupId>io.delta</groupId>
  <artifactId>delta-core_2.12</artifactId>
  <version>2.3.0rc1</version>
</dependency>

SBT project:

libraryDependencies += "io.delta" %% "delta-core" % "2.3.0rc1"
resolvers += "Delta" at  https://oss.sonatype.org/content/repositories/iodelta-1066/

Delta-spark:

pip install -i https://test.pypi.org/simple/ delta-spark==2.3.0rc1

Credits

Allison Portis, Andreas Chatzistergiou, Andrew Li, Bo Zhang, Brayan Jules, Burak Yavuz, Christos Stavrakakis, Daniel Tenedorio, Dhruv Shah, Felipe Pessoto, Fred Liu, Fredrik Klauss, Gengliang Wang,
Haejoon Lee, Hussein Nagree, Jackie Zhang, Jiaheng Tang, Jintian Liang, Johan Lasperas, Jungtaek Lim, Kam Cheung Ting, Koki Otsuka, Lars Kroll, Lin Ma, Lukas Rupprecht, Ming DAI, Mitchell Riley, Ole Sasse, Paddy Xu, Prakhar Jain, Pranav, Rahul Shivu Mahadev, Rajesh Parangi, Ryan Johnson, Scott Sandre, Serge Rielau, Shixiong Zhu, Slim Ouertani, Tobias Fabritz, Tom van Bussel, Tushar Machavolu, Tyson Condie, Venki Korukanti, Vitalii Li, Wenchen Fan, Xinyi Yu, Yaohua Zhao, Yingyi Bu

delta-io/delta v2.3.0rc1 Delta Lake 2.3.0 on GitHub

The key features in this release are as follows

How use the preview release

Credits

delta-io/delta v2.3.0rc1
Delta Lake 2.3.0

on GitHub