This is a preview of Delta 4.0 Preview. The final release notes are still being generated.
We are excited to announce RC2 of the preview release of Delta Lake 4.0.0! Instructions for how to use this release candidate are at the end of these notes. To give feedback on this release candidate, please create issues in our Delta repository.
- Please note the final version (and the version of every release candidate) for the preview release will be "4.0.0rc1". Apologies for any confusion this may cause, but due to our release infrastructure this is the versioning we must use.
Delta Spark
Delta Spark 4.0 preview is built on Apache Spark™ 4.0.0-preview1. Similar to Apache Spark, we have released Maven artifacts for Scala 2.13.
- RC2 artifacts: delta-spark_2.13, delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb
- Python artifacts: https://github.com/delta-io/delta/releases/tag/v4.0.0-preview-rc2
The key features of this release are:
- Support for Type Widening: Delta Spark can now change the type of a column to a wider type using the
ALTER TABLE t CHANGE COLUMN col TYPE type
command or with schema evolution duringMERGE
andINSERT
operations. See the type widening documentation for a list of all supported type changes and additional information. The table will be readable by Delta 4.0 readers without requiring the data to be rewritten. For compatibility with older versions, a rewrite of the data can be triggered using theALTER TABLE t DROP FEATURE 'typeWidening'
command. - Support for Spark Connect (aka Delta Connect): Spark Connect is a new initiative in Apache Spark that adds a decoupled client-server infrastructure which allows Spark applications to connect remotely to a Spark server and run SQL / Dataframe operations. Delta Connect allows Delta operations to be made in applications running in such client-server mode. Further instructions on how to try this out are coming soon.
- Support for the Variant data type. The Variant data type is a new Apache Spark data type. The Variant data type enables flexible, and efficient processing of semi-structured data, without a user-specified schema. Variant data does not require a fixed schema on write. Instead, Variant data is queried using the schema-on-read approach. The Variant data type allows flexible ingestion by not requiring a write schema, and enables faster processing with the Spark Variant binary encoding format.
- Support for Coordinated Commits: Coordinated Commits is a new writer table feature which allows users to designate a “Commit Coordinator” for their Delta table. A commit coordinator is an entity with a unique identifier which maintains information about commits. Once a commit coordinator has been set for a table, all writes to the table must be coordinated through it. This single point of ownership of commits for the table makes cross-environment (e.g. cross cloud) writes safe. Examples of Commit Coordinators are catalogs (Hive Metastore, Unity Catalog, etc.), DynamoDB, or any system which can implement the commit coordinator API. . This release also adds a DynamoDB Commit Coordinator which can use a DynamoDB table to coordinate commits for a table. Delta tables with commit coordinators are still readable through the object storage paths, making reads backward compatible.
More detailed release notes coming soon!
Delta Kernel Java
The Delta Kernel project is a set of Java and Rust libraries for building Delta connectors that can read and write to Delta tables without the need to understand the Delta protocol details). Release notes for this component to come later.
- RC2 artifacts: delta-kernel-api, delta-kernel-defaults
How to use this Release Candidate
Download Spark 4.0.0-preview1 from https://archive.apache.org/dist/spark/spark-4.0.0-preview1.
For this release candidate, we have published the artifacts to a staging repository. Here’s how you can use them:
Spark Submit
Add --repositories https://oss.sonatype.org/content/repositories/iodelta-1148
to the command line arguments.
spark-submit --packages io.delta:delta-spark_2.13:4.0.0rc1 \
--repositories https://oss.sonatype.org/content/repositories/iodelta-1148 examples/examples.py
Currently Spark shells (PySpark and Scala) do not accept the external repositories option. However, once the artifacts have been downloaded to the local cache, the shells can be run with Delta 4.0.0rc1
by just providing the --packages io.delta:delta-spark_2.13:4.0.0rc1
argument.
Spark Shell
bin/spark-shell --packages io.delta:delta-spark_2.13:4.0.0rc1 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
Spark SQL
bin/spark-sql --packages io.delta:delta-spark_2.13:4.0.0rc1 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
Maven
<repositories>
<repository>
<id>staging-repo</id>
<url>https://oss.sonatype.org/content/repositories/iodelta-1148</url>
</repository>
</repositories>
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-spark_2.13</artifactId>
<version>4.0.0rc1</version>
</dependency>
SBT Project
libraryDependencies += "io.delta" %% "delta-spark" % "4.0.0rc1"
resolvers += "Delta" at https://oss.sonatype.org/content/repositories/iodelta-1148
(PySpark) Delta-Spark
- Note: Spark version for PyPI is
4.0.0.dev1
- Download two artifacts from pre-release: https://github.com/delta-io/delta/releases/tag/v4.0.0-preview-rc2
- Artifacts to download are:
delta-spark-4.0.0rc1.tar.gz
delta_spark-4.0.0rc1-py3-none-any.whl
- Keep them in one directory. Lets call that
~/Downloads
pip install ~/Downloads/delta_spark-4.0.0rc1-py3-none-any.whl
pip show delta-spark
should show output similar to the below
Name: delta-spark
Version: 4.0.0rc1
Summary: Python APIs for using Delta Lake with Apache Spark
Home-page: https://github.com/delta-io/delta/
Author: The Delta Lake Project Authors
Author-email: delta-users@googlegroups.com
License: Apache-2.0
Location: /Users/allison.portis/opt/anaconda3/envs/delta-release-4-0/lib/python3.9/site-packages
Requires: importlib-metadata, pyspark
Required-by:
Note: artifacts delta-flink
, delta-iceberg
, delta-hudi
, delta-hive
and delta-standalone
are not included in this preview but will be available in a future release