This is a preview of Delta 4.0 Preview. The final release notes are still being generated.
We are excited to announce the preview release of Delta Lake 4.0.0 RC1! Instructions for how to use this release candidate are at the end of these notes. To give feedback on this release candidate, please create issues in our Delta repository.
Delta Spark
Delta Spark 4.0 preview is built on Apache Spark™ 4.0.0-preview1. Similar to Apache Spark, we have released Maven artifacts for Scala 2.13.
- RC1 artifacts: delta-spark_2.13, delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb
- Python artifacts: https://github.com/delta-io/delta/releases/tag/v4.0.0-preview-rc1
The key features of this release are:
- Support for Type Widening: Delta Spark can now change the type of a column to a wider type using the ALTER TABLE t CHANGE COLUMN col TYPE type command or with schema evolution during MERGE and INSERT operations. See the type widening documentation for a list of all supported type changes and additional information. The table will be readable by Delta 4.0 readers without requiring the data to be rewritten. For compatibility with older versions, a rewrite of the data can be triggered using the ALTER TABLE t DROP FEATURE 'typeWidening' command.
- Support for the Variant data type. The Variant data type is a new Apache Spark data type. The Variant data type enables flexible, and efficient processing of semi-structured data, without a user-specified schema. Variant data does not require a fixed schema on write. Instead, Variant data is queried using the schema-on-read approach. The Variant data type allows flexible ingestion by not requiring a write schema, and enables faster processing with the Spark Variant binary encoding format.
- Support for the Managed Commit table feature. Managed Commit is a writer table feature which allows users to designate a “Commit Owner” for their Delta table. A commit owner is an entity with a unique identifier which maintains information about commits. Once this commit owner has been designated, all writes to the table must be managed by it. This single point of ownership of commits for the table makes cross-environment (e.g. cross cloud) writes safe. This release also adds a DynamoDB Commit Owner which can use a DynamoDB instance to manage commits to a table.
More detailed release notes coming soon!
Delta Kernel Java
The Delta Kernel project is a set of Java and Rust libraries for building Delta connectors that can read and write to Delta tables without the need to understand the Delta protocol details). Release notes for this component to come later.
- RC1 artifacts: delta-kernel-api, delta-kernel-defaults
How to use this Release Candidate
Download Spark 4.0.0-preview1 from https://archive.apache.org/dist/spark/spark-4.0.0-preview1.
For this release candidate, we have published the artifacts to a staging repository. Here’s how you can use them:
Spark Submit
Add --repositories https://oss.sonatype.org/content/repositories/iodelta-1147
to the command line arguments.
spark-submit --packages io.delta:delta-spark_2.13:4.0.0rc1 \
--repositories https://oss.sonatype.org/content/repositories/iodelta-1147 examples/examples.py
Currently Spark shells (PySpark and Scala) do not accept the external repositories option. However, once the artifacts have been downloaded to the local cache, the shells can be run with Delta 4.0.0rc1
by just providing the --packages io.delta:delta-spark_2.13:4.0.0rc1
argument.
Spark Shell
bin/spark-shell --packages io.delta:delta-spark_2.13:4.0.0rc1 \
--repositories https://oss.sonatype.org/content/repositories/iodelta-1147 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
Spark SQL
bin/spark-sql --packages io.delta:delta-spark_2.13:4.0.0rc1 \
--repositories https://oss.sonatype.org/content/repositories/iodelta-1147 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
Maven
<repositories>
<repository>
<id>staging-repo</id>
<url>https://oss.sonatype.org/content/repositories/iodelta-1147</url>
</repository>
</repositories>
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-spark_2.13</artifactId>
<version>4.0.0rc1</version>
</dependency>
SBT Project
libraryDependencies += "io.delta" %% "delta-spark" % "4.0.0rc1"
resolvers += "Delta" at https://oss.sonatype.org/content/repositories/iodelta-1147
(PySpark) Delta-Spark
- Note: Spark version for PyPI is
4.0.0.dev1
- Download two artifacts from pre-release: https://github.com/delta-io/delta/releases/tag/v4.0.0rc1
- Artifacts to download are:
delta-spark-4.0.0rc1.tar.gz
delta_spark-4.0.0rc1-py3-none-any.whl
- Keep them in one directory. Lets call that
~/Downloads
pip install ~/Downloads/delta_spark-4.0.0rc1-py3-none-any.whl
pip show delta-spark
should show output similar to the below
Name: delta-spark
Version: 4.0.0rc1
Summary: Python APIs for using Delta Lake with Apache Spark
Home-page: https://github.com/delta-io/delta/
Author: The Delta Lake Project Authors
Author-email: delta-users@googlegroups.com
License: Apache-2.0
Location: /Users/allison.portis/opt/anaconda3/envs/delta-release-4-0/lib/python3.9/site-packages
Requires: importlib-metadata, pyspark
Required-by:
Note: artifacts delta-flink
, delta-iceberg
, delta-hudi
, delta-hive
and delta-standalone
are not included in this preview but will be available in a future release