Delta Lake 3.1.0
We are excited to announce the preview release of Delta Lake 3.1.0. This release includes several exciting new features.
Delta Spark
Delta Spark 3.1.0 is built on Apache Spark™ 3.5. Similar to Apache Spark, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/3.1.0/index.html
- Maven artifacts: https://oss.sonatype.org/content/repositories/iodelta-1129
- PyPi artifacts: (attached to this release, follow instructions at the end of this release docs (section
Delta-spark PyPi
) on how to install the pypi locally)
The key features of this release are:
- Support for merge with deletion vectors to reduce the write overhead for merge operations. This features improves the performance of merge by several folds.
- Support for optimizing min/max aggregation queries using the table metadata which improves the performance of simple aggregations queries (e.g SELECT min(x) FROM deltaTable) by up to 100x.
- (Experimental) Liquid clustering for better table layout. Now Delta allows clustering the data in a Delta table for better data skipping. Currently this is an experimental feature. See documentation and example for how to try out this feature.
- Support for DEFAULT value columns. Delta supports defining default expressions for columns on Delta tables. Delta will generate default values for columns when users do not explicitly provide values for them when writing to such tables, or when the user explicitly specifies the DEFAULT SQL keyword for any such column. See documentation on how to enable this feature and try out.
- Support for Hive Metastore schema sync. Adds a post-commit hook for syncing the table schema and properties to HMS (or compatible to HMS such as AWS Glue) whenever they change. See the documentation on how to enable this feature.
- Query Delta sharing tables from Delta-Spark. Now Delta-Spark allows querying Delta tables shared using Delta Sharing protocol. Queries include batch queries, streaming queries and CDF queries. Delta tables with deletion vectors or column mapping enabled can also be shared and read in Delta-Spark. See documentation for further details.
- Auto compaction to address the small files problem during table writes. Auto compaction which runs at the end of the write query combines small files within partitions to large files to reduce the metadata size and improve query performance.
- Optimized write is an optimization that repartitions and rebalances data before writing them out to a Delta table. Optimized writes improve file size as data is written and benefit subsequent reads on the table.
- Other notable changes include
Delta Universal Format (UniForm)
- Documentation: https://docs.delta.io/3.1.0/delta-uniform.html
Delta Universal Format (UniForm) will allow you to read Delta tables with Hudi and Iceberg clients. Delta 3.1.0 provided the following improvements:
- Enhanced Iceberg support called IcebergCompatV2, which supports List/Map and also improved compatibility with writing timestamp as int64 per Iceberg spec.
- A new SQL command
REORG TABLE table APPLY (UPGRADE UNIFORM(ICEBERG_COMPAT_VERSION=2))
to upgrade existing Delta tables to Uniform. - Delta file statistics conversion to Iceberg including max/min/rowCount/nullCount.
Delta Kernel
- API documentation: https://docs.delta.io/3.1.0/api/java/kernel/index.html
The Delta Kernel project is a set of Java libraries (Rust will be coming soon!) for building Delta connectors that can read (and, soon, write to) Delta tables without the need to understand the Delta protocol details).
Delta 3.0.0 released the first version of Kernel. In this release, read support is further enhanced and APIs are solidified by taking into account the feedback received from connectors trying out the first version Delta 3.0.0.
- Support for data skipping for given query predicates. Now Kernel can prune the list of files to scan for a given query predicate using the file level statistics stored in the Delta metadata. This helps connectors read less data than usual.
- Improved Delta table snapshot reconstruction latency. Kernel now can read load the initial protocol and metadata a lot faster due to improved table state reconstruction.
- Support for column mapping id mode. Now tables with column mapping id mode can be read by Kernel.
- Misc. API changes and bug fixes.
For more information, refer to:
- User guide on step by step process of using Kernel in a standalone Java program or in a distributed processing connector.
- Slides explaining the rationale behind Kernel and the API design.
- Example Java programs that illustrate how to read Delta tables using the Kernel APIs.
- Table and default TableClient API Java documentation
Delta Flink
Delta-Flink 3.1.0 is built on top of Apache Flink™ 1.16.1.
- Documentation: https://github.com/delta-io/delta/tree/branch-3.1/connectors/flink
- API Documentation: https://docs.delta.io/3.1.0/api/java/flink/index.html
The key features of this release are
- Flink write job startup time latency improvement using Kernel In this version. Flink has an option to use Kernel to load the Delta table metadata (i.e table schema) which helps the reduce the startup time by up to 45x.
Delta Standalone
- Documentation: https://docs.delta.io/3.1.0/delta-standalone.html
- API Documentation: https://docs.delta.io/3.1.0/api/java/standalone/index.html
There are no updates to standalone in this release.
Credits
Ala Luszczak, Allison Portis, Ami Oka, Amogh Akshintala, Andreas Chatzistergiou, Bart Samwel, BjarkeTornager, Christos Stavrakakis, Costas Zarifis, Daniel Tenedorio, Dhruv Arya, EJ Song, Eric Maynard, Felipe Pessoto, Fred Storage Liu, Fredrik Klauss, Gengliang Wang, Gerhard Brueckl, Haejoon Lee, Hao Jiang, Jared Wang, Jiaheng Tang, Jing Wang, Johan Lasperas, Kaiqi Jin, Kam Cheung Ting, Lars Kroll, Li Haoyi, Lin Zhou, Lukas Rupprecht, Mark Jarvin, Max Gekk, Ming DAI, Nick Lanham, Ole Sasse, Paddy Xu, Patrick Leahey, Peter Toth, Prakhar Jain, Renan Tomazoni Pinzon, Rui Wang, Ryan Johnson, Sabir Akhadov, Scott Sandre, Serge Rielau, Shixiong Zhu, Tathagata Das, Thang Long Vu, Tom van Bussel, Venki Korukanti, Vitalii Li, Wei Luo, Wenchen Fan, Xin Zhao, ericm-db, jintao shen, panbingkun
How to use the preview release
Delta-Spark
Download Spark 3.5.0 from https://spark.apache.org/downloads.html
For this preview, we have published the artifacts to a staging repository. Here’s how you can use them:
spark-submit
Add –-repositories https://oss.sonatype.org/content/repositories/iodelta-1129
to the command line arguments.
Example:
spark-submit --packages io.delta:delta-spark_2.12:3.1.0 \
--repositories \
https://oss.sonatype.org/content/repositories/iodelta-1129 examples/examples.py
Currently, Spark shells (PySpark and Scala) do not accept the external repositories option. However, once the artifacts have been downloaded to the local cache, the shells can be run with Delta 3.1.0 by just providing the --packages io.delta:delta-spark_2.12:3.1.0 argument.
Spark-shell
bin/spark-shell --packages io.delta:delta-spark_2.12:3.1.0 \
--repositories https://oss.sonatype.org/content/repositories/iodelta-1129 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
Spark-SQL
bin/spark-sql --packages io.delta:delta-spark_2.12:3.1.0 \
--repositories https://oss.sonatype.org/content/repositories/iodelta-1129 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
Maven project
<repositories>
<repository>
<id>staging-repo</id>
<url>https://oss.sonatype.org/content/repositories/iodelta-1129</url>
</repository>
</repositories>
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-spark_2.12</artifactId>
<version>3.1.0</version>
</dependency>
SBT project
libraryDependencies += "io.delta" %% "delta-spark" % "3.1.0"
resolvers += "Delta" at https://oss.sonatype.org/content/repositories/iodelta-1129
Delta-spark PyPi:
- Download two artifacts from pre-release here. Artifacts to download are:
- delta-spark-3.1.0.tar.gz
- delta_spark-3.1.0-py3-none-any.whl
- Keep them in one directory. Lets call that ~/Downloads
pip install ~/Downloads/delta_spark-3.1.0-py3-none-any.whl
pip show delta-spark
should show output similar to the below
Name: delta-spark
Version: 3.1.0
Summary: Python APIs for using Delta Lake with Apache Spark
Home-page: https://github.com/delta-io/delta/
Author: The Delta Lake Project Authors
Author-email: delta-users@googlegroups.com
License: Apache-2.0
Location: <user-home>/.conda/envs/delta-release/lib/python3.8/site-packages
Requires: importlib-metadata, pyspark