We are excited to announce the preview release of Delta Lake 2.1.0 on Apache Spark 3.3. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/2.1.0/index.html
- Maven artifacts: https://oss.sonatype.org/content/repositories/iodelta-1087/
- Python artifacts: https://test.pypi.org/project/delta-spark/2.1.0rc1/
The key features in this preview are as follows:
- Support for Apache Spark 3.3.
- Support for [TIMESTAMP | VERSION] AS OF in SQL. Earlier versions of Delta only supported time travel through the DataFrame API. With the necessary Spark parser changes released in Spark 3.3 Delta is now able to support time travel in SQL.
- Support for SHOW COLUMNS to query the columns of a Delta table in SQL.
- Support for Describe Detail in the Scala and Python DeltaTable API.
- Support for returning operation metrics from SQL Delete commands. Previously SQL Delete commands returned an empty DataFrame, now they return a DataFrame with num_affected_rows.
- Optimize performance improvements.
- Add a config to use
repartition(1)
instead ofcoalesce(1)
in Optimize for better performance when merging many small files. - Improve Optimize performance by using a queue-based approach to parallelize the compaction jobs.
- Add a config to use
- Other notable changes
- Support for using variables in the VACUUM and OPTIMIZE SQL commands.
- Improvements for CONVERT TO DELTA with catalog tables.
- Autofill the partition schema from the catalog when it’s not provided.
- Use partition information from the catalog to find the data files to commit instead of doing a full directory scan. Instead of committing all data files in the table directory, only data files under the directories of active partitions will be committed.
- Improve Update performance by enabling schema pruning in the first pass.
- Fix for
DeltaTableBuilder
to preserve table property case of non-delta properties when setting properties. - Fix for duplicate CDF row output for delete-when-matched merges with multiple matches.
- Fix for consistent timestamps in a MERGE command.
- Fix for incorrect operation metrics for DataFrame writes with a
replaceWhere
option. - Change in log4j properties file format. Apache Spark upgraded the log4j version from 1.x to 2.x which has a different format for the log4j file. Refer to the Spark upgrade notes.
Benchmark framework update:
Improvements to the benchmark framework (initial version added in version 1.2.0) including support for benchmarking arbitrary functions and not just SQL queries. We’ve also added Terraform scripts to automatically generate the infrastructure to run benchmarks on AWS and GCP.
How use the preview release:
For this preview we have published the artifacts to a staging repository. Here’s how you can use them:
- spark-submit: Add
–-repositories https://oss.sonatype.org/content/repositories/iodelta-1087/
to the command line arguments. For example:
spark-submit --packages io.delta:delta-core_2.12:2.1.0rc1 –-repositories https://oss.sonatype.org/content/repositories/iodelta-1087/ examples/examples.py
- Maven project:
<repositories>
<repository>
<id>staging-repo</id>
<url> https://oss.sonatype.org/content/repositories/iodelta-1087/</url>
</repository>
</repositories>
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-core_2.12</artifactId>
<version>2.1.0rc1</version>
</dependency>
- SBT project:
libraryDependencies += "io.delta" %% "delta-core" % "2.1.0rc1"
resolvers += "Delta" at https://oss.sonatype.org/content/repositories/iodelta-1087/
- Delta-spark:
pip install -i https://test.pypi.org/simple/ delta-spark==2.1.0rc1
Credits
Adam Binford, Allison Portis, Andreas Chatzistergiou, Andrew Vine, Andy Lam, Chang Yong Lik, Christos Stavrakakis, David Lewis, Denis Krivenko, Denny Lee, EJ Song, Edmondo Porcu, Felipe Pessoto, Fred Liu, Fu Chen, Grzegorz Kołakowski, Hedi Bejaoui, Hussein Nagree, Ionut Boicu, Ivan Sadikov, Jackie Zhang, Jiawei Bao, Jintao Shen, Jintian Liang, Jonas Irgens Kylling, Juliusz Sompolski, Junlin Zeng, KaiFei Yi, Kam Cheung Ting, Karen Feng, Koert Kuipers, Lars Kroll, Lin Zhou, Lukas Rupprecht, Max Gekk, Min Yang, Ming DAI, Nick, Ole Sasse, Prakhar Jain, Rahul Shivu Mahadev, Rajesh Parangi, Rui Wang, Ryan Johnson, Sabir Akhadov, Scott Sandre, Serge Rielau, Shixiong Zhu, Tathagata Das, Terry Kim, Thomas Newton, Tom van Bussel, Tyson Condie, Venki Korukanti, Vini Jaiswal, Will Jones, Xi Liang, Yijia Cui, Yousry Mohamed, Zach Schuermann, sherlockbeard, yikf