We are excited to announce the preview release of Delta Lake 2.2.0 on Apache Spark 3.3. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/latest/index.html
- Maven artifacts: https://oss.sonatype.org/content/repositories/iodelta-1102
- Python artifacts: https://test.pypi.org/project/delta-spark/2.2.0rc1/
The key features in this release are as follows:
-
LIMIT
pushdown into Delta scan. Improve the performance of queries containingLIMIT
clauses by pushing down theLIMIT
into Delta scan during query planning. Delta scan uses theLIMIT
and the file-level row counts to reduce the number of files scanned which helps the queries read far less number of files and could makeLIMIT
queries faster by 10-100x depending upon the table size. -
Aggregate pushdown into Delta scan for SELECT COUNT(*). Aggregation queries such as
SELECT COUNT(*)
on Delta tables are satisfied using file-level row counts in Delta table metadata rather than counting rows in the underlying data files. This significantly reduces the query time as the query just needs to read the table metadata and could make full table count queries faster by 10-100x. -
Support for collecting file level statistics as part of the CONVERT TO DELTA command. These statistics potentially help speed up queries on the Delta table. By default the statistics are collected now as part of the CONVERT TO DELTA command. In order to disable statistics collection specify
NO STATISTICS
clause in the command. Example:CONVERT TO DELTA table_name NO STATISTICS
-
Improve performance of the DELETE command by pruning the columns to read when searching for files to rewrite.
-
Fix for a bug in the DynamoDB-based S3 multi-cluster mode configuration. The previous version wrote an incorrect timestamp which was used by DynamoDB’s TTL feature to cleanup expired items. This timestamp value has been fixed and the table attribute renamed from
commitTime
toexpireTime
. If you already have TTL enabled, please follow the migration steps here. -
Fix non-deterministic behavior during MERGE when working with sources that are non-deterministic.
-
Remove the restrictions for using Delta tables with column mapping in certain Streaming + CDF cases. Earlier we used to block Streaming+CDF if the Delta table has column mapping enabled even though it doesn’t contain any RENAME or DROP columns.
-
Other notable changes
- Improve the monitoring of the Delta state construction queries (additional queries run as part of planning) by making them visible in the Spark UI.
- Support for multiple
where()
calls in Optimize scala/python API - Support for passing Hadoop configurations via DeltaTable API
- Support partition column names starting with
.
or_
in CONVERT TO DELTA command. - Improvements to metrics in table history
- Fix a metric in MERGE command
- Source type metric for CONVERT TO DELTA
- Metrics for DELETE on partitions
- Additional vacuum stats
- Fix for accidental protocol downgrades with RESTORE command. Until now, RESTORE TABLE may downgrade the protocol version of the table, which could have resulted in inconsistent reads with time travel. With this fix, the protocol version is never downgraded from the current one.
- Fix a bug in
MERGE INTO
when there are multipleUPDATE
clauses and one of the UPDATEs is with a schema evolution. - Fix a bug where sometimes active
SparkSession
object is not found when using Delta APIs - Fix an issue where partition schema couldn’t be set during the initial commit.
- Catch exceptions when writing
last_checkpoint
file fails. - Fix an issue when restarting a streaming query with
AvailableNow
trigger on a Delta table. - Fix an issue with CDF and Streaming where the offset is not correctly updated when there are no data changes.
How to use the preview release
For this preview we have published the artifacts to a staging repository. Here’s how you can use them:
- spark-submit: Add –-repositories https://oss.sonatype.org/content/repositories/iodelta-1102/ to the command line arguments. For example:
spark-submit --packages io.delta:delta-core_2.12:2.2.0rc1 --repositories https://oss.sonatype.org/content/repositories/iodelta-1102/ examples/examples.py
- Currently Spark shells (PySpark and Scala) do not accept the external repositories option. However, once the artifacts have been downloaded to the local cache, the shells can be run with Delta
2.2.0rc1
by just providing the--packages io.delta:delta-core_2.12:2.2.0rc1
argument. - Maven project:
<repositories>
<repository>
<id>staging-repo</id>
<url> https://oss.sonatype.org/content/repositories/iodelta-1102/</url>
</repository>
</repositories>
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-core_2.12</artifactId>
<version>2.2.0rc1</version>
</dependency>
- SBT project:
libraryDependencies += "io.delta" %% "delta-core" % "2.2.0rc1"
resolvers += "Delta" at https://oss.sonatype.org/content/repositories/iodelta-1102/
- Delta-spark:
pip install -i https://test.pypi.org/simple/ delta-spark==2.2.0rc1
Credits
Abhishek Somani, Adam Binford, Allison Portis, Amir Mor, Andreas Chatzistergiou, Anish Shrigondekar, Carl Fu, Carlos Peña ,Chen Shuai, Christos Stavrakakis, Eric Maynard, Fabian Paul, Felipe Pessoto, Fredrik Klauss, Ganesh Chand, Hedi Bejaoui, Helge Brügner, Hussein Nagree, Ionut Boicu, Jackie Zhang, Jiaheng Tang, Jintao Shen, Jintian Liang, Joe Harris, Johan Lasperas, Jonas Irgens Kylling, Josh Rosen, Juliusz Sompolski, Jungtaek Lim, Kam Cheung Ting, Karthik Subramanian, Kevin Neville, Lars Kroll, Lin Ma, Linhong Liu, Lukas Rupprecht, Max Gekk, Ming Dai, Mingliang Zhu, Nick Karpov, Ole Sasse, Paddy Xu, Patrick Marx, Prakhar Jain, Pranav, Rajesh Parangi, Ronald Zhang, Ryan Johnson, Sabir Akhadov, Scott Sandre, Serge Rielau, Shixiong Zhu, Supun Nakandala, Thang Long Vu, Tom van Bussel, Tyson Condie, Venki Korukanti, Vitalii Li, Weitao Wen, Wenchen Fan, Xinyi, Yuming Wang, Zach Schuermann, Zainab Lawal, sherlockbeard (github id)