We are excited to announce the release of Delta Lake 1.2.0 on Apache Spark 3.2. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/1.2.0/index.html
- Maven artifacts: delta-core_2.12, delta-core_2.13, delta-contribs_2.12 delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb
- Python artifacts: https://pypi.org/project/delta-spark/1.2.0/
Key features in this release
-
Support multi-cluster write in Delta Lake tables stored in S3. Users now have the option of specifying a new and experimental
LogStore
implementation that supports concurrent reads and writes to a single Delta Lake table in S3 from multiple Spark drivers. See the documentation for more details. -
Support for compacting small files (optimize) into larger files in a Delta Lake table. Reduced number of data files improves read latency due to reduced metadata size and per-file overheads such as file-open overhead and file-close overhead. See the documentation for more details.
-
Support for data skipping using column statistics. Column statistics are collected for each file as part of the Delta Lake table writes. These statistics can be used during the reading of a Delta Lake table to skip reading files not matching the filters in the query. See the documentation for more details.
-
Support for restoring a Delta table to an earlier version. Restoring to an earlier version number or a version of a specific timestamp is supported using the SQL command, Scala APIs or Python APIs. See the documentation for more details.
-
Support for column renaming in a Delta Lake table without the need to rewrite the underlying Parquet data files. See the documentation for more details.
-
Support for arbitrary characters in column names in Delta tables. Before, the supported list of characters was limited by the support of the same in Parquet data format. Column names containing special characters such space, tab,
,
,{
,(
etc. are supported now. See the documentation for more details. -
Support for automatic data skipping using generated columns. For any partition column that is a generated column, partition filters will be automatically generated from any data filters on its generating column(s), when possible.
-
Support for Google Cloud Storage is now generally available. See the documentation on how to read and write Delta Lake tables in Google Cloud Storage.
-
Other notable changes
- Create a new module
delta-storage
. This extracts out theLogStore
interface and implementations in a separate module which is published as its own jar. This enables new implementations ofLogStore
without depending upon the complete Delta jars. See the migration guide here for more details. - Improve the error messages and exceptions to be better organized and queryable.
- Support for
gettimestamp
expression in generated columns. - Snapshot/Checkpoint management improvements
- Make loading snapshots resilient to corrupt checkpoints in Delta. When reading a checkpoint fails, we try to search for an alternative checkpoint and use it to construct a snapshot.
- Fix to snapshot writing to not fail the write when a checkpoint fails due to non-fatal errors.
- Optimization to reduce the number of
list
calls to storage
- Improved output metrics for DELETE table command.
- Improved output metrics for UPDATE table command.
- Optimize merge operation in a Delta table with a large number of columns.
- Fix a
NullPointerException
when trying to reference aDeltaLog
created with aSparkContext
that has stopped. - Fix an issue in handling null partition column values in the change data capture feature.
- Fix an issue in adding a new column to the Delta table when the preceding column is of type
Array
. - Fix an issue where we are not closing the file list iterator when reading large log files in the Delta Streaming source.
- Throw proper exceptions when searching for a Delta table in the catalog.
- Fix a schema evolution issue when the column type is an array of structs.
- Better handling of
FileNotFoundException
when reading Delta log files to distinguish between the corrupt log files and no files found.
- Create a new module
Benchmark Framework
Independent of this release, we have also built a framework for writing large scale performance benchmarks on Delta tables using a real cluster. Currently, the framework provides a TPC-DS inspired benchmark to measure the ingestion time (e.g. time taken to create TPC-DS tables) and query times. But we encourage the community to contribute more benchmarks to measure performance of different real-world workloads on Delta tables.
Credits
Adam Binford, Alex Liu, Allison Portis, Anton Okolnychyi, Bart Samwel, Carmen Kwan, Chang Yong Lik, Christian Williams, Christos Stavrakakis, David Lewis, Denny Lee, Fabio Badalì, Fred Liu, Gengliang Wang, Hoang Pham, Hussein Nagree, Hyukjin Kwon, Jackie Zhang, Jan Paw, John ODwyer, Junlin Zeng, Jackie Zhang, Junyong Lee, Kam Cheung Ting, Kapil Sreedharan, Lars Kroll, Liwen Sun, Maksym Dovhal, Mariusz Krynski, Meng Tong, Peng Zhong, Prakhar Jain, Pranav, Ryan Johnson, Sabir Akhadov, Scott Sandre, Shixiong Zhu, Sri Tikkireddy, Tathagata Das, Tyson Condie, Vegard Stikbakke, Venkata Sai Akhil Gudesa, Venki Korukanti, Vini Jaiswal, Wenchen Fan, Will Jones, Xinyi Yu, Yann Byron, Yaohua Zhao, Yijia Cui