We are excited to announce the release of Delta Lake 1.1.0 on Apache Spark 3.2. Similar to Apache Spark™, we have released Maven artifacts for both Scala 2.12 and Scala 2.13. The key features in this release are as follows.
-
Performance improvements in MERGE operation - On partitioned tables, MERGE operations will automatically repartition the output data before writing to files. This ensures better performance out-of-the-box for both the MERGE operation as well as subsequent read operations.
-
Support for passing Hadoop configurations via DataFrameReader/Writer options - You can now set Hadoop FileSystem configurations (e.g., access credentials) via DataFrameReader/Writer options. Earlier the only way to pass such configurations was to set Spark session configuration which would set them to the same value for all reads and writes. Now you can set them to different values for each read and write. See the documentation for more details.
-
Support for arbitrary expressions in
replaceWhere
DataFrameWriter option - Instead of expressions only on partition columns, you can now use arbitrary expressions in thereplaceWhere
DataFrameWriter option. That is you can replace arbitrary data in a table directly with DataFrame writes. See the documentation for more details. -
Improvements to nested field resolution and schema evolution in MERGE operation on array of structs - When applying the MERGE operation on a target table having a column typed as an array of nested structs, the nested columns between the source and target data are now resolved by name and not by position in the struct. This ensures structs in arrays have a consistent behavior with structs outside arrays. When automatic schema evolution is enabled for MERGE, nested columns in structs in arrays will follow the same evolution rules (e.g., column added if no column by the same name exists in the table) as columns in structs outside arrays. See the documentation for more details.
-
Support for Generated Columns in MERGE operation - You can now apply MERGE operations on tables having Generated Columns.
-
Fix for rare data corruption issue on GCS - Experimental GCS support released in Delta Lake 1.0 has a rare bug that can lead to Delta tables being unreadable due to partially written transaction log files. This issue has now been fixed (1, 2).
-
Fix for the incorrect return object in Python
DeltaTable.convertToDelta()
- This existing API now returns the correct Python object of typedelta.tables.DeltaTable
instead of an incorrectly-typed, and therefore unusable object. -
Python type annotations - We have added Python type annotations which improve auto-completion performance in editors which support type hints. Optionally, you can enable static checking through mypy or built-in tools (for example Pycharm tools).
-
Other notable changes
- Removed support to read tables with certain special characters in partition column name. See migration guide for details.
- Support for “delta.`path`” in
DeltaTable.forName()
for consistency with other APIs - Improvements to DeltaTableBuilder API introduced in Delta 1.0.0
- Fix for bug that prevented passing of multiple partition columns in Python
DeltaTableBuilder.partitionBy
. - Throw error when column data type is not specified.
- Fix for bug that prevented passing of multiple partition columns in Python
- Improved support for MERGE/UPDATE/DELETE on temp views.
- Support for setting
userMetadata
in the commit information when creating or replacing tables. - Fix for an incorrect analysis exception in MERGE with multiple INSERT and UPDATE clauses and automatic schema evolution enabled.
- Fix for incorrect handling of special characters (e.g. spaces) in paths by MERGE/UPDATE/DELETE operations.
- Fix for Vacuum parallel mode from being affected by the Adaptive Query Execution enabled by default in Apache Spark 3.2.
- Fix for earliest valid time travel version.
- Fix for Hadoop configurations not being used to write checkpoints.
- Multiple fixes (1, 2, 3) to Delta Constraints.
Credits
Abhishek Somani, Adam Binford, Alex Jing, Alexandre Lopes, Allison Portis, Bogdan Raducanu, Bart Samwel, Burak Yavuz, David Lewis, Eunjin Song, Feng Zhu, Flavio Cruz, Florian Valeye, Fred Liu, Guy Khazma, Jacek Laskowski, Jackie Zhang, Jarred Parrett, JassAbidi, Jose Torres, Junlin Zeng, Junyong Lee, KamCheung Ting, Karen Feng, Lars Kroll, Li Zhang, Linhong Liu, Liwen Sun, Maciej, Max Gekk, Meng Tong, Prakhar Jain, Pranav Anand, Rahul Mahadev, Ryan Johnson, Sabir Akhadov, Scott Sandre, Shixiong Zhu, Shuting Zhang, Tathagata Das, Terry Kim, Tom Lynch, Vijayan Prabhakaran, Vítor Mussa, Wenchen Fan, Yaohua Zhao, Yijia Cui, YuXuan Tay, Yuchen Huo, Yuhong Chen, Yuming Wang, Yuyuan Tang, Zach Schuermann, ericfchang, gurunath