github delta-io/delta v3.0.0rc1
Delta Lake 3.0.0 Preview

latest releases: v3.2.1rc2, v3.2.1rc1, v4.0.0rc1...
pre-release15 months ago

We are excited to announce the preview release of Delta Lake 3.0.0. This release includes several exciting new features and artifacts.

Highlights

Here are the most important aspects of 3.0.0.

Delta Universal Format (UniForm)

Delta Universal Format (UniForm) will allow you to read Delta tables with Hudi and Iceberg clients. Iceberg support is available with this preview and Hudi will be coming soon. UniForm takes advantage of the fact that all table storage formats (Delta, Iceberg, and Hudi) actually consist of Parquet data files and a metadata layer. In this release, UniForm automatically generates Iceberg metadata, allowing Iceberg clients to read Delta tables as if they were Iceberg tables. Create an UniForm-enabled table using the following command:

CREATE TABLE T (c1 INT) USING DELTA SET TBLPROPERTIES (
  'delta.universalFormat.enabledFormats' = 'iceberg');

Every write to this table will automatically keep Iceberg metadata updated. See the documentation here for more details.

Delta Kernel

The Delta Kernel project is a set of Java libraries (Rust will be coming soon) for building Delta connectors that can read (and soon, write to) Delta tables without the need to understand the Delta protocol details).

You can use this library to do the following:

  • Read data from small Delta tables in a single thread in a single process.
  • Read data from large Delta tables using multiple threads in a single process.
  • Build a complex connector for a distributed processing engine and read very large Delta tables.
  • [soon!] Write to Delta tables from multiple threads / processes / distributed engines.

Here is an example of a simple table scan with a filter:

TableClient myTableClient = DefaultTableClient.create() ;        // define a client (more details below)
Table myTable = Table.forPath("/delta/table/path");              // define what table to scan
Snapshot mySnapshot = myTable.getLatestSnapshot(myTableClient);  // define which version of table to scan
Scan myScan = mySnapshot.getScanBuilder(myTableClient)           // specify the scan details
        .withFilters(scanFilter)
        .build();
Scan.readData(...)                                               // returns the table data 

For more information, refer to Delta Kernel Github docs.

Delta Connectors: welcome to the Delta repository!

All previous connectors from https://github.com/delta-io/connectors have been moved to this repository (https://github.com/delta-io/delta) as we aim to unify our Delta connector ecosystem structure. This includes Delta-Standalone, Delta-Flink, Delta-Hive, PowerBI, and SQL-Delta-Import. The repository https://github.com/delta-io/connectors is now deprecated.

Delta Spark

Delta Spark 3.0.0 is built on top of Apache Spark™ 3.4. Similar to Apache Spark, we have released Maven artifacts for both Scala 2.12 and Scala 2.13. Note that the Delta Spark maven artifact has been renamed from delta-core to delta-spark.

The key features of this release are

  • Delta Universal Format. Write as Delta, read as Iceberg! See the highlighted section above.
  • Up to 2x faster MERGE operation . MERGE now better leverages data skipping, the ability to use the insert-only code path in more cases, and an overall improved execution to achieve up to 2x better performance in various scenarios.
  • Performance of DELETE using Deletion Vectors improved by more than 2x. This fix improves the file path canonicalization logic by avoiding calling expensive Path.toUri.toString calls for each row in a table, resulting in a several hundred percent speed boost on DELETE operations (only when Deletion Vectors have been enabled on the table).
  • Support streaming reads from column mapping enabled tables when DROP COLUMN and RENAME COLUMN have been used. This includes streaming support for Change Data Feed. See the documentation here for more details.
  • Support specifying the columns for which Delta will collect file-skipping statistics via the table property delta.dataSkippingStatsColumns. Previously, Delta would only collect file-skipping statistics for the first N columns in the table schema (default to 32). Now, users can easily customize this.
  • Support zero-copy convert to Delta from Iceberg tables on Apache Spark 3.4 using CONVERT TO DELTA. This feature was excluded from the Delta Lake 2.4 release since Iceberg did not yet support Apache Spark 3.4. This command generates a Delta table in the same location and does not rewrite any parquet files.

Other notable changes include

  • Minor fix to Delta table path URI concatenation
  • Support writing parquet data files to the data subdirectory via the SQL configuration spark.databricks.delta.write.dataFilesToSubdir. This is used to add UniForm support on BigQuery.

Delta Flink

Delta-Flink 3.0.0 is built on top of Apache Flink™ 1.16.1.

The key features of this release are

  • Support for Flink SQL and Catalog. You can now use the Flink/Delta connector for Flink SQL jobs. You can CREATE Delta tables, SELECT data from them (uses the Delta Source), and INSERT new data into them (uses the Delta Sink). Note: for correct operations on the Delta tables, you must first configure the Delta Catalog using CREATE CATALOG before running a SQL command on Delta tables. For more information, please see the documentation here.
  • Significant performance improvement to Global Committer initialization. The last-successfully-committed delta version by a given Flink application is now loaded lazily, significantly reducing the CPU utilization in the most common scenarios.

Delta Standalone

The key features in this release are:

  • Support for disabling Delta checkpointing during commits. For very large tables with millions of files, performing Delta checkpoints can become an expensive overhead during writes. Users can now disable this checkpointing by setting the hadoop configuration property io.delta.standalone.checkpointing.enabled to false. This is only safe and suggested to do if another job will periodically perform the checkpointing.
  • Performance improvement to snapshot initialization. When a delta table is loaded at a particular version, the snapshot must contain, at a minimum, the latest protocol and metadata. This PR improves the snapshot load performance for repeated table changes.
  • Support adding absolute paths to the Delta log. This now enables users to manually perform SHALLOW CLONEs and create Delta tables with external files.
  • Fix in schema evolution to prevent adding non-nullable columns to existing Delta tables
  • Dropped support for Scala 2.11. Due to lack to community demand and very low number of downloads, we have dropped Scala 2.11 support.

Liquid Partitioning

Liquid Clustering, a new effort to revamp how clustering works in Delta, which addresses the shortcomings of Hive-style partitioning and current ZORDER clustering. This feature will be available to preview soon; meanwhile, for more information, please refer to Liquid Clustering #1874.

Credits

Ahir Reddy, Ala Luszczak, Alex, Allen Reese, Allison Portis, Antoine Amend, Bart Samwel, Boyang Jerry Peng, CabbageCollector, Carmen Kwan, Christos Stavrakakis, Denny Lee, Desmond Cheong, Eric Ogren, Felipe Pessoto, Fred Liu, Fredrik Klauss, Gerhard Brueckl, Gopi Krishna Madabhushi, Grzegorz Kołakowski, Herivelton Andreassa, Jackie Zhang, Jiaheng Tang, Johan Lasperas, Junyong Lee, K.I. (Dennis) Jung, Kam Cheung Ting, Krzysztof Chmielewski, Lars Kroll, Lin Ma, Luca Menichetti, Lukas Rupprecht, Ming DAI, Mohamed Zait, Ole Sasse, Olivier Nouguier, Pablo Flores, Paddy Xu, Patrick Pichler, Paweł Kubit, Prakhar Jain, Ryan Johnson, Sabir Akhadov, Satya Valluri, Scott Sandre, Shixiong Zhu, Siying Dong, Son, Tathagata Das, Terry Kim, Tom van Bussel, Venki Korukanti, Wenchen Fan, Yann Byron, Yaohua Zhao, Yuhong Chen, Yuming Wang, Yuya Ebihara, aokolnychyi, gurunath, jintao shen, maryannxue, noelo, panbingkun, windpiger, wwang-talend

Don't miss a new delta release

NewReleases is sending notifications on new releases.