apache/iceberg apache-iceberg-1.4.0 on GitHub

API
- Implement bound expression sanitization (#8149)
- Remove overflow checks in DefaultCounter causing performance issues (#8297)
- Support incremental scanning with branch (#5984)
- Add a validation API to DeleteFiles which validates files exist (#8525)
Core
- Use V2 format by default in new tables (#8381)
- Use zstd compression for Parquet by default in new tables (#8593)
- Add strict metadata cleanup mode and enable it by default (#8397) (#8599)
- Avoid generating huge manifests during commits (#6335)
- Add a writer for unordered position deletes (#7692)
- Optimize DeleteFileIndex (#8157)
- Optimize lookup in DeleteFileIndex without useful bounds (#8278)
- Optimize split offsets handling (#8336)
- Optimize computing user-facing state in data tasks (#8346)
- Don't persist useless file and position bounds for deletes (#8360)
- Don't persist counts for paths and positions in position delete files (#8590)
- Support setting system-level properties via environmental variables (#5659)
- Add JSON parser for ContentFile and FileScanTask (#6934)
- Add REST spec and request for commits to multiple tables (#7741)
- Add REST API for committing changes against multiple tables (#7569)
- Default to exponential retry strategy in REST client (#8366)
- Support registering tables with REST session catalog (#6512)
- Add last updated timestamp and snapshot ID to partitions metadata table (#7581)
- Add total data size to partitions metadata table (#7920)
- Extend ResolvingFileIO to support bulk operations (#7976)
- Key metadata in Avro format (#6450)
- Add AES GCM encryption stream (#3231)
- Fix a connection leak in streaming delete filters (#8132)
- Fix lazy snapshot loading history (#8470)
- Fix unicode handling in HTTPClient (#8046)
- Fix paths for unpartitioned specs in writers (#7685)
- Fix OOM caused by Avro decoder caching (#7791)
Spark
- Added support for Spark 3.5
  - Code for DELETE, UPDATE, and MERGE commands has moved to Spark, and all related extensions have been dropped from Iceberg.
  - Support for WHEN NOT MATCHED BY SOURCE clause in MERGE.
  - Column pruning in merge-on-read operations.
  - Ability to request a bigger advisory partition size for the final write to produce well-sized output files without harming the job parallelism.
- Dropped support for Spark 3.1
- Deprecated support for Spark 3.2
- Support vectorized reads for merge-on-read operations in Spark 3.4 and 3.5 (#8466)
- Increase default advisory partition size for writes in Spark 3.5 (#8660)
- Support distributed planning in Spark 3.4 and 3.5 (#8123)
- Support pushing down system functions by V2 filters in Spark 3.4 and 3.5 (#7886)
- Support fanout position delta writers in Spark 3.4 and 3.5 (#7703)
- Use fanout writers for unsorted tables by default in Spark 3.5 (#8621)
- Support multiple shuffle partitions per file in compaction in Spark 3.4 and 3.5 (#7897)
- Output net changes across snapshots for carryover rows in CDC (#7326)
- Display read metrics on Spark SQL UI (#7447) (#8445)
- Adjust split size to benefit from cluster parallelism in Spark 3.4 and 3.5 (#7714)
- Add fast_forward procedure (#8081)
- Support filters when rewriting position deletes (#7582)
- Support setting current snapshot with ref (#8163)
- Make backup table name configurable during migration (#8227)
- Add write and SQL options to override compression config (#8313)
- Correct partition transform functions to match the spec (#8192)
- Enable extra commit properties with metadata delete (#7649)
Flink
- Add possibility of ordering the splits based on the file sequence number (#7661)
- Fix serialization in TableSink with anonymous object (#7866)
- Switch to FileScanTaskParser for JSON serialization of IcebergSourceSplit (#7978)
- Custom partitioner for bucket partitions (#7161)
- Implement data statistics coordinator to aggregate data statistics from operator subtasks (#7360)
- Support alter table column (#7628)
Parquet
- Add encryption config to read and write builders (#2639)
- Skip writing bloom filters for deletes (#7617)
- Cache codecs by name and level (#8182)
- Fix decimal data reading from ParquetAvroValueReaders (#8246)
- Handle filters with transforms by assuming data must be scanned (#8243)
ORC
- Handle filters with transforms by assuming the filter matches (#8244)
Vendor Integrations
- GCP: Fix single byte read in GCSInputStream (#8071)
- GCP: Add properties for OAtuh2 and update library (#8073)
- GCP: Add prefix and bulk operations to GCSFileIO (#8168)
- GCP: Add bundle jar for GCP-related dependencies (#8231)
- GCP: Add range reads to GCSInputStream (#8301)
- AWS: Add bundle jar for AWS-related dependencies (#8261)
- AWS: support config storage class for S3FileIO (#8154)
- AWS: Add FileIO tracker/closer to Glue catalog (#8315)
- AWS: Update S3 signer spec to allow an optional string body in S3SignRequest (#8361)
- Azure: Add FileIO that supports ADLSv2 storage (#8303)
- Azure: Make ADLSFileIO implement DelegateFileIO (#8563)
- Nessie: Provide better commit message on table registration (#8385)
Dependencies
- Bump Nessie to 0.71.0
- Bump ORC to 1.9.1
- Bump Arrow to 12.0.1
- Bump AWS Java SDK to 2.20.131

apache/iceberg apache-iceberg-1.4.0 Apache Iceberg 1.4.0 on GitHub

apache/iceberg apache-iceberg-1.4.0
Apache Iceberg 1.4.0

on GitHub