Summary

This release introduces a new features: Segment Merge and Rollup to simplify users day to day operational work. A new metrics plugin is added to support dropwizard. As usual, new functionalities and many UI/ Performance improvements.

The release was cut from the following commit: 13c9ee9 and the following cherry-picks: 668b5e0, ee887b9

Support Segment Merge and Roll-up

LinkedIn operates a large multi-tenant cluster that serves a business metrics dashboard, and noticed that their tables consisted of millions of small segments. This was leading to slow operations in Helix/Zookeeper, long running queries due to having too many tasks to process, as well as using more space because of a lack of compression.

To solve this problem they added the Segment Merge task, which compresses segments based on timestamps and rolls up/aggregates older data. The task can be run on a schedule or triggered manually via the Pinot REST API.

At the moment this feature is only available for offline tables, but will be added for real-time tables in a future release.

Major Changes:

Integrate enhanced SegmentProcessorFramework into MergeRollupTaskExecutor (#7180)
Merge/Rollup task scheduler for offline tables. (#7178)
Fix MergeRollupTask uploading segments not updating their metadata (#7289)
MergeRollupTask integration tests (#7283)
Add mergeRollupTask delay metrics (#7368)
MergeRollupTaskGenerator enhancement: enable parallel buckets scheduling (#7481)
Use maxEndTimeMs for merge/roll-up delay metrics. (#7617)

UI Improvement

This release also sees improvements to Pinot’s query console UI.

Cmd+Enter shortcut to run query in query console (#7359)
Showing tooltip in SQL Editor (#7387)
Make the SQL Editor box expandable (#7381)
Fix tables ordering by number of segments (#7564)

SQL Improvements

There have also been improvements and additions to Pinot’s SQL implementation.

New functions:

IN (#7542)
LASTWITHTIME (#7584)
ID_SET on MV columns (#7355)
Raw results for Percentile TDigest and Est (#7226),
Add timezone as argument in function toDateTime (#7552)

New predicates are supported:

LIKE(#7214)
REGEXP_EXTRACT(#7114)
FILTER(#7566)

Query compatibility improvements:

Infer data type for Literal (#7332)
Support logical identifier in predicate (#7347)
Support JSON queries with top-level array path expression. (#7511)
Support configurable group by trim size to improve results accuracy (#7241)

Performance Improvements

This release contains many performance improvement, you may sense it for you day to day queries. Thanks to all the great contributions listed below:

Reduce the disk usage for segment conversion task (#7193)
Simplify association between Java Class and PinotDataType for faster mapping (#7402)
Avoid creating stateless ParseContextImpl once per jsonpath evaluation, avoid varargs allocation (#7412)
Replace MINUS with STRCMP (#7394)
Bit-sliced range index for int, long, float, double, dictionarized SV columns (#7454)
Use MethodHandle to access vectorized unsigned comparison on JDK9+ (#7487)
Add option to limit thread usage per query (#7492)
Improved range queries (#7513)
Faster bitmap scans (#7530)
Optimize EmptySegmentPruner to skip pruning when there is no empty segments (#7531)
Map bitmaps through a bounded window to avoid excessive disk pressure (#7535)
Allow RLE compression of bitmaps for smaller file sizes (#7582)
Support raw index properties for columns with JSON and RANGE indexes (#7615)
Enhance BloomFilter rule to include IN predicate(#7444) (#7624)
Introduce LZ4_WITH_LENGTH chunk compression type (#7655)
Enhance ColumnValueSegmentPruner and support bloom filter prefetch (#7654)
Apply the optimization on dictIds within the segment to DistinctCountHLL aggregation func (#7630)
During segment pruning, release the bloom filter after each segment is processed (#7668)
Fix JSONPath cache inefficient issue (#7409)
Optimize getUnpaddedString with SWAR padding search (#7708)
Lighter weight LiteralTransformFunction, avoid excessive array fills (#7707)
Inline binary comparison ops to prevent function call overhead (#7709)
Memoize literals in query context in order to deduplicate them (#7720)

Other Notable New Features and Changes

Human Readable Controller Configs (#7173)
Add the support of geoToH3 function (#7182)
Add Apache Pulsar as Pinot Plugin (#7223) (#7247)
Add dropwizard metrics plugin (#7263)
Introduce OR Predicate Execution On Star Tree Index (#7184)
Allow to extract values from array of objects with jsonPathArray (#7208)
Add Realtime table metadata and indexes API. (#7169)
Support array with mixing data types (#7234)
Support force download segment in reload API (#7249)
Show uncompressed znRecord from zk api (#7304)
Add debug endpoint to get minion task status. (#7300)
Validate CSV Header For Configured Delimiter (#7237)
Add auth tokens and user/password support to ingestion job command (#7233)
Add option to store the hash of the upsert primary key (#7246)
Add null support for time column (#7269)
Add mode aggregation function (#7318)
Support disable swagger in Pinot servers (#7341)
Delete metadata properly on table deletion (#7329)
Add basic Obfuscator Support (#7407)
Add AWS sts dependency to enable auth using web identity token. (#7017)(#7445)
Mask credentials in debug endpoint /appconfigs (#7452)
Fix /sql query endpoint now compatible with auth (#7230)
Fix case sensitive issue in BasicAuthPrincipal permission check (#7354)
Fix auth token injection in SegmentGenerationAndPushTaskExecutor (#7464)
Add segmentNameGeneratorType config to IndexingConfig (#7346)
Support trigger PeriodicTask manually (#7174)
Add endpoint to check minion task status for a single task. (#7353)
Showing partial status of segment and counting CONSUMING state as good segment status (#7327)
Add "num rows in segments" and "num segments queried per host" to the output of Realtime Provisioning Rule (#7282)
Check schema backward-compatibility when updating schema through addSchema with override (#7374)
Optimize IndexedTable (#7373)
Support indices remove in V3 segment format (#7301)
Optimize TableResizer (#7392)
Introduce resultSize in IndexedTable (#7420)
Offset based realtime consumption status checker (#7267)
Add causes to stack trace return (#7460)
Create controller resource packages config key (#7488)
Enhance TableCache to support schema name different from table name (#7525)
Add validation for realtimeToOffline task (#7523)
Unify CombineOperator multi-threading logic (#7450)
Support no downtime rebalance for table with 1 replica in TableRebalancer (#7532)
Introduce MinionConf, move END_REPLACE_SEGMENTS_TIMEOUT_MS to minion config instead of task config. (#7516)
Adjust tuner api (#7553)
Adding config for metrics library (#7551)
Add geo type conversion scalar functions (#7573)
Add BOOLEAN_ARRAY and TIMESTAMP_ARRAY types (#7581)
Add MV raw forward index and MV BYTES data type (#7595)
Enhance TableRebalancer to offload the segments from most loaded instances first (#7574)
Improve get tenant API to differentiate offline and realtime tenants (#7548)
Refactor query rewriter to interfaces and implementations to allow customization (#7576)
In ServiceStartable, apply global cluster config in ZK to instance config (#7593)
Make dimension tables creation bypass tenant validation (#7559)
Allow Metadata and Dictionary Based Plans for No Op Filters (#7563)
Reject query with identifiers not in schema (#7590)
Round Robin IP addresses when retry uploading/downloading segments (#7585)
Support multi-value derived column in offline table reload (#7632)
Support segmentNamePostfix in segment name (#7646)
Add select segments API (#7651)
Controller getTableInstance() call now returns the list of live brokers of a table. (#7556)
Allow MV Field Support For Raw Columns in Text Indices (#7638)
Allow override distinctCount to segmentPartitionedDistinctCount (#7664)
Add a quick start with both UPSERT and JSON index (#7669)
Add revertSegmentReplacement API (#7662)
Smooth segment reloading with non blocking semantic (#7675)
Clear the reused record in PartitionUpsertMetadataManager (#7676)
Replace args4j with picocli (#7665)
Handle datetime column consistently (#7645)(#7705)
Allow to carry headers with query requests (#7696) (#7712)
Allow adding JSON data type for dimension column types (#7718)
Separate SegmentDirectoryLoader and tierBackend concepts (#7737)
Implement size balanced V4 raw chunk format (#7661)
Add presto-pinot-driver lib (#7384)

Major Bug fixes

Fix null pointer exception for non-existed metric columns in schema for JDBC driver (#7175)
Fix the config key for TASK_MANAGER_FREQUENCY_PERIOD (#7198)
Fixed pinot java client to add zkClient close (#7196)
Ignore query json parse errors (#7165)
Fix shutdown hook for PinotServiceManager (#7251) (#7253)
Make STRING to BOOLEAN data type change as backward compatible schema change (#7259)
Replace gcp hardcoded values with generic annotations (#6985)
Fix segment conversion executor for in-place conversion (#7265)
Fix reporting consuming rate when the Kafka partition level consumer isn't stopped (#7322)
Fix the issue with concurrent modification for segment lineage (#7343)
Fix TableNotFound error message in PinotHelixResourceManager (#7340)
Fix upload LLC segment endpoint truncated download URL (#7361)
Fix task scheduling on table update (#7362)
Fix metric method for ONLINE_MINION_INSTANCES metric (#7363)
Fix JsonToPinotSchema behavior to be consistent with AvroSchemaToPinotSchema (#7366)
Fix currentOffset volatility in consuming segment(#7365)
Fix misleading error msg for missing URI (#7367)
Fix the correctness of getColumnIndices method (#7370)
Fix SegmentZKMetadta time handling (#7375)
Fix retention for cleaning up segment lineage (#7424)
Fix segment generator to not return illegal filenames (#7085)
Fix missing LLC segments in segment store by adding controller periodic task to upload them (#6778)
Fix parsing error messages returned to FileUploadDownloadClient (#7428)
Fix manifest scan which drives /version endpoint (#7456)
Fix missing rate limiter if brokerResourceEV becomes null due to ZK connection (#7470)
Fix race conditions between segment merge/roll-up and purge (or convertToRawIndex) tasks: (#7427)
Fix pql double quote checker exception (#7485)
Fix minion metrics exporter config (#7496)
Fix segment unable to retry issue by catching timeout exception during segment replace (#7509)
Add Exception to Broker Response When Not All Segments Are Available (Partial Response) (#7397)
Fix segment generation commands (#7527)
Return non zero from main with exception (#7482)
Fix parquet plugin shading error (#7570)
Fix the lowest partition id is not 0 for LLC (#7066)
Fix star-tree index map when column name contains '.' (#7623)
Fix cluster manager URLs encoding issue(#7639)
Fix fieldConfig nullable validation (#7648)
Fix verifyHostname issue in FileUploadDownloadClient (#7703)
Fix TableCache schema to include the built-in virtual columns (#7706)
Fix DISTINCT with AS function (#7678)
Fix SDF pattern in DataPreprocessingHelper (#7721)
Fix fields missing issue in the source in ParquetNativeRecordReader (#7742)

apache/pinot release-0.9.0 Apache Pinot 0.9.0 on GitHub