github apache/druid druid-0.20.0

latest releases: druid-31.0.0, druid-31.0.0-rc2, druid-31.0.0-rc1...
4 years ago

Apache Druid 0.20.0 contains around 160 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 36 contributors. Refer to the complete list of changes and everything tagged to the milestone for further details.

# New Features

# Ingestion

# Combining InputSource

A new combining InputSource has been added, allowing the user to combine multiple input sources during ingestion. Please see https://druid.apache.org/docs/0.20.0/ingestion/native-batch.html#combining-input-source for more details.

#10387

# Automatically determine numShards for parallel ingestion hash partitioning

When hash partitioning is used in parallel batch ingestion, it is no longer necessary to specify numShards in the partition spec. Druid can now automatically determine a number of shards by scanning the data in a new ingestion phase that determines the cardinalities of the partitioning key.

#10419

# Subtask file count limits for parallel batch ingestion

The size-based splitHintSpec now supports a new maxNumFiles parameter, which limits how many files can be assigned to individual subtasks in parallel batch ingestion.

The segment-based splitHintSpec used for reingesting data from existing Druid segments also has a new maxNumSegments parameter which functions similarly.

Please see https://druid.apache.org/docs/0.20.0/ingestion/native-batch.html#split-hint-spec for more details.

#10243

# Task slot usage metrics

New task slot usage metrics have been added. Please see the entries for the taskSlot metrics at https://druid.apache.org/docs/0.20.0/operations/metrics.html#indexing-service for more details.

#10379

# Compaction

# Support for all partitioning schemes for auto-compaction

A partitioning spec can now be defined for auto-compaction, allowing users to repartition their data at compaction time. Please see the documentation for the new partitionsSpec property in the compaction tuningConfig for more details:

https://druid.apache.org/docs/0.20.0/configuration/index.html#compaction-tuningconfig

#10307

# Auto-compaction status API

A new coordinator API which shows the status of auto-compaction for a datasource has been added. The new API shows whether auto-compaction is enabled for a datasource, and a summary of how far compaction has progressed.

The web console has also been updated to show this information:

https://user-images.githubusercontent.com/177816/94326243-9d07e780-ff57-11ea-9f80-256fa08580f0.png

Please see https://druid.apache.org/docs/latest/operations/api-reference.html#compaction-status for details on the new API, and https://druid.apache.org/docs/latest/operations/metrics.html#coordination for information on new related compaction metrics.

#10371
#10438

# Querying

# Query segment pruning with hash partitioning

Druid now supports query-time segment pruning (excluding certain segments as read candidates for a query) for hash partitioned segments. This optimization applies when all of the partitionDimensions specified in the hash partition spec during ingestion time are present in the filter set of a query, and the filters in the query filter on discrete values of the partitionDimensions (e.g., selector filters). Segment pruning with hash partitioning is not supported with non-discrete filters such as bound filters.

For existing users with existing segments, you will need to reingest those segments to take advantage of this new feature, as the segment pruning requires a partitionFunction to be stored together with the segments, which does not exist in segments created by older versions of Druid. It is not necessary to specify the partitionFunction explicitly, as the default is the same partition function that was used in prior versions of Druid.

Note that segments created with a default partitionDimensions value (partition by all dimensions + the time column) cannot be pruned in this manner, the segments need to be created with an explicit partitionDimensions.

#9810
#10288

# Vectorization

To enable vectorization features, please set the druid.query.default.context.vectorizeVirtualColumns property to true or set the vectorize property in the query context. Please see https://druid.apache.org/docs/0.20.0/querying/query-context.html#vectorization-parameters for more information.

# Vectorization support for expression virtual columns

Expression virtual columns now have vectorization support (depending on the expressions being used), which an results in a 3-5x performance improvement in some cases.

Please see https://druid.apache.org/docs/0.20.0/misc/math-expr.html#vectorization-support for details on the specific expressions that support vectorization.

#10388
#10401
#10432

# More vectorization support for aggregators

Vectorization support has been added for several aggregation types: numeric min/max aggregators, variance aggregators, ANY aggregators, and aggregators from the druid-histogram extension.

#10260 - numeric min/max
#10304 - histogram
#10338 - ANY
#10390 - variance

We've observed about a 1.3x to 1.8x performance improvement in some cases with vectorization enabled for the min, max, and ANY aggregator, and about 1.04x to 1.07x wuth the histogram aggregator.

# offset parameter for GroupBy and Scan queries

It is now possible set an offset parameter for GroupBy and Scan queries, which tells Druid to skip a number of rows when returning results. Please see https://druid.apache.org/docs/0.20.0/querying/limitspec.html and https://druid.apache.org/docs/0.20.0/querying/scan-query.html for details.

#10235
#10233

# OFFSET clause for SQL queries

Druid SQL queries now support an OFFSET clause. Please see https://druid.apache.org/docs/0.20.0/querying/sql.html#offset for details.

#10279

# Substring search operators

Druid has added new substring search operators in its expression language and for SQL queries.

Please see documentation for CONTAINS_STRING and ICONTAINS_STRING string functions for Druid SQL (https://druid.apache.org/docs/0.20.0/querying/sql.html#string-functions) and documentation for contains_string and icontains_string for the Druid expression language (https://druid.apache.org/docs/0.20.0/misc/math-expr.html#string-functions).

We've observed about a 2.5x performance improvement in some cases by using these functions instead of STRPOS.

#10350

# UNION ALL operator for SQL queries

Druid SQL queries now support the UNION ALL operator, which fuses the results of multiple queries together. Please see https://druid.apache.org/docs/0.20.0/querying/sql.html#union-all for details on what query shapes are supported by this operator.

#10324

# Cluster-wide default query context settings

It is now possible to set cluster-wide default query context properties by adding a configuration of the form druid.query.override.default.context.*, with * replaced by the property name.

#10208

# Other features

# Improved retention rules UI

The retention rules UI in the web console has been improved. It now provides suggestions and basic validation in the period dropdown, shows the cluster default rules, and makes editing the default rules more accessible.

#10226

# Redis cache extension enhancements

The Redis cache extension now supports Redis Cluster, selecting which database is used, connecting to password-protected servers, and period-style configurations for the expiration and timeout properties.

#10240

# Disable sending server version in response headers

It is now possible to disable sending of server version information in Druid's response headers.

This is controlled by a new property druid.server.http.sendServerVersion, which defaults to true.

#9832

# Specify byte-based configuration properties with units

Druid now supports units for specifying byte-based configuration properties, e.g.:

druid.server.maxSize=300g

equivalent to

druid.server.maxSize=300000000000

Please see https://druid.apache.org/docs/0.20.0/configuration/human-readable-byte.html for more details.

#10203

# Bug fixes

# Fix query correctness issue when historical has no segment timeline

Druid 0.20.0 fixes a query correctness issue when a broker issues a query expecting a historical to have certain segments for a datasource, but the historical when queried does not actually have any segments for that datasource (e.g., they were all unloaded before the historical processed the query). Prior to 0.20.0, the query would return successfully but without the results from the segments that were missing in the manner described previously. In 0.20.0, queries will now fail in such situations.

#10199

# Fix issue preventing result-level cache from being populated

Druid 0.20.0 fixes an issue introduced in 0.19.0 (#10337) which can prevent query caches from being populated when result-level caching is enabled.

#10341

# Fix for variance aggregator ordering

The variance aggregator previously used an incorrect comparator that compared using an aggregator's internal count variable instead of the variance.

#10340

# Fix incorrect caching for groupBy queries with limit specs

Druid 0.20.0 fixes an issues with groupBy queries and caching, where the limitSpec of the query was not considered in the cache key, leading to potentially incorrect results if queries that are identical except for the limitSpec are issued.

#10093

# Fix for stringFirst and stringLast with rollup enabled

#7243 has been resolved, the stringFirst and stringLast aggregators no longer cause an exception when used during ingestion with rollup enabled.

#10332

# Upgrading to Druid 0.20.0

Please be aware of the following considerations when upgrading from 0.19.0 to 0.20.0. If you're updating from an earlier version than 0.19.0, please see the release notes of the relevant intermediate versions.

# Default maxSize

druid.server.maxSize will now default to the sum of maxSize values defined within the druid.segmentCache.locations. The user can still provide a custom value for druid.server.maxSize which will take precedence over the default value.

#10255

# Compaction and kill task ID changes

Compaction and kill tasks issued by the coordinator will now have their task IDs prefixed by coordinator-issued, while user-issued kill tasks will be prefixed by api-issued.

#10278

# New size limits for parallel ingestion split hint specs

The size-based and segment-based splitHintSpec for parallel batch ingestion now apply a default file/segment limit of 1000 per subtask, controlled by the maxNumFiles and maxNumSegments respectively.

#10243

# New PostAggregator and AggregatorFactory methods

Users who have developed an extension with custom PostAggregator or AggregatorFactory implementions will need to update their extensions, as these two interfaces have new methods defined in 0.20.0.

PostAggregator now has a new method:

  ValueType getType();

To support type information on PostAggregator, AggregatorFactory also has 2 new methods:

  public abstract ValueType getType();

  public abstract ValueType getFinalizedType();

Please see #9638 for more details on the interface changes.

# New Expr-related methods

Users who have developed an extension with custom Expr implementions will need to update their extensions, as Expr and related interfaces hae changed in 0.20.0. Please see the PR below for details:

#10401

# More accurate query/cpu/time metric

In 0.20.0, the accuracy of the query/cpu/time metric has been improved. Previously, it did not account for certain portions of work during query processing, described in more detail in the following PR:

#10377

# New audit log service metric columns

If you are using audit logging, please be aware that new columns have been added to the audit log service metric (comment, remote_address, and created_date). An optional payload column has also been added, which can be enabled by setting druid.audit.manager.includePayloadAsDimensionInMetric to true.

#10373

# sqlQueryContext in request logs

If you are using query request logging, the request log events will now include the sqlQueryContext for SQL queries.

#10368

# Additional per-segment state in metadata store

Hash-partitioned segments created by Druid 0.20.0 will now have additional partitionFunction data in the metadata store.

Additionally, compaction tasks will now store additional per-segment information in the metadata store, used for tracking compaction history.

#10288
#10413

# Known issues

# druid.segmentCache.locationSelectorStrategy injection failure

Specifying a value for druid.segmentCache.locationSelectorStrategy prevents services from starting due to an injection error. Please see #10348 for more details.

# Resource leak in web console data sampler

When a timeout occurs while sampling data in the web console, internal resources created to read from the input source are not properly closed. Please see #10467 for more information.

# Credits

Thanks to everyone who contributed to this release!

@a2l007
@abhishekagarwal87
@abhishekrb19
@ArvinZheng
@belugabehr
@capistrant
@ccaominh
@clintropolis
@code-crusher
@dylwylie
@fermelone
@FrankChen021
@gianm
@himanshug
@jihoonson
@jon-wei
@josephglanville
@joykent99
@kroeders
@lightghli
@lkm
@mans2singh
@maytasm
@medb
@mghosh4
@nishantmonu51
@pan3793
@richardstartin
@sthetland
@suneet-s
@tarunparackal
@tdt17
@tourvi
@vogievetsky
@wjhypo
@xiangqiao123
@xvrl

Don't miss a new druid release

NewReleases is sending notifications on new releases.