dmlc/xgboost v1.6.0 on GitHub

v1.6.0 (2022 Apr 16)

After a long period of development, XGBoost v1.6.0 is packed with many new features and
improvements. We summarize them in the following sections starting with an introduction to
some major new features, then moving on to language binding specific changes including new
features and notable bug fixes for that binding.

Development of categorical data support

This version of XGBoost features new improvements and full coverage of experimental
categorical data support in Python and C package with tree model. Both hist, approx
and gpu_hist now support training with categorical data. Also, partition-based
categorical split is introduced in this release. This split type is first available in
LightGBM in the context of gradient boosting. The previous XGBoost release supported one-hot
split where the splitting criteria is of form x \in {c}, i.e. the categorical feature x is tested
against a single candidate. The new release allows for more expressive conditions: x \in S
where the categorical feature x is tested against multiple candidates. Moreover, it is now
possible to use any tree algorithms (hist, approx, gpu_hist) when creating categorical splits.
For more information, please see our tutorial on categorical data, along with
examples linked on that page. (#7380, #7708, #7695, #7330, #7307, #7322, #7705,
#7652, #7592, #7666, #7576, #7569, #7529, #7575, #7393, #7465, #7385, #7371, #7745, #7810)

In the future, we will continue to improve categorical data support with new features and
optimizations. Also, we are looking forward to bringing the feature beyond Python binding,
contributions and feedback are welcomed! Lastly, as a result of experimental status, the
behavior might be subject to change, especially the default value of related
hyper-parameters.

Experimental support for multi-output model

XGBoost 1.6 features initial support for the multi-output model, which includes
multi-output regression and multi-label classification. Along with this, the XGBoost
classifier has proper support for base margin without to need for the user to flatten the
input. In this initial support, XGBoost builds one model for each target similar to the
sklearn meta estimator, for more details, please see our quick
introduction.

(#7365, #7736, #7607, #7574, #7521, #7514, #7456, #7453, #7455, #7434, #7429, #7405, #7381)

External memory support

External memory support for both approx and hist tree method is considered feature
complete in XGBoost 1.6. Building upon the iterator-based interface introduced in the
previous version, now both hist and approx iterates over each batch of data during
training and prediction. In previous versions, hist concatenates all the batches into
an internal representation, which is removed in this version. As a result, users can
expect higher scalability in terms of data size but might experience lower performance due
to disk IO. (#7531, #7320, #7638, #7372)

Rewritten approx

The approx tree method is rewritten based on the existing hist tree method. The
rewrite closes the feature gap between approx and hist and improves the performance.
Now the behavior of approx should be more aligned with hist and gpu_hist. Here is a
list of user-visible changes:

Supports both max_leaves and max_depth.
Supports grow_policy.
Supports monotonic constraint.
Supports feature weights.
Use max_bin to replace sketch_eps.
Supports categorical data.
Faster performance for many of the datasets.
Improved performance and robustness for distributed training.
Supports prediction cache.
Significantly better performance for external memory when depthwise policy is used.

New serialization format

Based on the existing JSON serialization format, we introduce UBJSON support as a more
efficient alternative. Both formats will be available in the future and we plan to
gradually phase out support for the old
binary model format. Users can opt to use the different formats in the serialization
function by providing the file extension json or ubj. Also, the save_raw function in
all supported languages bindings gains a new parameter for exporting the model in different
formats, available options are json, ubj, and deprecated, see document for the
language binding you are using for details. Lastly, the default internal serialization
format is set to UBJSON, which affects Python pickle and R RDS. (#7572, #7570, #7358,
#7571, #7556, #7549, #7416)

General new features and improvements

Aside from the major new features mentioned above, some others are summarized here:

Users can now access the build information of XGBoost binary in Python and C
interface. (#7399, #7553)
Auto-configuration of seed_per_iteration is removed, now distributed training should
generate closer results to single node training when sampling is used. (#7009)
A new parameter huber_slope is introduced for the Pseudo-Huber objective.
During source build, XGBoost can choose cub in the system path automatically. (#7579)
XGBoost now honors the CPU counts from CFS, which is usually set in docker
environments. (#7654, #7704)
The metric aucpr is rewritten for better performance and GPU support. (#7297, #7368)
Metric calculation is now performed in double precision. (#7364)
XGBoost no longer mutates the global OpenMP thread limit. (#7537, #7519, #7608, #7590,
#7589, #7588, #7687)
The default behavior of max_leave and max_depth is now unified (#7302, #7551).
CUDA fat binary is now compressed. (#7601)
Deterministic result for evaluation metric and linear model. In previous versions of
XGBoost, evaluation results might differ slightly for each run due to parallel reduction
for floating-point values, which is now addressed. (#7362, #7303, #7316, #7349)
XGBoost now uses double for GPU Hist node sum, which improves the accuracy of
gpu_hist. (#7507)

Performance improvements

Most of the performance improvements are integrated into other refactors during feature
developments. The approx should see significant performance gain for many datasets as
mentioned in the previous section, while the hist tree method also enjoys improved
performance with the removal of the internal pruner along with some other
refactoring. Lastly, gpu_hist no longer synchronizes the device during training. (#7737)

General bug fixes

This section lists bug fixes that are not specific to any language binding.

The num_parallel_tree is now a model parameter instead of a training hyper-parameter,
which fixes model IO with random forest. (#7751)
Fixes in CMake script for exporting configuration. (#7730)
XGBoost can now handle unsorted sparse input. This includes text file formats like
libsvm and scipy sparse matrix where column index might not be sorted. (#7731)
Fix tree param feature type, this affects inputs with the number of columns greater than
the maximum value of int32. (#7565)
Fix external memory with gpu_hist and subsampling. (#7481)
Check the number of trees in inplace predict, this avoids a potential segfault when an
incorrect value for iteration_range is provided. (#7409)
Fix non-stable result in cox regression (#7756)

Changes in the Python package

Other than the changes in Dask, the XGBoost Python package gained some new features and
improvements along with small bug fixes.

Python 3.7 is required as the lowest Python version. (#7682)
Pre-built binary wheel for Apple Silicon. (#7621, #7612, #7747) Apple Silicon users will
now be able to run pip install xgboost to install XGBoost.
MacOS users no longer need to install libomp from Homebrew, as the XGBoost wheel now
bundles libomp.dylib library.
There are new parameters for users to specify the custom metric with new
behavior. XGBoost can now output transformed prediction values when a custom objective is
not supplied. See our explanation in the
tutorial
for details.
For the sklearn interface, following the estimator guideline from scikit-learn, all
parameters in fit that are not related to input data are moved into the constructor
and can be set by set_params. (#6751, #7420, #7375, #7369)
Apache arrow format is now supported, which can bring better performance to users'
pipeline (#7512)
Pandas nullable types are now supported (#7760)
A new function get_group is introduced for DMatrix to allow users to get the group
information in the custom objective function. (#7564)
More training parameters are exposed in the sklearn interface instead of relying on the
**kwargs. (#7629)
A new attribute feature_names_in_ is defined for all sklearn estimators like
XGBRegressor to follow the convention of sklearn. (#7526)
More work on Python type hint. (#7432, #7348, #7338, #7513, #7707)
Support the latest pandas Index type. (#7595)
Fix for Feature shape mismatch error on s390x platform (#7715)
Fix using feature names for constraints with multiple groups (#7711)
We clarified the behavior of the callback function when it contains mutable
states. (#7685)
Lastly, there are some code cleanups and maintenance work. (#7585, #7426, #7634, #7665,
#7667, #7377, #7360, #7498, #7438, #7667, #7752, #7749, #7751)

Changes in the Dask interface

Dask module now supports user-supplied host IP and port address of scheduler node.
Please see introduction and
API document
for reference. (#7645, #7581)
Internal DMatrix construction in dask now honers thread configuration. (#7337)
A fix for nthread configuration using the Dask sklearn interface. (#7633)
The Dask interface can now handle empty partitions. An empty partition is different
from an empty worker, the latter refers to the case when a worker has no partition of an
input dataset, while the former refers to some partitions on a worker that has zero
sizes. (#7644, #7510)
Scipy sparse matrix is supported as Dask array partition. (#7457)
Dask interface is no longer considered experimental. (#7509)

Changes in the R package

This section summarizes the new features, improvements, and bug fixes to the R package.

load.raw can optionally construct a booster as return. (#7686)
Fix parsing decision stump, which affects both transforming text representation to data
table and plotting. (#7689)
Implement feature weights. (#7660)
Some improvements for complying the CRAN release policy. (#7672, #7661, #7763)
Support CSR data for predictions (#7615)
Document update (#7263, #7606)
New maintainer for the CRAN package (#7691, #7649)
Handle non-standard installation of toolchain on macos (#7759)

Changes in JVM-packages

Some new features for JVM-packages are introduced for a more integrated GPU pipeline and
better compatibility with musl-based Linux. Aside from this, we have a few notable bug
fixes.

User can specify the tracker IP address for training, which helps running XGBoost on
restricted network environments. (#7808)
Add support for detecting musl-based Linux (#7624)
Add DeviceQuantileDMatrix to Scala binding (#7459)
Add Rapids plugin support, now more of the JVM pipeline can be accelerated by RAPIDS (#7491, #7779, #7793, #7806)
The setters for CPU and GPU are more aligned (#7692, #7798)
Control logging for early stopping (#7326)
Do not repartition when nWorker = 1 (#7676)
Fix the prediction issue for multi:softmax (#7694)
Fix for serialization of custom objective and eval (#7274)
Update documentation about Python tracker (#7396)
Remove jackson from dependency, which fixes CVE-2020-36518. (#7791)
Some refactoring to the training pipeline for better compatibility between CPU and
GPU. (#7440, #7401, #7789, #7784)
Maintenance work. (#7550, #7335, #7641, #7523, #6792, #4676)

Deprecation

Other than the changes in the Python package and serialization, we removed some deprecated
features in previous releases. Also, as mentioned in the previous section, we plan to
phase out the old binary format in future releases.

Remove old warning in 1.3 (#7279)
Remove label encoder deprecated in 1.3. (#7357)
Remove old callback deprecated in 1.3. (#7280)
Pre-built binary will no longer support deprecated CUDA architectures including sm35 and
sm50. Users can continue to use these platforms with source build. (#7767)

Documentation

This section lists some of the general changes to XGBoost's document, for language binding
specific change please visit related sections.

Document is overhauled to use the new RTD theme, along with integration of Python
examples using Sphinx gallery. Also, we replaced most of the hard-coded URLs with sphinx
references. (#7347, #7346, #7468, #7522, #7530)
Small update along with fixes for broken links, typos, etc. (#7684, #7324, #7334, #7655,
#7628, #7623, #7487, #7532, #7500, #7341, #7648, #7311)
Update document for GPU. [skip ci] (#7403)
Document the status of RTD hosting. (#7353)
Update document for building from source. (#7664)
Add note about CRAN release [skip ci] (#7395)

Maintenance

This is a summary of maintenance work that is not specific to any language binding.

Add CMake option to use /MD runtime (#7277)
Add clang-format configuration. (#7383)
Code cleanups (#7539, #7536, #7466, #7499, #7533, #7735, #7722, #7668, #7304, #7293,
#7321, #7356, #7345, #7387, #7577, #7548, #7469, #7680, #7433, #7398)
Improved tests with better coverage and latest dependency (#7573, #7446, #7650, #7520,
#7373, #7723, #7611, #7771)
Improved automation of the release process. (#7278, #7332, #7470)
Compiler workarounds (#7673)
Change shebang used in CLI demo. (#7389)
Update affiliation (#7289)

CI

Some fixes and update to XGBoost's CI infrastructure. (#7739, #7701, #7382, #7662, #7646,
#7582, #7407, #7417, #7475, #7474, #7479, #7472, #7626)

dmlc/xgboost v1.6.0 Release 1.6.0 stable on GitHub