v1.6.0 (2022 Apr 16)
After a long period of development, XGBoost v1.6.0 is packed with many new features and improvements. We summarize them in the following sections starting with an introduction to some major new features, then moving on to language binding specific changes including both new features and notable bug fixes for each package.
Development of categorical data support
This version of XGBoost features new improvements and full coverage of experimental categorical data support in Python and C package with tree model. Both hist
, approx
and gpu_hist
now supports training with categorical data. Also, partition-based categorical split is featured in this release. This split type is first available in LightGBM in the context of gradient boosting. In the previous version, only gpu_hist
supports one-hot encoding-based split which has the form of x \in {c}
where {c}
is the set of all categories. In this new release, the {c}
can be optionally split into 2 sets for the left and right nodes using any of the aforementioned tree methods. For more information, please see our tutorial on categorical data, along with examples linked on that page. (#7380, #7708, #7695, #7330, #7307, #7322, #7705, #7652, #7592, #7666, #7576, #7569, #7529, #7575, #7393, #7465, #7385, #7371, #7745, #7810)
In the future, we will continue to improve categorical data support with new features and optimizations. Also, we are looking forward to bringing the feature beyond Python binding, contributions and feedback are welcomed! Lastly, as a result of experimental status, the behavior might be subject to change, especially the default value of related hyper-parameters.
Experimental support for multi-output model
XGBoost 1.6 features initial support for the multi-output model, which includes multi-output regression and multi-label classification. Along with this, the XGBoost classifier has proper support for base margin without to need for the user to flatten the input. In this initial support, XGBoost builds one model for each target similar to the sklearn meta estimator, for more details, please see our quick introduction. (#7365, #7736, #7607, #7574, #7521, #7514, #7456, #7453, #7455, #7434, #7429, #7405, #7381)
External memory support
External memory support for both approx and hist tree method is considered feature complete in XGBoost 1.6. Building upon the iterator-based interface introduced in the previous version, now both hist
and approx
iterates over each batch of data during training and prediction. In previous versions, hist
concatenates all the batches into an internal representation, which is removed in this version. As a result, users can expect higher scalability in terms of data size but might experience lower performance due to disk IO. (#7531, #7320, #7638, #7372)
Rewritten approx
The approx
tree method is rewritten based on the existing hist
tree method, the rewrite closes the feature gap between approx
and hist
and improves the performance. Now the behavior and approx
should be more aligned with hist
and gpu_hist
, here's a list of user-visible changes:
- Supports both
max_leaves
andmax_depth
. - Supports
grow_policy
. - Supports monotonic constraint.
- Supports feature weights.
- Use
max_bin
to replacesketch_eps
. - Supports categorical data.
- Faster performance for many of the datasets.
- Improved performance and robustness for distributed training.
- Supports prediction cache.
- Significantly better performance for external memory when
depthwise
policy is used.
New serialization format
Based on the existing JSON serialization format, we introduce UBJSON support as a more efficient alternative. Both formats will be available in the future and we plan to gradually phase out support for the old binary model format. Users can opt to use the different formats in the serialization function by providing the file extension json
or ubj
. Also, the save_raw
function in all supported languages bindings gains a new parameter for exporting the model in different formats, available options are json
, ubj
, and deprecated
, see document for the language binding you are using for details. Lastly, the default internal serialization format is set to UBJSON, which affects Python pickle and R RDS. (#7572, #7570, #7358, #7571, #7556, #7549, #7416)
General new features and improvements
Aside from the major new features mentioned above, some others are summarized here:
- Users can now access the build information of XGBoost binary in Python and C interface. (#7399, #7553)
- Auto-configuration of
seed_per_iteration
is removed, now distributed training should generate closer results to single node training when sampling is used. (#7009) - A new parameter
huber_slope
is introduced for thePseudo-Huber
objective. - During source build, XGBoost can choose cub in the system path automatically. (#7579)
- XGBoost now honors the CPU counts from CFS, which is usually set in docker environments. (#7654, #7704)
- The metric
aucpr
is rewritten for better performance and GPU support. (#7297, #7368) - Metric calculation is now performed in double precision. (#7364)
- XGBoost no longer mutates the global OpenMP thread limit. (#7537, #7519, #7608, #7590, #7589, #7588, #7687)
- The default behavior of
max_leave
andmax_depth
is now unified (#7302, #7551). - CUDA fat binary is now compressed. (#7601)
- Deterministic result for evaluation metric and linear model. In previous versions of XGBoost, evaluation results might differ slightly for each run due to parallel reduction for floating-point values, which is now addressed. (#7362, #7303, #7316, #7349)
- XGBoost now uses double for GPU Hist node sum, which improves the accuracy of
gpu_hist
. (#7507)
Performance improvements
Most of the performance improvements are integrated into other refactors during feature developments. The approx
should see significant performance gain for many datasets as mentioned in the previous section, while the hist
tree method also enjoys improved performance with the removal of the internal pruner
along with some other refactoring. Lastly, gpu_hist
no longer synchronizes the device during training. (#7737)
General bug fixes
This section lists bug fixes that are not specific to any language binding.
- The
num_parallel_tree
is now a model parameter instead of a training hyper-parameter, which fixes model IO with random forest. (#7751) - Fixes in CMake script for exporting configuration. (#7730)
- XGBoost can now handle unsorted sparse input. This includes text file formats like libsvm and scipy sparse matrix where column index might not be sorted. (#7731)
- Fix tree param feature type, this affects inputs with the number of columns greater than the maximum value of int32. (#7565)
- Fix external memory with gpu_hist and subsampling. (#7481)
- Check the number of trees in inplace predict, this avoids a potential segfault when an incorrect value for
iteration_range
is provided. (#7409) - Fix non-stable result in cox regression (#7756)
Changes in the Python package
Other than the changes in Dask, the XGBoost Python package gained some new features and improvements along with small bug fixes.
- Python 3.7 is required as the lowest Python version. (#7682)
- Pre-built binary wheel for Apple Silicon. (#7621, #7612, #7747) Apple Silicon users will now be able to run
pip install xgboost
to install XGBoost. - MacOS users no longer need to install
libomp
from Homebrew, as the XGBoost wheel now bundleslibomp.dylib
library. - There are new parameters for users to specify the custom metric with new behavior. XGBoost can now output transformed prediction values when a custom objective is not supplied. See our explanation in the tutorial for details.
- For the sklearn interface, following the estimator guideline from scikit-learn, all parameters in
fit
that are not related to input data are moved into the constructor and can be set byset_params
. (#6751, #7420, #7375, #7369) - Apache arrow format is now supported, which can bring better performance to users' pipeline (#7512)
- Pandas nullable types are now supported (#7760)
- A new function
get_group
is introduced forDMatrix
to allow users to get the group information in the custom objective function. (#7564) - More training parameters are exposed in the sklearn interface instead of relying on the
**kwargs
. (#7629) - A new attribute
feature_names_in_
is defined for all sklearn estimators likeXGBRegressor
to follow the convention of sklearn. (#7526) - More work on Python type hint. (#7432, #7348, #7338, #7513, #7707)
- Support the latest pandas Index type. (#7595)
- Fix for Feature shape mismatch error on s390x platform (#7715)
- Fix using feature names for constraints with multiple groups (#7711)
- We clarified the behavior of the callback function when it contains mutable states. (#7685)
- Lastly, there are some code cleanups and maintenance work. (#7585, #7426, #7634, #7665, #7667, #7377, #7360, #7498, #7438, #7667, #7752, #7749, #7751)
Changes in the Dask interface
- Dask module now supports user-supplied host IP and port address of scheduler node. Please see introduction and API document for reference. (#7645, #7581)
- Internal
DMatrix
construction in dask now honers thread configuration. (#7337) - A fix for
nthread
configuration using the Dask sklearn interface. (#7633) - The Dask interface can now handle empty partitions. An empty partition is different from an empty worker, the latter refers to the case when a worker has no partition of an input dataset, while the former refers to some partitions on a worker that has zero sizes. (#7644, #7510)
- Scipy sparse matrix is supported as Dask array partition. (#7457)
- Dask interface is no longer considered experimental. (#7509)
Changes in the R package
This section summarizes the new features, improvements, and bug fixes to the R package.
load.raw
can optionally construct a booster as return. (#7686)- Fix parsing decision stump, which affects both transforming text representation to data table and plotting. (#7689)
- Implement feature weights. (#7660)
- Some improvements for complying the CRAN release policy. (#7672, #7661, #7763)
- Support CSR data for predictions (#7615)
- Document update (#7263, #7606)
- New maintainer for the CRAN package (#7691, #7649)
- Handle non-standard installation of toolchain on macos (#7759)
Changes in JVM-packages
Some new features for JVM-packages are introduced for a more integrated GPU pipeline and better compatibility with musl-based Linux. Aside from this, we have a few notable bug fixes.
- User can specify the tracker IP address for training, which helps running XGBoost on restricted network environments. (#7808)
- Add support for detecting musl-based Linux (#7624)
- Add
DeviceQuantileDMatrix
to Scala binding (#7459) - Add Rapids plugin support, now more of the JVM pipeline can be accelerated by RAPIDS (#7491, #7779, #7793, #7806)
- The setters for CPU and GPU are more aligned (#7692, #7798)
- Control logging for early stopping (#7326)
- Do not repartition when nWorker = 1 (#7676)
- Fix the prediction issue for
multi:softmax
(#7694) - Fix for serialization of custom objective and eval (#7274)
- Update documentation about Python tracker (#7396)
- Remove jackson from dependency, which fixes CVE-2020-36518. (#7791)
- Some refactoring to the training pipeline for better compatibility between CPU and GPU. (#7440, #7401, #7789, #7784)
- Maintenance work. (#7550, #7335, #7641, #7523, #6792, #4676)
Deprecation
Other than the changes in the Python package and serialization, we removed some deprecated features in previous releases. Also, as mentioned in the previous section, we plan to phase out the old binary format in future releases.
- Remove old warning in 1.3 (#7279)
- Remove label encoder deprecated in 1.3. (#7357)
- Remove old callback deprecated in 1.3. (#7280)
- Pre-built binary will no longer support deprecated CUDA architectures including sm35 and sm50. Users can continue to use these platforms with source build. (#7767)
Documentation
This section lists some of the general changes to XGBoost's document, for language binding specific change please visit related sections.
- Document is overhauled to use the new RTD theme, along with integration of Python examples using Sphinx gallery. Also, we replaced most of the hard-coded URLs with sphinx references. (#7347, #7346, #7468, #7522, #7530)
- Small update along with fixes for broken links, typos, etc. (#7684, #7324, #7334, #7655, #7628, #7623, #7487, #7532, #7500, #7341, #7648, #7311)
- Update document for GPU. [skip ci] (#7403)
- Document the status of RTD hosting. (#7353)
- Update document for building from source. (#7664)
- Add note about CRAN release [skip ci] (#7395)
Maintenance
This is a summary of maintenance work that is not specific to any language binding.
- Add CMake option to use /MD runtime (#7277)
- Add clang-format configuration. (#7383)
- Code cleanups (#7539, #7536, #7466, #7499, #7533, #7735, #7722, #7668, #7304, #7293, #7321, #7356, #7345, #7387, #7577, #7548, #7469, #7680, #7433, #7398)
- Improved tests with better coverage and latest dependency (#7573, #7446, #7650, #7520, #7373, #7723, #7611, #7771)
- Improved automation of the release process. (#7278, #7332, #7470)
- Compiler workarounds (#7673)
- Change shebang used in CLI demo. (#7389)
- Update affiliation (#7289)
CI
Some fixes and update to XGBoost's CI infrastructure. (#7739, #7701, #7382, #7662, #7646, #7582, #7407, #7417, #7475, #7474, #7479, #7472, #7626)
Artifacts
You can verify the downloaded packages by running this on your unix shell:
echo "<hash> <artifact>" | shasum -a 256 --check
e9334dd8f87d1b6b51bdf7a0efee0936aa0d0329b94688aab3da980b8c834b86 ./xgboost_r_gpu_linux_1.6.0.tar.gz
1a470c948326b060cb8b02f9aca7ca92c48e1006ca411b6c2960dff24ea78594 ./xgboost_r_gpu_win64_1.6.0.tar.gz
035704167465dd9104f13eaa37f8839ca10da183a88745a32bc30e6834bd1738 ./xgboost.tar.gz