xgboost 0.81 on Python PyPI

New feature: feature interaction constraints

Users are now able to control which features (independent variables) are allowed to interact by specifying feature interaction constraints (#3466).
Tutorial is available, as well as R and Python examples.

New feature: learning to rank using scikit-learn interface

Learning to rank task is now available for the scikit-learn interface of the Python package (#3560, #3848). It is now possible to integrate the XGBoost ranking model into the scikit-learn learning pipeline.
Examples of using XGBRanker class is found at demo/rank/rank_sklearn.py.

New feature: R interface for SHAP interactions

SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. Previously, this feature was only available from the Python package; now it is available from the R package as well (#3636).

New feature: GPU predictor now use multiple GPUs to predict

GPU predictor is now able to utilize multiple GPUs at once to accelerate prediction (#3738)

New feature: Scale distributed XGBoost to large-scale clusters

Fix OS file descriptor limit assertion error on large cluster (#3835, dmlc/rabit#73) by replacing select() based AllReduce/Broadcast with poll() based implementation.
Mitigate tracker "thundering herd" issue on large cluster. Add exponential backoff retry when workers connect to tracker.
With this change, we were able to scale to 1.5k executors on a 12 billion row dataset after some tweaks here and there.

New feature: Additional objective functions for GPUs

New objective functions ported to GPU: hinge, multi:softmax, multi:softprob, count:poisson, reg:gamma, reg:tweedie.
With supported objectives, XGBoost will select the correct devices based on your system and n_gpus parameter.

Major bug fix: learning to rank with XGBoost4J-Spark

Previously, repartitionForData would shuffle data and lose ordering necessary for ranking task.
To fix this issue, data points within each RDD partition is explicitly group by their group (query session) IDs (#3654). Also handle empty RDD partition carefully (#3750).

Major bug fix: early stopping fixed in XGBoost4J-Spark

Earlier implementation of early stopping had incorrect semantics and didn't let users to specify direction for optimizing (maximize / minimize)
A parameter maximize_evaluation_metrics is defined so as to tell whether a metric should be maximized or minimized as part of early stopping criteria (#3808). Also early stopping now has correct semantics.

API changes

Column sampling by level (colsample_bylevel) is now functional for hist algorithm (#3635, #3862)
GPU tag gpu: for regression objectives are now deprecated. XGBoost will select the correct devices automatically (#3643)
Add disable_default_eval_metric parameter to disable default metric (#3606)
Experimental AVX support for gradient computation is removed (#3752)
XGBoost4J-Spark
- Add rank:ndcg and rank:map to supported objectives (#3697)
Python package
- Add callbacks argument to fit() function of sciki-learn API (#3682)
- Add XGBRanker to scikit-learn interface (#3560, #3848)
- Add validate_features argument to predict() function of scikit-learn API (#3653)
- Allow scikit-learn grid search over parameters specified as keyword arguments (#3791)
- Add coef_ and intercept_ as properties of scikit-learn wrapper (#3855). Some scikit-learn functions expect these properties.

Performance improvements

Address very high GPU memory usage for large data (#3635)
Fix performance regression within EvaluateSplits() of gpu_hist algorithm. (#3680)

Bug-fixes

Fix a problem in GPU quantile sketch with tiny instance weights. (#3628)
Fix copy constructor for HostDeviceVectorImpl to prevent dangling pointers (#3657)
Fix a bug in partitioned file loading (#3673)
Fixed an uninitialized pointer in gpu_hist (#3703)
Reshared data among GPUs when number of GPUs is changed (#3721)
Add back max_delta_step to split evaluation (#3668)
Do not round up integer thresholds for integer features in JSON dump (#3717)
Use dmlc::TemporaryDirectory to handle temporaries in cross-platform way (#3783)
Fix accuracy problem with gpu_hist when min_child_weight and lambda are set to 0 (#3793)
Make sure that tree_method parameter is recognized and not silently ignored (#3849)
XGBoost4J-Spark
- Make sure thresholds are considered when executing predict() method (#3577)
- Avoid losing precision when computing probabilities by converting to Double early (#3576)
- getTreeLimit() should return Int (#3602)
- Fix checkpoint serialization on HDFS (#3614)
- Throw ControlThrowable instead of InterruptedException so that it is properly re-thrown (#3632)
- Remove extraneous output to stdout (#3665)
- Allow specification of task type for custom objectives and evaluations (#3646)
- Fix distributed updater check (#3739)
- Fix issue when spark job execution thread cannot return before we execute first() (#3758)
Python package
- Fix accessing DMatrix.handle before it is set (#3599)
- XGBClassifier.predict() should return margin scores when output_margin is set to true (#3651)
- Early stopping callback should maximize metric of form NDCG@n- (#3685)
- Preserve feature names when slicing DMatrix (#3766)
R package
- Replace nround with nrounds to match actual parameter (#3592)
- Amend xgb.createFolds to handle classes of a single element (#3630)
- Fix buggy random generator and make colsample_bytree functional (#3781)

Maintenance: testing, continuous integration, build system

Add sanitizers tests to Travis CI (#3557)
Add NumPy, Matplotlib, Graphviz as requirements for doc build (#3669)
Comply with CRAN submission policy (#3660, #3728)
Remove copy-paste error in JVM test suite (#3692)
Disable flaky tests in R-package/tests/testthat/test_update.R (#3723)
Make Python tests compatible with scikit-learn 0.20 release (#3731)
Separate out restricted and unrestricted tasks, so that pull requests don't build downloadable artifacts (#3736)
Add multi-GPU unit test environment (#3741)
Allow plug-ins to be built by CMake (#3752)
Test wheel compatibility on CPU containers for pull requests (#3762)
Fix broken doc build due to Matplotlib 3.0 release (#3764)
Produce xgboost.so for XGBoost-R on Mac OSX, so that make install works (#3767)
Retry Jenkins CI tests up to 3 times to improve reliability (#3769, #3769, #3775, #3776, #3777)
Add basic unit tests for gpu_hist algorithm (#3785)
Fix Python environment for distributed unit tests (#3806)
Test wheels on CUDA 10.0 container for compatibility (#3838)
Fix JVM doc build (#3853)

Maintenance: Refactor C++ code for legibility and maintainability

Merge generic device helper functions into GPUSet class (#3626)
Re-factor column sampling logic into ColumnSampler class (#3635, #3637)
Replace std::vector with HostDeviceVector in MetaInfo and SparsePage (#3446)
Simplify DMatrix class (#3395)
De-duplicate CPU/GPU code using Transform class (#3643, #3751)
Remove obsoleted QuantileHistMaker class (#3761)
Remove obsoleted NoConstraint class (#3792)

Other Features

C++20-compliant Span class for safe pointer indexing (#3548, #3588)
Add helper functions to manipulate multiple GPU devices (#3693)
XGBoost4J-Spark
- Allow specifying host ip from the xgboost-tracker.properties file (#3833). This comes in handy when hosts files doesn't correctly define localhost.

Usability Improvements

Add reference to GitHub repository in pom.xml of JVM packages (#3589)
Add R demo of multi-class classification (#3695)
Document JSON dump functionality (#3600, #3603)
Document CUDA requirement and lack of external memory for GPU algorithms (#3624)
Document LambdaMART objectives, both pairwise and listwise (#3672)
Document aucpr evaluation metric (#3687)
Document gblinear parameters: feature_selector and top_k (#3780)
Add instructions for using MinGW-built XGBoost with Python. (#3774)
Removed nonexistent parameter use_buffer from documentation (#3610)
Update Python API doc to include all classes and members (#3619, #3682)
Fix typos and broken links in documentation (#3618, #3640, #3676, #3713, #3759, #3784, #3843, #3852)
Binary classification demo should produce LIBSVM with 0-based indexing (#3652)
Process data once for Python and CLI examples of learning to rank (#3666)
Include full text of Apache 2.0 license in the repository (#3698)
Save predictor parameters in model file (#3856)
JVM packages
- Let users specify feature names when calling getModelDump and getFeatureScore (#3733)
- Warn the user about the lack of over-the-wire encryption (#3667)
- Fix errors in examples (#3719)
- Document choice of trackers (#3831)
- Document that vanilla Apache Spark is required (#3854)
Python package
- Document that custom objective can't contain colon (:) (#3601)
- Show a better error message for failed library loading (#3690)
- Document that feature importance is unavailable for non-tree learners (#3765)
- Document behavior of get_fscore() for zero-importance features (#3763)
- Recommend pickling as the way to save XGBClassifier / XGBRegressor / XGBRanker (#3829)
R package
- Enlarge variable importance plot to make it more visible (#3820)

BREAKING CHANGES

External memory page files have changed, breaking backwards compatibility for temporary storage used during external memory training. This only affects external memory users upgrading their xgboost version - we recommend clearing all *.page files before resuming training. Model serialization is unaffected.

Known issues

Quantile sketcher fails to produce any quantile for some edge cases (#2943)
The hist algorithm leaks memory when used with learning rate decay callback (#3579)
Using custom evaluation funciton together with early stopping causes assertion failure in XGBoost4J-Spark (#3595)
Early stopping doesn't work with gblinear learner (#3789)
Label and weight vectors are not reshared upon the change in number of GPUs (#3794). To get around this issue, delete the DMatrix object and re-load.
The DMatrix Python objects are initialized with incorrect values when given array slices (#3841)
The gpu_id parameter is broken and not yet properly supported (#3850)

Acknowledgement

Contributors (in no particular order): Hyunsu Cho (@hcho3), Jiaming Yuan (@trivialfis), Nan Zhu (@CodingCat), Rory Mitchell (@RAMitchell), Andy Adinets (@canonizer), Vadim Khotilovich (@khotilov), Sergei Lebedev (@superbobry)

First-time Contributors (in no particular order): Matthew Tovbin (@tovbinm), Jakob Richter (@jakob-r), Grace Lam (@grace-lam), Grant W Schneider (@grantschneider), Andrew Thia (@BlueTea88), Sergei Chipiga (@schipiga), Joseph Bradley (@jkbradley), Chen Qin (@chenqin), Jerry Lin (@linjer), Dmitriy Rybalko (@rdtft), Michael Mui (@mmui), Takahiro Kojima (@515hikaru), Bruce Zhao (@BruceZhaoR), Wei Tian (@weitian), Saumya Bhatnagar (@Sam1301), Juzer Shakir (@JuzerShakir), Zhao Hang (@cleghom), Jonathan Friedman (@jontonsoup), Bruno Tremblay (@meztez), Boris Filippov (@frenzykryger), @Shiki-H, @mrgutkun, @gorogm, @htgeis, @jakehoare, @zengxy, @KOLANICH

First-time Reviewers (in no particular order): Nikita Titov (@StrikerRUS), Xiangrui Meng (@mengxr), Nirmal Borah (@Nirmal-Neel)

xgboost 0.81 This is a stable release of 0.81 version on Python PyPI