CatBoost for Apache Spark
This release includes CatBoost for Apache Spark package that supports training, model application and feature evaluation on Apache Spark platform. We've prepared CatBoost for Apache Spark introduction and CatBoost for Apache Spark Architecture
videos for introduction. More details available at CatBoost for Apache Spark home page.
CatBoost supports recursive feature elimination procedure - when you have lot's of feature candidates and you want to select only most influential features by training models and selecting only strongest by feature importance. You can look for details in our tutorial
- Supported exact leaves estimation method for quantile, MAE and MAPE losses on GPU. You can enable it by setting
leaf_estimation_method=Exactexplicitly, in next releases we are planning to set it by default.
- Supported uncertainty prediction for multiclassification models
- #1568 Added support shap values calculation MultiRMSE models
- #1520 Added support for
pathlib.Pathin python package
- #1456 Added prehashed categorical features and text features to C API for model inference.
Losses and metrics
- Supported Huber and Tweedie losses in GPU training
- QueryAUC metric implemented by @fibersel
- We changed NDCG calculation principle for groups without relevant docs to make our NDCG score fully compatible with XGBoost and LightGBM implementations. Now we calc
dcg==1when there is no relevant objects in group (when ideal DCG equals zero), later we used
score==0in that case.
- With help of Intel developers team we switched our threading model implementation to Intel Threading Building Blocks. That gives us up to 20% speedup on 28 threads and around 2x speedup when training in 120 threads and largely improves scalability.
- Speed up rendering fstat plots.
- Slightly speed up string casting in python package during pool creation.
- Added path expansion when saving/loading files in R by @david-cortes
- Added functionality to restore R handle after deserializing model by @david-cortes
- Retrieve R pointers outside loops to speed up scalar access by @david-cortes
- Multiple R documentation edits from @david-cortes and @jameslamb
- #1588 Added precision for converting params to json
- #1525 Problem with missing exported functions in Windows R package dll
- #1315 Low CPU utilization in CPU cross-validation
- #785 Predict on single item with iloc fixed by @feeeper
- Segfaults due to null pointer in pool in R package fixed by @david-cortes
- #1553 Added check for baseline dimensions count in apply
- #1606 Allow to use CatBoost in AWS Lambda environment: fix bug with setting thread names
- #1609 and #1309 Print proper error message if all params in grid were invalid
- Ability to use docstrings in estimators added by @pawelopiela
- Allow extra space at the end of line for libsvm format
- We would like to recognize Intel software engineering team’s contributions to Catboost project.
- Many thanks to our individual contributors: @david-cortes @jameslamb @pawelopiela @feeeper @fibersel