catboost/catboost v0.25 on GitHub

CatBoost for Apache Spark

This release includes CatBoost for Apache Spark package that supports training, model application and feature evaluation on Apache Spark platform. We've prepared CatBoost for Apache Spark introduction and CatBoost for Apache Spark Architecture
videos for introduction. More details available at CatBoost for Apache Spark home page.

Feature selection

CatBoost supports recursive feature elimination procedure - when you have lot's of feature candidates and you want to select only most influential features by training models and selecting only strongest by feature importance. You can look for details in our tutorial

New features

Supported exact leaves estimation method for quantile, MAE and MAPE losses on GPU. You can enable it by setting leaf_estimation_method=Exact explicitly, in next releases we are planning to set it by default.
Supported uncertainty prediction for multiclassification models
#1568 Added support shap values calculation MultiRMSE models
#1520 Added support for pathlib.Path in python package
#1456 Added prehashed categorical features and text features to C API for model inference.

Losses and metrics

Supported Huber and Tweedie losses in GPU training
QueryAUC metric implemented by @fibersel

Breaking changes

We changed NDCG calculation principle for groups without relevant docs to make our NDCG score fully compatible with XGBoost and LightGBM implementations. Now we calc dcg==1 when there is no relevant objects in group (when ideal DCG equals zero), later we used score==0 in that case.

Speedups

With help of Intel developers team we switched our threading model implementation to Intel Threading Building Blocks. That gives us up to 20% speedup on 28 threads and around 2x speedup when training in 120 threads and largely improves scalability.
Speed up rendering fstat plots.
Slightly speed up string casting in python package during pool creation.

R package

Added path expansion when saving/loading files in R by @david-cortes
Added functionality to restore R handle after deserializing model by @david-cortes
Retrieve R pointers outside loops to speed up scalar access by @david-cortes
Multiple R documentation edits from @david-cortes and @jameslamb
#1588 Added precision for converting params to json

Bugfixes

#1525 Problem with missing exported functions in Windows R package dll
#1315 Low CPU utilization in CPU cross-validation
#785 Predict on single item with iloc fixed by @feeeper
Segfaults due to null pointer in pool in R package fixed by @david-cortes
#1553 Added check for baseline dimensions count in apply
#1606 Allow to use CatBoost in AWS Lambda environment: fix bug with setting thread names
#1609 and #1309 Print proper error message if all params in grid were invalid
Ability to use docstrings in estimators added by @pawelopiela
Allow extra space at the end of line for libsvm format

Thanks!

We would like to recognize Intel software engineering team’s contributions to Catboost project.
Many thanks to our individual contributors: @david-cortes @jameslamb @pawelopiela @feeeper @fibersel

catboost/catboost v0.25 0.25 on GitHub