catboost/catboost v0.10.0 on GitHub

Breaking changes

R package

In R package we have changed parameter name target to label in method save_pool()

Python package

We don't support Python 3.4 anymore
CatBoostClassifier and CatBoostRegressor get_params() method now returns only the params that were explicitly set when constructing the object. That means that CatBoostClassifier and CatBoostRegressor get_params() will not contain 'loss_function' if it was not specified.
This also means that this code:

model1 = CatBoostClassifier()
params = model1.get_params()
model2 = CatBoost(params)

will create model2 with default loss_function RMSE, not with Logloss.
This breaking change is done to support sklearn interface, so that sklearn GridSearchCV can work.

We've removed several attributes and changed them to functions. This was needed to avoid sklearn warnings:
is_fitted_ => is_fitted()
metadata_ => get_metadata()
We removed file with model from constructor of estimator. This was also done to avoid sklearn warnings.

Educational materials

We added tutorial for our ranking modes.
We published our slides, you are very welcome to use them.

Improvements

All

Now it is possible to save model in json format.
We have added Java interface for CatBoost model
We now have static linkage with CUDA, so you don't have to install any particular version of CUDA to get catboost working on GPU.
We implemented both multiclass modes on GPU, it is very fast.
It is possible now to use multiclass with string labels, they will be inferred from data
Added use_weights parameter to metrics. By default all metrics, except for AUC use weights, but you can disable it. To calculate metric value without weights, you need to set this parameter to false. Example: Accuracy:use_weights=false. This can be done only for custom_metrics or eval_metric, not for the objective function. Objective function always uses weights if they are present in the dataset.
We now use snapshot time intervals. It will work much faster if you save snapshot every 5 or 10 minutes instead of saving it on every iteration.
Reduced memory consumption by ranking modes.
Added automatic feature importance evaluation after completion of GPU training.
Allow inexistent indexes in ignored features list
Added new metrics: LogLikelihoodOfPrediction, RecallAt:top=k, PrecisionAt:top=k and MAP:top=k.
Improved quality for multiclass with weighted datasets.
Pairwise modes now support automatic pairs generation (see tutorial for that).
Metric QueryAverage is renamed to a more clear AverageGain. This is a very important ranking metric. It shows average target value in top k documents of a group.
Introduced parameter best_model_min_trees - the minimal number of trees the best model should have.

Python

We now support sklearn GridSearchCV: you can pass categorical feature indices when constructing estimator. And then use it in GridSearchCV.
We added new method to utils - building of ROC curve: get_roc_curve.
Added get_gpu_device_count() method to python package. This is a way to check if your CUDA devices are available.
We implemented automatical selection of decision-boundary using ROC curve. You can select best classification boundary given the maximum FPR or FNR that you allow to the model. Take a look on catboost.select_threshold(self, data=None, curve=None, FPR=None, FNR=None, thread_count=-1). You can also calculate FPR and FNR for each boundary value.
We have added pool slicing: pool.slice(doc_indices)
Allow GroupId and SubgroupId specified as strings.

R package

GPU support in R package. You need to use parameter task_type='GPU' to enable GPU training.
Models in R can be saved/restored by means of R: save/load or saveRDS/readRDS

Speedups

New way of loading data in Python using FeaturesData structure. Using FeaturesData will speed up both loading data for training and for prediction. It is especially important for prediction, because it gives around 10 to 20 times python prediction speedup.
Training multiclass on CPU ~ 60% speedup
Training of ranking modes on CPU ~ 50% speedup
Training of ranking modes on GPU ~ 50% speedup for datasets with many features and not very many objects
Speedups of metric calculation on GPU. Example of speedup on our internal dataset: training with - AUC eval metric with test dataset with 2kk objects is speeded up 7sec => 0.2 seconds per iteration.
Speedup of all modes on CPU training.

We also did a lot of stability improvements, and improved usability of the library, added new parameter synonyms and improved input data validations.

Thanks a lot to all people who created issues on github. And thanks a lot to our contributor https://github.com/pukhlyakova who implemented many new useful metrics!