- It is possible now to train models on huge datasets that do not fit into CPU RAM.
This can be accomplished by storing only quantized data in memory (it is many times smaller). Use
catboost.utils.quantizefunction to create quantized
Poolthis way. See usage example in the issue #1116.
Implemented by @noxwell.
- Python Pool class now has
save_quantization_bordersmethod that allows to save resulting borders into a file and use it for quantization of other datasets. Quantization can be a bottleneck of training, especially on GPU. Doing quantization once for several trainings can significantly reduce running time. It is recommended for large dataset to perform quantization first, save quantization borders, use them to quantize validation dataset, and then use quantized training and validation datasets for further training.
Use saved borders when quantizing other Pools by specifying
input_bordersparameter of the
Implemented by @noxwell.
- Training with text features is now supported on CPU
- It is now possible to set
border_count> 255 for GPU training. This might be useful if you have a "golden feature", see docs.
- Feature weights are implemented.
Specify weights for specific features by index or name like
Scores for splits with this features will be multiplied by corresponding weights.
Implemented by @Taube03.
- Feature penalties can be used for cost efficient gradient boosting.
Penalties are specified in a similar fashion to feature weights, using parameter
This parameter penalized the first usage of a feature. This should be used in case if the calculation of the feature is costly.
The penalty value (or the cost of using a feature) is subtracted from scores of the splits of this feature if feature has not been used in the model.
After the feature has been used once, it is considered free to proceed using this feature, so no substruction is done.
There is also a common multiplier for all
first_use_feature_penalties, it can be specified by
Implemented by @Taube03 (issue #1155)
recordCountattribute is added to PMML models (issue #1026).
New losses and metrics
- New ranking objective 'StochasticRank', details in paper.
Tweedieloss is supported now. It can be a good solution for right-skewed target with many zero values, see tutorial.
prediction_typefor this loss will be equal to
Exponent. Implemented by @ilya-pchelintsev (issue #577)
- Classification metrics now support a new parameter
proba_border. With this parameter you can set decision boundary for treating prediction as negative or positive. Implemented by @ivanychev.
TotalF1supports a new parameter
averagewith possible value
macro. Implemented by @ilya-pchelintsev.
- It is possible now to specify a custom multi-label metric in python. Note that it is only possible to calculate this metric and use it as
eval_metric. It is not possible to used it as an optimization objective.
To write a multi-label metric, you need to define a python class which inherits from
MultiLabelCustomMetricclass. Implemented by @azikmsu.
Improvements of grid and randomized search
class_weightsparameter is now supported in grid/randomized search. Implemented by @vazgenk.
- Invalid option configurations are automatically skipped during grid/randomized search. Implemented by @borzunov.
get_best_scorereturns train/validation best score after grid/randomized search (in case of refit=False). Implemented by @rednevaler.
Improvements of model analysis tools
- Computation of SHAP interaction values for CatBoost models. You can pass type=EFstrType.ShapInteractionValues to
CatBoost.get_feature_importanceto get a matrix of SHAP values for every prediction.
By default, SHAP interaction values are calculated for all features. You may specify features of interest using the
Implemented by @IvanKozlov98.
- SHAP values can be calculated approximately now which is much faster than default mode. To use this mode specify
"Approximate". Implemented by @LordProtoss (issue #1146).
PredictionDiffmodel analysis method can now be used with models that contain non symmetric trees. Implemented by @felixandrer.
New educational materials
- A tutorial on tweedie regression
- A tutorial on poisson regression
- A detailed tutorial on different types of AUC metric, which explains how different types of AUC can be used for binary classification, multiclassification and ranking tasks.
- When using
CatBoostRegressor.predictfunction for models trained with
prediction_typewill be equal to
Exponent(issue #1184). Implemented by @garkavem.
This release also contains bug fixes and performance improvements, including a major speedup for sparse data on GPU.