New functionality
- It is possible now to train models on huge datasets that do not fit into CPU RAM.
This can be accomplished by storing only quantized data in memory (it is many times smaller). Usecatboost.utils.quantize
function to create quantizedPool
this way. See usage example in the issue #1116.
Implemented by @noxwell. - Python Pool class now has
save_quantization_borders
method that allows to save resulting borders into a file and use it for quantization of other datasets. Quantization can be a bottleneck of training, especially on GPU. Doing quantization once for several trainings can significantly reduce running time. It is recommended for large dataset to perform quantization first, save quantization borders, use them to quantize validation dataset, and then use quantized training and validation datasets for further training.
Use saved borders when quantizing other Pools by specifyinginput_borders
parameter of thequantize
method.
Implemented by @noxwell. - Training with text features is now supported on CPU
- It is now possible to set
border_count
> 255 for GPU training. This might be useful if you have a "golden feature", see docs. - Feature weights are implemented.
Specify weights for specific features by index or name likefeature_weights="FeatureName1:1.5,FeatureName2:0.5"
.
Scores for splits with this features will be multiplied by corresponding weights.
Implemented by @Taube03. - Feature penalties can be used for cost efficient gradient boosting.
Penalties are specified in a similar fashion to feature weights, using parameterfirst_use_feature_penalties
.
This parameter penalized the first usage of a feature. This should be used in case if the calculation of the feature is costly.
The penalty value (or the cost of using a feature) is subtracted from scores of the splits of this feature if feature has not been used in the model.
After the feature has been used once, it is considered free to proceed using this feature, so no substruction is done.
There is also a common multiplier for allfirst_use_feature_penalties
, it can be specified bypenalties_coefficient
parameter.
Implemented by @Taube03 (issue #1155) recordCount
attribute is added to PMML models (issue #1026).
New losses and metrics
- New ranking objective 'StochasticRank', details in paper.
Tweedie
loss is supported now. It can be a good solution for right-skewed target with many zero values, see tutorial.
When usingCatBoostRegressor.predict
function, defaultprediction_type
for this loss will be equal toExponent
. Implemented by @ilya-pchelintsev (issue #577)- Classification metrics now support a new parameter
proba_border
. With this parameter you can set decision boundary for treating prediction as negative or positive. Implemented by @ivanychev. - Metric
TotalF1
supports a new parameteraverage
with possible valueweighted
,micro
,macro
. Implemented by @ilya-pchelintsev. - It is possible now to specify a custom multi-label metric in python. Note that it is only possible to calculate this metric and use it as
eval_metric
. It is not possible to used it as an optimization objective.
To write a multi-label metric, you need to define a python class which inherits fromMultiLabelCustomMetric
class. Implemented by @azikmsu.
Improvements of grid and randomized search
class_weights
parameter is now supported in grid/randomized search. Implemented by @vazgenk.- Invalid option configurations are automatically skipped during grid/randomized search. Implemented by @borzunov.
get_best_score
returns train/validation best score after grid/randomized search (in case of refit=False). Implemented by @rednevaler.
Improvements of model analysis tools
- Computation of SHAP interaction values for CatBoost models. You can pass type=EFstrType.ShapInteractionValues to
CatBoost.get_feature_importance
to get a matrix of SHAP values for every prediction.
By default, SHAP interaction values are calculated for all features. You may specify features of interest using theinteraction_indices
argument.
Implemented by @IvanKozlov98. - SHAP values can be calculated approximately now which is much faster than default mode. To use this mode specify
shap_calc_type
parameter ofCatBoost.get_feature_importance
function as"Approximate"
. Implemented by @LordProtoss (issue #1146). PredictionDiff
model analysis method can now be used with models that contain non symmetric trees. Implemented by @felixandrer.
New educational materials
- A tutorial on tweedie regression
- A tutorial on poisson regression
- A detailed tutorial on different types of AUC metric, which explains how different types of AUC can be used for binary classification, multiclassification and ranking tasks.
Breaking changes
- When using
CatBoostRegressor.predict
function for models trained withPoisson
loss, defaultprediction_type
will be equal toExponent
(issue #1184). Implemented by @garkavem.
This release also contains bug fixes and performance improvements, including a major speedup for sparse data on GPU.