Breaking changes
- We fixed bug in CatBoost. Pool initialization from
numpy.array
andpandas.dataframe
with string values that can cause slight inconsistence while using trained model from older versions. Around 1% of cat feature hashes were treated incorrectly. If you expirience quality drop after update you should consider retraining your model.
Major Features And Improvements
- Algorithm for finding most influential training samples for a given object from the 'Finding Influential Training Samples for Gradient Boosted Decision Trees' paper is implemented. This mode for every object from input pool calculates scores for every object from train pool. A positive score means that the given train object has made a negative contribution to the given test object prediction. And vice versa for negative scores. The higher score modulo - the higher contribution.
Seeget_object_importance
model method in Python package andostr
mode in cli-version. Tutorial for Python is available here.
More details and examples will be published in documentation soon. - We have implemented new way of exploring feature importance - SHAP values from paper. This allows to understand which features are most influent for a given object. You can also get more insite about your model, see details in a tutorial.
- Save model as code functionality published. For now you could save model as Python code with categorical features and as C++ code w/o categorical features.
Bug Fixes and Other Changes
- Fix
_catboost
reinitialization issues #268 and #269. - GPU parameter
use_cpu_ram_for_cat_features
renamed togpu_cat_features_storage
with posible valuesCpuPinnedMemory
andGpuRam
. Default isGpuRam
.
Thanks to our Contributors
This release contains contributions from CatBoost team.
As usual we are grateful to all who filed issues or helped resolve them, asked and answered questions.