github catboost/catboost v1.0.0
1.0.0

In this release, we decided to increment the major version as we think that CatBoost is pretty stable and production-ready. We know, that CatBoost is used a lot in many different companies and individual projects, and we think, that all the features we added in the last year are worth incrementing major version. And of course, as many programmers, we love the magic of binary numbers and we want to celebrate 100₂ anniversary since CatBoost first release on Github 🥳

New losses

  • We've implemented a multi-label multiclass loss function, that allows us to predict multiple labels for each object #1420
  • Added LogCosh loss implementation #844

Fully distributed CatBoost for Apache Spark

  • In this release Apache Spark package became truly distributed - in the previous version CatBoost stored test datasets in controller process memory. And now test datasets are split evenly by workers.

Major speedup on CPU

We've improved training speed on numeric datasets:

  • 28% speedup on Higgs dataset: 1000 trees, binclass: on 16 cores Intel CPU: 405 seconds -> 315 seconds
  • 20% speedup on the small numeric dataset with 480K rows, 60 features, 100 trees, binclass on 16 cores Intel CPU 3.7 seconds-> 2.9 seconds
  • 53% speedup on sparse one-hot encoded airlines dataset: 1000 trees training time 381 seconds -> 249 seconds

R package

  • Update C++ handles by reference to avoid redundant copies by @david-cortes
  • Avoid calculating groupwise feature importance: do not calculate feature importance for groupwise metrics by default
  • R tests clear environment after runs so they won't find temporary data from previous runs
  • Fixed ignored features in R fail when single feature was ignored
  • Fix feature_count attribute with ignored_features

CV improvements

  • Added support for text features and embeddings in cross-validation mode
  • We've changed the way cross-validation works - previously, CatBoost was training a small batch of trees on each fold and then switched to the next fold or next batch of trees. In 1.0.0 we changed this behavior and now CatBoost trains the full model on each fold. That allows us to reduce the memory and time overhead of starting a new batch - only one CPU to GPU memory copy is needed per fold, not per each batch of trees. Mean metric interactive plot became unavailable until the end of training on all folds.
  • Important change From now on use_best_model and early stopping works independently on each fold, as we are trying to make single fold training as close to regular training as possible. If one model stops at iteration i we use the last value of metric in the mean score plot for points with [i+1; last iteration).

GPU improvements

  • Fixed distributed training performance on Ethernet networks ~2x training time speedup. For 2 hosts, 8 v100/host, 10gigabit eth, 300 factors, 150m samples, 200 trees, 3300s -> 1700s
  • We've found a bug in model-size-reg implementation in GPU that leaded to worse quality of the resulting model, especially in comparison to a model trained on CPU with equal parameters

Rust

  • Enabled load model from the buffer for rust by @manavsah

Bugfixes

  • Fix for model predictions with text and embedding features
  • Switch to TBB local executor to limit TLS size and avoid memory leakage #1835
  • Switch to tcmalloc under Linux x86_64 to avoid memory fragmentation bug in LFAlloc
  • Fix for case of ignored text feature
  • Fixed application of baseline in C++ code. Moved addition of that before application of activation functions and determining labels of objects.
  • Fixes for scikit-learn compatibility validation #1783 and #1785
  • Fix for thread_count = -1 in set_params(). Issue #1800
  • Fix potential sigsegv in the model evaluator. Fixes #1809
  • Fix slow (u)int8 & (u)int16 parsing as catfeatures. Fixes #718
  • Adjust boost from average option before auto-learning rate
  • Fix embeddings with CrossEntropy mode #1654
  • Fix object importance #1820
  • Fix data provider without target #1827
15 days ago