Release: PyCaret 2.2 | Release Date: October 28, 2020
Summary of Changes
-
Modules Impacted:
pycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
-
Separate Train and Test Set: New parameter
test_data
has been added in thesetup
function ofpycaret.classification
andpycaret.regression
. When a DataFrame is passed into thetest_data
, it is used as a holdout set and thetrain_size
parameter is ignored.test_data
must be labeled and the shape oftest_data
must match with the shape ofdata
. -
Disable Default Preprocessing: A new parameter
preprocess
has been added into thesetup
function. Whenpreprocess
is set toFalse
, no transformations are applied except fortrain_test_split
and custom transformations passed in thecustom_pipeline
param. Data must be ready for modeling (no missing values, no dates, categorical data encoding) when preprocess is set to False. -
Custom Metrics: New functions
get_metric
,add_metric
andremove_metric
is now added inpycaret.classification
,pycaret.regression
, andpycaret.clustering
, that can be used to add / remove metrics used in model evaluation. -
Custom Transformations: A new parameter
custom_pipeline
has been added into thesetup
function. It takes a tuple of(str, transformer)
or a list of tuples. When passed, it will append the custom transformers in the preprocessing pipeline and are applied on each CV fold separately and on the final fit. All the custom transformations are applied aftertrain_test_split
and before pycaret's internal transformations. -
GPU enabled Training: To use GPU for training
use_gpu
parameter in thesetup
function can be set toTrue
orforce
. When set to True, it will use GPU with algorithms that support it and fall back on CPU for remaining. When set toforce
it will only use GPU-enabled algorithms and raise exceptions if they are unavailable for use. The following algorithms are supported on GPU:- Extreme Gradient Boosting
pycaret.classification
pycaret.regression
- LightGBM
pycaret.classification
pycaret.regression
- CatBoost
pycaret.classification
pycaret.regression
- Random Forest
pycaret.classification
pycaret.regression
- K-Nearest Neighbors
pycaret.classification
pycaret.regression
- Support Vector Machine
pycaret.classification
pycaret.regression
- Logistic Regression
pycaret.classification
- Ridge Classifier
pycaret.classification
- Linear Regression
pycaret.regression
- Lasso Regression
pycaret.regression
- Ridge Regression
pycaret.regression
- Elastic Net (Regression)
pycaret.regression
- K-Means
pycaret.clustering
- Density-Based Spatial Clustering
pycaret.clustering
- Extreme Gradient Boosting
-
Hyperparameter Tuning: New methods for hyperparameter tuning has been added in the
tune_model
function forpycaret.classification
andpycaret.regression
. New parametersearch_library
andsearch_algorithm
in thetune_model
function is added.search_library
can bescikit-learn
,scikit-optimize
,tune-sklearn
, andoptuna
. Thesearch_algorithm
param can take the following values based on itssearch_library
:- scikit-learn:
random
grid
- scikit-optimize:
bayesian
- tune-sklearn:
random
grid
bayesian
hyperopt
bohb
- optuna:
random
tpe
Except for
scikit-learn
, all the other search libraries are not hard dependencies of pycaret and must be installed separately. - scikit-learn:
-
Early Stopping: Early stopping now supported for hyperparameter tuning. A new parameter
early_stopping
is added in thetune_model
function forpycaret.classification
andpycaret.regression
. It is ignored whensearch_library
isscikit-learn
, or if the estimator doesn't have a 'partial_fit' attribute. It can be either an object accepted by the search library or one of the following:asha
for Asynchronous Successive Halving Algorithmhyperband
for Hyperbandmedian
for median stopping rule- When
False
orNone
, early stopping will not be used.
-
Iterative Imputation: Iterative imputation type for numeric and categorical missing values is now implemented. New parameters
imputation_type
,iterative_imptutation_iters
,categorical_iterative_imputer
, andnumeric_iterative_imputer
added in thesetup
function. Read the blog post for more details: https://www.linkedin.com/pulse/iterative-imputation-pycaret-22-antoni-baum/?trackingId=Shg1zF%2F%2FR5BE7XFpzfTHkA%3D%3D -
New Plots: Following new plots have been added:
- lift
pycaret.classification
- gain
pycaret.classification
- tree
pycaret.classification
pycaret.regression
- feature_all
pycaret.classification
pycaret.regression
- lift
-
CatBoost Compatibility:
CatBoostClassifier
andCatBoostRegressor
is now compatible withplot_model
. It requirescatboost>=0.23.2
. -
Log Plots in MLFlow Server: You can now log any plot in the
MLFlow
tracking server that is available in theplot_model
function. To log specific plots, pass a list containing plot IDs in thelog_plots
parameter. Check the documentation of theplot_model
to see all available plots. -
Data Split Stratification: A new parameter
data_split_stratify
is added in thesetup
function ofpycaret.classification
andpycaret.regression
. It controls stratification duringtrain_test_split
. When set to True, will stratify by target column. To stratify on any other columns, pass a list of column names. -
Fold Strategy: A new parameter
fold_strategy
is added in thesetup
function forpycaret.classification
andpycaret.regression
. By default, it is 'stratifiedkfold' forpycaret.classification
and 'kfold' forpycaret.regression
. Possible values are:kfold
for KFold CV;stratifiedkfold
for Stratified KFold CV;groupkfold
for Group KFold CV;timeseries
for TimeSeriesSplit CV; or- a custom CV generator object compatible with scikit-learn.
-
Global Fold Parameter: A new parameter
fold
has been added in thesetup
function forpycaret.classification
andpycaret.regression
. It controls the number of folds to be used in cross validation. This is a global setting that can be over-written at function level by usingfold
parameter within each function. Ignored whenfold_strategy
is a custom object. -
Fold Groups: Optional Group labels when
fold_strategy
isgroupkfold
. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing the group label. -
Transformation Pipeline: All transformations are now applied after
train_test_split
. -
Data Type Handling: All data types handling internally has been changed from
int64
andfloat64
toint32
andfloat32
respectively in order to improve memory usage and performance, as well as for better compatibility with GPU-based algorithms. -
AutoML Behavior Change:
automl
function inpycaret.classification
andpycaret.regression
is no more re-fitting the model on the entire dataset. As such, if the model needs to be fitted on the entire dataset including the holdout set,finalize_model
must be explicitly used. -
Default Tuning Grid: Default hyperparameter tuning grid for
RandomForest
,XGBoost
,CatBoost
, andLightGBM
has been amended to remove extreme values formax_depth
and other training intense parameters to speed up the tuning process. -
Random Forest Default Values: Default value of
n_estimators
forRandomForestClassifier
andRandomForestRegressor
has been changed from10
to100
to make it consistent with the default behavior ofscikit-learn
. -
AUC for Multiclass Classification: AUC for Multiclass target is now available in the metric evaluation.
-
Google Colab Display: All output printed on screen (information grid, score grids) is now format compatible with Google Colab resulting in semantic improvements.
-
Sampling Parameter Removed:
sampling
parameter is now removed from thesetup
function ofpycaret.classification
andpycaret.regression
. -
Type Hinting: In order to make both the usage and development easier, type hints have been added to all updated pycaret functions, in accordance with best practices. Users can leverage those by using an IDE with support for type hints.
-
Documentation: All Modules documentation on the website is now retired. Updated documentation is available here: https://pycaret.readthedocs.io/en/latest/
Function Level Changes
New Functions Introduced in PyCaret 2.2
-
get_metrics: Returns table of available metrics used for CV.
pycaret.classification
pycaret.regression
pycaret.clustering
-
add_metric: Adds a custom metric for model evaluation.
pycaret.classification
pycaret.regression
pycaret.clustering
-
remove_metric: Remove custom metrics.
pycaret.classification
pycaret.regression
pycaret.clustering
-
save_config: save all global variables to a pickle file, allowing to later resume without rerunning the
setup
function.
pycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
-
load_config: Load global variables from pickle file into Python environment.
pycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
setup
pycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
Following new parameters have been added:
-
test_data: pandas.DataFrame, default = None
If not None, test_data is used as a hold-out set, and thetrain_size
parameter is ignored. test_data must be labeled and the shape of data and test_data must match. -
preprocess: bool, default = True
When set to False, no transformations are applied except fortrain_test_split
and custom transformations passed incustom_pipeline
param. Data must be ready for modeling (no missing values, no dates, categorical data encoding) whenpreprocess
is set to False. -
imputation_type: str, default = 'simple'
The type of imputation to use. Can be either 'simple' or 'iterative'. -
iterative_imputation_iters: int, default = 5
The number of iterations. Ignored whenimputation_type
is not 'iterative'. -
categorical_iterative_imputer: str, default = 'lightgbm'
Estimator for iterative imputation of missing values in categorical features. Ignored whenimputation_type
is not 'iterative'. -
numeric_iterative_imputer: str, default = 'lightgbm'
Estimator for iterative imputation of missing values in numeric features. Ignored whenimputation_type
is set to 'simple'. -
data_split_stratify: bool or list, default = False
Controls stratification during 'train_test_split'. When set to True, will stratify by target column. To stratify on any other columns, pass a list of column names. Ignored whendata_split_shuffle
is False. -
fold_strategy: str or sklearn CV generator object, default = 'stratifiedkfold' / 'kfold'
Choice of cross validation strategy. Possible values are:- 'kfold'
- 'stratifiedkfold'
- 'groupkfold'
- 'timeseries'
- a custom CV generator object compatible with scikit-learn.
-
fold: int, default = 10
The number of folds to be used in cross-validation. Must be at least 2. This is a global setting that can be over-written at the function level by using thefold
parameter. Ignored whenfold_strategy
is a custom object. -
fold_shuffle: bool, default = False
Controls the shuffle parameter of CV. Only applicable whenfold_strategy
is 'kfold' or 'stratifiedkfold'. Ignored whenfold_strategy
is a custom object. -
fold_groups: str or array-like, with shape (n_samples,), default = None
Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels. -
use_gpu: str or bool, default = False
When set to 'force', will try to use GPU with all algorithms that support it, and raise exceptions if they are unavailable. When set to True, will use GPU with algorithms that support it, and fall back to CPU if they are unavailable. When False, all algorithms are trained using CPU only. -
custom_pipeline: transformer or list of transformers or tuple, default = None*
When passed, will append the custom transformers in the preprocessing pipeline and are applied on each CV fold separately and on the final fit. All the custom transformations are applied after 'train_test_split' and before pycaret's internal transformations.
compare_models
pycaret.classification
pycaret.regression
Following new parameters have been added:
-
cross_validation: bool = True
When set to False, metrics are evaluated on holdout set.fold
param is ignored when cross_validation is set to False. -
errors: str = "ignore"
When set to 'ignore', will skip the model with exceptions and continue. If 'raise', will stop the function when exceptions are raised. -
fit_kwargs: Optional[dict] = None
Dictionary of arguments passed to the fit method of the model. -
groups: Optional[Union[str, Any]] = None
Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
create_model
pycaret.classification
pycaret.regression
Following new parameters have been added:
-
cross_validation: bool = True
When set to False, metrics are evaluated on holdout set.fold
param is ignored when cross_validation is set to False. -
fit_kwargs: Optional[dict] = None
Dictionary of arguments passed to the fit method of the model. -
groups: Optional[Union[str, Any]] = None
Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
Following parameters have been removed:
- ensemble - Deprecated - use
ensemble_model
function directly. - method - Deprecated - use
ensemble_model
function directly. - system - Moved to private API.
tune_model
pycaret.classification
pycaret.regression
Following new parameters have been added:
-
search_library: str, default = 'scikit-learn'
The search library used for tuning hyperparameters. Possible values:'scikit-learn' - default, requires no further installation
https://github.com/scikit-learn/scikit-learn'scikit-optimize' -
pip install scikit-optimize
https://scikit-optimize.github.io/stable/'tune-sklearn' -
pip install tune-sklearn ray[tune]
https://github.com/ray-project/tune-sklearn'optuna' -
pip install optuna
https://optuna.org/ -
search_algorithm: str, default = None
The search algorithm depends on thesearch_library
parameter. Some search algorithms require additional libraries to be installed. When None, will use the search library-specific default algorithm.scikit-learn
possible values:
- random (default)
- gridscikit-optimize
possible values:
- bayesian (default)tune-sklearn
possible values:
- random (default)
- grid
- bayesianpip install scikit-optimize
- hyperoptpip install hyperopt
- bohbpip install hpbandster ConfigSpace
optuna
possible values:
- tpe (default)
- random -
early_stopping: bool or str or object, default = False
Use early stopping to stop fitting to a hyperparameter configuration if it performs poorly. Ignored whensearch_library
is scikit-learn, or if the estimator does not have 'partial_fit' attribute. If False or None, early stopping will not be used. Can be either an object accepted by the search library or one of the following:- 'asha' for Asynchronous Successive Halving Algorithm
- 'hyperband' for Hyperband
- 'median' for Median Stopping Rule
- If False or None, early stopping will not be used.
-
early_stopping_max_iters: int, default = 10
The maximum number of epochs to run for each sampled configuration. Ignored ifearly_stopping
is False or None. -
fit_kwargs: Optional[dict] = None
Dictionary of arguments passed to the fit method of the model. -
groups: Optional[Union[str, Any]] = None
Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels. -
return_tuner: bool, default = False
When set to True, will return a tuple of (model, tuner_object). -
tuner_verbose: bool or in, default = True
If True or above 0, will print messages from the tuner. Higher values print more messages. Ignored whenverbose
param is False.
ensemble_model
pycaret.classification
pycaret.regression
Following new parameters have been added:
-
fit_kwargs: Optional[dict] = None
Dictionary of arguments passed to the fit method of the model. -
groups: Optional[Union[str, Any]] = None
Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
blend_models
pycaret.classification
pycaret.regression
Following new parameters have been added:
-
fit_kwargs: Optional[dict] = None
Dictionary of arguments passed to the fit method of the model. -
groups: Optional[Union[str, Any]] = None
Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels. -
weights: list, default = None
Sequence of weights (float or int) to weight the occurrences of predicted class labels (hard voting) or class probabilities before averaging (soft voting). Uses uniform weights when None. -
The default value for the
method
parameter has been changed fromhard
toauto
.
stack_models
pycaret.classification
pycaret.regression
Following new parameters have been added:
-
fit_kwargs: Optional[dict] = None
Dictionary of arguments passed to the fit method of the model. -
groups: Optional[Union[str, Any]] = None
Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
calibrate_model
pycaret.classification
Following new parameters have been added:
-
fit_kwargs: Optional[dict] = None
Dictionary of arguments passed to the fit method of the model. -
groups: Optional[Union[str, Any]] = None
Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
plot_model
pycaret.classification
pycaret.regression
Following new parameters have been added:
-
fold: int or scikit-learn compatible CV generator, default = None
Controls cross-validation. If None, the CV generator in thefold_strategy
parameter of thesetup
function is used. When an integer is passed, it is interpreted as the 'n_splits' parameter of the CV generator in thesetup
function. -
fit_kwargs: Optional[dict] = None
Dictionary of arguments passed to the fit method of the model. -
groups: Optional[Union[str, Any]] = None
Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
evaluate_model
pycaret.classification
pycaret.regression
Following new parameters have been added:
-
fold: int or scikit-learn compatible CV generator, default = None
Controls cross-validation. If None, the CV generator in thefold_strategy
parameter of thesetup
function is used. When an integer is passed, it is interpreted as the 'n_splits' parameter of the CV generator in thesetup
function. -
fit_kwargs: Optional[dict] = None
Dictionary of arguments passed to the fit method of the model. -
groups: Optional[Union[str, Any]] = None
Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels.
finalize_model
pycaret.classification
pycaret.regression
Following new parameters have been added:
-
fit_kwargs: Optional[dict] = None
Dictionary of arguments passed to the fit method of the model. -
groups: Optional[Union[str, Any]] = None
Optional group labels when 'GroupKFold' is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When a string is passed, it is interpreted as the column name in the dataset containing group labels. -
model_only: bool, default = True
When set to False, only the model object is re-trained and all the transformations in Pipeline are ignored.
models
pycaret.classification
pycaret.regression
pycaret.clustering
pycaret.anomaly
Following new parameters have been added:
-
internal: bool, default = False
When True, will return extra columns and rows used internally. -
raise_errors: bool, default = True
When False, will suppress all exceptions, ignoring models that couldn't be created.