github quantumblacklabs/kedro 0.15.0

Major features and improvements

  • Added KedroContext base class which holds the configuration and Kedro's main functionality (catalog, pipeline, config, runner).
  • Added a new CLI command kedro jupyter convert to facilitate converting Jupyter Notebook cells into Kedro nodes.
  • Added support for pip-compile and new Kedro command kedro build-reqs that generates requirements.txt based on
  • Running kedro install will install packages to conda environment if src/environment.yml exists in your project.
  • Added a new --node flag to kedro run, allowing users to run only the nodes with the specified names.
  • Added new --from-nodes and --to-nodes run arguments, allowing users to run a range of nodes from the pipeline.
  • Added prefix params: to the parameters specified in parameters.yml which allows users to differentiate between their different parameter node inputs and outputs.
  • Jupyter Lab/Notebook now starts with only one kernel by default.
  • Added the following datasets:
    • CSVHTTPDataSet to load CSV using HTTP(s) links.
    • JSONBlobDataSet to load json (-delimited) files from Azure Blob Storage.
    • ParquetS3DataSet in contrib for usage with pandas. (by @mmchougule)
    • CachedDataSet in contrib which will cache data in memory to avoid io/network operations. It will clear the cache once a dataset is no longer needed by a pipeline. (by @tsanikgr)
    • YAMLLocalDataSet in contrib to load and save local YAML files. (by @Minyus)

Bug fixes and other changes

  • Documentation improvements including instructions on how to initialise a Spark session using YAML configuration.
  • anyconfig default log level changed from INFO to WARNING.
  • Added information on installed plugins to kedro info.
  • Added style sheets for project documentation, so the output of kedro build-docs will resemble the style of kedro docs.

Breaking changes to the API

  • Simplified the Kedro template in with the introduction of KedroContext class.
  • Merged FilepathVersionMixIn and S3VersionMixIn under one abstract class AbstractVersionedDataSet which extendsAbstractDataSet.
  • name changed to be a keyword-only argument for Pipeline.
  • CSVLocalDataSet no longer supports URLs. CSVHTTPDataSet supports URLs.

Migration guide from Kedro 0.14.X to Kedro 0.15.0

Migration for Kedro project template

This guide assumes that:

  • The framework specific code has not been altered significantly
  • Your project specific code is stored in the dedicated python package under src/.

The breaking changes were introduced in the following project template files:

  • <project-name>/.ipython/profile_default/startup/
  • <project-name>/
  • <project-name>/src/tests/
  • <project-name>/src/<package-name>/
  • <project-name>/.kedro.yml (new file)

The easiest way to migrate your project from Kedro 0.14.* to Kedro 0.15.0 is to create a new project (by using kedro new) and move code and files bit by bit as suggested in the detailed guide below:

  1. Create a new project with the same name by running kedro new

  2. Copy the following folders to the new project:

    • results/
    • references/
    • notebooks/
    • logs/
    • data/
    • conf/
  3. If you customised your src/<package>/, make sure you apply the same customisations to src/<package>/

    • If you customised get_config(), you can override config_loader property in ProjectContext derived class
    • If you customised create_catalog(), you can override catalog() property in ProjectContext derived class
    • If you customised run(), you can override run() method in ProjectContext derived class
    • If you customised default env, you can override it in ProjectContext derived class or pass it at construction. By default, env is local.
    • If you customised default root_conf, you can override CONF_ROOT attribute in ProjectContext derived class. By default, KedroContext base class has CONF_ROOT attribute set to conf.
  4. The following syntax changes are introduced in ipython or Jupyter notebook/labs:

    • proj_dir -> context.project_path
    • proj_name -> context.project_name
    • conf -> context.config_loader.
    • io -> context.catalog (e.g., io.load() -> context.catalog.load())
  5. If you customised your, you need to apply the same customisations to your in the new project.

  6. Copy the contents of the old project's src/requirements.txt into the new project's src/ and, from the project root directory, run the kedro build-reqs command in your terminal window.

Migration for versioning custom dataset classes

If you defined any custom dataset classes which support versioning in your project, you need to apply the following changes:

  1. Make sure your dataset inherits from AbstractVersionedDataSet only.
  2. Call super().__init__() with the appropriate arguments in the dataset's __init__. If storing on local filesystem, providing the filepath and the version is enough. Otherwise, you should also pass in an exists_function and a glob_function that emulate exists and glob in a different filesystem (see CSVS3DataSet as an example).
  3. Remove setting of the _filepath and _version attributes in the dataset's __init__, as this is taken care of in the base abstract class.
  4. Any calls to _get_load_path and _get_save_path methods should take no arguments.
  5. Ensure you convert the output of _get_load_path and _get_save_path appropriately, as these now return PurePaths instead of strings.
  6. Make sure _check_paths_consistency is called with PurePaths as input arguments, instead of strings.

These steps should have brought your project to Kedro 0.15.0. There might be some more minor tweaks needed as every project is unique, but now you have a pretty solid base to work with. If you run into any problems, please consult the Kedro documentation.

Thanks for supporting contributions

Dmitry Vukolov, Jo Stichbury, Angus Williams, Deepyaman Datta, Mayur Chougule, Marat Kopytjuk, Evan Miller, Yusuke Minami

