github quantumblacklabs/kedro 0.15.6

Major features and improvements

TL;DR We're launching kedro.extras, the new home for our revamped series of datasets, decorators and dataset transformers. The datasets in kedro.extras.datasets use fsspec to access a variety of data stores including local file systems, network file systems, cloud object stores (including S3 and GCP), and Hadoop, read more about this here. The change will allow #178 to happen in the next major release of Kedro.

An example of this new system can be seen below, loading the CSV SparkDataSet from S3:

  type: spark.SparkDataSet  # Observe the specified type, this  affects all datasets
  filepath: s3a://your_bucket/data/01_raw/weather*  # filepath uses fsspec to indicate the file storage system
  credentials: dev_s3
  file_format: csv

You can also load data incrementally whenever it is dumped into a directory with the extension to PartionedDataSet, a feature that allows you to load a directory of files. The IncrementalDataSet stores the information about the last processed partition in a checkpoint, read more about this feature here.

New features

  • Added layer attribute for datasets in kedro.extras.datasets to specify the name of a layer according to data engineering convention, this feature will be passed to kedro-viz in future releases.
  • Enabled loading a particular version of a dataset in Jupyter Notebooks and iPython, using catalog.load("dataset_name", version="<2019-12-13T15.08.09.255Z>").
  • Added property run_id on ProjectContext, used for versioning using the Journal. To customise your journal run_id you can override the private method _get_run_id().
  • Added the ability to install all optional kedro dependencies via pip install "kedro[all]".
  • Modified the DataCatalog's load order for datasets, loading order is the following:
    • kedro.extras.datasets
    • Import path, specified in type
  • Added an optional copy_mode flag to CachedDataSet and MemoryDataSet to specify (deepcopy, copy or assign) the copy mode to use when loading and saving.

New Datasets

Type Description Location
ParquetDataSet Handles parquet datasets using Dask kedro.extras.datasets.dask
PickleDataSet Work with Pickle files using fsspec to communicate with the underlying filesystem kedro.extras.datasets.pickle
CSVDataSet Work with CSV files using fsspec to communicate with the underlying filesystem kedro.extras.datasets.pandas
TextDataSet Work with text files using fsspec to communicate with the underlying filesystem kedro.extras.datasets.pandas
ExcelDataSet Work with Excel files using fsspec to communicate with the underlying filesystem kedro.extras.datasets.pandas
HDFDataSet Work with HDF using fsspec to communicate with the underlying filesystem kedro.extras.datasets.pandas
YAMLDataSet Work with YAML files using fsspec to communicate with the underlying filesystem kedro.extras.datasets.yaml
MatplotlibWriter Save with Matplotlib images using fsspec to communicate with the underlying filesystem kedro.extras.datasets.matplotlib
NetworkXDataSet Work with NetworkX files using fsspec to communicate with the underlying filesystem kedro.extras.datasets.networkx
BioSequenceDataSet Work with bio-sequence objects using fsspec to communicate with the underlying filesystem kedro.extras.datasets.biosequence
GBQTableDataSet Work with Google BigQuery kedro.extras.datasets.pandas
FeatherDataSet Work with feather files using fsspec to communicate with the underlying filesystem kedro.extras.datasets.pandas
IncrementalDataSet Inherit from PartitionedDataSet and remembers the last processed partition

Files with a new location

Type New Location
JSONDataSet kedro.extras.datasets.pandas
CSVBlobDataSet kedro.extras.datasets.pandas
JSONBlobDataSet kedro.extras.datasets.pandas
SQLTableDataSet kedro.extras.datasets.pandas
SQLQueryDataSet kedro.extras.datasets.pandas
SparkDataSet kedro.extras.datasets.spark
SparkHiveDataSet kedro.extras.datasets.spark
SparkJDBCDataSet kedro.extras.datasets.spark
kedro/contrib/decorators/ kedro/extras/decorators/
kedro/contrib/decorators/ kedro/extras/decorators/
kedro/contrib/io/transformers/ kedro/extras/transformers/
kedro/contrib/colors/logging/ kedro/extras/logging/
extras/ tools/ipython/
kedro/contrib/io/cached/ kedro/io/
kedro/contrib/io/catalog_with_default/ kedro/io/
kedro/contrib/config/ kedro/config/

Upcoming deprecations

Category Type
Datasets BioSequenceLocalDataSet
Decorators kedro.contrib.decorators.memory_profiler
Configuration Loaders kedro.contrib.config.TemplatedConfigLoader

Bug fixes and other changes

  • Added the option to set/overwrite params in config.yaml using YAML dict style instead of string CLI formatting only.
  • Kedro CLI arguments --node and --tag support comma-separated values, alternative methods will be deprecated in future releases.
  • Fixed a bug in the invalidate_cache method of ParquetGCSDataSet and CSVGCSDataSet.
  • --load-version now won't break if version value contains a colon.
  • Enabled running nodes with duplicate inputs.
  • Improved error message when empty credentials are passed into SparkJDBCDataSet.
  • Fixed bug that caused an empty project to fail unexpectedly with ImportError in template/.../
  • Fixed bug related to saving dataframe with categorical variables in table mode using HDFS3DataSet.
  • Fixed bug that caused unexpected behavior when using from_nodes and to_nodes in pipelines using transcoding.
  • Credentials nested in the dataset config are now also resolved correctly.
  • Bumped minimum required pandas version to 0.24.0 to make use of pandas.DataFrame.to_numpy (recommended alternative to pandas.DataFrame.values).
  • Docs improvements.
  • Pipeline.transform skips modifying node inputs/outputs containing params: or parameters keywords.
  • Support for dataset_credentials key in the credentials for PartitionedDataSet is now deprecated. The dataset credentials should be specified explicitly inside the dataset config.
  • Datasets can have a new confirm function which is called after a successful node function execution if the node contains confirms argument with such dataset name.
  • Make the resume prompt on pipeline run failure use --from-nodes instead of --from-inputs to avoid unnecessarily re-running nodes that had already executed.
  • When closed, Jupyter notebook kernels are automatically terminated after 30 seconds of inactivity by default. Use --idle-timeout option to update it.
  • Added kedro-viz to the Kedro project template requirements.txt file.
  • Removed the results and references folder from the project template.
  • Updated contribution process in

Breaking changes to the API

  • Existing MatplotlibWriter dataset in contrib was renamed to MatplotlibLocalWriter.
  • kedro/contrib/io/matplotlib/ was renamed to kedro/contrib/io/matplotlib/
  • was renamed to

Thanks for supporting contributions

Andrii Ivaniuk, Jonas Kemper, Yuhao Zhu, Balazs Konig, Pedro Abreu, Tam-Sanh Nguyen, Peter Zhao, Deepyaman Datta, Florian Roessler, Miguel Rodriguez Gutierrez

latest releases: 0.17.5, 0.17.4, 0.17.3...
19 months ago