kedro-org/kedro 0.15.6 on GitHub

Major features and improvements

TL;DR We're launching kedro.extras, the new home for our revamped series of datasets, decorators and dataset transformers. The datasets in kedro.extras.datasets use fsspec to access a variety of data stores including local file systems, network file systems, cloud object stores (including S3 and GCP), and Hadoop, read more about this here. The change will allow #178 to happen in the next major release of Kedro.

An example of this new system can be seen below, loading the CSV SparkDataSet from S3:

weather:
  type: spark.SparkDataSet  # Observe the specified type, this  affects all datasets
  filepath: s3a://your_bucket/data/01_raw/weather*  # filepath uses fsspec to indicate the file storage system
  credentials: dev_s3
  file_format: csv

You can also load data incrementally whenever it is dumped into a directory with the extension to PartionedDataSet, a feature that allows you to load a directory of files. The IncrementalDataSet stores the information about the last processed partition in a checkpoint, read more about this feature here.

New features

Added layer attribute for datasets in kedro.extras.datasets to specify the name of a layer according to data engineering convention, this feature will be passed to kedro-viz in future releases.
Enabled loading a particular version of a dataset in Jupyter Notebooks and iPython, using catalog.load("dataset_name", version="<2019-12-13T15.08.09.255Z>").
Added property run_id on ProjectContext, used for versioning using the Journal. To customise your journal run_id you can override the private method _get_run_id().
Added the ability to install all optional kedro dependencies via pip install "kedro[all]".
Modified the DataCatalog's load order for datasets, loading order is the following:
- kedro.io
- kedro.extras.datasets
- Import path, specified in type
Added an optional copy_mode flag to CachedDataSet and MemoryDataSet to specify (deepcopy, copy or assign) the copy mode to use when loading and saving.

New Datasets

Type	Description	Location
`ParquetDataSet`	Handles parquet datasets using Dask	`kedro.extras.datasets.dask`
`PickleDataSet`	Work with Pickle files using `fsspec` to communicate with the underlying filesystem	`kedro.extras.datasets.pickle`
`CSVDataSet`	Work with CSV files using `fsspec` to communicate with the underlying filesystem	`kedro.extras.datasets.pandas`
`TextDataSet`	Work with text files using `fsspec` to communicate with the underlying filesystem	`kedro.extras.datasets.pandas`
`ExcelDataSet`	Work with Excel files using `fsspec` to communicate with the underlying filesystem	`kedro.extras.datasets.pandas`
`HDFDataSet`	Work with HDF using `fsspec` to communicate with the underlying filesystem	`kedro.extras.datasets.pandas`
`YAMLDataSet`	Work with YAML files using `fsspec` to communicate with the underlying filesystem	`kedro.extras.datasets.yaml`
`MatplotlibWriter`	Save with Matplotlib images using `fsspec` to communicate with the underlying filesystem	`kedro.extras.datasets.matplotlib`
`NetworkXDataSet`	Work with NetworkX files using `fsspec` to communicate with the underlying filesystem	`kedro.extras.datasets.networkx`
`BioSequenceDataSet`	Work with bio-sequence objects using `fsspec` to communicate with the underlying filesystem	`kedro.extras.datasets.biosequence`
`GBQTableDataSet`	Work with Google BigQuery	`kedro.extras.datasets.pandas`
`FeatherDataSet`	Work with feather files using `fsspec` to communicate with the underlying filesystem	`kedro.extras.datasets.pandas`
`IncrementalDataSet`	Inherit from `PartitionedDataSet` and remembers the last processed partition	`kedro.io`

Files with a new location

Type	New Location
`JSONDataSet`	`kedro.extras.datasets.pandas`
`CSVBlobDataSet`	`kedro.extras.datasets.pandas`
`JSONBlobDataSet`	`kedro.extras.datasets.pandas`
`SQLTableDataSet`	`kedro.extras.datasets.pandas`
`SQLQueryDataSet`	`kedro.extras.datasets.pandas`
`SparkDataSet`	`kedro.extras.datasets.spark`
`SparkHiveDataSet`	`kedro.extras.datasets.spark`
`SparkJDBCDataSet`	`kedro.extras.datasets.spark`
`kedro/contrib/decorators/retry.py`	`kedro/extras/decorators/retry_node.py`
`kedro/contrib/decorators/memory_profiler.py`	`kedro/extras/decorators/memory_profiler.py`
`kedro/contrib/io/transformers/transformers.py`	`kedro/extras/transformers/time_profiler.py`
`kedro/contrib/colors/logging/color_logger.py`	`kedro/extras/logging/color_logger.py`
`extras/ipython_loader.py`	`tools/ipython/ipython_loader.py`
`kedro/contrib/io/cached/cached_dataset.py`	`kedro/io/cached_dataset.py`
`kedro/contrib/io/catalog_with_default/data_catalog_with_default.py`	`kedro/io/data_catalog_with_default.py`
`kedro/contrib/config/templated_config.py`	`kedro/config/templated_config.py`

Upcoming deprecations

Category	Type
Datasets	`BioSequenceLocalDataSet`
	`CSVGCSDataSet`
	`CSVHTTPDataSet`
	`CSVLocalDataSet`
	`CSVS3DataSet`
	`ExcelLocalDataSet`
	`FeatherLocalDataSet`
	`JSONGCSDataSet`
	`JSONLocalDataSet`
	`HDFLocalDataSet`
	`HDFS3DataSet`
	`kedro.contrib.io.cached.CachedDataSet`
	`kedro.contrib.io.catalog_with_default.DataCatalogWithDefault`
	`MatplotlibLocalWriter`
	`MatplotlibS3Writer`
	`NetworkXLocalDataSet`
	`ParquetGCSDataSet`
	`ParquetLocalDataSet`
	`ParquetS3DataSet`
	`PickleLocalDataSet`
	`PickleS3DataSet`
	`TextLocalDataSet`
	`YAMLLocalDataSet`
Decorators	`kedro.contrib.decorators.memory_profiler`
	`kedro.contrib.decorators.retry`
	`kedro.contrib.decorators.pyspark.spark_to_pandas`
	`kedro.contrib.decorators.pyspark.pandas_to_spark`
Transformers	`kedro.contrib.io.transformers.transformers`
Configuration Loaders	`kedro.contrib.config.TemplatedConfigLoader`

Bug fixes and other changes

Added the option to set/overwrite params in config.yaml using YAML dict style instead of string CLI formatting only.
Kedro CLI arguments --node and --tag support comma-separated values, alternative methods will be deprecated in future releases.
Fixed a bug in the invalidate_cache method of ParquetGCSDataSet and CSVGCSDataSet.
--load-version now won't break if version value contains a colon.
Enabled running nodes with duplicate inputs.
Improved error message when empty credentials are passed into SparkJDBCDataSet.
Fixed bug that caused an empty project to fail unexpectedly with ImportError in template/.../pipeline.py.
Fixed bug related to saving dataframe with categorical variables in table mode using HDFS3DataSet.
Fixed bug that caused unexpected behavior when using from_nodes and to_nodes in pipelines using transcoding.
Credentials nested in the dataset config are now also resolved correctly.
Bumped minimum required pandas version to 0.24.0 to make use of pandas.DataFrame.to_numpy (recommended alternative to pandas.DataFrame.values).
Docs improvements.
Pipeline.transform skips modifying node inputs/outputs containing params: or parameters keywords.
Support for dataset_credentials key in the credentials for PartitionedDataSet is now deprecated. The dataset credentials should be specified explicitly inside the dataset config.
Datasets can have a new confirm function which is called after a successful node function execution if the node contains confirms argument with such dataset name.
Make the resume prompt on pipeline run failure use --from-nodes instead of --from-inputs to avoid unnecessarily re-running nodes that had already executed.
When closed, Jupyter notebook kernels are automatically terminated after 30 seconds of inactivity by default. Use --idle-timeout option to update it.
Added kedro-viz to the Kedro project template requirements.txt file.
Removed the results and references folder from the project template.
Updated contribution process in CONTRIBUTING.md.

Breaking changes to the API

Existing MatplotlibWriter dataset in contrib was renamed to MatplotlibLocalWriter.
kedro/contrib/io/matplotlib/matplotlib_writer.py was renamed to kedro/contrib/io/matplotlib/matplotlib_local_writer.py.
kedro.contrib.io.bioinformatics.sequence_dataset.py was renamed to kedro.contrib.io.bioinformatics.biosequence_local_dataset.py.

Thanks for supporting contributions

Andrii Ivaniuk, Jonas Kemper, Yuhao Zhu, Balazs Konig, Pedro Abreu, Tam-Sanh Nguyen, Peter Zhao, Deepyaman Datta, Florian Roessler, Miguel Rodriguez Gutierrez