Major features and improvements
TL;DR We're launching
kedro.extras
, the new home for our revamped series of datasets, decorators and dataset transformers. The datasets inkedro.extras.datasets
usefsspec
to access a variety of data stores including local file systems, network file systems, cloud object stores (including S3 and GCP), and Hadoop, read more about this here. The change will allow #178 to happen in the next major release of Kedro.
An example of this new system can be seen below, loading the CSV SparkDataSet
from S3:
weather:
type: spark.SparkDataSet # Observe the specified type, this affects all datasets
filepath: s3a://your_bucket/data/01_raw/weather* # filepath uses fsspec to indicate the file storage system
credentials: dev_s3
file_format: csv
You can also load data incrementally whenever it is dumped into a directory with the extension to PartionedDataSet
, a feature that allows you to load a directory of files. The IncrementalDataSet
stores the information about the last processed partition in a checkpoint
, read more about this feature here.
New features
- Added
layer
attribute for datasets inkedro.extras.datasets
to specify the name of a layer according to data engineering convention, this feature will be passed tokedro-viz
in future releases. - Enabled loading a particular version of a dataset in Jupyter Notebooks and iPython, using
catalog.load("dataset_name", version="<2019-12-13T15.08.09.255Z>")
. - Added property
run_id
onProjectContext
, used for versioning using theJournal
. To customise your journalrun_id
you can override the private method_get_run_id()
. - Added the ability to install all optional kedro dependencies via
pip install "kedro[all]"
. - Modified the
DataCatalog
's load order for datasets, loading order is the following:kedro.io
kedro.extras.datasets
- Import path, specified in
type
- Added an optional
copy_mode
flag toCachedDataSet
andMemoryDataSet
to specify (deepcopy
,copy
orassign
) the copy mode to use when loading and saving.
New Datasets
Type | Description | Location |
---|---|---|
ParquetDataSet
| Handles parquet datasets using Dask | kedro.extras.datasets.dask
|
PickleDataSet
| Work with Pickle files using fsspec to communicate with the underlying filesystem
| kedro.extras.datasets.pickle
|
CSVDataSet
| Work with CSV files using fsspec to communicate with the underlying filesystem
| kedro.extras.datasets.pandas
|
TextDataSet
| Work with text files using fsspec to communicate with the underlying filesystem
| kedro.extras.datasets.pandas
|
ExcelDataSet
| Work with Excel files using fsspec to communicate with the underlying filesystem
| kedro.extras.datasets.pandas
|
HDFDataSet
| Work with HDF using fsspec to communicate with the underlying filesystem
| kedro.extras.datasets.pandas
|
YAMLDataSet
| Work with YAML files using fsspec to communicate with the underlying filesystem
| kedro.extras.datasets.yaml
|
MatplotlibWriter
| Save with Matplotlib images using fsspec to communicate with the underlying filesystem
| kedro.extras.datasets.matplotlib
|
NetworkXDataSet
| Work with NetworkX files using fsspec to communicate with the underlying filesystem
| kedro.extras.datasets.networkx
|
BioSequenceDataSet
| Work with bio-sequence objects using fsspec to communicate with the underlying filesystem
| kedro.extras.datasets.biosequence
|
GBQTableDataSet
| Work with Google BigQuery | kedro.extras.datasets.pandas
|
FeatherDataSet
| Work with feather files using fsspec to communicate with the underlying filesystem
| kedro.extras.datasets.pandas
|
IncrementalDataSet
| Inherit from PartitionedDataSet and remembers the last processed partition
| kedro.io
|
Files with a new location
Type | New Location |
---|---|
JSONDataSet
| kedro.extras.datasets.pandas
|
CSVBlobDataSet
| kedro.extras.datasets.pandas
|
JSONBlobDataSet
| kedro.extras.datasets.pandas
|
SQLTableDataSet
| kedro.extras.datasets.pandas
|
SQLQueryDataSet
| kedro.extras.datasets.pandas
|
SparkDataSet
| kedro.extras.datasets.spark
|
SparkHiveDataSet
| kedro.extras.datasets.spark
|
SparkJDBCDataSet
| kedro.extras.datasets.spark
|
kedro/contrib/decorators/retry.py
| kedro/extras/decorators/retry_node.py
|
kedro/contrib/decorators/memory_profiler.py
| kedro/extras/decorators/memory_profiler.py
|
kedro/contrib/io/transformers/transformers.py
| kedro/extras/transformers/time_profiler.py
|
kedro/contrib/colors/logging/color_logger.py
| kedro/extras/logging/color_logger.py
|
extras/ipython_loader.py
| tools/ipython/ipython_loader.py
|
kedro/contrib/io/cached/cached_dataset.py
| kedro/io/cached_dataset.py
|
kedro/contrib/io/catalog_with_default/data_catalog_with_default.py
| kedro/io/data_catalog_with_default.py
|
kedro/contrib/config/templated_config.py
| kedro/config/templated_config.py
|
Upcoming deprecations
Category | Type |
---|---|
Datasets | BioSequenceLocalDataSet
|
CSVGCSDataSet
| |
CSVHTTPDataSet
| |
CSVLocalDataSet
| |
CSVS3DataSet
| |
ExcelLocalDataSet
| |
FeatherLocalDataSet
| |
JSONGCSDataSet
| |
JSONLocalDataSet
| |
HDFLocalDataSet
| |
HDFS3DataSet
| |
kedro.contrib.io.cached.CachedDataSet
| |
kedro.contrib.io.catalog_with_default.DataCatalogWithDefault
| |
MatplotlibLocalWriter
| |
MatplotlibS3Writer
| |
NetworkXLocalDataSet
| |
ParquetGCSDataSet
| |
ParquetLocalDataSet
| |
ParquetS3DataSet
| |
PickleLocalDataSet
| |
PickleS3DataSet
| |
TextLocalDataSet
| |
YAMLLocalDataSet
| |
Decorators | kedro.contrib.decorators.memory_profiler
|
kedro.contrib.decorators.retry
| |
kedro.contrib.decorators.pyspark.spark_to_pandas
| |
kedro.contrib.decorators.pyspark.pandas_to_spark
| |
Transformers | kedro.contrib.io.transformers.transformers
|
Configuration Loaders | kedro.contrib.config.TemplatedConfigLoader
|
Bug fixes and other changes
- Added the option to set/overwrite params in
config.yaml
using YAML dict style instead of string CLI formatting only. - Kedro CLI arguments
--node
and--tag
support comma-separated values, alternative methods will be deprecated in future releases. - Fixed a bug in the
invalidate_cache
method ofParquetGCSDataSet
andCSVGCSDataSet
. --load-version
now won't break if version value contains a colon.- Enabled running
node
s with duplicate inputs. - Improved error message when empty credentials are passed into
SparkJDBCDataSet
. - Fixed bug that caused an empty project to fail unexpectedly with ImportError in
template/.../pipeline.py
. - Fixed bug related to saving dataframe with categorical variables in table mode using
HDFS3DataSet
. - Fixed bug that caused unexpected behavior when using
from_nodes
andto_nodes
in pipelines using transcoding. - Credentials nested in the dataset config are now also resolved correctly.
- Bumped minimum required pandas version to 0.24.0 to make use of
pandas.DataFrame.to_numpy
(recommended alternative topandas.DataFrame.values
). - Docs improvements.
Pipeline.transform
skips modifying node inputs/outputs containingparams:
orparameters
keywords.- Support for
dataset_credentials
key in the credentials forPartitionedDataSet
is now deprecated. The dataset credentials should be specified explicitly inside the dataset config. - Datasets can have a new
confirm
function which is called after a successful node function execution if the node containsconfirms
argument with such dataset name. - Make the resume prompt on pipeline run failure use
--from-nodes
instead of--from-inputs
to avoid unnecessarily re-running nodes that had already executed. - When closed, Jupyter notebook kernels are automatically terminated after 30 seconds of inactivity by default. Use
--idle-timeout
option to update it. - Added
kedro-viz
to the Kedro project templaterequirements.txt
file. - Removed the
results
andreferences
folder from the project template. - Updated contribution process in
CONTRIBUTING.md
.
Breaking changes to the API
- Existing
MatplotlibWriter
dataset incontrib
was renamed toMatplotlibLocalWriter
. kedro/contrib/io/matplotlib/matplotlib_writer.py
was renamed tokedro/contrib/io/matplotlib/matplotlib_local_writer.py
.kedro.contrib.io.bioinformatics.sequence_dataset.py
was renamed tokedro.contrib.io.bioinformatics.biosequence_local_dataset.py
.
Thanks for supporting contributions
Andrii Ivaniuk, Jonas Kemper, Yuhao Zhu, Balazs Konig, Pedro Abreu, Tam-Sanh Nguyen, Peter Zhao, Deepyaman Datta, Florian Roessler, Miguel Rodriguez Gutierrez