Major features and improvements
- New CLI commands and command flags:
- Load multiple
kedro run
CLI flags from a configuration file with the--config
flag (e.g.kedro run --config run_config.yml
) - Run parametrised pipeline runs with the
--params
flag (e.g.kedro run --params param1:value1,param2:value2
). - Lint your project code using the
kedro lint
command, your project is linted withblack
(Python 3.6+),flake8
andisort
.
- Load multiple
- Load specific environments with Jupyter notebooks using
KEDRO_ENV
which will globally setrun
,jupyter notebook
andjupyter lab
commands using environment variables. - Added the following datasets:
CSVGCSDataSet
dataset incontrib
for working with CSV files in Google Cloud Storage.ParquetGCSDataSet
dataset incontrib
for working with Parquet files in Google Cloud Storage.JSONGCSDataSet
dataset incontrib
for working with JSON files in Google Cloud Storage.MatplotlibS3Writer
dataset incontrib
for saving Matplotlib images to S3.PartitionedDataSet
for working with datasets split across multiple files.JSONDataSet
dataset for working with JSON files that usesfsspec
to communicate with the underlying filesystem. It doesn't supporthttp(s)
protocol for now.
- Added
s3fs_args
to all S3 datasets. - Pipelines can be deducted with
pipeline1 - pipeline2
.
Bug fixes and other changes
ParallelRunner
now works withSparkDataSet
.- Allowed the use of nulls in
parameters.yml
. - Fixed an issue where
%reload_kedro
wasn't reloading all user modules. - Fixed
pandas_to_spark
andspark_to_pandas
decorators to work with functions with kwargs. - Fixed a bug where
kedro jupyter notebook
andkedro jupyter lab
would run a different Jupyter installation to the one in the local environment. - Implemented Databricks-compatible dataset versioning for
SparkDataSet
. - Fixed a bug where
kedro package
would fail in certain situations wherekedro build-reqs
was used to generaterequirements.txt
. - Made
bucket_name
argument optional for the following datasets:CSVS3DataSet
,HDFS3DataSet
,PickleS3DataSet
,contrib.io.parquet.ParquetS3DataSet
,contrib.io.gcs.JSONGCSDataSet
- bucket name can now be included into the filepath along with the filesystem protocol (e.g.s3://bucket-name/path/to/key.csv
). - Documentation improvements and fixes.
Breaking changes to the API
- Renamed entry point for running pip-installed projects to
run_package()
instead ofmain()
insrc/<package>/run.py
. bucket_name
key has been removed from the string representation of the following datasets:CSVS3DataSet
,HDFS3DataSet
,PickleS3DataSet
,contrib.io.parquet.ParquetS3DataSet
,contrib.io.gcs.JSONGCSDataSet
.- Moved the
mem_profiler
decorator tocontrib
and separated thecontrib
decorators so that dependencies are modular. You may need to update your import paths, for example the pyspark decorators should be imported asfrom kedro.contrib.decorators.pyspark import <pyspark_decorator>
instead offrom kedro.contrib.decorators import <pyspark_decorator>
.
Thanks for supporting contributions
Sheldon Tsen, @roumail, Karlson Lee, Waylon Walker, Deepyaman Datta, Giovanni, Zain Patel