github quantumblacklabs/kedro 0.15.5

Major features and improvements

  • New CLI commands and command flags:
    • Load multiple kedro run CLI flags from a configuration file with the --config flag (e.g. kedro run --config run_config.yml)
    • Run parametrised pipeline runs with the --params flag (e.g. kedro run --params param1:value1,param2:value2).
    • Lint your project code using the kedro lint command, your project is linted with black (Python 3.6+), flake8 and isort.
  • Load specific environments with Jupyter notebooks using KEDRO_ENV which will globally set run, jupyter notebook and jupyter lab commands using environment variables.
  • Added the following datasets:
    • CSVGCSDataSet dataset in contrib for working with CSV files in Google Cloud Storage.
    • ParquetGCSDataSet dataset in contrib for working with Parquet files in Google Cloud Storage.
    • JSONGCSDataSet dataset in contrib for working with JSON files in Google Cloud Storage.
    • MatplotlibS3Writer dataset in contrib for saving Matplotlib images to S3.
    • PartitionedDataSet for working with datasets split across multiple files.
    • JSONDataSet dataset for working with JSON files that uses fsspec to communicate with the underlying filesystem. It doesn't support http(s) protocol for now.
  • Added s3fs_args to all S3 datasets.
  • Pipelines can be deducted with pipeline1 - pipeline2.

Bug fixes and other changes

  • ParallelRunner now works with SparkDataSet.
  • Allowed the use of nulls in parameters.yml.
  • Fixed an issue where %reload_kedro wasn't reloading all user modules.
  • Fixed pandas_to_spark and spark_to_pandas decorators to work with functions with kwargs.
  • Fixed a bug where kedro jupyter notebook and kedro jupyter lab would run a different Jupyter installation to the one in the local environment.
  • Implemented Databricks-compatible dataset versioning for SparkDataSet.
  • Fixed a bug where kedro package would fail in certain situations where kedro build-reqs was used to generate requirements.txt.
  • Made bucket_name argument optional for the following datasets: CSVS3DataSet, HDFS3DataSet, PickleS3DataSet, contrib.io.parquet.ParquetS3DataSet, contrib.io.gcs.JSONGCSDataSet - bucket name can now be included into the filepath along with the filesystem protocol (e.g. s3://bucket-name/path/to/key.csv).
  • Documentation improvements and fixes.

Breaking changes to the API

  • Renamed entry point for running pip-installed projects to run_package() instead of main() in src/<package>/run.py.
  • bucket_name key has been removed from the string representation of the following datasets: CSVS3DataSet, HDFS3DataSet, PickleS3DataSet, contrib.io.parquet.ParquetS3DataSet, contrib.io.gcs.JSONGCSDataSet.
  • Moved the mem_profiler decorator to contrib and separated the contrib decorators so that dependencies are modular. You may need to update your import paths, for example the pyspark decorators should be imported as from kedro.contrib.decorators.pyspark import <pyspark_decorator> instead of from kedro.contrib.decorators import <pyspark_decorator>.

Thanks for supporting contributions

Sheldon Tsen, @roumail, Karlson Lee, Waylon Walker, Deepyaman Datta, Giovanni, Zain Patel

latest releases: 0.17.5, 0.17.4, 0.17.3...
21 months ago