Kedro v0.15.6 Release Notes

Release Date: 2020-02-26 // about 4 years ago
  • Major features and improvements

    TL;DR We're launching kedro.extras, the new home for our revamped series of datasets, decorators and dataset transformers. The datasets in kedro.extras.datasets use fsspec to access a variety of data stores including local file systems, network file systems, cloud object stores (including S3 and GCP), and Hadoop, read more about this here. The change will allow #178 to happen in the next major release of Kedro.

    ๐Ÿ‘€ An example of this new system can be seen below, loading the CSV SparkDataSet from S3:

    weather: type: spark.SparkDataSet # Observe the specified type, this affects all datasetsfilepath: s3a://your\_bucket/data/01\_raw/weather\* # filepath uses fsspec to indicate the file storage systemcredentials: dev\_s3file\_format: csv
    

    ๐Ÿ‘‰ You can also load data incrementally whenever it is dumped into a directory with the extension to PartionedDataSet, a feature that allows you to load a directory of files. The IncrementalDataSet stores the information about the last processed partition in a checkpoint, read more about this feature here.

    ๐Ÿ†• New features

    • Added layer attribute for datasets in kedro.extras.datasets to specify the name of a layer according to data engineering convention, this feature will be passed to kedro-viz in future releases.
    • Enabled loading a particular version of a dataset in Jupyter Notebooks and iPython, using catalog.load("dataset_name", version="<2019-12-13T15.08.09.255Z>").
    • Added property run_id on ProjectContext, used for versioning using the Journal. To customise your journal run_id you can override the private method _get_run_id().
    • โž• Added the ability to install all optional kedro dependencies via pip install "kedro[all]".
    • Modified the DataCatalog's load order for datasets, loading order is the following:
      • kedro.io
      • kedro.extras.datasets
      • Import path, specified in type
    • โž• Added an optional copy_mode flag to CachedDataSet and MemoryDataSet to specify (deepcopy, copy or assign) the copy mode to use when loading and saving.

    ๐Ÿ†• New Datasets

    Type Description Location
    ParquetDataSet Handles parquet datasets using Dask kedro.extras.datasets.dask
    โœ… PickleDataSet Work with Pickle files using fsspec to communicate with the underlying filesystem
    โœ… CSVDataSet Work with CSV files using fsspec to communicate with the underlying filesystem
    โœ… TextDataSet Work with text files using fsspec to communicate with the underlying filesystem
    โœ… ExcelDataSet Work with Excel files using fsspec to communicate with the underlying filesystem
    โœ… HDFDataSet Work with HDF using fsspec to communicate with the underlying filesystem
    โœ… YAMLDataSet Work with YAML files using fsspec to communicate with the underlying filesystem
    โœ… MatplotlibWriter Save with Matplotlib images using fsspec to communicate with the underlying filesystem
    โœ… NetworkXDataSet Work with NetworkX files using fsspec to communicate with the underlying filesystem
    โœ… BioSequenceDataSet Work with bio-sequence objects using fsspec to communicate with the underlying filesystem
    GBQTableDataSet Work with Google BigQuery kedro.extras.datasets.pandas
    โœ… FeatherDataSet Work with feather files using fsspec to communicate with the underlying filesystem
    IncrementalDataSet Inherit from PartitionedDataSet and remembers the last processed partition kedro.io

    Files with a new location

    Type New Location
    JSONDataSet kedro.extras.datasets.pandas
    CSVBlobDataSet kedro.extras.datasets.pandas
    JSONBlobDataSet kedro.extras.datasets.pandas
    SQLTableDataSet kedro.extras.datasets.pandas
    SQLQueryDataSet kedro.extras.datasets.pandas
    SparkDataSet kedro.extras.datasets.spark
    SparkHiveDataSet kedro.extras.datasets.spark
    SparkJDBCDataSet kedro.extras.datasets.spark
    kedro/contrib/decorators/retry.py kedro/extras/decorators/retry_node.py
    kedro/contrib/decorators/memory_profiler.py kedro/extras/decorators/memory_profiler.py
    kedro/contrib/io/transformers/transformers.py kedro/extras/transformers/time_profiler.py
    ๐ŸŒฒ kedro/contrib/colors/logging/color_logger.py
    extras/ipython_loader.py tools/ipython/ipython_loader.py
    kedro/contrib/io/cached/cached_dataset.py kedro/io/cached_dataset.py
    kedro/contrib/io/catalog_with_default/data_catalog_with_default.py kedro/io/data_catalog_with_default.py
    kedro/contrib/config/templated_config.py kedro/config/templated_config.py

    ๐Ÿ—„ Upcoming deprecations

    Category Type
    Datasets BioSequenceLocalDataSet
    CSVGCSDataSet
    CSVHTTPDataSet
    CSVLocalDataSet
    CSVS3DataSet
    ExcelLocalDataSet
    FeatherLocalDataSet
    JSONGCSDataSet
    JSONLocalDataSet
    HDFLocalDataSet
    HDFS3DataSet
    kedro.contrib.io.cached.CachedDataSet
    kedro.contrib.io.catalog_with_default.DataCatalogWithDefault
    MatplotlibLocalWriter
    MatplotlibS3Writer
    NetworkXLocalDataSet
    ParquetGCSDataSet
    ParquetLocalDataSet
    ParquetS3DataSet
    PickleLocalDataSet
    PickleS3DataSet
    TextLocalDataSet
    YAMLLocalDataSet
    Decorators kedro.contrib.decorators.memory_profiler
    kedro.contrib.decorators.retry
    kedro.contrib.decorators.pyspark.spark_to_pandas
    kedro.contrib.decorators.pyspark.pandas_to_spark
    Transformers kedro.contrib.io.transformers.transformers
    ๐Ÿ”ง Configuration Loaders

    ๐Ÿ› Bug fixes and other changes

    • โž• Added the option to set/overwrite params in config.yaml using YAML dict style instead of string CLI formatting only.
    • ๐Ÿš€ Kedro CLI arguments --node and --tag support comma-separated values, alternative methods will be deprecated in future releases.
    • ๐Ÿ›  Fixed a bug in the invalidate_cache method of ParquetGCSDataSet and CSVGCSDataSet.
    • --load-version now won't break if version value contains a colon.
    • Enabled running nodes with duplicate inputs.
    • ๐Ÿ‘Œ Improved error message when empty credentials are passed into SparkJDBCDataSet.
    • ๐Ÿ›  Fixed bug that caused an empty project to fail unexpectedly with ImportError in template/.../pipeline.py.
    • ๐Ÿ›  Fixed bug related to saving dataframe with categorical variables in table mode using HDFS3DataSet.
    • Fixed bug that caused unexpected behavior when using from_nodes and to_nodes in pipelines using transcoding.
    • Credentials nested in the dataset config are now also resolved correctly.
    • โฌ†๏ธ Bumped minimum required pandas version to 0.24.0 to make use of pandas.DataFrame.to_numpy (recommended alternative to pandas.DataFrame.values).
    • ๐Ÿ“„ Docs improvements.
    • Pipeline.transform skips modifying node inputs/outputs containing params: or parameters keywords.
    • ๐Ÿ‘Œ Support for dataset_credentials key in the credentials for PartitionedDataSet is now deprecated. The dataset credentials should be specified explicitly inside the dataset config.
    • Datasets can have a new confirm function which is called after a successful node function execution if the node contains confirms argument with such dataset name.
    • ๐Ÿ‘‰ Make the resume prompt on pipeline run failure use --from-nodes instead of --from-inputs to avoid unnecessarily re-running nodes that had already executed.
    • โšก๏ธ When closed, Jupyter notebook kernels are automatically terminated after 30 seconds of inactivity by default. Use --idle-timeout option to update it.
    • โž• Added kedro-viz to the Kedro project template requirements.txt file.
    • โœ‚ Removed the results and references folder from the project template.
    • โšก๏ธ Updated contribution process in CONTRIBUTING.md.

    ๐Ÿ’ฅ Breaking changes to the API

    • Existing MatplotlibWriter dataset in contrib was renamed to MatplotlibLocalWriter.
    • kedro/contrib/io/matplotlib/matplotlib_writer.py was renamed to kedro/contrib/io/matplotlib/matplotlib_local_writer.py.
    • kedro.contrib.io.bioinformatics.sequence_dataset.py was renamed to kedro.contrib.io.bioinformatics.biosequence_local_dataset.py.

    ๐Ÿ‘ Thanks for supporting contributions

    Andrii Ivaniuk, Jonas Kemper, Yuhao Zhu, Balazs Konig, Pedro Abreu, Tam-Sanh Nguyen, Peter Zhao, Deepyaman Datta, Florian Roessler, Miguel Rodriguez Gutierrez