Kedro v0.15.6 release notes (2020-02-26)

« Changelog History

Kedro v0.15.6 Release Notes

Release Date: 2020-02-26 // about 4 years ago

Major features and improvements

TL;DR We're launching kedro.extras, the new home for our revamped series of datasets, decorators and dataset transformers. The datasets in kedro.extras.datasets use fsspec to access a variety of data stores including local file systems, network file systems, cloud object stores (including S3 and GCP), and Hadoop, read more about this here. The change will allow #178 to happen in the next major release of Kedro.

👀 An example of this new system can be seen below, loading the CSV SparkDataSet from S3:

weather: type: spark.SparkDataSet # Observe the specified type, this affects all datasetsfilepath: s3a://your\_bucket/data/01\_raw/weather\* # filepath uses fsspec to indicate the file storage systemcredentials: dev\_s3file\_format: csv

👉 You can also load data incrementally whenever it is dumped into a directory with the extension to PartionedDataSet, a feature that allows you to load a directory of files. The IncrementalDataSet stores the information about the last processed partition in a checkpoint, read more about this feature here.

🆕 New features

Added layer attribute for datasets in kedro.extras.datasets to specify the name of a layer according to data engineering convention, this feature will be passed to kedro-viz in future releases.
Enabled loading a particular version of a dataset in Jupyter Notebooks and iPython, using catalog.load("dataset_name", version="<2019-12-13T15.08.09.255Z>").
Added property run_id on ProjectContext, used for versioning using the Journal. To customise your journal run_id you can override the private method _get_run_id().
➕ Added the ability to install all optional kedro dependencies via pip install "kedro[all]".
Modified the DataCatalog's load order for datasets, loading order is the following:
- kedro.io
- kedro.extras.datasets
- Import path, specified in type
➕ Added an optional copy_mode flag to CachedDataSet and MemoryDataSet to specify (deepcopy, copy or assign) the copy mode to use when loading and saving.

🆕 New Datasets

Type	Description	Location
`ParquetDataSet`	Handles parquet datasets using Dask	`kedro.extras.datasets.dask`
✅	`PickleDataSet`	Work with Pickle files using `fsspec` to communicate with the underlying filesystem
✅	`CSVDataSet`	Work with CSV files using `fsspec` to communicate with the underlying filesystem
✅	`TextDataSet`	Work with text files using `fsspec` to communicate with the underlying filesystem
✅	`ExcelDataSet`	Work with Excel files using `fsspec` to communicate with the underlying filesystem
✅	`HDFDataSet`	Work with HDF using `fsspec` to communicate with the underlying filesystem
✅	`YAMLDataSet`	Work with YAML files using `fsspec` to communicate with the underlying filesystem
✅	`MatplotlibWriter`	Save with Matplotlib images using `fsspec` to communicate with the underlying filesystem
✅	`NetworkXDataSet`	Work with NetworkX files using `fsspec` to communicate with the underlying filesystem
✅	`BioSequenceDataSet`	Work with bio-sequence objects using `fsspec` to communicate with the underlying filesystem
`GBQTableDataSet`	Work with Google BigQuery	`kedro.extras.datasets.pandas`
✅	`FeatherDataSet`	Work with feather files using `fsspec` to communicate with the underlying filesystem
`IncrementalDataSet`	Inherit from `PartitionedDataSet` and remembers the last processed partition	`kedro.io`

Files with a new location

Type	New Location
`JSONDataSet`	`kedro.extras.datasets.pandas`
`CSVBlobDataSet`	`kedro.extras.datasets.pandas`
`JSONBlobDataSet`	`kedro.extras.datasets.pandas`
`SQLTableDataSet`	`kedro.extras.datasets.pandas`
`SQLQueryDataSet`	`kedro.extras.datasets.pandas`
`SparkDataSet`	`kedro.extras.datasets.spark`
`SparkHiveDataSet`	`kedro.extras.datasets.spark`
`SparkJDBCDataSet`	`kedro.extras.datasets.spark`
`kedro/contrib/decorators/retry.py`	`kedro/extras/decorators/retry_node.py`
`kedro/contrib/decorators/memory_profiler.py`	`kedro/extras/decorators/memory_profiler.py`
`kedro/contrib/io/transformers/transformers.py`	`kedro/extras/transformers/time_profiler.py`
🌲	`kedro/contrib/colors/logging/color_logger.py`
`extras/ipython_loader.py`	`tools/ipython/ipython_loader.py`
`kedro/contrib/io/cached/cached_dataset.py`	`kedro/io/cached_dataset.py`
`kedro/contrib/io/catalog_with_default/data_catalog_with_default.py`	`kedro/io/data_catalog_with_default.py`
`kedro/contrib/config/templated_config.py`	`kedro/config/templated_config.py`

🗄 Upcoming deprecations

Category	Type
Datasets	`BioSequenceLocalDataSet`
	`CSVGCSDataSet`
	`CSVHTTPDataSet`
	`CSVLocalDataSet`
	`CSVS3DataSet`
	`ExcelLocalDataSet`
	`FeatherLocalDataSet`
	`JSONGCSDataSet`
	`JSONLocalDataSet`
	`HDFLocalDataSet`
	`HDFS3DataSet`
	`kedro.contrib.io.cached.CachedDataSet`
	`kedro.contrib.io.catalog_with_default.DataCatalogWithDefault`
	`MatplotlibLocalWriter`
	`MatplotlibS3Writer`
	`NetworkXLocalDataSet`
	`ParquetGCSDataSet`
	`ParquetLocalDataSet`
	`ParquetS3DataSet`
	`PickleLocalDataSet`
	`PickleS3DataSet`
	`TextLocalDataSet`
	`YAMLLocalDataSet`
Decorators	`kedro.contrib.decorators.memory_profiler`
	`kedro.contrib.decorators.retry`
	`kedro.contrib.decorators.pyspark.spark_to_pandas`
	`kedro.contrib.decorators.pyspark.pandas_to_spark`
Transformers	`kedro.contrib.io.transformers.transformers`
🔧	Configuration Loaders

🐛 Bug fixes and other changes

➕ Added the option to set/overwrite params in config.yaml using YAML dict style instead of string CLI formatting only.
🚀 Kedro CLI arguments --node and --tag support comma-separated values, alternative methods will be deprecated in future releases.
🛠 Fixed a bug in the invalidate_cache method of ParquetGCSDataSet and CSVGCSDataSet.
--load-version now won't break if version value contains a colon.
Enabled running nodes with duplicate inputs.
👌 Improved error message when empty credentials are passed into SparkJDBCDataSet.
🛠 Fixed bug that caused an empty project to fail unexpectedly with ImportError in template/.../pipeline.py.
🛠 Fixed bug related to saving dataframe with categorical variables in table mode using HDFS3DataSet.
Fixed bug that caused unexpected behavior when using from_nodes and to_nodes in pipelines using transcoding.
Credentials nested in the dataset config are now also resolved correctly.
⬆️ Bumped minimum required pandas version to 0.24.0 to make use of pandas.DataFrame.to_numpy (recommended alternative to pandas.DataFrame.values).
📄 Docs improvements.
Pipeline.transform skips modifying node inputs/outputs containing params: or parameters keywords.
👌 Support for dataset_credentials key in the credentials for PartitionedDataSet is now deprecated. The dataset credentials should be specified explicitly inside the dataset config.
Datasets can have a new confirm function which is called after a successful node function execution if the node contains confirms argument with such dataset name.
👉 Make the resume prompt on pipeline run failure use --from-nodes instead of --from-inputs to avoid unnecessarily re-running nodes that had already executed.
⚡️ When closed, Jupyter notebook kernels are automatically terminated after 30 seconds of inactivity by default. Use --idle-timeout option to update it.
➕ Added kedro-viz to the Kedro project template requirements.txt file.
✂ Removed the results and references folder from the project template.
⚡️ Updated contribution process in CONTRIBUTING.md.

💥 Breaking changes to the API

Existing MatplotlibWriter dataset in contrib was renamed to MatplotlibLocalWriter.
kedro/contrib/io/matplotlib/matplotlib_writer.py was renamed to kedro/contrib/io/matplotlib/matplotlib_local_writer.py.
kedro.contrib.io.bioinformatics.sequence_dataset.py was renamed to kedro.contrib.io.bioinformatics.biosequence_local_dataset.py.