Kedro v0.15.6 Release Notes
Release Date: 2020-02-26 // about 4 years ago-
Major features and improvements
TL;DR We're launching
kedro.extras
, the new home for our revamped series of datasets, decorators and dataset transformers. The datasets inkedro.extras.datasets
usefsspec
to access a variety of data stores including local file systems, network file systems, cloud object stores (including S3 and GCP), and Hadoop, read more about this here. The change will allow #178 to happen in the next major release of Kedro.๐ An example of this new system can be seen below, loading the CSV
SparkDataSet
from S3:weather: type: spark.SparkDataSet # Observe the specified type, this affects all datasetsfilepath: s3a://your\_bucket/data/01\_raw/weather\* # filepath uses fsspec to indicate the file storage systemcredentials: dev\_s3file\_format: csv
๐ You can also load data incrementally whenever it is dumped into a directory with the extension to
PartionedDataSet
, a feature that allows you to load a directory of files. TheIncrementalDataSet
stores the information about the last processed partition in acheckpoint
, read more about this feature here.๐ New features
- Added
layer
attribute for datasets inkedro.extras.datasets
to specify the name of a layer according to data engineering convention, this feature will be passed tokedro-viz
in future releases. - Enabled loading a particular version of a dataset in Jupyter Notebooks and iPython, using
catalog.load("dataset_name", version="<2019-12-13T15.08.09.255Z>")
. - Added property
run_id
onProjectContext
, used for versioning using theJournal
. To customise your journalrun_id
you can override the private method_get_run_id()
. - โ Added the ability to install all optional kedro dependencies via
pip install "kedro[all]"
. - Modified the
DataCatalog
's load order for datasets, loading order is the following:kedro.io
kedro.extras.datasets
- Import path, specified in
type
- โ Added an optional
copy_mode
flag toCachedDataSet
andMemoryDataSet
to specify (deepcopy
,copy
orassign
) the copy mode to use when loading and saving.
๐ New Datasets
Type Description Location ParquetDataSet
Handles parquet datasets using Dask kedro.extras.datasets.dask
โ PickleDataSet
Work with Pickle files using fsspec
to communicate with the underlying filesystemโ CSVDataSet
Work with CSV files using fsspec
to communicate with the underlying filesystemโ TextDataSet
Work with text files using fsspec
to communicate with the underlying filesystemโ ExcelDataSet
Work with Excel files using fsspec
to communicate with the underlying filesystemโ HDFDataSet
Work with HDF using fsspec
to communicate with the underlying filesystemโ YAMLDataSet
Work with YAML files using fsspec
to communicate with the underlying filesystemโ MatplotlibWriter
Save with Matplotlib images using fsspec
to communicate with the underlying filesystemโ NetworkXDataSet
Work with NetworkX files using fsspec
to communicate with the underlying filesystemโ BioSequenceDataSet
Work with bio-sequence objects using fsspec
to communicate with the underlying filesystemGBQTableDataSet
Work with Google BigQuery kedro.extras.datasets.pandas
โ FeatherDataSet
Work with feather files using fsspec
to communicate with the underlying filesystemIncrementalDataSet
Inherit from PartitionedDataSet
and remembers the last processed partitionkedro.io
Files with a new location
Type New Location JSONDataSet
kedro.extras.datasets.pandas
CSVBlobDataSet
kedro.extras.datasets.pandas
JSONBlobDataSet
kedro.extras.datasets.pandas
SQLTableDataSet
kedro.extras.datasets.pandas
SQLQueryDataSet
kedro.extras.datasets.pandas
SparkDataSet
kedro.extras.datasets.spark
SparkHiveDataSet
kedro.extras.datasets.spark
SparkJDBCDataSet
kedro.extras.datasets.spark
kedro/contrib/decorators/retry.py
kedro/extras/decorators/retry_node.py
kedro/contrib/decorators/memory_profiler.py
kedro/extras/decorators/memory_profiler.py
kedro/contrib/io/transformers/transformers.py
kedro/extras/transformers/time_profiler.py
๐ฒ kedro/contrib/colors/logging/color_logger.py
extras/ipython_loader.py
tools/ipython/ipython_loader.py
kedro/contrib/io/cached/cached_dataset.py
kedro/io/cached_dataset.py
kedro/contrib/io/catalog_with_default/data_catalog_with_default.py
kedro/io/data_catalog_with_default.py
kedro/contrib/config/templated_config.py
kedro/config/templated_config.py
๐ Upcoming deprecations
Category Type Datasets BioSequenceLocalDataSet
CSVGCSDataSet
CSVHTTPDataSet
CSVLocalDataSet
CSVS3DataSet
ExcelLocalDataSet
FeatherLocalDataSet
JSONGCSDataSet
JSONLocalDataSet
HDFLocalDataSet
HDFS3DataSet
kedro.contrib.io.cached.CachedDataSet
kedro.contrib.io.catalog_with_default.DataCatalogWithDefault
MatplotlibLocalWriter
MatplotlibS3Writer
NetworkXLocalDataSet
ParquetGCSDataSet
ParquetLocalDataSet
ParquetS3DataSet
PickleLocalDataSet
PickleS3DataSet
TextLocalDataSet
YAMLLocalDataSet
Decorators kedro.contrib.decorators.memory_profiler
kedro.contrib.decorators.retry
kedro.contrib.decorators.pyspark.spark_to_pandas
kedro.contrib.decorators.pyspark.pandas_to_spark
Transformers kedro.contrib.io.transformers.transformers
๐ง Configuration Loaders ๐ Bug fixes and other changes
- โ Added the option to set/overwrite params in
config.yaml
using YAML dict style instead of string CLI formatting only. - ๐ Kedro CLI arguments
--node
and--tag
support comma-separated values, alternative methods will be deprecated in future releases. - ๐ Fixed a bug in the
invalidate_cache
method ofParquetGCSDataSet
andCSVGCSDataSet
. --load-version
now won't break if version value contains a colon.- Enabled running
node
s with duplicate inputs. - ๐ Improved error message when empty credentials are passed into
SparkJDBCDataSet
. - ๐ Fixed bug that caused an empty project to fail unexpectedly with ImportError in
template/.../pipeline.py
. - ๐ Fixed bug related to saving dataframe with categorical variables in table mode using
HDFS3DataSet
. - Fixed bug that caused unexpected behavior when using
from_nodes
andto_nodes
in pipelines using transcoding. - Credentials nested in the dataset config are now also resolved correctly.
- โฌ๏ธ Bumped minimum required pandas version to 0.24.0 to make use of
pandas.DataFrame.to_numpy
(recommended alternative topandas.DataFrame.values
). - ๐ Docs improvements.
Pipeline.transform
skips modifying node inputs/outputs containingparams:
orparameters
keywords.- ๐ Support for
dataset_credentials
key in the credentials forPartitionedDataSet
is now deprecated. The dataset credentials should be specified explicitly inside the dataset config. - Datasets can have a new
confirm
function which is called after a successful node function execution if the node containsconfirms
argument with such dataset name. - ๐ Make the resume prompt on pipeline run failure use
--from-nodes
instead of--from-inputs
to avoid unnecessarily re-running nodes that had already executed. - โก๏ธ When closed, Jupyter notebook kernels are automatically terminated after 30 seconds of inactivity by default. Use
--idle-timeout
option to update it. - โ Added
kedro-viz
to the Kedro project templaterequirements.txt
file. - โ Removed the
results
andreferences
folder from the project template. - โก๏ธ Updated contribution process in
CONTRIBUTING.md
.
๐ฅ Breaking changes to the API
- Existing
MatplotlibWriter
dataset incontrib
was renamed toMatplotlibLocalWriter
. kedro/contrib/io/matplotlib/matplotlib_writer.py
was renamed tokedro/contrib/io/matplotlib/matplotlib_local_writer.py
.kedro.contrib.io.bioinformatics.sequence_dataset.py
was renamed tokedro.contrib.io.bioinformatics.biosequence_local_dataset.py
.
๐ Thanks for supporting contributions
Andrii Ivaniuk, Jonas Kemper, Yuhao Zhu, Balazs Konig, Pedro Abreu, Tam-Sanh Nguyen, Peter Zhao, Deepyaman Datta, Florian Roessler, Miguel Rodriguez Gutierrez
- Added