Kedro v0.16.0 Release Notes
Release Date: 2020-05-20 // almost 4 years ago-
Major features and improvements
CLI
- โ Added new CLI commands (only available for the projects created using Kedro 0.16.0 or later):
kedro catalog list
to list datasets in your catalogkedro pipeline list
to list pipelineskedro pipeline describe
to describe a specific pipelinekedro pipeline create
to create a modular pipeline
- ๐ Improved the CLI speed by up to 50%.
- ๐ Improved error handling when making a typo on the CLI. We now suggest some of the possible commands you meant to type, in
git
-style.
Framework
- ๐ All modules in
kedro.cli
andkedro.context
have been moved intokedro.framework.cli
andkedro.framework.context
respectively.kedro.cli
andkedro.context
will be removed in future releases. - โ Added
Hooks
, which is a new mechanism for extending Kedro. - ๐ Fixed
load_context
changing user's current working directory. - ๐ Allowed the source directory to be configurable in
.kedro.yml
. - โ Added the ability to specify nested parameter values inside your node inputs, e.g.
node(func, "params:a.b", None)
DataSets
- โ Added the following new datasets.
Type Description Location pillow.ImageDataSet
Work with image files using Pillow
kedro.extras.datasets.pillow
geopandas.GeoJSONDataSet
Work with geospatial data using GeoPandas
kedro.extras.datasets.geopandas.GeoJSONDataSet
api.APIDataSet
Work with data from HTTP(S) API requests kedro.extras.datasets.api.APIDataSet
- โ Added
joblib
backend support topickle.PickleDataSet
. - โ Added versioning support to
MatplotlibWriter
dataset. - โ Added the ability to install dependencies for a given dataset with more granularity, e.g.
pip install "kedro[pandas.ParquetDataSet]"
. - ๐ Added the ability to specify extra arguments, e.g.
encoding
orcompression
, forfsspec.spec.AbstractFileSystem.open()
calls when loading/saving a dataset. See Example 3 under docs.
Other
- โ Added
namespace
property onNode
, related to the modular pipeline where the node belongs. - Added an option to enable asynchronous loading inputs and saving outputs in both
SequentialRunner(is_async=True)
andParallelRunner(is_async=True)
class. - โ Added
MemoryProfiler
transformer. - โ Removed the requirement to have all dependencies for a dataset module to use only a subset of the datasets within.
- โ Added support for
pandas>=1.0
. - Enabled Python 3.8 compatibility. Please note that a Spark workflow may be unreliable for this Python version as
pyspark
is not fully-compatible with 3.8 yet. - Renamed "features" layer to "feature" layer to be consistent with (most) other layers and the relevant FAQ.
๐ Bug fixes and other changes
- ๐ Fixed a bug where a new version created mid-run by an external system caused inconsistencies in the load versions used in the current run.
- ๐ Documentation improvements
- Added instruction in the documentation on how to create a custom runner).
- Updated contribution process in
CONTRIBUTING.md
- added Developer Workflow. - Documented installation of development version of Kedro in the FAQ section.
- Added missing
_exists
method toMyOwnDataSet
example in 04_user_guide/08_advanced_io.
- ๐ Fixed a bug where
PartitionedDataSet
andIncrementalDataSet
were not working withs3a
ors3n
protocol. - โ Added ability to read partitioned parquet file from a directory in
pandas.ParquetDataSet
. - Replaced
functools.lru_cache
withcachetools.cachedmethod
inPartitionedDataSet
andIncrementalDataSet
for per-instance cache invalidation. - Implemented custom glob function for
SparkDataSet
when running on Databricks. - ๐ Fixed a bug in
SparkDataSet
not allowing for loading data from DBFS in a Windows machine using Databricks-connect. - ๐ Improved the error message for
DataSetNotFoundError
to suggest possible dataset names user meant to type. - โ Added the option for contributors to run Kedro tests locally without Spark installation with
make test-no-spark
. - โ Added option to lint the project without applying the formatting changes (
kedro lint --check-only
).
๐ฅ Breaking changes to the API
Datasets
- โ Deleted obsolete datasets from
kedro.io
. - โ Deleted
kedro.contrib
andextras
folders. - โ Deleted obsolete
CSVBlobDataSet
andJSONBlobDataSet
dataset types. - Made
invalidate_cache
method on datasets private. get_last_load_version
andget_last_save_version
methods are no longer available onAbstractDataSet
.get_last_load_version
andget_last_save_version
have been renamed toresolve_load_version
andresolve_save_version
onAbstractVersionedDataSet
, the results of which are cached.- ๐ The
release()
method on datasets extendingAbstractVersionedDataSet
clears the cached load and save version. All custom datasets must callsuper()._release()
inside_release()
. TextDataSet
no longer hasload_args
andsave_args
. These can instead be specified underopen_args_load
oropen_args_save
infs_args
.PartitionedDataSet
andIncrementalDataSet
methodinvalidate_cache
was made private:_invalidate_caches
.
Other
- Removed
KEDRO_ENV_VAR
fromkedro.context
to speed up the CLI run time. - ๐
Pipeline.name
has been removed in favour ofPipeline.tag()
. - โฌ๏ธ Dropped
Pipeline.transform()
in favour ofkedro.pipeline.modular_pipeline.pipeline()
helper function. - ๐ Made constant
PARAMETER_KEYWORDS
private, and moved it fromkedro.pipeline.pipeline
tokedro.pipeline.modular_pipeline
. - ๐ Layers are no longer part of the dataset object, as they've moved to the
DataCatalog
. - ๐ Python 3.5 is no longer supported by the current and all future versions of Kedro.
๐ Migration guide from Kedro 0.15.* to Upcoming Release
Migration for datasets
โก๏ธ Since all the datasets (from
kedro.io
andkedro.contrib.io
) were moved tokedro/extras/datasets
you must update the type of all datasets in<project>/conf/base/catalog.yml
file.
Here how it should be changed:type: <SomeDataSet>
->type: <subfolder of kedro/extras/datasets>.<SomeDataSet>
(e.g.type: CSVDataSet
->type: pandas.CSVDataSet
).๐ In addition, all the specific datasets like
CSVLocalDataSet
,CSVS3DataSet
etc. were deprecated. Instead, you must use generalized datasets likeCSVDataSet
.
E.g.type: CSVS3DataSet
->type: pandas.CSVDataSet
.Note: No changes required if you are using your custom dataset.
Migration for Pipeline.transform()
Pipeline.transform()
has been dropped in favour of thepipeline()
constructor. The following changes apply:- Remember to import
from kedro.pipeline import pipeline
- The
prefix
argument has been renamed tonamespace
- And
datasets
has been broken down into more granular arguments:inputs
: Independent inputs to the pipelineoutputs
: Any output created in the pipeline, whether an intermediary dataset or a leaf outputparameters
:params:...
orparameters
As an example, code that used to look like this with the
Pipeline.transform()
constructor:result = my\_pipeline.transform( datasets={"input": "new\_input", "output": "new\_output", "params:x": "params:y"}, prefix="pre")
When used with the new
pipeline()
constructor, becomes:from kedro.pipeline import pipelineresult = pipeline( my\_pipeline, inputs={"input": "new\_input"}, outputs={"output": "new\_output"}, parameters={"params:x": "params:y"}, namespace="pre")
Migration for decorators, color logger, transformers etc.
โก๏ธ Since some modules were moved to other locations you need to update import paths appropriately.
๐ You can find the list of moved files in the0.15.6
release notes under the section titledFiles with a new location
.Migration for KEDRO_ENV_VAR, the environment variable
โก๏ธ > Note: If you haven't made significant changes to your
kedro_cli.py
, it may be easier to simply copy the updatedkedro_cli.py
.ipython/profile_default/startup/00-kedro-init.py
and from GitHub or a newly generated project into your old project.- We've removed
KEDRO_ENV_VAR
fromkedro.context
. To get your existing project template working, you'll need to remove all instances ofKEDRO_ENV_VAR
from your project template:- From the imports in
kedro_cli.py
and.ipython/profile_default/startup/00-kedro-init.py
:from kedro.context import KEDRO_ENV_VAR, load_context
->from kedro.framework.context import load_context
- Remove the
envvar=KEDRO_ENV_VAR
line from the click options inrun
,jupyter_notebook
andjupyter_lab
inkedro_cli.py
- Replace
KEDRO_ENV_VAR
with"KEDRO_ENV"
in_build_jupyter_env
- Replace
context = load_context(path, env=os.getenv(KEDRO_ENV_VAR))
withcontext = load_context(path)
in.ipython/profile_default/startup/00-kedro-init.py
- From the imports in
๐ Migration for
kedro build-reqs
๐ We have upgraded
pip-tools
which is used bykedro build-reqs
to 5.x. Thispip-tools
version requirespip>=20.0
. To upgradepip
, please refer to their documentation.๐ Thanks for supporting contributions
@foolsgold, Mani Sarkar, Priyanka Shanbhag, Luis Blanche, Deepyaman Datta, Antony Milne, Panos Psimatikas, Tam-Sanh Nguyen, Tomasz Kaczmarczyk, Kody Fischer, Waylon Walker
- โ Added new CLI commands (only available for the projects created using Kedro 0.16.0 or later):