All Versions
51
Latest Version
Avg Release Cycle
11 days
Latest Release
362 days ago

Changelog History
Page 4

  • v2.3.0.dev1

    June 09, 2020
  • v2.3.0.dev0

    June 03, 2020
  • v2.2.4 Changes

    March 12, 2020

    ๐Ÿฑ โœจ New features and improvements

    • ๐Ÿ†• NEW: Add Span.char_span method.
    • ๐Ÿ†• NEW: Base language support for Yoruba and Basque.
    • ๐Ÿ†• NEW: Add --tag-map-path argument to debug-data and train commands.
    • ๐Ÿ†• NEW Add add_lemma option to displacy dependency visualizer.
    • โž• Add IDX as an attribute available via Doc.to_array.
    • ๐Ÿ‘Œ Improve speed of adding large number of patterns to EntityRuler.
    • Replace python-mecab3 with fugashi for Japanese.
    • ๐Ÿ‘Œ Improve language data for Norwegian, Luxembourgish, Finnish, Slovak, Romanian, Greek and German.

    ๐Ÿฑ ๐Ÿ”ด Bug fixes

    • ๐Ÿ›  Fix issue #3979, #4819, #4871: Add tok2vec parameters to train command.
    • ๐Ÿ›  Fix issue #4009: Fix use of pretrained vectors in text classifier.
    • ๐Ÿ›  Fix issue #4342: Improve CLI training with base model.
    • ๐Ÿ›  Fix issue #4432: Add destructors for states in TransitionSystem.
    • ๐Ÿ“œ Fix issue #4440: Require HEAD for is_parsed in Doc.from_array.
    • ๐Ÿ›  Fix issue #4615: Update SHAPE docs and examples.
    • ๐Ÿ›  Fix issue #4665: Allow HEAD field in CoNLL-U format to be an underscore.
    • ๐Ÿ›  Fix issue #4673: Ensure correct array module is used when returning a vector via Vocab.
    • ๐Ÿ›  Fix issue #4674: Make set_entities in the KnowledgeBase more robust.
    • ๐Ÿ›  Fix issue #4677: Add missing tags to tag maps for el, es and pt.
    • ๐Ÿ›  Fix issue #4688: Iterate over lr_edges until Doc.sents are correct.
    • ๐Ÿ›  Fix issue #4703, #4823: Facilitate large training files.
    • ๐Ÿ›  Fix issue #4707: Auto-exclude disabled when calling from_disk during load.
    • ๐Ÿ›  Fix issue #4717: Fix int value handling in Matcher.
    • ๐Ÿ›  Fix issue #4719: Add message when cli train script throws exception.
    • ๐Ÿ›  Fix issue #4723: Update EntityLinker example.
    • ๐Ÿ›  Fix issue #4725: Take care of global vectors in multiprocessing.
    • ๐Ÿ›  Fix issue #4770: Include Doc.cats in serialization of Doc and DocBin.
    • ๐Ÿ›  Fix issue #4772: Fix bug in EntityLinker.predict.
    • ๐Ÿ›  Fix issue #4777: Fix link to user hooks in documentation.
    • ๐Ÿ›  Fix issue #4829: Update build dependencies in pyproject.toml.
    • ๐Ÿ›  Fix issue #4830: Warn for punctuation in entities when training with noise.
    • ๐Ÿ›  Fix issue #4833: Make example scripts work with transformer starter models.
    • ๐Ÿ›  Fix issue #4849: Fix serialization of ENT_ID.
    • ๐Ÿ›  Fix issue #4862: Fix and improve URL pattern.
    • ๐Ÿ›  Fix issue #4868: Include .pyx and .pxd files in the distribution.
    • ๐Ÿ›  Fix issue #4876: Add friendlier error to entity linking example script.
    • ๐Ÿ›  Fix issue #4903: Fix handling of custom underscore attributes during multiprocessing.
    • ๐Ÿ›  Fix issue #4924: Fix handling of empty docs or golds in Language.evaluate.
    • ๐Ÿ›  Fix issue #4934: Prevent updating component config if the Model was already defined.
    • ๐Ÿ›  Fix issue #4935: Fix Sentencizer.pipe for empty Doc.
    • ๐Ÿ›  Fix issue #4961: Remove old docs section links.
    • Fix issue #4965: Sync Span. __eq__ and Span. __hash__.
    • ๐Ÿ›  Fix issue #4975: Adjust srsly pin.
    • ๐Ÿ›  Fix issue #5048: Fix behavior of get_doc test utility.
    • Fix issue #5073: Normalize IS_SENT_START to SENT_START for Matcher.
    • ๐Ÿ›  Fix issue #5075: Make it impossible to create invalid heads with Doc.from_array.
    • ๐Ÿ›  Fix issue #5082: Correctly set vector of merged span in merge_entities.
    • Fix issue #5115: Ensure paths in Tokenizer.to_disk and Tokenizer.from_disk.
    • ๐Ÿ›  Fix issue #5117: Clarify behavior of Doc.is_ flags for empty Docs.

    ๐Ÿ“š ๐Ÿ“– Documentation and examples

    • ๐Ÿ›  Fix various typos and inconsistencies.
    • โž• Add new projects to the spaCy Universe.

    ๐Ÿ‘ฅ Contributors

    Thanks to @polm, @mmaybeno, @jarib, @questoph, @aajanki, @mr-bjerre, @Tclack88, @thiagola92, @tamuhey, @Olamyy, @AlJohri, @iechevarria, @iurshina, @lineality, @pbadeer, @BramVanroy, @kabirkhan, @ceteri, @omri374, @maknotavailable, @onlyanegg, @drndos, @ju-sh, @nlptechbook, @chkoar, @Jan-711, @MisterKeefe, @bryant1410, @mirfan899, @dhpollack and @mabraham for the pull requests and contributions!

  • v2.2.4.dev0

    March 08, 2020
  • v2.2.3 Changes

    November 21, 2019

    ๐Ÿฑ โœจ New features and improvements

    • ๐Ÿ†• NEW: Tokenizer.explain method to see which rule or pattern was matched.

      tok_exp = nlp.tokenizer.explain("(don't)")assert [t[0] for t in tok_exp] == ["PREFIX", "SPECIAL-1", "SPECIAL-2", "SUFFIX"]assert [t[1] for t in tok_exp] == ["(", "do", "n't", ")"]

    • ๐Ÿ†• NEW: Official Python 3.8 wheels for spaCy and its dependencies.

    • ๐Ÿ‘ Base language support for Korean.

    • Add Scorer.las_per_type (labelled depdencency scores per label).

    • Rework Chinese language initialization and tokenization

    • ๐Ÿ‘Œ Improve language data for Luxembourgish.

    ๐Ÿฑ ๐Ÿ”ด Bug fixes

    • ๐Ÿ›  Fix issue #4573, #4645: Improve tokenizer usage docs.
    • ๐Ÿ›  Fix issue #4575: Add error in debug-data if no dev docs are available.
    • ๐Ÿ›  Fix issue #4582: Make as_tuples=True in Language.pipe work with multiprocessing.
    • ๐Ÿ›  Fix issue #4590: Correctly call on_match in DependencyMatcher.
    • ๐Ÿ›  Fix issue #4593: Build wheels for Python 3.8.
    • ๐Ÿ›  Fix issue #4604: Fix realloc in Retokenizer.split.
    • ๐Ÿ›  Fix issue #4656: Fix conllu2json converter when -n > 1.
    • ๐Ÿ›  Fix issue #4662: Fix Language.evaluate for components without .pipe method.
    • ๐Ÿ›  Fix issue #4670: Ensure EntityRuler is deserialized correctly from disk.
    • ๐Ÿ›  Fix issue #4680: Raise error if non-string labels are added to Tagger or TextCategorizer.
    • ๐Ÿ›  Fix issue #4691: Make Vectors.find return keys in correct order.

    ๐Ÿ“š ๐Ÿ“– Documentation and examples

    • ๐Ÿ›  Fix various typos and inconsistencies.

    ๐Ÿ‘ฅ Contributors

    Thanks to @yash1994, @walterhenry, @prilopes, @f11r, @questoph, @erip, @richardpaulhudson and @GuiGel for the pull requests and contributions.

  • v2.2.3.dev0

    November 21, 2019
  • v2.2.2 Changes

    October 31, 2019

    ๐Ÿฑ โœจ New features and improvements

    • ๐Ÿ†• NEW: Support multiprocessing in nlp.pipe via the n_process argument (Python 3 only).
    • ๐Ÿ‘ Base language support for Luxembourgish.
    • โž• Add noun chunks iterator for Swedish.
    • ๐Ÿ“œ Retrained models for Greek, Norwegian Bokmรฅl and Lithuanian that now correctly support parser-based sentence segmentation.
    • ๐Ÿ“ฆ Repackaged models for Greek and German with improved lookup tables via spacy-lookups-data.
    • โž• Add warning in debug-data for low sentences per doc ratio.
    • ๐Ÿ‘Œ Improve checks and errors related to ill-formed IOB input in convert and debug-data CLI.
    • ๐Ÿ‘Œ Support training dict format as JSONL.
    • ๐Ÿ‘‰ Make EntityRuler ID resolution 2ร— faster and support "id" in patterns to set Token.ent_id.
    • ๐Ÿ‘Œ Improve rendering of named entity spans in displacy for RTL languages.
    • Update Thinc to ditch thinc_gpu_ops for simpler GPU install.
    • ๐Ÿ‘Œ Support Mish activation in spacy pretrain.
    • โž• Add forwards-compatible support for new Language.disable_pipes API, which will become
      0๏ธโƒฃ the default in the future. The method can now also take a list of component names as its first argument (instead of a variable number of arguments).

      • disabled = nlp.disable_pipes("tagger", "parser")+ disabled = nlp.disable_pipes(["tagger", "parser"])
    • โž• Add forwards-compatible support for new Matcher.add and PhraseMatcher.add API, which will become the default in the future. The patterns are now the second argument and a list (instead of a variable number of arguments). The on_match callback becomes an optional keyword argument.

      patterns = [[{"TEXT": "Google"}, {"TEXT": "Now"}], [{"TEXT": "GoogleNow"}]]- matcher.add("GoogleNow", None, *patterns)+ matcher.add("GoogleNow", patterns)- matcher.add("GoogleNow", on_match, *patterns)+ matcher.add("GoogleNow", patterns, on_match=on_match)

    • โž• Add new and improved tokenization alignment in gold.align behind a feature flag. The new alignment may produce backwards-incompatible results, so it won't be enabled by default before v3.0.

      import spacy.gold spacy.gold.USE_NEW_ALIGN = True

    ๐Ÿฑ ๐Ÿ”ด Bug fixes

    • ๐Ÿ›  Fix issue #1303: Support multiprocessing in nlp.pipe.
    • Fix issue #1745: Ditch thinc_gpu_ops for simpler GPU install.
    • ๐Ÿ›  Fix issue #2411: Update Thinc to fix compilation on cygwin.
    • ๐Ÿ›  Fix issue #3412: Prevent division by zero in Vectors.most_similar.
    • ๐Ÿ›  Fix issue #3618: Fix memory leak for long-running parsing processes.
    • ๐Ÿ›  Fix issue #4241: Update Greek lookups in spacy-lookups-data.
    • ๐Ÿ›  Fix issue #4269: Extend unicode character block for Sinhala.
    • ๐Ÿ›  Fix issue #4362: Improve URL_PATTERN and handling in tokenizer.
    • ๐Ÿ›  Fix issue #4373: Make PhraseMatcher.vocab consistent with Matcher.vocab.
    • ๐Ÿ›  Fix issue #4377: Clarify serialization of extension attributes.
    • ๐Ÿ›  Fix issue #4382: Improve usage of pkg_resources and handling of entry points.
    • ๐Ÿ›  Fix issue #4386: Consider batch_size when sorting similar vectors.
    • ๐Ÿ›  Fix issue #4389: Fix ner_jsonl2json converter.
    • ๐Ÿ›  Fix issue #4397: Ensure on_match callback is executed in PhraseMatcher.
    • ๐Ÿ›  Fix issue #4401, #4408: Fix sentence segmentation in Greek, Norwegian and Lithuanian models.
    • ๐Ÿ›  Fix issue #4402: Fix issue with how training data was passed through the pipeline.
    • ๐Ÿ›  Fix issue #4406: Correct spelling in lemmatizer API docs.
    • ๐Ÿ›  Fix issue #4418, #4438: Improve knowledge base and Wikidata parsing.
    • ๐Ÿ›  Fix issue #4435: Fix PhraseMatcher.remove for overlapping patterns.
    • ๐Ÿ›  Fix issue #4443: Fix bug in Vectors.most_similar.
    • Fix issue #4452: Fix gold.docs_to_json documentation.
    • Fix issue #4463: Add missing cats to GoldParse.from_annot_tuples in Scorer.
    • ๐Ÿ›  Fix issue #4470: Suppress convert output if writing to stdout.
    • ๐Ÿ›  Fix issue #4475: Correct mistake in docs example.
    • ๐Ÿ›  Fix issue #4485: Update tag maps and docs for English and German.
    • ๐Ÿ›  Fix issue #4493: Update information in spaCy Universe.
    • ๐Ÿ›  Fix issue #4496: Improve docs of PhraseMatcher.add arguments.
    • ๐Ÿ›  Fix issue #4506: Ensure Vectors.most_similar returns 1.0 for identical vectors.
    • ๐Ÿ›  Fix issue #4509: Fix None iteration error in entity linking script.
    • ๐Ÿ›  Fix issue #4524: Fix typo in Parser sample construction of GoldParse.
    • ๐Ÿ›  Fix issue #4528: Fix serialization of extension attribute values in DocBin.
    • ๐Ÿ›  Fix issue #4529: Ensure GoldParse is initialized correctly with misaligned tokens.
    • ๐Ÿ›  Fix issue #4538: Backport memory leak fix to v2.1.x branch and release v2.1.9.

    ๐Ÿฑ โš ๏ธ Backwards incompatibilities

    • The unused attributes lemma_rules, lemma_index, lemma_exc and lemma_lookup of the Language.Defaults have now been removed to prevent confusion (e.g. if users add rules that then have no effect). The only place lemmatization tables are stored and can be modified at runtime is via nlp.vocab.lookups.

      • nlp.Defaults.lemma_lookup["spaCies"] = "spaCy"+ lemma_lookup = nlp.vocab.lookups.get_table("lemma_lookup")+ lemma_lookup["spaCies"] = "spaCy"

    ๐Ÿ“š ๐Ÿ“– Documentation and examples

    • ๐Ÿ›  Fix various typos and inconsistencies.
    • โž• Add more projects to the spaCy Universe.

    ๐Ÿ‘ฅ Contributors

    Thanks to @tamuhey, @PeterGilles, @akornilo, @danielkingai2, @ghollah, @pberba, @gustavengstrom, @ju-sh, @kabirkhan, @ZhuoruLin, @nipunsadvilkar and @neelkamath for the pull requests and contributions.

  • v2.2.1 Changes

    October 03, 2019

    ๐Ÿฑ โœจ New features and improvements

    • ๐Ÿ‘‰ Make Vectors.most_similar return the top most similar vectors instead of only one.

    ๐Ÿฑ ๐Ÿ”ด Bug fixes

    • ๐Ÿ›  Fix issue #4365: Fix tag map in Dutch model.
    • ๐Ÿ›  Fix issue #4368: Fix initialization of DocBin with attributes.

    ๐Ÿ“š ๐Ÿ“– Documentation and examples

    ๐Ÿ‘ฅ Contributors

    Thanks to @bintay and @svlandeg for the pull requests and contributuons.

  • v2.2.0 Changes

    October 02, 2019

    ๐Ÿ†• > โš ๏ธ This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions. If you've been training your own models , you'll need to retrain them with the new version.

    ๐Ÿฑ โœจ New features and improvements

    • ๐Ÿ†• NEW: Pretrained core models for Norwegian (MIT) and Lithuanian (CC BY-SA).
    • ๐Ÿ†• NEW: Better pre-trained Dutch NER using custom labelled UD corpus instead of WikiNER.
    • ๐Ÿ†• NEW: Make spaCy roughly 5-10ร— smaller on disk (depending on your platform) by compressing and moving lookups to a separate package.
    • ๐Ÿ†• NEW: EntityLinker and KnowledgeBase API to train and access entity linking models, plus scripts to train your own Wikidata models.
    • ๐Ÿ†• NEW: 10ร— faster PhraseMatcher and improved phrase matching algorithm.
    • ๐Ÿ†• NEW: DocBin class to efficiently serialize collections of Doc objects.
    • ๐Ÿ†• NEW: Train text classification models on the command line with spacy train and get textcat results via the Scorer.
    • ๐Ÿ†• NEW: debug-data command to validate your training and development data, get useful stats, and find problems like invalid entity annotations, cyclic dependencies, low data labels and more.
    • ๐Ÿ†• NEW: Efficient Lookups class using Bloom filters that allows storing, accessing and serializing large dictionaries via vocab.lookups.
    • Data augmentation in spacy train via the --orth-variant-level flag, which defines the percentage of occurrences of some tokens subject to replacement during training.
    • โž• Add nlp.pipe_labels (labels assigned by pipeline components) and include "labels" in nlp.meta.
    • Support spacy_displacy_colors entry point to allow packages to add entity colors to displacy.
    • ๐Ÿ‘ Allow template config option in displacy to customize entity HTML template.
    • ๐Ÿ‘Œ Improve match pattern validation and handling of unsupported attributes.
    • โž• Add lookup lemmatization data for Croatian and Serbian.
    • โšก๏ธ Update and improve language data for Chinese, Croatian, Thai, Romanian, Hindi and English.

    ๐Ÿฑ ๐Ÿ”ด Bug fixes

    • ๐Ÿ›  Fix issue #3258: Reduce package size on disk by moving and compressing large dictionaries.
    • ๐Ÿ›  Fix issue #3540: Update lemma and vector information after splitting a token.
    • ๐Ÿ›  Fix issue #3687: Automatically skip duplicates in Doc.retokenize.
    • ๐Ÿ›  Fix issue #3830: Retrain German model and fix subtok errors.
    • ๐Ÿ›  Fix issue #3850: Allow customizing entity HTML template in displaCy.
    • ๐Ÿ›  Fix issue #3879, #3951, #4154: Fix bug in Matcher retry loop that'd cause problems with ? operator.
    • ๐Ÿ›  Fix issue #3917: Raise error for negative token indices in displacy.
    • ๐Ÿ›  Fix issue #3922: Add PhraseMatcher.remove method.
    • ๐Ÿ›  Fix issue #3959, #4133: Make sure both pos and tag are correctly serialized.
    • ๐Ÿ›  Fix issue #3972: Ensure PhraseMatcher returns multiple matches for identical rules.
    • ๐Ÿท Fix issue #4020: Raise error for overlapping entities in biluo_tags_from_offsets.
    • ๐Ÿ›  Fix issue #4051: Ensure retokenizer sets POS tags correctly on merge.
    • ๐Ÿ›  Fix issue #4070: Improve token pattern checking without validation.
    • ๐Ÿ›  Fix issue #4096: Add checks for cycles in debug-data.
    • ๐Ÿ›  Fix issue #4100: Improve docs on phrase pattern attributes.
    • ๐Ÿ›  Fix issue #4102: Correct mistakes in English lookup lemmatizer data.
    • ๐Ÿ›  Fix issue #4104: Make visualized NER examples in docs more clear.
    • ๐Ÿ›  Fix issue #4107: Automatically set span root attributes on merging.
    • ๐Ÿ›  Fix issue #4111, #4170: Improve NER/IOB converters.
    • ๐Ÿ›  Fix issue #4120: Correctly handle ? operator at the end of pattern.
    • ๐Ÿ›  Fix issue #4123: Provide more details in cycle error message E069.
    • ๐Ÿ›  Fix issue #4138: Correctly open .html files as UTF-8 in evaluate command.
    • ๐Ÿ›  Fix issue #4139: Make emoticon data a raw string.
    • ๐Ÿ›  Fix issue #4148: Add missing API docs for force flag on set_extension.
    • ๐Ÿ›  Fix issue #4155: Correct language code for Serbian.
    • ๐Ÿ›  Fix issue #4165: Add more attributes to matcher validation schema.
    • ๐Ÿ›  Fix issue #4190: Fix caching issue that'd cause tokenizer to not be deserialized correctly.
    • ๐Ÿ›  Fix issue #4200: Work around tqdm bug that'd remove text color from terminal output.
    • ๐Ÿ›  Fix issue #4229: Fix handling of pre-set entities.
    • ๐Ÿ›  Fix issue #4238: Flush tokenizer cache when affixes, token_match, or special cases are modified.
    • ๐Ÿ›  Fix issue #4242: Make .pos/.tag distinction more clear in the docs.
    • ๐Ÿ›  Fix issue #4245: Fix bug that occurred when processing empty string in Korean.
    • ๐Ÿ›  Fix issue #4262: Fix handling of spaces in Japanese.
    • ๐Ÿ›  Fix issue #4269: Tokenize punctuation correctly in Kannada, Tamil, and Telugu and add unicode characters to default sentencizer config.
    • ๐Ÿ›  Fix issue #4270: Fix --vectors-loc documentation.
    • ๐Ÿ›  Fix issue #4302: Remove duplicate Parser.tok2vec property.
    • Fix issue #4303: Correctly support as_tuples and return_matches in Matcher.pipe.
    • ๐Ÿ›  Fix issue #4307: Ensure that pre-set entities are preserved and allow overwriting unset tokens.
    • ๐Ÿ›  Fix issue #4308: Fix bug that could cause PhraseMatcher with very large lists to miss matches.
    • ๐Ÿ›  Fix issue #4348: Ensure training doesn't crash with empty batches.

    ๐Ÿฑ โš ๏ธ Backwards incompatibilities

    • ๐Ÿ†• This version of spaCy requires downloading new models. You can use the spacy validate command to find out which models need updating, and print update instructions.
    • The lemmatization tables have been moved to their own package, spacy-lookups-data, which is not installed by default. If you're using pre-trained models, nothing changes , because the tables are now included in the model packages. If you want to use the lemmatizer for other languages that don't yet have pre-trained models (e.g. Turkish or Croatian) or start off with a blank model that contains lookup data (e.g. spacy.blank("en")), you'll need to explicitly install spaCy plus data via pip install spacy[lookups]. The data will be registered automatically via entry points.
    • Lemmatization tables (rules, exceptions, index and lookups) are now part of the Vocab and serialized with it. This means that serialized objects (nlp, pipeline components, vocab) will now include additional data, and models written to disk will include additional files.
    • โšก๏ธ The Lemmatizer class is now initialized with an instance of Lookups containing the rules and tables, instead of dicts as separate arguments. This makes it easier to share data tables and modify them at runtime. This is mostly internals, but if you've been implementing a custom Lemmatizer, you'll need to update your code.
    • If you've been training your own models , you'll need to retrain them with the new version.
    • ๐Ÿ‘ The Dutch model has been trained on a new NER corpus (custom labelled UD instead of WikiNER), so their predictions may be very different compared to the previous version. The results should be significantly better and more generalizable, though.
    • The spacy download command does not set the --no-deps pip argument anymore by default, meaning that model package dependencies (if available) will now be also downloaded and installed. If spaCy (which is also a model dependency) is not installed in the current environment, e.g. if a user has built from source, --no-deps is added back automatically to prevent spaCy from being downloaded and installed again from pip.
    • ๐Ÿท The built-in biluo_tags_from_offsets converter is now stricter and will raise an error if entities are overlapping (instead of silently skipping them). If your data contains invalid entity annotations, make sure to clean it and resolve conflicts. You can now also use the new debug-data command to find problems in your data.
    • Pipeline components can now overwrite IOB tags of tokens that are not yet part of an entity. Once a token has an ent_iob value set, it won't be reset to an "unset" state and will always have at least O assigned. list(doc.ents) now actually keeps the annotations on the token level consistent, instead of resetting O to an empty string.
    • 0๏ธโƒฃ The default punctuation in the Sentencizer has been extended and now includes more characters common in various languages. This also means that the results it produces may change, depending on your text. If you want the previous behaviour with limited characters, set punct_chars=[".", "!", "?"] on initialization.
    • ๐Ÿ‘€ The PhraseMatcher algorithm was rewritten from scratch and it's now 10ร— faster. The rewrite also resolved a few subtle bugs with very large terminology lists. So if you were matching large lists, you may see slightly different results โ€“ however, the results should now be fully correct. See #4309 for details on this change.
    • ๐Ÿ›  The Serbian language class (introduced in v2.1.8) incorrectly used the language code rs instead of sr. This has now been fixed, so Serbian is now available via spacy.lang.sr.
    • โšก๏ธ The "sources" in the meta.json have changed from a list of strings to a list of dicts. This is mostly internals, but if your code used nlp.meta["sources"], you might have to update it.

    ๐Ÿ“ˆ Benchmarks

    Model Language Version UAS LAS POS NER F Vec Size
    en_core_web_sm English 2.2.0 91.61 89.71 97.03 85.07 ๐„‚ 11 MB
    en_core_web_md English 2.2.0 91.65 89.77 97.14 86.10 โœ“ 91 MB
    en_core_web_lg English 2.2.0 91.98 90.16 97.21 86.30 โœ“ 789 MB
    de_core_news_sm German 2.2.0 90.75 88.63 96.29 83.11 ๐„‚ 14 MB
    de_core_news_md German 2.2.0 91.26 89.36 96.44 83.42 โœ“ 214 MB
    es_core_news_sm Spanish 2.2.0 90.20 87.05 96.79 89.45 ๐„‚ 15 MB
    es_core_news_md Spanish 2.2.0 90.89 87.94 97.03 89.86 โœ“ 74 MB
    pt_core_news_sm Portuguese 2.2.0 89.53 86.07 79.96 87.97 ๐„‚ 20 MB
    fr_core_news_sm French 2.2.0 87.27 84.28 94.38 82.77 ๐„‚ 14 MB
    fr_core_news_md French 2.2.0 88.82 86.07 95.15 82.82 โœ“ 84 MB
    it_core_news_sm Italian 2.2.0 90.79 86.94 96.06 86.29 ๐„‚ 13 MB
    nl_core_news_sm Dutch 2.2.0 76.79 69.53 90.10 68.79 ๐„‚ 14 MB
    el_core_news_sm Greek 2.2.0 84.40 80.98 94.41 71.88 ๐„‚ 10 MB
    el_core_news_md Greek 2.2.0 87.96 84.88 96.38 77.59 โœ“ 126 MB
    nb_core_news_sm Norwegian 2.2.0 89.02 86.49 95.72 83.99 ๐„‚ 12 MB
    lt_core_news_sm Lithuanian 2.2.0 59.87 48.00 74.02 76.58 ๐„‚ 12 MB
    xx_ent_wiki_sm Multi 2.2.0 - - - 79.88 ๐„‚ 3 MB

    ๐Ÿฑ > ๐Ÿ’ฌ UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e. Token.tag_). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).

    ๐Ÿ“š ๐Ÿ“– Documentation and examples

    ๐Ÿ‘ฅ Contributors

    Thanks to @ICLRandD, @phiedulxp, @ajrader, @RyanZHe, @jenojp, @yanaiela, @isaric, @mrdbourke, @avramandrei, @Pavle992, @chkoar, @wannaphongcom, @BreakBB, @b1uec0in, @mihaigliga21, @tamuhey, @euand, @Hazoom, @SeanBE, @esemeniuc, @zqianem, @ajkl, @jaydeepborkar, @EarlGreyT and @er-raoniz for the pull requests and contributions.

    ๐Ÿ›  Special thanks to our spaCy team @svlandeg and @adrianeboyd for the bug fixes and new features, @polm for the Bloom filters implementation and data compression and @yvespeirsman, @lemontheme, @jarib, @miktoki and @rokasramas for the help and resources for the new models.

  • v2.1.9 Changes

    October 28, 2019

    โšก๏ธ > This is a small maintenance update that backports a bug fix for a memory leak that'd occur in long-running parsing processes. It's intended for users who can't or don't yet want to upgrade to spaCy v2.2 (e.g. because it requires retraining all the models). If you're able to upgrade, you shouldn't use this version and instead install the latest v2.2.

    ๐Ÿฑ ๐Ÿ”ด Bug fixes

    • ๐Ÿ›  Fix issue #3618: Fix memory leak for long-running parsing processes.
    • ๐Ÿ›  Fix issue #4538: Backport memory leak fix to v2.1.x branch.