Popularity

6.9

Stable

Activity

6.1

Stars 3,300

Watchers 64

Forks 244

Last Commit 2 days ago

Programming language: Python

License: MIT License

Tags: Text Processing Data Analysis Scientific Engineering Linguistic Human Machine Interfaces Diff

Latest version: v4.2.0

TextDistance alternatives and similar packages

Based on the "Text Processing" category.
Alternatively, view TextDistance alternatives based on common mentions on social networks and blogs.

pydantic

9.4 9.8 TextDistance VS pydantic

Data validation using Python type hints
fuzzywuzzy

8.8 0.0 L4 TextDistance VS fuzzywuzzy

Fuzzy String Matching in Python

WorkOS - The modern identity platform for B2B SaaS

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

Promo workos.com

汉字拼音转换工具（Python 版）

7.9 7.0 TextDistance VS 汉字拼音转换工具（Python 版）

汉字转拼音(pypinyin)
Lark

7.7 7.5 TextDistance VS Lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
sqlparse

7.6 8.2 L4 TextDistance VS sqlparse

A non-validating SQL parser module for Python
Pygments

7.3 - TextDistance VS Pygments

A generic syntax highlighter.
phonenumbers

7.2 8.4 L4 TextDistance VS phonenumbers

Python port of Google's libphonenumber
ftfy

7.1 5.7 L4 TextDistance VS ftfy

Fixes mojibake and other glitches in Unicode text, after the fact.
PLY

6.9 1.0 L2 TextDistance VS PLY

Python Lex-Yacc
chardet

6.2 2.9 L4 TextDistance VS chardet

Python character encoding detector
jellyfish

5.9 6.8 TextDistance VS jellyfish

🪼 a python library for doing approximate and phonetic matching of strings.
shortuuid

5.8 0.8 L5 TextDistance VS shortuuid

A generator library for concise, unambiguous and URL-safe UUIDs.
msgspec

5.5 8.6 TextDistance VS msgspec

A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML
python-user-agents

5.4 0.0 L4 TextDistance VS python-user-agents

A Python library that provides an easy way to identify devices like mobile phones, tablets and their capabilities by parsing (browser) user agent strings.
typeguard

5.1 8.4 TextDistance VS typeguard

Run-time type checker for Python
python-slugify

5.1 5.6 L4 TextDistance VS python-slugify

Returns unicode slugs
Data Profiler

5.1 6.3 TextDistance VS Data Profiler

What's in your data? Extract schema, statistics and entities from datasets
Levenshtein

5.0 0.0 L1 TextDistance VS Levenshtein

The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity
pyfiglet

5.0 7.4 L3 TextDistance VS pyfiglet

An implementation of figlet written in Python
pyparsing

4.9 7.3 TextDistance VS pyparsing

DISCONTINUED. Python library for creating PEG parsers [Moved to: https://github.com/pyparsing/pyparsing]
xpinyin

4.6 2.6 L4 TextDistance VS xpinyin

Translate Chinese hanzi to pinyin (拼音) by Python, 汉字转拼音
Construct

4.5 7.3 TextDistance VS Construct

Construct: Declarative data structures for python that allow symmetric parsing and building
python-nameparser

4.2 4.3 L2 TextDistance VS python-nameparser

A simple Python module for parsing human names into their individual components
ijson

4.0 0.3 TextDistance VS ijson

DISCONTINUED. Iterative JSON parser with Pythonic interface
Charset Normalizer

3.5 8.5 TextDistance VS Charset Normalizer

Truly universal encoding detector in pure Python
awesome-slugify

3.4 0.0 L5 TextDistance VS awesome-slugify

Python flexible slugify function
RenderCV

3.1 9.9 TextDistance VS RenderCV

LaTeX CV generator from a YAML/JSON input file.
unicode-slugify

3.0 0.0 L4 TextDistance VS unicode-slugify

A slugifier that works in unicode
pangu.py

2.6 1.9 L5 TextDistance VS pangu.py

Paranoid text spacing in Python
AnyAscii

2.6 2.2 TextDistance VS AnyAscii

Unicode to ASCII transliteration - C Elixir Go Java JS Julia PHP Python Ruby Rust Shell .NET
json-streamer

2.6 2.4 TextDistance VS json-streamer

A fast streaming JSON parser for Python that generates SAX-like events using yajl
uniout

2.3 1.8 L5 TextDistance VS uniout

Never see escaped bytes in output.
simplematch

2.2 5.9 TextDistance VS simplematch

Minimal, super readable string pattern matching for python.
nider

2.1 0.0 TextDistance VS nider

Python package to add text to images, textures and different backgrounds
HaikunatorPY

2.0 0.0 L5 TextDistance VS HaikunatorPY

Generate Heroku-like random names to use in your python applications
Atoma

1.9 0.0 TextDistance VS Atoma

Atom, RSS and JSON feed parser for Python 3
Python Left-Right Parser

1.9 1.6 L4 TextDistance VS Python Left-Right Parser

Python Parser
json2xml

1.9 6.6 TextDistance VS json2xml

json to xml converter in python3
Efficient keyword mining with regular expressions

1.9 3.5 TextDistance VS Efficient keyword mining with regular expressions

Efficient string matching with regular expressions
Mirascope

1.7 9.8 TextDistance VS Mirascope

LLM toolkit for lightning-fast, high-quality development
unidecode

- TextDistance VS unidecode

ASCII transliterations of Unicode text.
difflib

- TextDistance VS difflib

(Python standard library) Helpers for computing deltas.

* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.

Do you think we are missing an alternative of TextDistance or a related project?

Add another 'Text Processing' Package

Popular Comparisons

README

TextDistance

[TextDistance logo](logo.png)

[ License ](LICENSE)

TextDistance -- python library for comparing distance between two or more sequences by many algorithms.

Features:

30+ algorithms
Pure python implementation
Simple usage
More than two sequences comparing
Some algorithms have more than one implementation in one class.
Optional numpy usage for maximum speed.

Algorithms

Edit based

Algorithm	Class	Functions
Hamming	`Hamming`	`hamming`
MLIPNS	`Mlipns`	`mlipns`
Levenshtein	`Levenshtein`	`levenshtein`
Damerau-Levenshtein	`DamerauLevenshtein`	`damerau_levenshtein`
Jaro-Winkler	`JaroWinkler`	`jaro_winkler`, `jaro`
Strcmp95	`StrCmp95`	`strcmp95`
Needleman-Wunsch	`NeedlemanWunsch`	`needleman_wunsch`
Gotoh	`Gotoh`	`gotoh`
Smith-Waterman	`SmithWaterman`	`smith_waterman`

Token based

Algorithm	Class	Functions
Jaccard index	`Jaccard`	`jaccard`
Sørensen–Dice coefficient	`Sorensen`	`sorensen`, `sorensen_dice`, `dice`
Tversky index	`Tversky`	`tversky`
Overlap coefficient	`Overlap`	`overlap`
Tanimoto distance	`Tanimoto`	`tanimoto`
Cosine similarity	`Cosine`	`cosine`
Monge-Elkan	`MongeElkan`	`monge_elkan`
Bag distance	`Bag`	`bag`

Sequence based

Algorithm	Class	Functions
longest common subsequence similarity	`LCSSeq`	`lcsseq`
longest common substring similarity	`LCSStr`	`lcsstr`
Ratcliff-Obershelp similarity	`RatcliffObershelp`	`ratcliff_obershelp`

Compression based

Normalized compression distance with different compression algorithms.

Classic compression algorithms:

Algorithm	Class	Function
Arithmetic coding	`ArithNCD`	`arith_ncd`
RLE	`RLENCD`	`rle_ncd`
BWT RLE	`BWTRLENCD`	`bwtrle_ncd`

Normal compression algorithms:

Algorithm	Class	Function
Square Root	`SqrtNCD`	`sqrt_ncd`
Entropy	`EntropyNCD`	`entropy_ncd`

Work in progress algorithms that compare two strings as array of bits:

Algorithm	Class	Function
BZ2	`BZ2NCD`	`bz2_ncd`
LZMA	`LZMANCD`	`lzma_ncd`
ZLib	`ZLIBNCD`	`zlib_ncd`

See blog post for more details about NCD.

Phonetic

Algorithm	Class	Functions
MRA	`MRA`	`mra`
Editex	`Editex`	`editex`

Simple

Algorithm	Class	Functions
Prefix similarity	`Prefix`	`prefix`
Postfix similarity	`Postfix`	`postfix`
Length distance	`Length`	`length`
Identity similarity	`Identity`	`identity`
Matrix similarity	`Matrix`	`matrix`

Installation

Stable

Only pure python implementation:

pip install textdistance

With extra libraries for maximum speed:

pip install "textdistance[extras]"

With all libraries (required for benchmarking and testing):

pip install "textdistance[benchmark]"

With algorithm specific extras:

pip install "textdistance[Hamming]"

Algorithms with available extras: DamerauLevenshtein, Hamming, Jaro, JaroWinkler, Levenshtein.

Dev

Via pip:

pip install -e git+https://github.com/life4/textdistance.git#egg=textdistance

Or clone repo and install with some extras:

git clone https://github.com/life4/textdistance.git
pip install -e ".[benchmark]"

Usage

All algorithms have 2 interfaces:

Class with algorithm-specific params for customizing.
Class instance with default params for quick and simple usage.

All algorithms have some common methods:

.distance(*sequences) -- calculate distance between sequences.
.similarity(*sequences) -- calculate similarity for sequences.
.maximum(*sequences) -- maximum possible value for distance and similarity. For any sequence: distance + similarity == maximum.
.normalized_distance(*sequences) -- normalized distance between sequences. The return value is a float between 0 and 1, where 0 means equal, and 1 totally different.
.normalized_similarity(*sequences) -- normalized similarity for sequences. The return value is a float between 0 and 1, where 0 means totally different, and 1 equal.

Most common init arguments:

qval -- q-value for split sequences into q-grams. Possible values:
- 1 (default) -- compare sequences by chars.
- 2 or more -- transform sequences to q-grams.
- None -- split sequences by words.
as_set -- for token-based algorithms:
- True -- t and ttt is equal.
- False (default) -- t and ttt is different.

Examples

For example, Hamming distance:

import textdistance

textdistance.hamming('test', 'text')
# 1

textdistance.hamming.distance('test', 'text')
# 1

textdistance.hamming.similarity('test', 'text')
# 3

textdistance.hamming.normalized_distance('test', 'text')
# 0.25

textdistance.hamming.normalized_similarity('test', 'text')
# 0.75

textdistance.Hamming(qval=2).distance('test', 'text')
# 2

Any other algorithms have same interface.

Articles

A few articles with examples how to use textdistance in the real world:

Extra libraries

For main algorithms textdistance try to call known external libraries (fastest first) if available (installed in your system) and possible (this implementation can compare this type of sequences). Install textdistance with extras for this feature.

You can disable this by passing external=False argument on init:

import textdistance
hamming = textdistance.Hamming(external=False)
hamming('text', 'testit')
# 3

Supported libraries:

Algorithms:

DamerauLevenshtein
Hamming
Jaro
JaroWinkler
Levenshtein

Benchmarks

Without extras installation:

algorithm	library	function	time
DamerauLevenshtein	jellyfish	damerau_levenshtein_distance	0.00965294
DamerauLevenshtein	pyxdameraulevenshtein	damerau_levenshtein_distance	0.151378
DamerauLevenshtein	pylev	damerau_levenshtein	0.766461
DamerauLevenshtein	textdistance	DamerauLevenshtein	4.13463
DamerauLevenshtein	abydos	damerau_levenshtein	4.3831
Hamming	Levenshtein	hamming	0.0014428
Hamming	jellyfish	hamming_distance	0.00240262
Hamming	distance	hamming	0.036253
Hamming	abydos	hamming	0.0383933
Hamming	textdistance	Hamming	0.176781
Jaro	Levenshtein	jaro	0.00313561
Jaro	jellyfish	jaro_distance	0.0051885
Jaro	py_stringmatching	jaro	0.180628
Jaro	textdistance	Jaro	0.278917
JaroWinkler	Levenshtein	jaro_winkler	0.00319735
JaroWinkler	jellyfish	jaro_winkler	0.00540443
JaroWinkler	textdistance	JaroWinkler	0.289626
Levenshtein	Levenshtein	distance	0.00414404
Levenshtein	jellyfish	levenshtein_distance	0.00601647
Levenshtein	py_stringmatching	levenshtein	0.252901
Levenshtein	pylev	levenshtein	0.569182
Levenshtein	distance	levenshtein	1.15726
Levenshtein	abydos	levenshtein	3.68451
Levenshtein	textdistance	Levenshtein	8.63674

Total: 24 libs.

Yeah, so slow. Use TextDistance on production only with extras.

Textdistance use benchmark's results for algorithm's optimization and try to call fastest external lib first (if possible).

You can run benchmark manually on your system:

pip install textdistance[benchmark]
python3 -m textdistance.benchmark

TextDistance show benchmarks results table for your system and save libraries priorities into libraries.json file in TextDistance's folder. This file will be used by textdistance for calling fastest algorithm implementation. Default [libraries.json](textdistance/libraries.json) already included in package.

Running tests

All you need is task. See [Taskfile.yml](./Taskfile.yml) for the list of available commands. For example, to run tests including third-party libraries usage, execute task pytest-external:run.

Contributing

PRs are welcome!

Found a bug? Fix it!
Want to add more algorithms? Sure! Just make it with the same interface as other algorithms in the lib and add some tests.
Can make something faster? Great! Just avoid external dependencies and remember that everything should work not only with strings.
Something else that do you think is good? Do it! Just make sure that CI passes and everything from the README is still applicable (interface, features, and so on).
Have no time to code? Tell your friends and subscribers about textdistance. More users, more contributions, more amazing features.

Thank you :heart:

*Note that all licence references and agreements mentioned in the TextDistance README section above are relevant to that project's source code only.

TextDistance

📐 Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

TextDistance alternatives and similar packages

Popular Comparisons

README

TextDistance

Algorithms

Edit based

Token based

Sequence based

Compression based

Phonetic

Simple

Installation

Stable

Dev

Usage

Examples

Articles

Extra libraries

Benchmarks

Running tests

Contributing