字符串模糊匹配---Python fuzzywuzzy库

fuzzywuzzy模糊字符串匹配使用Levenshtein Distance来计算序列之间的差异。

fuzzywuzzy安装

方法一：

1	pip install fuzzywuzzy

方法二，在Anaconda上安装：
1.启动anaconda命令窗口：开始->所有程序->anaconda->anaconda prompt
2.在anaconda prompt中输入pip install 路径 + whl文件名

Levenshtein距离

fuzzywuzzy进行模糊匹配时所用到的求相似度的距离是Levenshtein diatance

Levenshtein距离简介
Levenshtein 距离是一种编辑距离，用来表示两个字符串的差异。编辑距离是指从字符串 A 开始，修改成字符串 B 的最小步骤数，每个以步骤中，你可以删除一个字符、修改一个字符或者新增一个字符。

比如我们把 acat 变成 gate 的时候，需要做如下的修改：

删除 a
把 c 改成 g
新增 e
所以 acat 和 gate 的 Levenshtein 距离是 3。

fuzzywuzzy用法

Usage

1 2	>>> from fuzzywuzzy import fuzz >>> from fuzzywuzzy import process

Simple Ratio

1 2	>>> fuzz.ratio("this is a test", "this is a test!") 97

Partial Ratio

1 2	>>> fuzz.partial_ratio("this is a test", "this is a test!") 100

Token Sort Ratio

>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    91
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    100

Token Set Ratio

>>> fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
    84
>>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
    100

Process

>>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
>>> process.extract("new york jets", choices, limit=2)
    [('New York Jets', 100), ('New York Giants', 78)]
>>> process.extractOne("cowboys", choices)
    ("Dallas Cowboys", 90)