字符串模糊匹配---Python fuzzywuzzy库

fuzzywuzzy模糊字符串匹配使用Levenshtein Distance来计算序列之间的差异。

fuzzywuzzy安装

方法一:

1
pip install fuzzywuzzy

方法二,在Anaconda上安装:
1.启动anaconda命令窗口:开始->所有程序->anaconda->anaconda prompt
2.在anaconda prompt中输入pip install 路径 + whl文件名

Levenshtein距离

fuzzywuzzy进行模糊匹配时所用到的求相似度的距离是Levenshtein diatance

  • Levenshtein距离简介
    Levenshtein 距离是一种编辑距离,用来表示两个字符串的差异。编辑距离是指从字符串 A 开始,修改成字符串 B 的最小步骤数,每个以步骤中,你可以删除一个字符、修改一个字符或者新增一个字符。

比如我们把 acat 变成 gate 的时候,需要做如下的修改:

删除 a
把 c 改成 g
新增 e
所以 acat 和 gate 的 Levenshtein 距离是 3。

fuzzywuzzy用法

  • Usage
1
2
>>> from fuzzywuzzy import fuzz
>>> from fuzzywuzzy import process
  • Simple Ratio
1
2
>>> fuzz.ratio("this is a test", "this is a test!")
97
  • Partial Ratio
1
2
>>> fuzz.partial_ratio("this is a test", "this is a test!")
100
  • Token Sort Ratio
1
2
3
4
>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
91
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
100
  • Token Set Ratio
1
2
3
4
>>> fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
84
>>> fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")
100
  • Process
1
2
3
4
5
>>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
>>> process.extract("new york jets", choices, limit=2)
[('New York Jets', 100), ('New York Giants', 78)]
>>> process.extractOne("cowboys", choices)
("Dallas Cowboys", 90)