fuzzywuzzy模糊字符串匹配使用Levenshtein Distance来计算序列之间的差异。
fuzzywuzzy安装
方法一:
1 | pip install fuzzywuzzy |
方法二,在Anaconda上安装:
1.启动anaconda命令窗口:开始->所有程序->anaconda->anaconda prompt
2.在anaconda prompt中输入pip install 路径 + whl文件名
Levenshtein距离
fuzzywuzzy进行模糊匹配时所用到的求相似度的距离是Levenshtein diatance
- Levenshtein距离简介
Levenshtein 距离是一种编辑距离,用来表示两个字符串的差异。编辑距离是指从字符串 A 开始,修改成字符串 B 的最小步骤数,每个以步骤中,你可以删除一个字符、修改一个字符或者新增一个字符。
比如我们把 acat 变成 gate 的时候,需要做如下的修改:
删除 a
把 c 改成 g
新增 e
所以 acat 和 gate 的 Levenshtein 距离是 3。
fuzzywuzzy用法
- Usage
1 | from fuzzywuzzy import fuzz |
- Simple Ratio
1 | "this is a test", "this is a test!") fuzz.ratio( |
- Partial Ratio
1 | "this is a test", "this is a test!") fuzz.partial_ratio( |
- Token Sort Ratio
1 | "fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear") fuzz.ratio( |
- Token Set Ratio
1 | "fuzzy was a bear", "fuzzy fuzzy was a bear") fuzz.token_sort_ratio( |
- Process
1 | "Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"] choices = [ |