本代码来源于《python自然语言处理实战 核心技术与算法》一书中逆向最大匹配算法实现:
假设已经有正向匹配算法源码,则可以将文档进行倒序处理,生成逆序文档,然后根据逆序词典,对逆序文档使用正向最大匹配法处理即可。同理已经存在逆向最大匹配算法,则只要将文档倒序处理,正向词典倒序变为逆序词典,则可以送入逆向西大匹配算法中进行分词处理。
class IMM(object):
def __init__(self, dic_path, reversed_match = True):
self.dictionary = set()
self.maximun = 0
self.reversed_match = reversed_match
with open(dic_path, "r", encoding="utf-8-sig") as f:
for line in f:
line = line.strip()
if not line:
continue
if self.reversed_match: #choose reverse maximum match method
self.dictionary.add(line)
else: #choose maximum match method
self.dictionary.add(line[::-1])
if len(line) > self.maximun:
self.maximun = len(line)
#print(self.dictionary)
def cut(self, text):
if self.reversed_match:
text = text
else:
text = text[::-1]
index = len(text)
result = [] #store tokenizer result
while index > 0:
word = []
for size in range(self.maximun, 0, -1):
if index < size:
continue
piece = text[(index - size): index]
if piece in self.dictionary:
word = piece
if self.reversed_match:
result.append(word)
else:
result.append(word[::-1])
index -= size
break
if not word:
index -= 1
if self.reversed_match:
return result[::-1]
else:
return result
path = r"E:\\学习相关资料\\python自然语言处理实战核心技术与算法--代码\\第三章"
doc = r"imm_dic.txt"
text = "南京市长江大桥"
doc_in_path = path + "\\" + doc
tokenizer = IMM(doc_in_path)
print(tokenizer.cut(text))
tokenizer = IMM(doc_in_path, reversed_match=False)
print(tokenizer.cut(text))
其中,imm_dic.txt内容为:
南京市
南京市长
长江大桥
人民解放军
大桥
江大桥
这里,将字符串反转的实现方式是:
x = "hello world"
z = x[::-1]
print(z)
将字符串逐字符反转
在打开文件处,
encoding="utf-8-sig"
encoding="utf-8"
主要是发现
dic = []
with open(doc_in_path, "r", encoding="utf-8-sig") as f:
for line in f:
line = line.strip()
if line:
dic.append(line)
print(dic)
['南京市', '南京市长', '长江大桥', '人民解放军', '大桥', '江大桥']
dic = []
with open(doc_in_path, "r", encoding="utf-8") as f:
for line in f:
line = line.strip()
if line:
dic.append(line)
print(dic)
['\ufeff南京市', '南京市长', '长江大桥', '人民解放军', '大桥', '江大桥']
"\ufeff"的存在,限制我只能使用“utf-8-sig”