服务器配置hanlp 分词器，并添加自定义词典

1.首先在我的服务器上有hanlp文件了

2.打开data看一下，dictionary 里面都是自带的一些词典

服务器配置hanlp 分词器，并添加自定义词典

3.custom里面就时我们可以自己添加词典的地方，这里已经有一些机构名和地名。自己还可以添加

在这里我添加了很多实体词典：actor\app\singer等

服务器配置hanlp 分词器，并添加自定义词典

添加的词典格式如下图，每个词一行

服务器配置hanlp 分词器，并添加自定义词典

4.现在开始配置词典地址，时词典在分词时能生效

服务器配置hanlp 分词器，并添加自定义词典

打开这个hanlp.properties配置文件

服务器配置hanlp 分词器，并添加自定义词典

这个地方的地址要改成hanlp文件存放的地址：

现在我的文件放在我的私人文件夹里面，然后我就把地址改了,这个地方一定要改，不然调用分词器时，会找不到词典文件

服务器配置hanlp 分词器，并添加自定义词典

5.然后这里面又很多词典的调用，如果你不想要某个词典，直接可以注销掉，例如这个2元语法词典就注销掉了

服务器配置hanlp 分词器，并添加自定义词典

6.自定义词典添加的位置说的很清楚，注意在“；”之前一定要加空格，表示和前面的词典在同一个文件夹

服务器配置hanlp 分词器，并添加自定义词典

7.到此基本都已经完成了，然后我们测试一下。

服务器配置hanlp 分词器，并添加自定义词典

LIB用来指定分词器的位置：

例句：打开天天动听

未添加词典时测试：

服务器配置hanlp 分词器，并添加自定义词典

添加词典：

服务器配置hanlp 分词器，并添加自定义词典

源代码：

#encoding: utf-8
# python /home/public/word_segment/hanlp_1.py 需分词文件输出文件 pos/no（若需要词性传pos，不需要词性传 no）
from jpype import *
import sys
import time

def word_seg_pos(sent,newseg):

'''返回带词性'''
   if newseg == HanLP_default:
       words = newseg.segment(sent)
   else:
       words = newseg.seg(sent)
   wplist = []
   for wd in words:
       wp = wd.word + '/' + str(wd.nature)
       wplist.append(wp)
   return " ".join(wplist)

def only_word_seg(sent,newseg):

'''仅返回分词'''
   wordlist = []
   if newseg == HanLP_default:
       words = newseg.segment(sent)
   else:
       words = newseg.seg(sent)
   for wd in words:
       wordlist.append(wd.word)
   return " ".join(wordlist)

if __name__ == '__main__':

start_time = time.time()

   # 指定分词器
   LIB = "/home/public/hanlp-1.7.2-release" #分词器所在的位置
   startJVM(getDefaultJVMPath(), "-Djava.class.path=%s/hanlp-1.7.2.jar:%s"%(LIB,LIB), "-Xms1g", "-Xmx1g") #配置文件
   HanLP_default = JClass('com.hankcs.hanlp.HanLP')#.newSegment()
   # HanLP_HMM = JClass('com.hankcs.hanlp.seg.HMM.HMMSegment')()
   # HanLP_CRF = JClass('com.hankcs.hanlp.seg.CRF.CRFSegment')()
   # HanLP_NShort = JClass('com.hankcs.hanlp.seg.NShort.NShortSegment')()

print ('word segment prepared...')

   # 打开地名和机构名识别
   # HanLP = HanLP_default.newSegment()#.enablePlaceRecognize(True)
   # HanLP = HanLP.enableOrganizationRecognize(True)
   # 读取文件
to_seg_file = sys.argv[1]
result_file = sys.argv[2]
pos=sys.argv[3]

   test_sent = '打开天天动听'
   if pos=='pos': #需要带词性
   print ('example:',word_seg_pos(test_sent,HanLP_default))
print ('test sucessfully')
else: #不需要词性
print ('example:',only_word_seg(test_sent,HanLP_default))
print ('test sucessfully')
with open(result_file,'w') as f:
for line in open(to_seg_file):
try:
if pos=='pos':
line_result = word_seg_pos(line, HanLP_default)
else:
line_result = only_word_seg(line, HanLP_default)
line_result = line_result.strip()
if line_result:
f.write(line_result + '\n')
except:
print ('raise an error:', line)
continue

end_time = time.time()

print ('running time:', end_time-start_time)

print ('word segment done...')

服务器配置hanlp 分词器，并添加自定义词典

相关推荐