自定义标记与nltk

问题描述：

我想创建一个小的类似英语的语言来指定任务。基本的想法是将一个陈述分解成这些动词应该适用的动词和名词短语。我与NLTK工作，但没有得到我所希望的，如结果：自定义标记与nltk

>>> nltk.pos_tag(nltk.word_tokenize("select the files and copy to harddrive'")) 
[('select', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('and', 'CC'), ('copy', 'VB'), ('to', 'TO'), ("harddrive'", 'NNP')] 
>>> nltk.pos_tag(nltk.word_tokenize("move the files to harddrive'")) 
[('move', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')] 
>>> nltk.pos_tag(nltk.word_tokenize("copy the files to harddrive'")) 
[('copy', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')]

在每种情况下，未能实现的第一个字（选择，移动和复制），旨在为动词。我知道我可以创建自定义标记符和语法来解决这个问题，但同时我很犹豫是否会在很多这些东西超出我的联盟时重新发明轮子。我特别希望能够处理非英语语言的解决方案。

因此，无论如何，我的问题是：有没有更好的标记这种类型的语法？有没有一种方法可以使现有标注器比名词形式更频繁地使用动词形式？有没有办法培训一个标签？完全有更好的方法吗？

答

一个解决方案是创建一个手动UnigramTagger，后者返回到NLTK标记器。事情是这样的：

>>> import nltk.tag, nltk.data 
>>> default_tagger = nltk.data.load(nltk.tag._POS_TAGGER) 
>>> model = {'select': 'VB'} 
>>> tagger = nltk.tag.UnigramTagger(model=model, backoff=default_tagger)

然后你得到

>>> tagger.tag(['select', 'the', 'files']) 
[('select', 'VB'), ('the', 'DT'), ('files', 'NNS')]

同样的方法可以用于非英语语言的工作，只要你有一个适当的默认恶搞。您可以使用train_tagger.py从nltk-trainer和适当的语料库训练您自己的标签符号。

答

雅各布的答案是现货。但是，要扩展它，你可能会发现你不仅需要unigrams。

例如，考虑了三个句子：

select the files 
use the select function on the sockets 
the select was good

在此，词语“选择”被分别用作动词，形容词和名词。一个unigram tagger将无法对此进行建模。甚至一个bigram tagger也无法处理它，因为两个案例共享相同的前面的单词（即“the”）。你需要一个trigram tagger来正确处理这种情况。

import nltk.tag, nltk.data 
from nltk import word_tokenize 
default_tagger = nltk.data.load(nltk.tag._POS_TAGGER) 

def evaluate(tagger, sentences): 
    good,total = 0,0. 
    for sentence,func in sentences: 
     tags = tagger.tag(nltk.word_tokenize(sentence)) 
     print tags 
     good += func(tags) 
     total += 1 
    print 'Accuracy:',good/total 

sentences = [ 
    ('select the files', lambda tags: ('select', 'VB') in tags), 
    ('use the select function on the sockets', lambda tags: ('select', 'JJ') in tags and ('use', 'VB') in tags), 
    ('the select was good', lambda tags: ('select', 'NN') in tags), 
] 

train_sents = [ 
    [('select', 'VB'), ('the', 'DT'), ('files', 'NNS')], 
    [('use', 'VB'), ('the', 'DT'), ('select', 'JJ'), ('function', 'NN'), ('on', 'IN'), ('the', 'DT'), ('sockets', 'NNS')], 
    [('the', 'DT'), ('select', 'NN'), ('files', 'NNS')], 
] 

tagger = nltk.TrigramTagger(train_sents, backoff=default_tagger) 
evaluate(tagger, sentences) 
#model = tagger._context_to_tag

注意，您可以使用NLTK的NgramTagger训练使用任意大量的正克恶搞，但通常你没有得到卦以后多的性能提升。

是否有可能使用（在雅各布答案等）的模型和培训句子（如在此答案）？ – Sadik 2015-10-17 13:11:02

答

查看Jacob的回答。

在后来的版本（至少nltk 3.2）nltk.tag._POS_TAGGER不存在。默认标注器通常被下载到nltk_data /标记者/目录，例如：

>>> import nltk 
>>> nltk.download('maxent_treebank_pos_tagger')

用法如下。

>>> import nltk.tag, nltk.data 
>>> tagger_path = '/path/to/nltk_data/taggers/maxent_treebank_pos_tagger/english.pickle' 
>>> default_tagger = nltk.data.load(tagger_path) 
>>> model = {'select': 'VB'} 
>>> tagger = nltk.tag.UnigramTagger(model=model, backoff=default_tagger)

另请参阅：How to do POS tagging using the NLTK POS tagger in Python。

答

巴德的回答是正确的。另外，根据this link，

如果正确安装了您的nltk_data包，然后NLTK知道他们是在系统上的，而你并不需要传递一个绝对路径。

含义，你就可以说

tagger_path = '/path/to/nltk_data/taggers/maxent_treebank_pos_tagger/english.pickle' 
default_tagger = nltk.data.load(tagger_path)

相关推荐