Python nltk计数单词和短语频率

问题描述：

我正在使用NLTK并试图让单词短语数达到特定文档的特定长度以及每个短语的频率。我将字符串标记为获取数据列表。Python nltk计数单词和短语频率

from nltk.util import ngrams 
from nltk.tokenize import sent_tokenize, word_tokenize 
from nltk.collocations import * 


data = ["this", "is", "not", "a", "test", "this", "is", "real", "not", "a", "test", "this", "is", "this", "is", "real", "not", "a", "test"] 

bigrams = ngrams(data, 2) 

bigrams_c = {} 
for b in bigrams: 
    if b not in bigrams_c: 
     bigrams_c[b] = 1 
    else: 
     bigrams_c[b] += 1

上面的代码提供了像这样的输出：

(('is', 'this'), 1) 
(('test', 'this'), 2) 
(('a', 'test'), 3) 
(('this', 'is'), 4) 
(('is', 'not'), 1) 
(('real', 'not'), 2) 
(('is', 'real'), 2) 
(('not', 'a'), 3)

这部分我所期待的。

我的问题是，有没有一种更方便的方法来做到这一点，直到长度为4或5的短语，而不重复此代码只是为了更改计数变量？

答

既然你标记了这个nltk，下面是如何使用nltk的方法，它比标准python集合中的更多特性。

from nltk import ngrams, FreqDist 
all_counts = dict() 
for size in 2, 3, 4, 5: 
    all_counts[size] = FreqDist(ngrams(data, size))

字典all_counts的每个元素是一个ngram频率的字典。例如，你可以得到这样的五个最常见的卦：

all_counts[3].most_common(5)

神圣烟，这工作比我以前写的好多了。非常感谢，精湛的回答！ – user1610950

答

是的，不要运行此循环，请使用collections.Counter(bigrams)或pandas.Series(bigrams).value_counts()来计算单线计数。

Python nltk计数单词和短语频率

相关推荐