如何在不使用nltk库的情况下计算二元组估计值?
问题描述:
所以,我超新的Python和我计算的双字母组没有任何使用Python包的这个项目。我必须使用python 2.7。这是我迄今为止所拥有的。它需要一个文件hello然后给出一个输出,如 {'Hello','How'} 5。现在对于二元数的估计,我必须除以5的Hello(在整个文本文件中出现了多少次'Hello')。 我卡住任何帮助请!如何在不使用nltk库的情况下计算二元组估计值?
f = open("hello.txt", 'r')
dictionary={}
for line in f:
for word in line.split():
items = line.split()
bigrams = []
for i in range(len(items) - 1):
bigrams.append((items[i], items[i+1]))
my_dict = {i:bigrams.count(i) for i in bigrams}
# print(my_dict)
with open('bigram.txt', 'wt') as out:
out.write(str(my_dict))
f.close()
答
我用一个非常简单的代码回答你的问题,只是为了说明。请注意,bigram的估计比你想象的要复杂一点。它需要在分而治之的方法中完成。它可以使用不同的模型进行估计,其中最常见的是隐马尔可夫模型,我将在下面的代码中进行解释。请注意,数据的大小越大,估计就越好。我在Brown Corpus上测试了以下代码。
def bigramEstimation(file):
'''A very basic solution for the sake of illustration.
It can be calculated in a more sophesticated way.
'''
lst = [] # This will contain the tokens
unigrams = {} # for unigrams and their counts
bigrams = {} # for bigrams and their counts
# 1. Read the textfile, split it into a list
text = open(file, 'r').read()
lst = text.strip().split()
print 'Read ', len(lst), ' tokens...'
del text # No further need for text var
# 2. Generate unigrams frequencies
for l in lst:
if not l in unigrams:
unigrams[l] = 1
else:
unigrams[l] += 1
print 'Generated ', len(unigrams), ' unigrams...'
# 3. Generate bigrams with frequencies
for i in range(len(lst) - 1):
temp = (lst[i], lst[i+1]) # Tuples are easier to reuse than nested lists
if not temp in bigrams:
bigrams[temp] = 1
else:
bigrams[temp] += 1
print 'Generated ', len(bigrams), ' bigrams...'
# Now Hidden Markov Model
# bigramProb = (Count(bigram)/Count(first_word)) + (Count(first_word)/ total_words_in_corpus)
# A few things we need to keep in mind
total_corpus = sum(unigrams.values())
# You can add smoothed estimation if you want
print 'Calculating bigram probabilities and saving to file...'
# Comment the following 4 lines if you do not want the header in the file.
with open("bigrams.txt", 'a') as out:
out.write('Bigram' + '\t' + 'Bigram Count' + '\t' + 'Uni Count' + '\t' + 'Bigram Prob')
out.write('\n')
out.close()
for k,v in bigrams.iteritems():
# first_word = helle in ('hello', 'world')
first_word = k[0]
first_word_count = unigrams[first_word]
bi_prob = bigrams[k]/unigrams[first_word]
uni_prob = unigrams[first_word]/total_corpus
final_prob = bi_prob + uni_prob
with open("bigrams.txt", 'a') as out:
out.write(k[0] + ' ' + k[1] + '\t' + str(v) + '\t' + str(first_word_count) + '\t' + str(final_prob)) # Delete whatever you don't want to print into a file
out.write('\n')
out.close()
# Callings
bigramEstimation('hello.txt')
我希望这可以帮助你!
请参阅https://stackoverflow.com/questions/7591258/fast-n-gram-calculation和https://stackoverflow.com/questions/21883108/fast-optimize-n-gram-implementations-in-python以及https://stackoverflow.com/questions/40373414/counting-bigrams-real-fast-with-or-without-multiprocessing-python – alvas
我需要两字估计......所有其他的答案都只是给两字。我需要它的可能性。示例:计数(你好如何)/计数(你好)。你知道该怎么做吗? – Ash
你需要一个ngram语言模型... – alvas