如何在不使用nltk库的情况下计算二元组估计值?

问题描述:

所以,我超新的Python和我计算的双字母组没有任何使用Python包的这个项目。我必须使用python 2.7。这是我迄今为止所拥有的。它需要一个文件hello然后给出一个输出,如 {'Hello','How'} 5。现在对于二元数的估计,我必须除以5的Hello(在整个文本文件中出现了多少次'Hello')。 我卡住任何帮助请!如何在不使用nltk库的情况下计算二元组估计值?

f = open("hello.txt", 'r') 
    dictionary={} 
    for line in f: 
     for word in line.split(): 
      items = line.split() 
      bigrams = [] 
      for i in range(len(items) - 1): 
       bigrams.append((items[i], items[i+1])) 
       my_dict = {i:bigrams.count(i) for i in bigrams} 
       # print(my_dict) 
       with open('bigram.txt', 'wt') as out: 
        out.write(str(my_dict)) 
    f.close() 
+0

请参阅https://*.com/questions/7591258/fast-n-gram-calculation和https://*.com/questions/21883108/fast-optimize-n-gram-implementations-in-python以及https://*.com/questions/40373414/counting-bigrams-real-fast-with-or-without-multiprocessing-python – alvas

+0

我需要两字估计......所有其他的答案都只是给两字。我需要它的可能性。示例:计数(你好如何)/计数(你好)。你知道该怎么做吗? – Ash

+0

你需要一个ngram语言模型... – alvas

我用一个非常简单的代码回答你的问题,只是为了说明。请注意,bigram的估计比你想象的要复杂一点。它需要在分而治之的方法中完成。它可以使用不同的模型进行估计,其中最常见的是隐马尔可夫模型,我将在下面的代码中进行解释。请注意,数据的大小越大,估计就越好。我在Brown Corpus上测试了以下代码。

def bigramEstimation(file): 
    '''A very basic solution for the sake of illustration. 
     It can be calculated in a more sophesticated way. 
     ''' 

    lst = [] # This will contain the tokens 
    unigrams = {} # for unigrams and their counts 
    bigrams = {} # for bigrams and their counts 

    # 1. Read the textfile, split it into a list 
    text = open(file, 'r').read() 
    lst = text.strip().split() 
    print 'Read ', len(lst), ' tokens...' 

    del text # No further need for text var 



    # 2. Generate unigrams frequencies 
    for l in lst: 
     if not l in unigrams: 
      unigrams[l] = 1 
     else: 
      unigrams[l] += 1 

    print 'Generated ', len(unigrams), ' unigrams...' 

    # 3. Generate bigrams with frequencies 
    for i in range(len(lst) - 1): 
     temp = (lst[i], lst[i+1]) # Tuples are easier to reuse than nested lists 
     if not temp in bigrams: 
      bigrams[temp] = 1 
     else: 
      bigrams[temp] += 1 

    print 'Generated ', len(bigrams), ' bigrams...' 

    # Now Hidden Markov Model 
    # bigramProb = (Count(bigram)/Count(first_word)) + (Count(first_word)/ total_words_in_corpus) 
    # A few things we need to keep in mind 
    total_corpus = sum(unigrams.values()) 
    # You can add smoothed estimation if you want 


    print 'Calculating bigram probabilities and saving to file...' 

    # Comment the following 4 lines if you do not want the header in the file. 
    with open("bigrams.txt", 'a') as out: 
     out.write('Bigram' + '\t' + 'Bigram Count' + '\t' + 'Uni Count' + '\t' + 'Bigram Prob') 
     out.write('\n') 
     out.close() 


    for k,v in bigrams.iteritems(): 
     # first_word = helle in ('hello', 'world') 
     first_word = k[0] 
     first_word_count = unigrams[first_word] 
     bi_prob = bigrams[k]/unigrams[first_word] 
     uni_prob = unigrams[first_word]/total_corpus 

     final_prob = bi_prob + uni_prob 
     with open("bigrams.txt", 'a') as out: 
      out.write(k[0] + ' ' + k[1] + '\t' + str(v) + '\t' + str(first_word_count) + '\t' + str(final_prob)) # Delete whatever you don't want to print into a file 
      out.write('\n') 
      out.close() 




# Callings 
bigramEstimation('hello.txt') 

我希望这可以帮助你!

+0

参见http://cs.nyu.edu/courses/spring17/CSCI-UA.0480-009/lecture3-and-half-n-grams.pdf – alvas

+0

感谢您的回应。但我认为它是一点点关闭。所以如果我有文字。“Hello Hello How”for bigram P(How | Hello)它应该计算(你好如何),它是1除以(你好)的计数是2的概率。概率1/2。 – Ash

+0

你对hello hello的评分是多少? – Mohammed