蟒蛇 - 忽略Bigram频率中的数字和符号

问题描述:

我试图从txt文件中的文本中找到Bi-gram频率。到目前为止,它的工作原理,但它统计的数字和symbols.Here是我的代码:蟒蛇 - 忽略Bigram频率中的数字和符号

import nltk 
from nltk.collocations import * 
import prettytable 




file = open('tweets.txt').read() 
tokens = nltk.word_tokenize(file) 


pt = prettytable.PrettyTable(['Words', 'Counts']) 
pt.align['Words'] = 'l' 
pt.align['Counts'] = 'r' 



bgs = nltk.bigrams(tokens) 
fdist = nltk.FreqDist(bgs) 

for row in fdist.most_common(100): 
    pt.add_row(row) 
print pt 


Below is the code output: 
+------------------------------------+--------+ 
| Words        | Counts | 
+------------------------------------+--------+ 
| ('https', ':')      | 1615 | 
| ('!', '#')       | 445 | 
| ('Thank', 'you')     | 386 | 
| ('.', '``')      | 358 | 
| ('.', 'I')       | 354 | 
| ('.', 'Thank')      | 337 | 
| ('``', '@')      | 320 | 
| ('&', 'amp')      | 290 | 

有没有办法忽略数字和符号(如,:)!?由于文本是推文,我想忽略数字和符号,#和s的除外#

bigrams的fdist是包含bigram元组和tuple整数的元组的元组,因此我们需要访问bigram元组,并保留除了bigram的数量外我们需要的元组。尝试:

import nltk 
from nltk.probability import FreqDist 
from nltk.util import ngrams 
from pprint import pprint 

def filter_most_common_bigrams(mc_bigrams_counts): 
    filtered_mc_bigrams_counts = [] 
    for mc_bigram_count in mc_bigrams_counts: 
     bigram, count = mc_bigram_count 
     #print (bigram, count) 
     if all([gram.isalpha() for gram in bigram]) or bigram[0] in "#@" and bigram[1].isalpha(): 
      filtered_mc_bigrams_counts.append((bigram, count)) 
    return tuple(filtered_mc_bigrams_counts) 

text = """Is there a way to ignore numbers and symbols (like !,.,?,:)? 
Since the text are tweets, I want to ignore numbers and symbols, except for the #'s and @'s 
https: !# . Thank you . `` 12 hi . 1st place 1 love 13 in @twitter # twitter""" 

tokenized_text = nltk.word_tokenize(text) 
bigrams = ngrams(tokenized_text, 2) 
fdist = FreqDist(bigrams) 
mc_bigrams_counts = fdist.most_common(100)  
pprint (filter_most_common_bigrams(mc_bigrams_counts)) 

的代码的关键部分是:

if all([gram.isalpha() for gram in bigram]) or bigram[0] in "#@" and bigram[1].isalpha(): 
    filtered_mc_bigrams_counts.append((bigram, count)) 

这就验证了在两字组所有1克包括字母,或者,可替代地,所述第一两字组是#或@符号第二个二元组由字母组成。它只追加那些满足这些条件的元素,并且在包含bigram的fdist数的元组内进行。

结果:

((('to', 'ignore'), 2), 
(('and', 'symbols'), 2), 
(('ignore', 'numbers'), 2), 
(('numbers', 'and'), 2), 
(('for', 'the'), 1), 
(('@', 'twitter'), 1), 
(('Is', 'there'), 1), 
(('text', 'are'), 1), 
(('a', 'way'), 1), 
(('Thank', 'you'), 1), 
(('want', 'to'), 1), 
(('Since', 'the'), 1), 
(('I', 'want'), 1), 
(('#', 'twitter'), 1), 
(('the', 'text'), 1), 
(('are', 'tweets'), 1), 
(('way', 'to'), 1), 
(('except', 'for'), 1), 
(('there', 'a'), 1))