删除文件中的特定单词

问题描述：

我想删除文件中的停用单词（它包含一个句子，一个选项卡，然后是一个英文单词）。停用词是在一个单独的文件中，语言是波斯语。下面的代码工作，但问题是，它会删除一行中的停用词，但不会删除其他行中的同一停用词。它发生几乎每一个停止词。我猜也许它可以用于正常化。所以我通过导入hazm模块（hazm就像NLTK，波斯语）来标准化这两个文件。但问题没有改变。一些身体可以帮助吗？删除文件中的特定单词

from hazm import* 
punctuation = '!"#$%&\'()*+,-./:;<=>[email protected][\\]^_`{|}~،؟«؛' 

file1 = "stopwords.txt" 
file2 = "test/پر.txt" 


witoutStops = [] 
corpuslines = [] 

def RemStopWords (file1, file2): 
    with open(file1, encoding = "utf-8") as stopfile: 
     normalizer = Normalizer() 
     stopwords = stopfile.read() 
     stopwords = normalizer.normalize(stopwords) 
     with open(file2, encoding = "utf-8") as trainfile: 
      with open ("y.txt", "w", encoding = "utf-8") as newfile: 
       for line in trainfile: 
        tmp = line.strip().split("\t") 
        tmp[0] = normalizer.normalize(tmp[0]) 
        corpuslines.append(tmp) 
        for row in corpuslines: 
         line = "" 
         tokens = row[0].split() 
         for token in tokens: 
          if token not in stopwords: 
           line += token + " " 
        line = line.strip() + "\n" 
        for i in punctuation: # deletes punctuations 
         if i in line: 
          line = line.replace(i, "") 
        newfile.write(line) 
        witoutStops.append (line)

停止词的文件： https://www.dropbox.com/s/irjkjmwkzwnnpnk/stopwords.txt?dl=0

文件： https://www.dropbox.com/s/p4m8san3xhr0pdj/%D9%BE%D8%B1.txt?dl=0

[删除使用正则表达式的停用词]的可能重复（http://stackoverflow.com/questions/41417528/delete-stop-words-using-regular-expression） –

答

我发现这个问题。这是因为在某些文字中，标点符号附在单词上，代码将其视为单词的一部分，而不是标点符号。如果首先删除标点符号，则通过将属于该部分的代码的三行移动到行“tmp [0] = normalizer.normalize（tmp [0]）”中，然后删除停用词，所有停止单词将被省略。

删除文件中的特定单词

相关推荐