删除文件中的特定单词
问题描述:
我想删除文件中的停用单词(它包含一个句子,一个选项卡,然后是一个英文单词)。停用词是在一个单独的文件中,语言是波斯语。下面的代码工作,但问题是,它会删除一行中的停用词,但不会删除其他行中的同一停用词。它发生几乎每一个停止词。我猜也许它可以用于正常化。所以我通过导入hazm模块(hazm就像NLTK,波斯语)来标准化这两个文件。但问题没有改变。一些身体可以帮助吗?删除文件中的特定单词
from hazm import*
punctuation = '!"#$%&\'()*+,-./:;<=>[email protected][\\]^_`{|}~،؟«؛'
file1 = "stopwords.txt"
file2 = "test/پر.txt"
witoutStops = []
corpuslines = []
def RemStopWords (file1, file2):
with open(file1, encoding = "utf-8") as stopfile:
normalizer = Normalizer()
stopwords = stopfile.read()
stopwords = normalizer.normalize(stopwords)
with open(file2, encoding = "utf-8") as trainfile:
with open ("y.txt", "w", encoding = "utf-8") as newfile:
for line in trainfile:
tmp = line.strip().split("\t")
tmp[0] = normalizer.normalize(tmp[0])
corpuslines.append(tmp)
for row in corpuslines:
line = ""
tokens = row[0].split()
for token in tokens:
if token not in stopwords:
line += token + " "
line = line.strip() + "\n"
for i in punctuation: # deletes punctuations
if i in line:
line = line.replace(i, "")
newfile.write(line)
witoutStops.append (line)
停止词的文件: https://www.dropbox.com/s/irjkjmwkzwnnpnk/stopwords.txt?dl=0
文件: https://www.dropbox.com/s/p4m8san3xhr0pdj/%D9%BE%D8%B1.txt?dl=0
答
我发现这个问题。这是因为在某些文字中,标点符号附在单词上,代码将其视为单词的一部分,而不是标点符号。如果首先删除标点符号,则通过将属于该部分的代码的三行移动到行“tmp [0] = normalizer.normalize(tmp [0])”中,然后删除停用词,所有停止单词将被省略。
[删除使用正则表达式的停用词]的可能重复(http://stackoverflow.com/questions/41417528/delete-stop-words-using-regular-expression) –