熊猫Dataframe列值拆分

问题描述：

我有一个Excel数据集包含usertype，ID和属性的描述。我已经在python熊猫中以dataframe（df）导入了这个文件。熊猫Dataframe列值拆分

现在我想将说明中的内容分成一个字，两个字和三个字。我可以在NLTK库的帮助下做一个单词标记。但我坚持两个和三个词标记。例如，列Description中的行之一有句子 -

一个全新的住宅公寓在孟买主要道路用便携式水。

我想这句话被分割为

“A品牌”，“全新”，“新住宅”，“住宅公寓” ......“饮用水”。

而这种拆分应该反映在该列的每一行中。

Image of my dataset in excel format

你怎么样1）不张贴图片2）不要张贴链接，图片3）_excel_数据的图片要少得多链接。 –

并阅读：http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples –

有一个'ngrams'函数在nltk这很容易做到这一点，采取一个参数的数字你想组合在一起的单词 – kev8484

答

下面是使用ngrams从nltk一个小例子。希望它能帮助：

from nltk.util import ngrams 
from nltk import word_tokenize 

# Creating test dataframe 
df = pd.DataFrame({'text': ['my first sentence', 
          'this is the second sentence', 
          'third sent of the dataframe']}) 
print(df)

输入dataframe：

text 
0 my first sentence 
1 this is the second sentence 
2 third sent of the dataframe

现在我们可以使用的n-gram与word_tokenize沿着bigrams和trigrams和将其应用到数据帧中的每一行。对于bigram，我们将2的值与标记化单词一起传递给ngrams函数，而对于卦则传递值为3。 ngrams返回的结果是generator类型，所以它被转换为列表。对于每一行，列表bigrams和trigrams都保存在不同的列中。

df['bigram'] = df['text'].apply(lambda row: list(ngrams(word_tokenize(row), 2))) 
df['trigram'] = df['text'].apply(lambda row: list(ngrams(word_tokenize(row), 3))) 
print(df)

结果：

     text \ 
0   my first sentence 
1 this is the second sentence 
2 third sent of the dataframe 

                bigram \ 
0       [(my, first), (first, sentence)] 
1 [(this, is), (is, the), (the, second), (second, sentence)] 
2 [(third, sent), (sent, of), (of, the), (the, dataframe)] 

                trigram 
0          [(my, first, sentence)] 
1 [(this, is, the), (is, the, second), (the, second, sentence)] 
2  [(third, sent, of), (sent, of, the), (of, the, dataframe)]

熊猫Dataframe列值拆分

相关推荐