将csv文件中的列加载到spaCy

问题描述：

我是新来的spaCy和NLTK作为一个整体，所以我提前道歉，如果这似乎是一个愚蠢的问题。将csv文件中的列加载到spaCy

基于spaCy教程，我必须使用以下命令将文本加载到文档中。

doc = nlp(u'Hello, world. Natural Language Processing in 10 lines of code.')

但是，我在sql server或excel上以表格格式存储了很多文本。它基本上有两列。第一列有一个唯一的标识符。第二栏有一个简短的文字。

如何将它们加载到spaCy中？我是否需要将它们转换为Numpy数组或Pandas数据框，然后将其加载到文档中？

在此先感谢您的帮助！

答

这应该很简单 - 您可以使用任何方法从数据库中读取文本（熊猫数据框，CSV阅读器等），然后遍历它们。

这最终取决于你想要做什么，你要如何处理您的文本 - 如果你想每个文本单独处理，简单地通过线遍历数据线：

for id, line in text: 
    doc = nlp(line) 
    # do something with each text

另外，您也可以加入文本成一个字符串，并处理它们作为一份文件：

text = open('some_large_text_file.txt').read() 
doc = nlp(text)

对于更高级的用法示例，请参见this code snippet of streaming input and output使用pipe()。

但是，如果读入一个数据帧，可以使用'df.apply（）'或等价的方法将行输入到'nlp'中，而不是迭代。 – alexis

答

给出一个CSV文件是这样的：

$ cat test.tsv 
DocID Text WhateverAnnotations 
1 Foo bar bar dot dot dot 
2 bar bar black sheep dot dot dot dot 

$ cut -f2 test.tsv 
Text 
Foo bar bar 
bar bar black sheep

而在代码：

$ python 
>>> import pandas as pd 
>>> pd.read_csv('test.tsv', delimiter='\t') 
    DocID     Text WhateverAnnotations 
0  1   Foo bar bar   dot dot dot 
1  2 bar bar black sheep  dot dot dot dot 
>>> df = pd.read_csv('test.tsv', delimiter='\t') 
>>> df['Text'] 
0   Foo bar bar 
1 bar bar black sheep 
Name: Text, dtype: object

要使用pipe在spacy：

>>> import spacy 
>>> nlp = spacy.load('en') 
>>> for parsed_doc in nlp.pipe(iter(df['Text']), batch_size=1, n_threads=4): 
...  print (parsed_doc[0].text, parsed_doc[0].tag_) 
... 
Foo NNP 
bar NN

要使用pandas.DataFrame.apply()：

>>> df['Parsed'] = df['Text'].apply(nlp) 

>>> df['Parsed'].iloc[0] 
Foo bar bar 
>>> type(df['Parsed'].iloc[0]) 
<class 'spacy.tokens.doc.Doc'> 
>>> df['Parsed'].iloc[0][0].tag_ 
'NNP' 
>>> df['Parsed'].iloc[0][0].text 
'Foo'

以基准。

第一个重复的行200万次：

$ cat test.tsv 
DocID Text WhateverAnnotations 
1 Foo bar bar dot dot dot 
2 bar bar black sheep dot dot dot dot 

$ tail -n 2 test.tsv > rows2 

$ perl -ne 'print "$_" x1000000' rows2 > rows2000000 

$ cat test.tsv rows2000000 > test-2M.tsv 

$ wc -l test-2M.tsv 
2000003 test-2M.tsv 

$ head test-2M.tsv 
DocID Text WhateverAnnotations 
1 Foo bar bar dot dot dot 
2 bar bar black sheep dot dot dot dot 
1 Foo bar bar dot dot dot 
1 Foo bar bar dot dot dot 
1 Foo bar bar dot dot dot 
1 Foo bar bar dot dot dot 
1 Foo bar bar dot dot dot 
1 Foo bar bar dot dot dot 
1 Foo bar bar dot dot dot

[nlppipe.py]：

import time 

import pandas as pd 
import spacy 


df = pd.read_csv('test-2M.tsv', delimiter='\t') 
nlp = spacy.load('en') 

start = time.time() 
for parsed_doc in nlp.pipe(iter(df['Text']), batch_size=1000, n_threads=4): 
    x = parsed_doc[0].tag_ 
print (time.time() - start)

[dfapply.py]：

import time 

import pandas as pd 
import spacy 


df = pd.read_csv('test-2M.tsv', delimiter='\t') 
nlp = spacy.load('en') 

start = time.time() 
df['Parsed'] = df['Text'].apply(nlp) 

for doc in df['Parsed']: 
    x = doc[0].tag_ 
print (time.time() - start)

答

我认为亚历克西斯对评论使用pandas.apply()是最好的答案，这对我很好：

import spacy 

df = pd.read_csv('doc filename.txt') 
df['text_as_spacy_objects'] = df['text column name'].apply(nlp)

将csv文件中的列加载到spaCy

相关推荐