如何标记为spacy的Sence2vec实施
问题描述:
SpaCy句子已经实施了sense2vec字的嵌入包,其中他们的文件here如何标记为spacy的Sence2vec实施
的载体是所有形式WORD|POS
的。例如,句子
Dear local newspaper, I think effects computers have on people are great learning skills/affects because they give us time to chat with friends/new people, helps us learn about the globe(astronomy) and keeps us out of trouble
需要被转换成
Dear|ADJ local|ADJ newspaper|NOUN ,|PUNCT I|PRON think|VERB effects|NOUN computers|NOUN have|VERB on|ADP people|NOUN are|VERB great|ADJ learning|NOUN skills/affects|NOUN because|ADP they|PRON give|VERB us|PRON time|NOUN to|PART chat|VERB with|ADP friends/new|ADJ people|NOUN ,|PUNCT helps|VERB us|PRON learn|VERB about|ADP the|DET globe(astronomy|NOUN)|PUNCT and|CONJ keeps|VERB us|PRON out|ADP of|ADP trouble|NOUN !|PUNCT
为了通过sense2vec预训练的嵌入,并且为了要在sense2vec格式是可解释的。
这怎么办?
答
基于关闭的SpaCy's bin/merge.py实现这确实是需要的正是:
from spacy.en import English
import re
LABELS = {
'ENT': 'ENT',
'PERSON': 'ENT',
'NORP': 'ENT',
'FAC': 'ENT',
'ORG': 'ENT',
'GPE': 'ENT',
'LOC': 'ENT',
'LAW': 'ENT',
'PRODUCT': 'ENT',
'EVENT': 'ENT',
'WORK_OF_ART': 'ENT',
'LANGUAGE': 'ENT',
'DATE': 'DATE',
'TIME': 'TIME',
'PERCENT': 'PERCENT',
'MONEY': 'MONEY',
'QUANTITY': 'QUANTITY',
'ORDINAL': 'ORDINAL',
'CARDINAL': 'CARDINAL'
}
nlp = False;
def tag_words_in_sense2vec_format(passage):
global nlp;
if(nlp == False): nlp = English()
if isinstance(passage, str): passage = passage.decode('utf-8',errors='ignore');
doc = nlp(passage);
return transform_doc(doc);
def transform_doc(doc):
for index, ent in enumerate(doc.ents):
ent.merge(ent.root.tag_, ent.text, LABELS[ent.label_])
#if index % 100 == 0: print ("enumerating at entity index " + str(index));
#for np in doc.noun_chunks:
# while len(np) > 1 and np[0].dep_ not in ('advmod', 'amod', 'compound'):
# np = np[1:]
# np.merge(np.root.tag_, np.text, np.root.ent_type_)
strings = []
for index, sent in enumerate(doc.sents):
if sent.text.strip():
strings.append(' '.join(represent_word(w) for w in sent if not w.is_space))
#if index % 100 == 0: print ("converting at sentence index " + str(index));
if strings:
return '\n'.join(strings) + '\n'
else:
return ''
def represent_word(word):
if word.like_url:
return '%%URL|X'
text = re.sub(r'\s', '_', word.text)
tag = LABELS.get(word.ent_type_, word.pos_)
if not tag:
tag = '?'
return text + '|' + tag
凡
print(tag_words_in_sense2vec_format("Dear local newspaper, ..."))
结果
Dear|ADJ local|ADJ newspaper|NOUN ,|PUNCT ...