基于搜索的问答系统概述
-
分词 :How | do | you | like | NLP?
-
预处理 :
Spell correction(拼写纠错)
Lemmatisation(词性还原)和 Stemming(词干提取)
Stop-words(停顿词)
Words filter(单词过滤)
Synonym(同义词)
… -
文本表示(文本→向量):
Boolean Vertor (0,1,0,0,0,1,0)
Count Vector (0,1,2,0,0,0,0)
TF-IDF (0.7,0.3,0,0.2,0)
Word2Vec (0.15,0.6,0.3,0.7)
Seq2Seq (0.03,0.05,0.02,0.01) -
计算相似度(用倒排索引-Inverted index减少时间复杂度)
欧氏距离
余弦距离
Jaccard Distance -
根据相似度排序
-
过滤
-
返回结果