假设:一个文档的分词w1,w2,w3,…,wn.
unigram(一元元组):
句子中每个分词都是独立的.将每个分词的概率直接相乘即可.
p(w)=p(w1)∗p(w2)∗p(w3)...∗p(wn)
=i=1∏np(wi)
bigram(二元元组):
基于 markov assumption ,考虑句子中前一个分词出现情况下的概率.
p(w)=p(w1)∗p(w2∣w1)∗p(w3∣w2)...∗p(wn∣wn−1)
=p(w1)∗i=2∏np(wi∣wi−1)
trigram(三元元组):
基于 markov assumption ,考虑句子中前两个分词出现情况下的概率.
p(w)=p(w1)∗p(w2∣w1)∗p(w3∣w2w1)...∗p(wn∣wn−1wn−2)
=p(w1)∗p(w2∣w1)∗i=3∏np(wi∣wi−1wi−2)
