信息检索导论第十二章笔记(英文)
Language models for information retrieval
Abstract
Firstly, this chapter intorduce the concept of language models and then describe the basic and most commonly used language modeling approach to IR, the query likelihood model. And compare the language modeling approach and other approaches to IR in this chaper. Finally, it briefly describes various extensions to the language modeling approach.
Language Model
The simplest language model is equivalent to a probabilistic finite automaton.
A simple finite automaton and some of the strings in the language it generates. shows the start state of the automaton and a double circle indicates a (possible) finishing state.
A one-state finite automaton that acts as a unigram language model. We show a partial specification of the state emission probabilities.
Types of language models
- unigram language model
for example:
- bigram language models
which condition on the previous term
Such models are vital for tasks like speech recognition, spelling correction, and machine translation, where you need the probability of a term conditioned on surrounding context. However, most language-modeling work in IR has used unigram language models.
Multinomial distributions over words
the equations presented above do not present the multinomial probability of a bag of words, because they do not sum over all possible orderings of those words, as is done by the multinomial coefficient (the first term on the righthand side) in the standard presentation of a multinomial model:
The fundamental problem in designing language models is that we do notknow what exactly we should use as the model M d M_d Md
We pretend that the documentdis only a representative sample of text drawnfrom a model distribution, treating it like a fine-grained topic. We then esti-mate a language model from this sample, and use that model to calculate the probability of observing any word sequence, and, finally, we rank documentsaccording to their probability of generating the query
The query likelihood model
We construct the corresponding language model M d M_d Md for each document D in the document set. Our goal is to sort the documents according to their likelihood P(d|q) associated with the query.
The most common way to do this is to use the multinomial unigram language model, which is equivalent to a multinomial naive Bayes model,
so we have:
K q K_q Kq is the multinomial coefficient for the query q.
- For retrieval based on a language model (henceforth LM), we treat the generation of queries as a random process. The approach is to:
-
- infer a LM for each document
-
- Estimate P(q|Mdi), the probability of generating the query according to each of these document models
-
- rank the documents according to these probabolities
Estimating the query generation probability
How to estimate P(q|Md).
The probability of producing the query given the LM Md of document d using maximum likelihood estimation (MLE) and the unigram assumption is:
- approaches to smoothing probability distributions
- if t f ( t , d ) tf_(t,d) tf(t,d) = 0, then
where cftis the raw count of the term in the collection, andTis the raw size(number of tokens) of the entire collection.
- a simple idea that works well inpractice is to use a mixture between a document-specific multinomial distri-bution and a multinomial distribution estimated from the entire collection:
Mc is a language model built from the entire document collection
- use an LM built from the whole collection as a priordistribution in aBayesian updating process
Language modeling versus other approachesin information retrieval
The LM approach provides a novel way of looking at the problem of textretrieval, which links it with a lot of recent work in speech and language processing.
The LM approach assumes that documents and experssions of information needs are objects of the same type, and assesses their match by importing the toolsand methods of language modeling from speech and natural language pro-cessing.
The resulting model is mathematically precise, conceptually simple, computationally tractable, and intuitively appealing.
Extended language modeling approaches
- You can look at the probability of a query language model M q M_q Mq generating the document
This way is less appealing is that there is much less text available to estimate a LM based on the query text and is easy to see how to incorporate relevance feedback into such a model.
- Rather than directly generating in either direction, we can make an LMfrom both the document and query, and then ask how different these twolanguage models are from each other.
Three ways of developing the language modeling approach. (a) Query likelihood.
(b) Document likelihood.
© Model comparison.
For instance, one way to model the risk of returning a documentdas relevant to a query q is to use the (KL) divergence between their respective language models:
- intorduce translation models to bridge query-document gap ro address issues of alternate expression.
assume that the translation model can be represented by a conditional probability distribution T(·|·) between vocabulary terms. The form of the translation query generation model is then: