信息检索导论第十二章笔记(英文)

Language models for information retrieval

Abstract

Firstly, this chapter intorduce the concept of language models and then describe the basic and most commonly used language modeling approach to IR, the query likelihood model. And compare the language modeling approach and other approaches to IR in this chaper. Finally, it briefly describes various extensions to the language modeling approach.

Language Model

The simplest language model is equivalent to a probabilistic finite automaton.

信息检索导论第十二章笔记(英文)

A simple finite automaton and some of the strings in the language it generates. shows the start state of the automaton and a double circle indicates a (possible) finishing state.

信息检索导论第十二章笔记(英文)

A one-state finite automaton that acts as a unigram language model. We show a partial specification of the state emission probabilities.

Types of language models

  • unigram language model

信息检索导论第十二章笔记(英文)

for example:

信息检索导论第十二章笔记(英文)

  • bigram language models

信息检索导论第十二章笔记(英文)

which condition on the previous term

信息检索导论第十二章笔记(英文)

Such models are vital for tasks like speech recognition, spelling correction, and machine translation, where you need the probability of a term conditioned on surrounding context. However, most language-modeling work in IR has used unigram language models.

Multinomial distributions over words

the equations presented above do not present the multinomial probability of a bag of words, because they do not sum over all possible orderings of those words, as is done by the multinomial coefficient (the first term on the righthand side) in the standard presentation of a multinomial model:

信息检索导论第十二章笔记(英文)

The fundamental problem in designing language models is that we do notknow what exactly we should use as the model M d M_d Md

We pretend that the documentdis only a representative sample of text drawnfrom a model distribution, treating it like a fine-grained topic. We then esti-mate a language model from this sample, and use that model to calculate the probability of observing any word sequence, and, finally, we rank documentsaccording to their probability of generating the query

The query likelihood model

We construct the corresponding language model M d M_d Md for each document D in the document set. Our goal is to sort the documents according to their likelihood P(d|q) associated with the query.

信息检索导论第十二章笔记(英文)

The most common way to do this is to use the multinomial unigram language model, which is equivalent to a multinomial naive Bayes model,

so we have:

信息检索导论第十二章笔记(英文)

K q K_q Kq is the multinomial coefficient for the query q.

  • For retrieval based on a language model (henceforth LM), we treat the generation of queries as a random process. The approach is to:
    1. infer a LM for each document
    1. Estimate P(q|Mdi), the probability of generating the query according to each of these document models
    1. rank the documents according to these probabolities

Estimating the query generation probability

How to estimate P(q|Md).

The probability of producing the query given the LM Md of document d using maximum likelihood estimation (MLE) and the unigram assumption is:

信息检索导论第十二章笔记(英文)

  • approaches to smoothing probability distributions
  1. if t f ( t , d ) tf_(t,d) tf(t,d) = 0, then

信息检索导论第十二章笔记(英文)

where cftis the raw count of the term in the collection, andTis the raw size(number of tokens) of the entire collection.

  1. a simple idea that works well inpractice is to use a mixture between a document-specific multinomial distri-bution and a multinomial distribution estimated from the entire collection:

信息检索导论第十二章笔记(英文)

Mc is a language model built from the entire document collection

  1. use an LM built from the whole collection as a priordistribution in aBayesian updating process

信息检索导论第十二章笔记(英文)

Language modeling versus other approachesin information retrieval

The LM approach provides a novel way of looking at the problem of textretrieval, which links it with a lot of recent work in speech and language processing.

The LM approach assumes that documents and experssions of information needs are objects of the same type, and assesses their match by importing the toolsand methods of language modeling from speech and natural language pro-cessing.

The resulting model is mathematically precise, conceptually simple, computationally tractable, and intuitively appealing.

Extended language modeling approaches

  • You can look at the probability of a query language model M q M_q Mq generating the document

This way is less appealing is that there is much less text available to estimate a LM based on the query text and is easy to see how to incorporate relevance feedback into such a model.

  • Rather than directly generating in either direction, we can make an LMfrom both the document and query, and then ask how different these twolanguage models are from each other.

信息检索导论第十二章笔记(英文)

Three ways of developing the language modeling approach. (a) Query likelihood.
(b) Document likelihood.
© Model comparison.

For instance, one way to model the risk of returning a documentdas relevant to a query q is to use the (KL) divergence between their respective language models:

信息检索导论第十二章笔记(英文)

  • intorduce translation models to bridge query-document gap ro address issues of alternate expression.

assume that the translation model can be represented by a conditional probability distribution T(·|·) between vocabulary terms. The form of the translation query generation model is then:

信息检索导论第十二章笔记(英文)