IR2140: Reading notes for Unit 5

11 Probabilistic information retrieval

In the Boolean or vector space models of IR, given the query and document representations, a system has an uncertain guess of whether a document has content relevant to the information need. Probability theory provides a principled foundation for such reasoning under uncertainty.

All the theories are based on the basic probabilistic theories.

The Binary Independence Model:

To estimate the probability function P(R|d, q) practical, some simple assumptions are imported.

Documents and queries are both represented as binary term incidence vectors; We assume here that the relevance of each document is independent of the relevance of other documents.

Deriving a ranking function for query terms:

Given a query q, we wish to order returned documents by descending P(R =1|d, q). Under the BIM, this is modeled as ordering by P(R = 1|~x,~q). Ratherthan estimating this probability directly, because we are interested only in the ranking of documents.

pt = P(xt = 1|R = 1,~q): probability of a term appearing in a document relevant to the query

ut = P(xt = 1|R = 0,~q): probability of a term appearing in a non-relevant document.

The ct terms are log odds ratios for the terms in the query.

If relevant ct= (pt/(1 − pt))

We can provide a theoretical justification for the most frequently used form of idf weighting.

Ut (the probability of term occurrence in nonrelevant documents for a query) is dft/N and

log[(1− ut)/ut] = log[(N −dft)/dft] ≈ log N/dft

Then we can use the frequency of term occurrence in known relevant documents

(if we know some).

It is perhaps the severity of the modeling assumptions that makes achieving good performance difficult.

Compared to other probabilistic approaches, such as the BIM, the main difference is LM approach does away with explicitly modeling relevance (whereas this is the central variable evaluated in the BIM approach). It is perhaps the severity of the modeling assumptions that makes achieving good performance difficult.

12 Language models for information retrieval

A language model is a function LANGUAGE MODEL that puts a probability measure over strings drawn from some vocabulary.

Types of language models

P(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t1t2)P(t4|t1t2t3)

The simplest form of language model simply throws away all conditioning context, and estimates each term independently. Such model is called a unigram language model:

Puni(t1t2t3t4) = P(t1)P(t2)P(t3)P(t4)

There are many more complex kinds of language models, such as bigram language models, which condition on the previous term,

Pbi(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t2)P(t4|t3)

The query likelihood

A document could be regarded as a language model.

Every language has its own logistic. The model is made up of a sequence of string.

For retrieval based on a language model (henceforth LM), we treat the generation of queries as a random process. The approach is to

1. Infer a LM for each document.

2. Estimate P(q|Mdi), the probability of generating the query according to each of these document models.

3. Rank the documents according to these probabilities.

Based on the change of argument, the language model could be changed to other models.

IR2140

Monday, February 10, 2014

Reading notes for Unit 5

No comments:

Post a Comment