11 Probabilistic information retrieval
In the Boolean or vector
space models of IR, given the query and document representations, a system has
an uncertain guess of whether a document has content relevant to the
information need. Probability theory provides a principled foundation for such
reasoning under uncertainty.
All the theories are based
on the basic probabilistic theories.
The
Binary Independence Model:
To estimate the
probability function P(R|d, q) practical, some simple assumptions are imported.
Documents and queries are
both represented as binary term incidence vectors; We assume here that the
relevance of each document is independent of the relevance of other documents.
Deriving a ranking function
for query terms:
Given a query q, we wish
to order returned documents by descending P(R =1|d, q). Under the BIM, this is
modeled as ordering by P(R = 1|~x,~q). Ratherthan estimating this probability
directly, because we are interested only in the ranking of documents.
pt = P(xt = 1|R = 1,~q):
probability of a term appearing in a document relevant to the query
ut = P(xt = 1|R = 0,~q):
probability of a term appearing in a non-relevant document.
If relevant ct= (pt/(1 −
pt))
We can provide a
theoretical justification for the most frequently used form of idf weighting.
Ut (the probability of
term occurrence in nonrelevant documents for a query) is dft/N and
log[(1−
ut)/ut] = log[(N −dft)/dft] ≈ log N/dft
Then we can use the
frequency of term occurrence in known relevant documents
(if we know some).
It is perhaps the severity
of the modeling assumptions that makes achieving good performance difficult.
Compared to other
probabilistic approaches, such as the BIM, the main difference is LM approach
does away with explicitly modeling relevance (whereas this is the central
variable evaluated in the BIM approach). It is perhaps the severity of the
modeling assumptions that makes achieving good performance difficult.
12 Language models for information
retrieval
A language model is a
function LANGUAGE MODEL that puts a probability measure over strings drawn from
some vocabulary.
Types of language models
P(t1t2t3t4) =
P(t1)P(t2|t1)P(t3|t1t2)P(t4|t1t2t3)
The simplest form of
language model simply throws away all conditioning context, and estimates each
term independently. Such model is called a unigram
language model:
Puni(t1t2t3t4) =
P(t1)P(t2)P(t3)P(t4)
There are many more
complex kinds of language models, such as bigram
language models, which condition on the previous term,
Pbi(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t2)P(t4|t3)
The
query likelihood
A document could be regarded as a
language model.
Every language has its own
logistic. The model is made up of a sequence of string.
For retrieval based on a
language model (henceforth LM), we treat the generation of queries as a random
process. The approach is to
1. Infer a LM for each
document.
2. Estimate P(q|Mdi), the
probability of generating the query according to each of these document models.
3. Rank the documents according to
these probabilities.
Based on the change of argument,
the language model could be changed to other models.
No comments:
Post a Comment