IR2140: February 2014

Friday, February 21, 2014

Reading notes for Unit 7

Relevance feedback and query expansion

query refinement

Global methods include:

• Query expansion/reformulationwith a thesaurus orWordNet

• Query expansion via automatic thesaurus generation

• Techniques like spelling correction

Local methods

• Relevance feedback

• Pseudo relevance feedback, also known as Blind relevance feedback

• (Global) indirect relevance feedback

The idea of relevance feedback (RF) is to involve the user in the retrieval process so as to improve the final result set.

The basic procedure is:

• The user issues a (short, simple) query.

• The system returns an initial set of retrieval results.

• The user marks some returned documents as relevant or non-relevant.

• The system computes a better representation of the information need based on the user feedback.

• The system displays a revised set of retrieval results.

Rocchio algorithm

The Rocchio Algorithm is the classic algorithm for implementing relevance feedback. It models a way of incorporating relevance feedback information into the vector space model.

Relevance feedback can improve both recall and precision.

Probabilistic relevance feedback

Rather than reweighting the query in a vector space, if a user has told us some relevant and nonrelevant documents, then we can proceed to build a classifier.

The use only collection statistics and information about the term distribution within the documents judged relevant. They preserve no memory of the original query.

Cases where relevance feedback alone is not sufficient include:

· Misspellings.

· Cross-language information retrieval.

· Mismatch of searcher’s vocabulary versus collection vocabulary.

Secondly, the relevance feedback approach requires relevant documents to be similar to each other.

Muddiest Points for Unit 6

My question is since we only use the average precision, then why do we need to calculate the recall?

Wednesday, February 12, 2014

For the "Multinomial and Multiple-Bernoulli", there is an example of how the formulas are built. One is targeted at the number of query and one is targeted at the number of M. Whether the final result indicates the correspondence of frequency of each term in query and frequency of each term in M?

Reading notes for Unit 6

Information retrieval system evaluation

· To measure ad hoc information retrieval effectiveness, we need a test collection:

1. A document collection

2. A test suite of information needs

3. A set of relevance judgments

· A document is relevant if it addresses the stated information need, not because it just happens to contain all the words in the query.

Evaluation of unranked retrieval sets

· The two most frequent and basic measures for information retrieval effectiveness are precision and recall.
· Accuracy is not an appropriate measure for information retrieval problems. A system tuned to maximize accuracy can appear to perform well by simply deeming all documents non-relevant to all queries. However, labeling all documents as non-relevant is completely unsatisfying to an information retrieval system user.
· A single measure that trades off precision versus recall is the F measure.
· We use a harmonic mean rather than the simpler average (arithmetic mean).

Evaluation of ranked retrieval results

· Examining the entire precision-recall curve is very informative, but there is often a desire to boil this information down to a few numbers, or perhaps even a single number. The traditional way of doing this is the 11-point interpolated average precision.
· In web search, what matters is rather how many good results there are on the first page or the first three pages. This leads to measuring precision at fixed low levels of retrieved results, such as 10 or 30 documents.
· Mean Average Precision (MAP), which provides a single-figure measure of quality across recall levels.
· An alternative, which alleviates this problem, is R-precision. It requires having a set of known relevant documents Rel, from which we calculate the precision of the top Rel documents returned.
· Another concept sometimes used in evaluation ROC CURVE is an ROC curve. In many fields, a common aggregate measure is to report the area under the ROC curve, which is the ROC analog of MAP.

Assessing relevance

· To properly evaluate a system, your test information needs must be germane to the documents in the test document collection, and appropriate for predicted usage of the system.
· Given information needs and documents, you need to collect relevance assessments. The most standard approach is pooling, where relevance is assessed over a subset of the collection that is formed from the top k documents returned by a number of different IR systems, and perhaps other sources such as the results of Boolean keyword searches or documents found by expert searchers in an interactive process.
· Rather, humans and their relevance judgments are quite idiosyncratic and variable. In the social sciences, a common measure for agreement between judges is the kappa statistic.
· The choice of a human can make a considerable absolute difference to reported scores, but has in general been found to have little impact on the relative effectiveness ranking of either different systems or variants of a single system which are being compared for effectiveness.

Critiques and justifications of the concept of relevance

· One clear problem with the relevance-based assessment that we have presented is the distinction between relevance and marginal relevance.

Results snippets

· The two basic kinds of summaries are static, which are always the same regardless of the query, and dynamic (or query-dependent), which are customized according to the user’s information need as deduced from a query. Dynamic summaries attempt to explain why a particular document was retrieved for the query at hand.

· Dynamic summaries are generally regarded as greatly improving the usability of IR systems, but they present a complication for IR system design

Monday, February 10, 2014

Reading notes for Unit 5

11 Probabilistic information retrieval

In the Boolean or vector space models of IR, given the query and document representations, a system has an uncertain guess of whether a document has content relevant to the information need. Probability theory provides a principled foundation for such reasoning under uncertainty.

All the theories are based on the basic probabilistic theories.

The Binary Independence Model:

To estimate the probability function P(R|d, q) practical, some simple assumptions are imported.

Documents and queries are both represented as binary term incidence vectors; We assume here that the relevance of each document is independent of the relevance of other documents.

Deriving a ranking function for query terms:

Given a query q, we wish to order returned documents by descending P(R =1|d, q). Under the BIM, this is modeled as ordering by P(R = 1|~x,~q). Ratherthan estimating this probability directly, because we are interested only in the ranking of documents.

pt = P(xt = 1|R = 1,~q): probability of a term appearing in a document relevant to the query

ut = P(xt = 1|R = 0,~q): probability of a term appearing in a non-relevant document.

The ct terms are log odds ratios for the terms in the query.

If relevant ct= (pt/(1 − pt))

We can provide a theoretical justification for the most frequently used form of idf weighting.

Ut (the probability of term occurrence in nonrelevant documents for a query) is dft/N and

log[(1− ut)/ut] = log[(N −dft)/dft] ≈ log N/dft

Then we can use the frequency of term occurrence in known relevant documents

(if we know some).

It is perhaps the severity of the modeling assumptions that makes achieving good performance difficult.

Compared to other probabilistic approaches, such as the BIM, the main difference is LM approach does away with explicitly modeling relevance (whereas this is the central variable evaluated in the BIM approach). It is perhaps the severity of the modeling assumptions that makes achieving good performance difficult.

12 Language models for information retrieval

A language model is a function LANGUAGE MODEL that puts a probability measure over strings drawn from some vocabulary.

Types of language models

P(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t1t2)P(t4|t1t2t3)

The simplest form of language model simply throws away all conditioning context, and estimates each term independently. Such model is called a unigram language model:

Puni(t1t2t3t4) = P(t1)P(t2)P(t3)P(t4)

There are many more complex kinds of language models, such as bigram language models, which condition on the previous term,

Pbi(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t2)P(t4|t3)

The query likelihood

A document could be regarded as a language model.

Every language has its own logistic. The model is made up of a sequence of string.

For retrieval based on a language model (henceforth LM), we treat the generation of queries as a random process. The approach is to

1. Infer a LM for each document.

2. Estimate P(q|Mdi), the probability of generating the query according to each of these document models.

3. Rank the documents according to these probabilities.

Based on the change of argument, the language model could be changed to other models.

Monday, February 3, 2014

Muddiest Points for Unit 4

The reason that we use log-frequency weighting is that we want to weaken the linear relatiohn between frequency and weight. But I can’t see why using log is more accurate than linear.
Is there any rule that two terms should be combined together so that we could decrease the space of vector?

Sunday, February 2, 2014

Reading notes for Unit 4

Complex queries could be achieved by processing Boolean queries.
The Boolean retrievalmodel contrasts with ranked retrieval RANKED RETRIEVAL models.
Compared to Boolean queries, richer query models are needed for the following reasons:

Be tolerant to spelling mistakes and inconsistent choice of words.
The index has to be augmented to capture the proximities of terms in documents.
We need term frequency information in posting lists.
We wish to have an effective method to order (or “rank”) the returned results.

Parametric and zone indexes. They allow us to index and retrieve documents by metadata they give us a simple means for scoring (and thereby ranking) documents in response to a query.
The dictionary for a parametric index comes from a fixed vocabulary while the dictionary for a zone index structure whatever vocabulary stems from the text of that zone.
In fact, we can reduce the size of the dictionary by encoding the zone in which a term occurs in the postings.

Weighted zone scoring: Given a Boolean query q and a document d, weighted zone scoring assigns to the pair (q, d) a score in the interval [0, 1], by computing a linear combination of zone scores, where each zone of the document contributes a Boolean value.
Learning weights: training examples are consisted of a query q and a document d. The weights gi are then “learned” from these examples.
Weighting the importance of a term in a document, based on the statistics of occurrence of the term.
Term frequency; Inverse document frequency; tf-idft,d = tft,d ×idft.
By viewing each document as a vector of such weights, we can compute a score between a query and each document.
We denote by ~V (d) the vector derived from document d, with one component in the vector for each dictionary term. The standard way of quantifying the similarity between two documents d1 and d2 is to compute the cosine similarity of their vector representations ~V (d1) and ~V (d2)

sim(d1, d2) =~V(d1) · ~V (d2)|~V (d1)||~V (d2)|

Practical Benefit: Given a document d (potentially one of the di in the collection), consider searching for the documents in the collection most similar to d. Such a search is useful in a system where a user may identify a document and seek others like it.
Queries as vectors: Assign to each document d a score equal to the dot product ~v(q) ·~v(d).
In a typical setting we have a collection of documents each represented by a vector, a free text query represented by a vector, and a positive integer K. We seek the K documents of the collection with the highest vector space scores on the given query.
The process of adding in contributions one query term at a time is sometimes known as term-at-a-time scoring or accumulation, and the N elements of the array Scores are therefore known as accumulators.
Variant of term-weighting for the vector space model.

Sublinear tf scaling
Normalize the tf weights of all terms occurring in a document by the maximum tf in that document
Pivoted document length normalization