Reading notes for Unit 4
- Complex queries could be achieved by processing
Boolean queries.
- The Boolean retrievalmodel contrasts
with ranked retrieval RANKED RETRIEVAL models.
- Compared to Boolean queries, richer
query models are needed for the following reasons:
- Be tolerant to spelling mistakes and inconsistent choice
of words.
- The index has to be augmented to capture the
proximities of terms in documents.
- We need term frequency information in posting lists.
- We wish to have an effective method to order (or
“rank”) the returned results.
- Parametric and zone indexes. They allow
us to index and retrieve documents by metadata they give us a simple means for
scoring (and thereby ranking) documents in response to a query.
- The dictionary for a parametric index comes from a
fixed vocabulary while the dictionary for a zone index structure whatever
vocabulary stems from the text of that zone.
- In fact, we can reduce the size of the dictionary by
encoding the zone in which a term occurs in the postings.

- Weighted zone scoring: Given a Boolean query q
and a document d,
weighted zone scoring assigns to the pair (q,
d) a score in the interval [0, 1],
by computing a linear combination of zone
scores, where each zone of the
document contributes a Boolean value.
- Learning weights: training examples are consisted of a query q and a document d.
The weights gi are then
“learned” from these examples.
- Weighting the importance of a term in a
document, based on the statistics of occurrence of the term.
- Term
frequency; Inverse document frequency; tf-idft,d = tft,d ×idft.
- By viewing each document as a vector of
such weights, we can compute a score between a query and each document.
- We denote by ~V (d) the vector derived from
document d, with one component in the vector for each dictionary term. The
standard way of quantifying the similarity between two documents d1
and d2 is to compute the
cosine
similarity of their vector representations ~V (d1) and ~V (d2)
- sim(d1, d2) =~V(d1) · ~V
(d2)|~V (d1)||~V (d2)|
- Practical Benefit:
Given a document d (potentially one of the di in the collection), consider
searching for the documents in the collection most similar to d. Such a search
is useful in a system where a user may identify a document and seek others like
it.
- Queries as vectors:
Assign to each document d a score equal to the dot product ~v(q) ·~v(d).
- In a typical
setting we have a collection of documents each represented by a vector, a free
text query represented by a vector, and a positive integer K.
We seek the K documents of the collection with the highest vector space scores
on the given query.
- The process of
adding in contributions one query term at a time is sometimes known as term-at-a-time
scoring or accumulation, and the N elements of the array Scores are therefore
known as accumulators.
- Variant of term-weighting for the vector
space model.
- Sublinear tf scaling
- Normalize the
tf weights of all terms occurring in a document by the maximum tf in that
document
- Pivoted
document length normalization
No comments:
Post a Comment