Sunday, February 2, 2014

Reading notes for Unit 4

  •     Complex queries could be achieved by processing Boolean queries.
  • The Boolean retrievalmodel contrasts with ranked retrieval RANKED RETRIEVAL models. 
  •  Compared to Boolean queries, richer query models are needed for the following reasons:
    • Be tolerant to spelling mistakes and inconsistent choice of words.
    • The index has to be augmented to capture the proximities of terms in documents.
    • We need term frequency information in posting lists.
    • We wish to have an effective method to order (or “rank”) the returned results.
  •  Parametric and zone indexes. They allow us to index and retrieve documents by metadata they give us a simple means for scoring (and thereby ranking) documents in response to a query.
  • The dictionary for a parametric index comes from a fixed vocabulary while the dictionary for a zone index structure whatever vocabulary stems from the text of that zone.
  • In fact, we can reduce the size of the dictionary by encoding the zone in which a term occurs in the postings.

  • Weighted zone scoring: Given a Boolean query q and a document d, weighted zone scoring assigns to the pair (q, d) a score in the interval [0, 1], by computing a linear combination of zone scores, where each zone of the document contributes a Boolean value.
  • Learning weights: training examples are consisted of a query q and a document d. The weights gi are then “learned” from these examples.
  • Weighting the importance of a term in a document, based on the statistics of occurrence of the term.
  • Term frequency; Inverse document frequency; tf-idft,d = tft,d ×idft.
  •  By viewing each document as a vector of such weights, we can compute a score between a query and each document.
  • We denote by ~V (d) the vector derived from document d, with one component in the vector for each dictionary term. The standard way of quantifying the similarity between two documents d1 and d2 is to compute the cosine similarity of their vector representations ~V (d1) and ~V (d2)
    • sim(d1, d2) =~V(d1) · ~V (d2)|~V (d1)||~V (d2)|
  • Practical Benefit: Given a document d (potentially one of the di in the collection), consider searching for the documents in the collection most similar to d. Such a search is useful in a system where a user may identify a document and seek others like it.
  • Queries as vectors: Assign to each document d a score equal to the dot product ~v(q) ·~v(d).
  • In a typical setting we have a collection of documents each represented by a vector, a free text query represented by a vector, and a positive integer K. We seek the K documents of the collection with the highest vector space scores on the given query.
  • The process of adding in contributions one query term at a time is sometimes known as term-at-a-time scoring or accumulation, and the N elements of the array Scores are therefore known as accumulators.
  •  Variant of term-weighting for the vector space model.
    • Sublinear tf scaling
    • Normalize the tf weights of all terms occurring in a document by the maximum tf in that document
    • Pivoted document length normalization

No comments:

Post a Comment