IR2140: Reading Notes for Unit 3

Index Construction

The method of Blocked sort-based indexing, makes the collection assembling all term-docID pairs. External sorting algorithm that uses disk minimize the number of random disk seeks during sorting.

Single-pass in-memory indexing processes tokens one by one during each successive call. When a term occurs for the first time, it is added to the dictionary. The call returns postings list for subsequent occurrences of the term. It eliminates the expensive sorting step.

Distributed indexing is an application of MapReduce. It breaks a large computing problem into smaller parts be recasting it in terms of manipulation of key-value pairs. In map phase, input data was split into key-value pairs. During reduce phase, inverters collect all values( docIDs) for a given key( term ID).

Dynamic indexing has one simple way to do it. Periodically reconstruct the index from scratch.

Index compression

2 central data structures in information retrival are dictionary and the inverted index. Employ compression techniques for them are essential for efficient IR system.

Heaps’ law is used to estimate the number of terms. Even though vocabulary growth depends a lot on the nature of collection and how processed. The law are proved which is the dictionary continues to increase with more documents in the collection rather than the maximum vocabulary size reached. The size of dictionary is quite large for large collections.

Zipf’s law is used for modeling the distribution of terms.

Dictionary as a string. Using fixed-width entries for terms is clearly wasteful. So we store one long string of characters. Blocked storage is further compressing the dictionary by grouping terms. Careful there is tradeoff between compression and the speed of term.

Better compression with increased block size. If we have to partition the dictionary onto pages that are stored on disk, then we can index the first term of each page using a B-tree.

Variable encoding method use fewer bits for short gaps.

Variable Byte code: use an integral number of bytes to encode a gap. The 1^st bite of the byte is a continuation bit. VB codes use an adaptive number of bytes depending on the size of the gap. Bit-level codes adapt the length of the code on the finer grained bit level. The simplest is unary code. r codes implement variable-length encoding by splitting the representation of a gap G into a pair of length and offset. Length of G: in unary. Offset of G: in binary.

IR2140

Friday, January 24, 2014

Reading Notes for Unit 3

No comments:

Post a Comment