Index Construction
The method of Blocked
sort-based indexing, makes the collection assembling all term-docID pairs. External sorting algorithm that uses disk minimize the number of random
disk seeks during sorting.
Single-pass in-memory
indexing processes tokens one by one during each successive call. When a term
occurs for the first time, it is added to the dictionary. The call returns
postings list for subsequent occurrences of the term. It eliminates the
expensive sorting step.
Distributed indexing
is an application of MapReduce. It breaks a large computing problem into
smaller parts be recasting it in terms of manipulation of key-value pairs. In
map phase, input data was split into key-value pairs. During reduce phase,
inverters collect all values( docIDs) for a given key( term ID).
Dynamic indexing has
one simple way to do it. Periodically reconstruct the index from scratch.
Index compression
2 central data structures in information retrival are dictionary and the inverted index. Employ compression techniques for them are
essential for efficient IR system.
Heaps’ law is used
to estimate the number of terms. Even though vocabulary growth depends a lot on
the nature of collection and how processed. The law are proved which is the
dictionary continues to increase with more documents in the collection rather
than the maximum vocabulary size reached. The size of dictionary is quite large
for large collections.
Zipf’s law is used for modeling the distribution of terms.
Dictionary as a string. Using fixed-width entries for terms is
clearly wasteful. So we store one long string of characters. Blocked storage is
further compressing the dictionary by grouping terms. Careful there is tradeoff
between compression and the speed of term.
Better compression with increased
block size. If we have to partition the dictionary onto pages that are stored
on disk, then we can index the first term of each page using a B-tree.
Variable encoding method use fewer
bits for short gaps.
Variable Byte code: use an integral number of bytes to encode a
gap. The 1st bite of the byte is a continuation bit. VB codes use an
adaptive number of bytes depending on the size of the gap. Bit-level codes adapt
the length of the code on the finer grained bit level. The simplest is unary
code. r codes implement variable-length
encoding by splitting the representation of a gap G into a pair of length and
offset. Length of G: in unary. Offset of G: in binary.
No comments:
Post a Comment