Tuesday, January 28, 2014

Muddiest Points for Unit 3

One problem of the feature of index is that some of the words have limited value of frequencies in documents. However, Many words are more frequently to be in some specific kinds of document. I was wondering is it possible that the words listed in the table should be with similar meaning and then they do not need to list all the docs which are not related to their topics.

Friday, January 24, 2014

Reading Notes for Unit 3

Index Construction

The method of Blocked sort-based indexing, makes the collection assembling all term-docID pairs. External sorting algorithm that uses disk minimize the number of random disk seeks during sorting.

Single-pass in-memory indexing processes tokens one by one during each successive call. When a term occurs for the first time, it is added to the dictionary. The call returns postings list for subsequent occurrences of the term. It eliminates the expensive sorting step.

Distributed indexing is an application of MapReduce. It breaks a large computing problem into smaller parts be recasting it in terms of manipulation of key-value pairs. In map phase, input data was split into key-value pairs. During reduce phase, inverters collect all values( docIDs) for a given key( term ID).

Dynamic indexing has one simple way to do it. Periodically reconstruct the index from scratch.

Index compression

2 central data structures in information retrival are dictionary and the inverted index. Employ compression techniques for them are essential for efficient IR system.

Heaps’ law is used to estimate the number of terms. Even though vocabulary growth depends a lot on the nature of collection and how processed. The law are proved which is the dictionary continues to increase with more documents in the collection rather than the maximum vocabulary size reached. The size of dictionary is quite large for large collections.

Zipf’s law is used for modeling the distribution of terms.

Dictionary as a string. Using fixed-width entries for terms is clearly wasteful. So we store one long string of characters. Blocked storage is further compressing the dictionary by grouping terms. Careful there is tradeoff between compression and the speed of term.

Better compression with increased block size. If we have to partition the dictionary onto pages that are stored on disk, then we can index the first term of each page using a B-tree.

Variable encoding method use fewer bits for short gaps.

Variable Byte code: use an integral number of bytes to encode a gap. The 1^st bite of the byte is a continuation bit. VB codes use an adaptive number of bytes depending on the size of the gap. Bit-level codes adapt the length of the code on the finer grained bit level. The simplest is unary code. r codes implement variable-length encoding by splitting the representation of a gap G into a pair of length and offset. Length of G: in unary. Offset of G: in binary.

Muddiest Points for Unit 2

Different punctuations could have different meanings, but it could be difficult for system automatically identify the which is the right place. I was thinking using experience is a normal way? For example taking statistical data and using it when deciding to punctuate.

Analyzing the context of paragraph could help when stemming. How about the effectiveness?

Friday, January 10, 2014