Tuesday, January 28, 2014

Muddiest Points for Unit 3

One problem of the feature of index is that some of the words have limited value of frequencies in documents. However, Many words are more frequently to be in some specific kinds of document. I was wondering is it possible that the words listed in the table should be with similar meaning and then they do not need to list all the docs which are not related to their topics.

Friday, January 24, 2014

Reading Notes for Unit 3

Index Construction
The method of Blocked sort-based indexing, makes the collection assembling all term-docID pairs. External sorting algorithm  that uses disk minimize the number of random disk seeks during sorting.
Single-pass in-memory indexing processes tokens one by one during each successive call. When a term occurs for the first time, it is added to the dictionary. The call returns postings list for subsequent occurrences of the term. It eliminates the expensive sorting step.
Distributed indexing is an application of MapReduce. It breaks a large computing problem into smaller parts be recasting it in terms of manipulation of key-value pairs. In map phase, input data was split into key-value pairs. During reduce phase, inverters collect all values( docIDs) for a given key( term ID).

Dynamic indexing has one simple way to do it. Periodically reconstruct the index from scratch.
Index compression
2 central data structures in information retrival are dictionary and the inverted index. Employ compression techniques for them are essential for efficient IR system.
Heaps’ law is used to estimate the number of terms. Even though vocabulary growth depends a lot on the nature of collection and how processed. The law are proved which is the dictionary continues to increase with more documents in the collection rather than the maximum vocabulary size reached. The size of dictionary is quite large for large collections.
Zipf’s law is used for modeling the distribution of terms.
Dictionary as a string. Using fixed-width entries for terms is clearly wasteful. So we store one long string of characters. Blocked storage is further compressing the dictionary by grouping terms. Careful there is tradeoff between compression and the speed of term.
Better compression with increased block size. If we have to partition the dictionary onto pages that are stored on disk, then we can index the first term of each page using a B-tree.
Variable encoding method use fewer bits for short gaps.

Variable Byte code: use an integral number of bytes to encode a gap. The 1st bite of the byte is a continuation bit. VB codes use an adaptive number of bytes depending on the size of the gap. Bit-level codes adapt the length of the code on the finer grained bit level. The simplest is unary code. r codes implement variable-length encoding by splitting the representation of a gap G into a pair of length and offset. Length of G: in unary. Offset of G: in binary.

Muddiest Points for Unit 2

Different punctuations could have different meanings,  but it could be difficult for system automatically identify the which is the right place. I was thinking using experience is a normal way? For example taking statistical data and using it when deciding to punctuate.


Analyzing the context of paragraph could help when stemming. How about the effectiveness?

Friday, January 10, 2014

Reading Notes for Unit 2


1.2
Build inverted index:
There are 4 major steps of building an inverted index.
1)      Collect the document s to be indexed;
2)      Tokenize the text, turning each document into a list of tokens;
3)      Do linguistic preprocessing;
4)      Index the documents.
Input to indexing : A list of normalized tokens for each document ( a list of pairs of term and docID).
Core indexing step: sorting the list so that the terms are alphabetical.
Inverted index structure is the most efficient structure for supporting ad hoc text search.
 


2
1, Define the basic unit of a document and define character sequence:
Convert byte sequence into linear sequence of characters. The first step would be to determine encoding.
Appropriate indexing granularity is essential.
 2, Determine the vocabulary of terms using tokenization and linguistic preprocessing:
1)      Tokenization:
Major question of it is what are the correct tokens to use.
It requires the language of the document to be known. Handling hyphens can be complex and splitting on white space can split what should be regarded as a single token. And each new language presents some new issues.
2)      Dropping common terms: stop words
Some extremely common words appear to be of little value in helping select documents matching a user need.
To determine a stop list is to sort the terms by collection frequency. And a lot of time not indexing stop words does little harm.
3)      Normalization:
Token normalization
Remove characters like hyphens or maintain relations between unnormalized tokens.
Some issues: Accents and diacritics; Capitalization/ case-folding; Other issues in English; Other languages.
4)      Stemming and lemmatization:
It would be useful for a search for one of these words to return documents to return documents that contain another word in the set.  The goal of stemming and lemmatization is to reduce inflectional forms or just change words to base form.
 

3, How to process posting list intersection in sublinear time?
Skip list: segmenting postings lists with skip pointers. This allow us to avoid processing parts of the postings list that will not figure in the search results.
Where do we place skips? A simple heuristic for placing skips is to use P0.5 evenly-spaced skip pointers. Building effective skip pointers is easy if an index is relatively static.
 4, To process two-word phrase queries, we treat each biword as a vocabulary term. Longer phrases can be processed by breaking them down.
Rather than byword index, positional index is most commonly employed. We still need to access the inverted index entries for each distinct term. Rather than simply checking that both terms are in a document, we need to check that their positions of appearance in the document are compatible with the phrase query being evaluated.
Combination schemes: The strategies of byword indexes and positional indexes can be fruitfully combined. A combination strategy uses a phrase index or just a byword index for certain queries and uses a positional index for other phrase queries.
 


3
1, The data structure to be used in vocabulary lookup operation is dictionary.
Two broad classes of solutions: hashing and search trees.

2, Two techniques to handle wildcard queries:
permuterm indexes: To find m*n, try n$m* and $ is the end  single. However this takes too much space.
 k-gram indexes: To find re*ve, issue the Boolean query $re AND ve$.

3, Two steps to correct spelling errors in queries (isolated-term correction): the 1st based on edit distance and 2nd based on k-gram overlap. The definition of edit distance: the minimum number of edit operations required to transform s1 into s2.
Context sensitive spelling correction: enumerate corrections of each of the three query terms.


4,  Phonetic correction: for each term, generate a “phonetic hash” so that similar-sounding terms hash to the same value.

Muddiest Points for Unit 1


The huge amount of data we have now may exist in a very complex format. I doubt that these kind of data may be hard to convert to be an exclusive format of structured data. Also, many information could be redundant, how to deal with such kind of data?