Friday, January 10, 2014

Reading Notes for Unit 2


1.2
Build inverted index:
There are 4 major steps of building an inverted index.
1)      Collect the document s to be indexed;
2)      Tokenize the text, turning each document into a list of tokens;
3)      Do linguistic preprocessing;
4)      Index the documents.
Input to indexing : A list of normalized tokens for each document ( a list of pairs of term and docID).
Core indexing step: sorting the list so that the terms are alphabetical.
Inverted index structure is the most efficient structure for supporting ad hoc text search.
 


2
1, Define the basic unit of a document and define character sequence:
Convert byte sequence into linear sequence of characters. The first step would be to determine encoding.
Appropriate indexing granularity is essential.
 2, Determine the vocabulary of terms using tokenization and linguistic preprocessing:
1)      Tokenization:
Major question of it is what are the correct tokens to use.
It requires the language of the document to be known. Handling hyphens can be complex and splitting on white space can split what should be regarded as a single token. And each new language presents some new issues.
2)      Dropping common terms: stop words
Some extremely common words appear to be of little value in helping select documents matching a user need.
To determine a stop list is to sort the terms by collection frequency. And a lot of time not indexing stop words does little harm.
3)      Normalization:
Token normalization
Remove characters like hyphens or maintain relations between unnormalized tokens.
Some issues: Accents and diacritics; Capitalization/ case-folding; Other issues in English; Other languages.
4)      Stemming and lemmatization:
It would be useful for a search for one of these words to return documents to return documents that contain another word in the set.  The goal of stemming and lemmatization is to reduce inflectional forms or just change words to base form.
 

3, How to process posting list intersection in sublinear time?
Skip list: segmenting postings lists with skip pointers. This allow us to avoid processing parts of the postings list that will not figure in the search results.
Where do we place skips? A simple heuristic for placing skips is to use P0.5 evenly-spaced skip pointers. Building effective skip pointers is easy if an index is relatively static.
 4, To process two-word phrase queries, we treat each biword as a vocabulary term. Longer phrases can be processed by breaking them down.
Rather than byword index, positional index is most commonly employed. We still need to access the inverted index entries for each distinct term. Rather than simply checking that both terms are in a document, we need to check that their positions of appearance in the document are compatible with the phrase query being evaluated.
Combination schemes: The strategies of byword indexes and positional indexes can be fruitfully combined. A combination strategy uses a phrase index or just a byword index for certain queries and uses a positional index for other phrase queries.
 


3
1, The data structure to be used in vocabulary lookup operation is dictionary.
Two broad classes of solutions: hashing and search trees.

2, Two techniques to handle wildcard queries:
permuterm indexes: To find m*n, try n$m* and $ is the end  single. However this takes too much space.
 k-gram indexes: To find re*ve, issue the Boolean query $re AND ve$.

3, Two steps to correct spelling errors in queries (isolated-term correction): the 1st based on edit distance and 2nd based on k-gram overlap. The definition of edit distance: the minimum number of edit operations required to transform s1 into s2.
Context sensitive spelling correction: enumerate corrections of each of the three query terms.


4,  Phonetic correction: for each term, generate a “phonetic hash” so that similar-sounding terms hash to the same value.

No comments:

Post a Comment