Friday, March 28, 2014

Reading notes for Unit 10

Information retrieval systems often have to deal with very large amounts of data. They must be able to process many gigabytes or even terabytes of text, and to build and maintain an index for millions of documents.

Parallel Query Processing is an efficient way to process a large scale of indexing file when doing a query. I got that in reality, search engine retrieve the data from a huge posting and indexing file and parallel query processing become necessary in part. Splitting the indexing files making them work simultaneously and return top K results to central server.

I also learned something about redundancy and fault tolerance distributed search engines and index construction and statistical analysis of a corpus of text.

Muddiest Points for Unit 9

I do not have any question for this class.

Friday, March 21, 2014

Reading notes for Unit 9

19 Web search basics
A user searching for maui golf real estate is not merely seeking news or entertainment on the subject of housing on golf courses on the island of Maui, but instead likely to be seeking to purchase such a property.
It is crucial that we understand the users of web search as well. This is again a significant change from traditional information retrieval,where users were typically professionals with at least some training in the art of phrasing queries over a well-authored collection whose style and structure they understood well. In contrast, web search users tend to not know (or care) about the heterogeneity of web content, the syntax of query languages and the art of phrasing queries; indeed, a mainstream tool (as web search has come to become) should not place such onerous demands on billions of people. To a first approximation, comprehensiveness grows with index size, although it does matter which specific pages a search engine indexes – some pages are more informative than others. It is also difficult to reason about the fraction of theWeb indexed by a search engine, because there is an infinite number of dynamic web pages.

21 Link analysis
In this chapter what I have learned is that the analysis of hyperlinks and the graph structure of the Web has been instrumental in the development of web search. In this chapterwe focus on the use of hyperlinks for ranking web search results. Such link analysis is one of many factors considered by web search engines in computing a composite score for a web page on any given query.

Link analysis forweb search has intellectual antecedents in the field of citation analysis, aspects of which overlap with an area known as bibliometrics. These disciplines seek to quantify the influence of scholarly articles by analyzing the pattern of citations amongst them. Much as citations represent the conferral of authority from a scholarly article to others, link analysis on the Web treats hyperlinks from a web page to another as a conferral of authority. Clearly, not every citation or hyperlink implies such authority conferral; for this reason, simply measuring the quality of a web page by the number of in-links (citations from other pages) is not robust enough.

Paper: Authoritative Sources in a Hyperlinked Environment

Rich source of information can be provided by the structure of hyperlinked environment and effective means for understanding it can be provided by the content of the environment. Algorithmic tools is developed by the author to extract information from the link structures of such environments, reporting on experiments that demonstrate their effectiveness in a variety of contexts on the World Wide Web. The central issue would be the distillation of broad search topics, through the discovery of “authoritative” information sources on such topics. An algorithmic formulation was proposed basing on the relationship between set of relevant authoritative pages and set of “hub pages” joining them together in the link structure. The formulation has connections to the eigenvectors of certain matrices associated with the link graph which in turn motivate additional heuristics for link-based analysis. This is the point of this paper.

Paper: The Anatomy of a Large-Scale Hypertextual Web Search Engine

This paper took Google as an example, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. A search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Even though it is large-scale search engines on the web is important, little academic research has been done on them. Also rapid advance in technology and web proliferation creates a web search engine today is very different from three years ago. The paper has in-depth description of our large-scale web search engine which is detailed public description we know of to date. Some the problems of scaling traditional search techniques to data of this magnitude. New technical challenges are involved using additional information present in hypertext to produce better search results. I learned that how to build a practical large-scale system which can exploit the additional information present in hypertext and how to effectively deal with uncontrolled hypertext collections .

Saturday, March 1, 2014

Reading notes for Unit 8

1, Human-Computer Interaction

Design Principles

Offer informative feedback.

Reduce working memory load.
Provide alternative interfaces for novice and expert users.

The Role of Visualization

Brushing and linking refers to the connecting of two or more views of the same data. A mouse-click on the title of the chapter causes the text of the chapter itself to appear in another window, in a linking action. Panning and zooming or focus-plus-context can be used to change the view of the contents within the overview window. Additionally, there are a large number of graphical methods for depicting trees and hierarchies.

Evaluating Interactive Systems

Precision and recall measures have been widely used for comparing the ranking results of non-interactive systems, but are less appropriate for assessing interactive systems. Empirical data involving human users is time consuming to gather and difficult to draw conclusions from.

2, The Information Access Process

Information access process :

1.Start with an information need.

2.Select a system and collections to search on.

3.Formulate a query.

4.Send the query to the system.

5.Receive the results in the form of information items.

6.Scan, evaluate, and interpret the results.

7.Either stop, or,

8.Reformulate the query and go to step 4.

From these observations it is convenient to divide the entire information access process into two main components: search/retrieval, and analysis/synthesis of results. User interfaces should allow both kinds of activity to be tightly interwoven. They should be done independently.

3, Starting Points

The meanings of category labels differ somewhat among collections. Most are designed to help organize the documents and to aid in query specification. Most interfaces that depict category hierarchies graphically do so by associating a document directly with the node of the category hierarchy to which it has been assigned.

4, Query Classification

1. Boolean Queries
2. From Command Lines to Forms and Menus
3. Faceted Queries
4. Graphical Approaches to Query Specification
5. Phrases and Proximity
6. Natural Language and Free Text Queries

5, Context

Interface techniques for placing the current document set in the context of other information types, in order to make the document set more understandable.

Document Surrogates

The most common way to show results for a query is to list information about documents in order of their computed relevance to the query. Alternatively, for pure Boolean ranking, documents are listed according to a metadata attribute, such as date.

Query Term Hits Within Document Content

KWIC

A facility related to highlighting is the keyword-in-context (KWIC) document surrogate. Sentence fragments, full sentences, or groups of sentences that contain query terms are extracted from the full text and presented for viewing along with other kinds of surrogate information (such as document title and abstract).

TileBars

The user enters a query in a faceted format, with one topic per line. After the system retrieves documents (using a quorum or statistical ranking algorithm), a graphical bar is displayed next to the title of each document showing the degree of match for each facet.

SeeSoft

The SeeSoft visualization represents text in a manner resembling columns of newspaper text, with one `line' of text on each horizontal line of the strip.

Query Term Hits Between Documents

InfoCrystal

The InfoCrystal shows how many documents contain each subset of query terms.

VIBE and Lyberworld

In these displays, query terms are placed in an abstract graphical space.

Lattices

Several researchers have employed a graphical depiction of a mathematical lattice for the purposes of query formulation, where the query consists of a set of constraints on a hierarchy of categories (actually, semantic attributes in these systems)

Using Hyperlinks to Organize Retrieval Results

Tables

Tabular display is another approach for showing relationships among retrieval documents.

6, Using Relevance Judgements

Interfaces for Standard Relevance Feedback

A standard interface for relevance feedback consists of a list of titles with checkboxes beside the titles that allow the user to mark relevant documents. This can imply either that unmarked documents are not relevant or that no opinion has been made about unmarked documents, depending on the system. Another option is to provide a choice among several checkboxes indicating relevant or not relevant (with no selection implying no opinion).

Studies of User Interaction with Relevance Feedback SystemsStandard relevance feedback assumes the user is involved in the interaction by specifying the relevant documents. In some interfaces users are also able to select which terms to add to the query.

Fetching Relevant Information in the Background

Standard relevance feedback is predicated on the goal of improving an ad hoc query or building a profile for a routing query. More recently researchers have begun developing systems that monitor users' progress and behavior over long interaction periods in an attempt to predict which documents or actions the user is likely to want in future.

Group Relevance Judgements

Recently there has been much interest in using relevance judgements from a large number of different users to rate or rank information of general interest.

Pseudo-Relevance Feedback

Muddiest Points for Unit 7

I feel like it is hard to decide how to deal with different feedback. Especially the implicit feedback. In interactive IR, what kind of activity should give a positive feedback score and what kind of activity should give a negative feedback?

Friday, February 21, 2014

Reading notes for Unit 7

Relevance feedback and query expansion

query refinement

Global methods include:

• Query expansion/reformulationwith a thesaurus orWordNet

• Query expansion via automatic thesaurus generation

• Techniques like spelling correction

Local methods

• Relevance feedback

• Pseudo relevance feedback, also known as Blind relevance feedback

• (Global) indirect relevance feedback

The idea of relevance feedback (RF) is to involve the user in the retrieval process so as to improve the final result set.

The basic procedure is:

• The user issues a (short, simple) query.

• The system returns an initial set of retrieval results.

• The user marks some returned documents as relevant or non-relevant.

• The system computes a better representation of the information need based on the user feedback.

• The system displays a revised set of retrieval results.

Rocchio algorithm

The Rocchio Algorithm is the classic algorithm for implementing relevance feedback. It models a way of incorporating relevance feedback information into the vector space model.

Relevance feedback can improve both recall and precision.

Probabilistic relevance feedback

Rather than reweighting the query in a vector space, if a user has told us some relevant and nonrelevant documents, then we can proceed to build a classifier.

The use only collection statistics and information about the term distribution within the documents judged relevant. They preserve no memory of the original query.

Cases where relevance feedback alone is not sufficient include:

· Misspellings.

· Cross-language information retrieval.

· Mismatch of searcher’s vocabulary versus collection vocabulary.

Secondly, the relevance feedback approach requires relevant documents to be similar to each other.

Muddiest Points for Unit 6

My question is since we only use the average precision, then why do we need to calculate the recall?

Wednesday, February 12, 2014

Muddiest Points for Unit 5

For the "Multinomial and Multiple-Bernoulli", there is an example of how the formulas are built. One is targeted at the number of query and one is targeted at the number of M. Whether the final result indicates the correspondence of frequency of each term in query and frequency of each term in M?

Reading notes for Unit 6

Information retrieval system evaluation

· To measure ad hoc information retrieval effectiveness, we need a test collection:

1. A document collection

2. A test suite of information needs

3. A set of relevance judgments

· A document is relevant if it addresses the stated information need, not because it just happens to contain all the words in the query.

Evaluation of unranked retrieval sets

· The two most frequent and basic measures for information retrieval effectiveness are precision and recall.
· Accuracy is not an appropriate measure for information retrieval problems. A system tuned to maximize accuracy can appear to perform well by simply deeming all documents non-relevant to all queries. However, labeling all documents as non-relevant is completely unsatisfying to an information retrieval system user.
· A single measure that trades off precision versus recall is the F measure.
· We use a harmonic mean rather than the simpler average (arithmetic mean).

Evaluation of ranked retrieval results

· Examining the entire precision-recall curve is very informative, but there is often a desire to boil this information down to a few numbers, or perhaps even a single number. The traditional way of doing this is the 11-point interpolated average precision.
· In web search, what matters is rather how many good results there are on the first page or the first three pages. This leads to measuring precision at fixed low levels of retrieved results, such as 10 or 30 documents.
· Mean Average Precision (MAP), which provides a single-figure measure of quality across recall levels.
· An alternative, which alleviates this problem, is R-precision. It requires having a set of known relevant documents Rel, from which we calculate the precision of the top Rel documents returned.
· Another concept sometimes used in evaluation ROC CURVE is an ROC curve. In many fields, a common aggregate measure is to report the area under the ROC curve, which is the ROC analog of MAP.

Assessing relevance

· To properly evaluate a system, your test information needs must be germane to the documents in the test document collection, and appropriate for predicted usage of the system.
· Given information needs and documents, you need to collect relevance assessments. The most standard approach is pooling, where relevance is assessed over a subset of the collection that is formed from the top k documents returned by a number of different IR systems, and perhaps other sources such as the results of Boolean keyword searches or documents found by expert searchers in an interactive process.
· Rather, humans and their relevance judgments are quite idiosyncratic and variable. In the social sciences, a common measure for agreement between judges is the kappa statistic.
· The choice of a human can make a considerable absolute difference to reported scores, but has in general been found to have little impact on the relative effectiveness ranking of either different systems or variants of a single system which are being compared for effectiveness.

Critiques and justifications of the concept of relevance

· One clear problem with the relevance-based assessment that we have presented is the distinction between relevance and marginal relevance.

Results snippets

· The two basic kinds of summaries are static, which are always the same regardless of the query, and dynamic (or query-dependent), which are customized according to the user’s information need as deduced from a query. Dynamic summaries attempt to explain why a particular document was retrieved for the query at hand.

· Dynamic summaries are generally regarded as greatly improving the usability of IR systems, but they present a complication for IR system design

Monday, February 10, 2014

Reading notes for Unit 5

11 Probabilistic information retrieval

In the Boolean or vector space models of IR, given the query and document representations, a system has an uncertain guess of whether a document has content relevant to the information need. Probability theory provides a principled foundation for such reasoning under uncertainty.

All the theories are based on the basic probabilistic theories.

The Binary Independence Model:

To estimate the probability function P(R|d, q) practical, some simple assumptions are imported.

Documents and queries are both represented as binary term incidence vectors; We assume here that the relevance of each document is independent of the relevance of other documents.

Deriving a ranking function for query terms:

Given a query q, we wish to order returned documents by descending P(R =1|d, q). Under the BIM, this is modeled as ordering by P(R = 1|~x,~q). Ratherthan estimating this probability directly, because we are interested only in the ranking of documents.

pt = P(xt = 1|R = 1,~q): probability of a term appearing in a document relevant to the query

ut = P(xt = 1|R = 0,~q): probability of a term appearing in a non-relevant document.

The ct terms are log odds ratios for the terms in the query.

If relevant ct= (pt/(1 − pt))

We can provide a theoretical justification for the most frequently used form of idf weighting.

Ut (the probability of term occurrence in nonrelevant documents for a query) is dft/N and

log[(1− ut)/ut] = log[(N −dft)/dft] ≈ log N/dft

Then we can use the frequency of term occurrence in known relevant documents

(if we know some).

It is perhaps the severity of the modeling assumptions that makes achieving good performance difficult.

Compared to other probabilistic approaches, such as the BIM, the main difference is LM approach does away with explicitly modeling relevance (whereas this is the central variable evaluated in the BIM approach). It is perhaps the severity of the modeling assumptions that makes achieving good performance difficult.

12 Language models for information retrieval

A language model is a function LANGUAGE MODEL that puts a probability measure over strings drawn from some vocabulary.

Types of language models

P(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t1t2)P(t4|t1t2t3)

The simplest form of language model simply throws away all conditioning context, and estimates each term independently. Such model is called a unigram language model:

Puni(t1t2t3t4) = P(t1)P(t2)P(t3)P(t4)

There are many more complex kinds of language models, such as bigram language models, which condition on the previous term,

Pbi(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t2)P(t4|t3)

The query likelihood

A document could be regarded as a language model.

Every language has its own logistic. The model is made up of a sequence of string.

For retrieval based on a language model (henceforth LM), we treat the generation of queries as a random process. The approach is to

1. Infer a LM for each document.

2. Estimate P(q|Mdi), the probability of generating the query according to each of these document models.

3. Rank the documents according to these probabilities.

Based on the change of argument, the language model could be changed to other models.

Monday, February 3, 2014

Muddiest Points for Unit 4

The reason that we use log-frequency weighting is that we want to weaken the linear relatiohn between frequency and weight. But I can’t see why using log is more accurate than linear.
Is there any rule that two terms should be combined together so that we could decrease the space of vector?

Sunday, February 2, 2014

Reading notes for Unit 4

Complex queries could be achieved by processing Boolean queries.
The Boolean retrievalmodel contrasts with ranked retrieval RANKED RETRIEVAL models.
Compared to Boolean queries, richer query models are needed for the following reasons:

Be tolerant to spelling mistakes and inconsistent choice of words.
The index has to be augmented to capture the proximities of terms in documents.
We need term frequency information in posting lists.
We wish to have an effective method to order (or “rank”) the returned results.

Parametric and zone indexes. They allow us to index and retrieve documents by metadata they give us a simple means for scoring (and thereby ranking) documents in response to a query.
The dictionary for a parametric index comes from a fixed vocabulary while the dictionary for a zone index structure whatever vocabulary stems from the text of that zone.
In fact, we can reduce the size of the dictionary by encoding the zone in which a term occurs in the postings.

Weighted zone scoring: Given a Boolean query q and a document d, weighted zone scoring assigns to the pair (q, d) a score in the interval [0, 1], by computing a linear combination of zone scores, where each zone of the document contributes a Boolean value.
Learning weights: training examples are consisted of a query q and a document d. The weights gi are then “learned” from these examples.
Weighting the importance of a term in a document, based on the statistics of occurrence of the term.
Term frequency; Inverse document frequency; tf-idft,d = tft,d ×idft.
By viewing each document as a vector of such weights, we can compute a score between a query and each document.
We denote by ~V (d) the vector derived from document d, with one component in the vector for each dictionary term. The standard way of quantifying the similarity between two documents d1 and d2 is to compute the cosine similarity of their vector representations ~V (d1) and ~V (d2)

sim(d1, d2) =~V(d1) · ~V (d2)|~V (d1)||~V (d2)|

Practical Benefit: Given a document d (potentially one of the di in the collection), consider searching for the documents in the collection most similar to d. Such a search is useful in a system where a user may identify a document and seek others like it.
Queries as vectors: Assign to each document d a score equal to the dot product ~v(q) ·~v(d).
In a typical setting we have a collection of documents each represented by a vector, a free text query represented by a vector, and a positive integer K. We seek the K documents of the collection with the highest vector space scores on the given query.
The process of adding in contributions one query term at a time is sometimes known as term-at-a-time scoring or accumulation, and the N elements of the array Scores are therefore known as accumulators.
Variant of term-weighting for the vector space model.

Sublinear tf scaling
Normalize the tf weights of all terms occurring in a document by the maximum tf in that document
Pivoted document length normalization

Tuesday, January 28, 2014

Muddiest Points for Unit 3

One problem of the feature of index is that some of the words have limited value of frequencies in documents. However, Many words are more frequently to be in some specific kinds of document. I was wondering is it possible that the words listed in the table should be with similar meaning and then they do not need to list all the docs which are not related to their topics.

Friday, January 24, 2014

Reading Notes for Unit 3

Index Construction

The method of Blocked sort-based indexing, makes the collection assembling all term-docID pairs. External sorting algorithm that uses disk minimize the number of random disk seeks during sorting.

Single-pass in-memory indexing processes tokens one by one during each successive call. When a term occurs for the first time, it is added to the dictionary. The call returns postings list for subsequent occurrences of the term. It eliminates the expensive sorting step.

Distributed indexing is an application of MapReduce. It breaks a large computing problem into smaller parts be recasting it in terms of manipulation of key-value pairs. In map phase, input data was split into key-value pairs. During reduce phase, inverters collect all values( docIDs) for a given key( term ID).

Dynamic indexing has one simple way to do it. Periodically reconstruct the index from scratch.

Index compression

2 central data structures in information retrival are dictionary and the inverted index. Employ compression techniques for them are essential for efficient IR system.

Heaps’ law is used to estimate the number of terms. Even though vocabulary growth depends a lot on the nature of collection and how processed. The law are proved which is the dictionary continues to increase with more documents in the collection rather than the maximum vocabulary size reached. The size of dictionary is quite large for large collections.

Zipf’s law is used for modeling the distribution of terms.

Dictionary as a string. Using fixed-width entries for terms is clearly wasteful. So we store one long string of characters. Blocked storage is further compressing the dictionary by grouping terms. Careful there is tradeoff between compression and the speed of term.

Better compression with increased block size. If we have to partition the dictionary onto pages that are stored on disk, then we can index the first term of each page using a B-tree.

Variable encoding method use fewer bits for short gaps.

Variable Byte code: use an integral number of bytes to encode a gap. The 1^st bite of the byte is a continuation bit. VB codes use an adaptive number of bytes depending on the size of the gap. Bit-level codes adapt the length of the code on the finer grained bit level. The simplest is unary code. r codes implement variable-length encoding by splitting the representation of a gap G into a pair of length and offset. Length of G: in unary. Offset of G: in binary.

Muddiest Points for Unit 2

Different punctuations could have different meanings, but it could be difficult for system automatically identify the which is the right place. I was thinking using experience is a normal way? For example taking statistical data and using it when deciding to punctuate.

Analyzing the context of paragraph could help when stemming. How about the effectiveness?

Friday, January 10, 2014

Reading Notes for Unit 2

Reading Notes for Unit 2
1.2
Build inverted index:
There are 4 major steps of building an inverted index.
1)      Collect the document s to be indexed;
2)      Tokenize the text, turning each document into a list of tokens;
3)      Do linguistic preprocessing;
4)      Index the documents.
Input to indexing : A list of normalized tokens for each document ( a list of pairs of term and docID).
Core indexing step: sorting the list so that the terms are alphabetical.
Inverted index structure is the most efficient structure for supporting ad hoc text search.

2
1, Define the basic unit of a document and define character sequence:
Convert byte sequence into linear sequence of characters. The first step would be to determine encoding.
Appropriate indexing granularity is essential.
2, Determine the vocabulary of terms using tokenization and linguistic preprocessing:
1)      Tokenization:
Major question of it is what are the correct tokens to use.
It requires the language of the document to be known. Handling hyphens can be complex and splitting on white space can split what should be regarded as a single token. And each new language presents some new issues.
2)      Dropping common terms: stop words
Some extremely common words appear to be of little value in helping select documents matching a user need.
To determine a stop list is to sort the terms by collection frequency. And a lot of time not indexing stop words does little harm.
3)      Normalization:
Token normalization
Remove characters like hyphens or maintain relations between unnormalized tokens.
Some issues: Accents and diacritics; Capitalization/ case-folding; Other issues in English; Other languages.
4)      Stemming and lemmatization:
It would be useful for a search for one of these words to return documents to return documents that contain another word in the set. The goal of stemming and lemmatization is to reduce inflectional forms or just change words to base form.

3, How to process posting list intersection in sublinear time?
Skip list: segmenting postings lists with skip pointers. This allow us to avoid processing parts of the postings list that will not figure in the search results.
Where do we place skips? A simple heuristic for placing skips is to use P^0.5 evenly-spaced skip pointers. Building effective skip pointers is easy if an index is relatively static.
4, To process two-word phrase queries, we treat each biword as a vocabulary term. Longer phrases can be processed by breaking them down.
Rather than byword index, positional index is most commonly employed. We still need to access the inverted index entries for each distinct term. Rather than simply checking that both terms are in a document, we need to check that their positions of appearance in the document are compatible with the phrase query being evaluated.
Combination schemes: The strategies of byword indexes and positional indexes can be fruitfully combined. A combination strategy uses a phrase index or just a byword index for certain queries and uses a positional index for other phrase queries.

3
1, The data structure to be used in vocabulary lookup operation is dictionary.
Two broad classes of solutions: hashing and search trees.

2, Two techniques to handle wildcard queries:
permuterm indexes: To find mn, try n$m and $ is the end single. However this takes too much space.
k-gram indexes: To find re*ve, issue the Boolean query $re AND ve$.

3, Two steps to correct spelling errors in queries (isolated-term correction): the 1^st based on edit distance and 2^nd based on k-gram overlap. The definition of edit distance: the minimum number of edit operations required to transform s1 into s2.
Context sensitive spelling correction: enumerate corrections of each of the three query terms.

4, Phonetic correction: for each term, generate a “phonetic hash” so that similar-sounding terms hash to the same value.

Muddiest Points for Unit 1

Muddiest Points for Unit 1
The huge amount of data we have now may exist in a very complex format. I doubt that these kind of data may be hard to convert to be an exclusive format of structured data. Also, many information could be redundant, how to deal with such kind of data?