Friday, March 28, 2014

Reading notes for Unit 10

Information retrieval systems often have to deal with very large amounts of data. They must be able to process many gigabytes or even terabytes of text, and to build and maintain an index for millions of documents.

Parallel Query Processing is an efficient way to process a large scale of indexing file when doing a query. I got that in reality, search engine retrieve the data from a huge posting and indexing file and parallel query processing become necessary in part. Splitting the indexing files making them work simultaneously and return top K results to central server.

I also learned something about redundancy and fault tolerance distributed search engines and index construction and statistical analysis of a corpus of text.

Muddiest Points for Unit 9

I do not have any question for this class.

Friday, March 21, 2014

Reading notes for Unit 9

19 Web search basics
A user searching for maui golf real estate is not merely seeking news or entertainment on the subject of housing on golf courses on the island of Maui, but instead likely to be seeking to purchase such a property.
It is crucial that we understand the users of web search as well. This is again a significant change from traditional information retrieval,where users were typically professionals with at least some training in the art of phrasing queries over a well-authored collection whose style and structure they understood well. In contrast, web search users tend to not know (or care) about the heterogeneity of web content, the syntax of query languages and the art of phrasing queries; indeed, a mainstream tool (as web search has come to become) should not place such onerous demands on billions of people. To a first approximation, comprehensiveness grows with index size, although it does matter which specific pages a search engine indexes – some pages are more informative than others. It is also difficult to reason about the fraction of theWeb indexed by a search engine, because there is an infinite number of dynamic web pages.

21 Link analysis
In this chapter what I have learned is that the analysis of hyperlinks and the graph structure of the Web has been instrumental in the development of web search. In this chapterwe focus on the use of hyperlinks for ranking web search results. Such link analysis is one of many factors considered by web search engines in computing a composite score for a web page on any given query.

Link analysis forweb search has intellectual antecedents in the field of citation analysis, aspects of which overlap with an area known as bibliometrics. These disciplines seek to quantify the influence of scholarly articles by analyzing the pattern of citations amongst them. Much as citations represent the conferral of authority from a scholarly article to others, link analysis on the Web treats hyperlinks from a web page to another as a conferral of authority. Clearly, not every citation or hyperlink implies such authority conferral; for this reason, simply measuring the quality of a web page by the number of in-links (citations from other pages) is not robust enough.

Paper: Authoritative Sources in a Hyperlinked Environment

Rich source of information can be provided by the structure of hyperlinked environment and effective means for understanding it can be provided by the content of the environment. Algorithmic tools is developed by the author to extract information from the link structures of such environments, reporting on experiments that demonstrate their effectiveness in a variety of contexts on the World Wide Web. The central issue would be the distillation of broad search topics, through the discovery of “authoritative” information sources on such topics. An algorithmic formulation was proposed basing on the relationship between set of relevant authoritative pages and set of “hub pages” joining them together in the link structure. The formulation has connections to the eigenvectors of certain matrices associated with the link graph which in turn motivate additional heuristics for link-based analysis. This is the point of this paper.

Paper: The Anatomy of a Large-Scale Hypertextual Web Search Engine

This paper took Google as an example, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. A search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Even though it is large-scale search engines on the web is important, little academic research has been done on them. Also rapid advance in technology and web proliferation creates a web search engine today is very different from three years ago. The paper has in-depth description of our large-scale web search engine which is detailed public description we know of to date. Some the problems of scaling traditional search techniques to data of this magnitude. New technical challenges are involved using additional information present in hypertext to produce better search results. I learned that how to build a practical large-scale system which can exploit the additional information present in hypertext and how to effectively deal with uncontrolled hypertext collections .

Saturday, March 1, 2014

Reading notes for Unit 8

1, Human-Computer Interaction

Design Principles

Offer informative feedback.

Reduce working memory load.
Provide alternative interfaces for novice and expert users.

The Role of Visualization

Brushing and linking refers to the connecting of two or more views of the same data. A mouse-click on the title of the chapter causes the text of the chapter itself to appear in another window, in a linking action. Panning and zooming or focus-plus-context can be used to change the view of the contents within the overview window. Additionally, there are a large number of graphical methods for depicting trees and hierarchies.

Evaluating Interactive Systems

Precision and recall measures have been widely used for comparing the ranking results of non-interactive systems, but are less appropriate for assessing interactive systems. Empirical data involving human users is time consuming to gather and difficult to draw conclusions from.

2, The Information Access Process

Information access process :

1.Start with an information need.

2.Select a system and collections to search on.

3.Formulate a query.

4.Send the query to the system.

5.Receive the results in the form of information items.

6.Scan, evaluate, and interpret the results.

7.Either stop, or,

8.Reformulate the query and go to step 4.

From these observations it is convenient to divide the entire information access process into two main components: search/retrieval, and analysis/synthesis of results. User interfaces should allow both kinds of activity to be tightly interwoven. They should be done independently.

3, Starting Points

The meanings of category labels differ somewhat among collections. Most are designed to help organize the documents and to aid in query specification. Most interfaces that depict category hierarchies graphically do so by associating a document directly with the node of the category hierarchy to which it has been assigned.

4, Query Classification

1. Boolean Queries
2. From Command Lines to Forms and Menus
3. Faceted Queries
4. Graphical Approaches to Query Specification
5. Phrases and Proximity
6. Natural Language and Free Text Queries

5, Context

Interface techniques for placing the current document set in the context of other information types, in order to make the document set more understandable.

Document Surrogates

The most common way to show results for a query is to list information about documents in order of their computed relevance to the query. Alternatively, for pure Boolean ranking, documents are listed according to a metadata attribute, such as date.

Query Term Hits Within Document Content

KWIC

A facility related to highlighting is the keyword-in-context (KWIC) document surrogate. Sentence fragments, full sentences, or groups of sentences that contain query terms are extracted from the full text and presented for viewing along with other kinds of surrogate information (such as document title and abstract).

TileBars

The user enters a query in a faceted format, with one topic per line. After the system retrieves documents (using a quorum or statistical ranking algorithm), a graphical bar is displayed next to the title of each document showing the degree of match for each facet.

SeeSoft

The SeeSoft visualization represents text in a manner resembling columns of newspaper text, with one `line' of text on each horizontal line of the strip.

Query Term Hits Between Documents

InfoCrystal

The InfoCrystal shows how many documents contain each subset of query terms.

VIBE and Lyberworld

In these displays, query terms are placed in an abstract graphical space.

Lattices

Several researchers have employed a graphical depiction of a mathematical lattice for the purposes of query formulation, where the query consists of a set of constraints on a hierarchy of categories (actually, semantic attributes in these systems)

Using Hyperlinks to Organize Retrieval Results

Tables

Tabular display is another approach for showing relationships among retrieval documents.

6, Using Relevance Judgements

Interfaces for Standard Relevance Feedback

A standard interface for relevance feedback consists of a list of titles with checkboxes beside the titles that allow the user to mark relevant documents. This can imply either that unmarked documents are not relevant or that no opinion has been made about unmarked documents, depending on the system. Another option is to provide a choice among several checkboxes indicating relevant or not relevant (with no selection implying no opinion).

Studies of User Interaction with Relevance Feedback SystemsStandard relevance feedback assumes the user is involved in the interaction by specifying the relevant documents. In some interfaces users are also able to select which terms to add to the query.

Fetching Relevant Information in the Background

Standard relevance feedback is predicated on the goal of improving an ad hoc query or building a profile for a routing query. More recently researchers have begun developing systems that monitor users' progress and behavior over long interaction periods in an attempt to predict which documents or actions the user is likely to want in future.

Group Relevance Judgements

Recently there has been much interest in using relevance judgements from a large number of different users to rate or rank information of general interest.

Pseudo-Relevance Feedback

Muddiest Points for Unit 7

I feel like it is hard to decide how to deal with different feedback. Especially the implicit feedback. In interactive IR, what kind of activity should give a positive feedback score and what kind of activity should give a negative feedback?

Friday, February 21, 2014

Reading notes for Unit 7

Relevance feedback and query expansion

query refinement

Global methods include:

• Query expansion/reformulationwith a thesaurus orWordNet

• Query expansion via automatic thesaurus generation

• Techniques like spelling correction

Local methods

• Relevance feedback

• Pseudo relevance feedback, also known as Blind relevance feedback

• (Global) indirect relevance feedback

The idea of relevance feedback (RF) is to involve the user in the retrieval process so as to improve the final result set.

The basic procedure is:

• The user issues a (short, simple) query.

• The system returns an initial set of retrieval results.

• The user marks some returned documents as relevant or non-relevant.

• The system computes a better representation of the information need based on the user feedback.

• The system displays a revised set of retrieval results.

Rocchio algorithm

The Rocchio Algorithm is the classic algorithm for implementing relevance feedback. It models a way of incorporating relevance feedback information into the vector space model.

Relevance feedback can improve both recall and precision.

Probabilistic relevance feedback

Rather than reweighting the query in a vector space, if a user has told us some relevant and nonrelevant documents, then we can proceed to build a classifier.

The use only collection statistics and information about the term distribution within the documents judged relevant. They preserve no memory of the original query.

Cases where relevance feedback alone is not sufficient include:

· Misspellings.

· Cross-language information retrieval.

· Mismatch of searcher’s vocabulary versus collection vocabulary.

Secondly, the relevance feedback approach requires relevant documents to be similar to each other.

Muddiest Points for Unit 6

My question is since we only use the average precision, then why do we need to calculate the recall?