IR Systems and Web Search
Download
Report
Transcript IR Systems and Web Search
IR Systems and
Web Search
By
Sri Harsha Tumuluri
(UNI: st2653)
World Wide Web
The Web is becoming a universal repository of human
knowledge and culture.
It allows un-precedent sharing of ideas and information
in a scale never seen before.
Any user can create his own Web documents and make
them point to any other Web documents without
restrictions.
Home shopping and home banking are becoming very
popular and have generated several hundred million
dollars in revenues.
Web has Problems!
Despite so much success, the Web has introduced new
problems of its own.
Finding useful information on the Web is frequently a
tedious and difficult task.
There is no fundamental data model for web which
implies that information definition and structure is of low
quality.
These difficulties have attracted renewed interests in IR
and its technology as promising solution.
Information Retrieval at center of the stage
What is an Information Retrieval System?
An Information Retrieval System is capable of storage,
retrieval and maintenance of information.
Information can be text, images, audio, video and other
multi-media objects.
An information retrieval process begins when a user
enters a query into the system.
Several objects may match the query, perhaps with
different degrees of relevancy.
Indexing documents
Indexes are one of the most important components of IR
systems and web search.
Indexing a document gives us the ability to fetch the
required documents fast.
It improves the efficiency of query by many folds but at
the same time it takes a lot of time and memory to create
indexes on a collection.
When we are creating indexes for search engines we have
to crawl all the documents on World Wide Web.
Creating indexes (Basic idea)
1.
Collect all the documents to be indexed.
2.
Tokenize the text in each of the documents. (Tokens can
be considered as words in the document)
3.
Do some processing on the list of words to make them
as indexes.
4.
Index the documents that each term occurs in by
creating an inverted index.
Processing of words…
Tokenization
This is a process in which we break down the contents of
documents into words, phrases, symbols, or other
meaningful elements called as tokens.
Stop words
There are a set of words in English language which are
considered to be very common.
Examples of stop words are a, an, the, which, that, is, at
etc.
Processing of words…
Stop words contd..
These words have no specific meaning and do not convey anything
important or unique about a sentence or document.
Sounds good?
Hold on….! But can we really eliminate stop words??
In some cases we cannot eliminate stop words. Ex: for Queries like
To be or not to be
The who
The rock
1.
2.
3.
So stop words elimination should be done very carefully
considering all these cases in mind.
Processing of words…
Stemming
Stemming is the process for reducing words to their stem,
base or root form – generally a written word form.
The stem need not be identical to the morphological root
of the word; it is usually sufficient that related words map
to the same stem.
Words like Cat, Cats, Catlike, Cat’s, Cats’ will stem to word
Cat
Words like Institute, Institutionalization, Institutional will
stem to word
Processing of words…
Stemming contd..
By stemming it becomes easier to group words or tokens
which are derived from the same word.
Many search engines treat words with same stem as
synonyms.
Problems with Stemming:
Can you stem query terms like [Larry Page], [Steve Jobs],
[Bill Gates] ?
Processing of words…
Thesaurus Expansions
Pick words from the query and then expand it using a good
thesaurus.
Examples: A person looking for query [Cars] would also like
to look at results of [Automobiles]
Problems!
If the query is short and not descriptive then the results drift
in an undesirable direction.
Ex: [Mining] would be expanded to [Data Mining] & [Gold
Mining]
Inverted index
Documents are broken down into words and
they are saved along with document IDs.
Inverted index
Postings list. Words and the document id’s
http://nlp.stanford.edu/IR-book/html/htmledition/processing-boolean-queries-1.html
So if we want to fetch results for query [Brutus Calpurnia]
we would take intersection of postings list of words
[Brutus] & [ Calpurnia ].
Different Models of IR systems
Boolean Model
Vector Space Model (Ranked retrieval model)
Probabilistic Model ( Ranked retrieval model )
Boolean Model
Boolean model is one of the earliest Information Retrieval
models which are very simple to understand and use.
The user specifies queries with key words using
conjunctions like AND, OR, NOT etc.
Boolean model is based on set theory.
AND (^): Intersection of two sets
OR (V) : Union of two sets
NOT (~) : Negation of a set. That is all the documents which
does not contain a particular word
Boolean Model Example
1.
Want to get results which have terms both terms “Data”
& “Mining”. Query = [data AND mining]
2.
Want to get results which have either of the terms
“Data” and “Mining” Query = [data OR mining]
3.
Similarly if a user wants to find all the documents which
have the terms X, Y but not Z, then he would form the
query = [X AND Y NOT Z]
Boolean Model Drawbacks
It only returns the documents based on the matching.
That is, it only checks whether a document has terms in
the query or not.
It cannot rank the documents based on the relevance.
When the query string has too many conditions it
becomes very difficult to understand as the words are
cluttered.
However, it is still used by many power users like
librarians and researches as they feel it as they are in
control of retrieval process.
Vector Space Model (Ranked retrieval)
In the vector space model documents/ webpages are
represented by a vector of terms.
If words are chosen as terms, then every word in the
vocabulary becomes an independent dimension in a very
high dimensional vector space.
Even the query is considered as a vector.
Each of the dimensions have the weight associated with
them. The weight of a component in document vector is
zero if the corresponding word is not present in a
document.
Vector Space Model (Ranked retrieval)
The weights of the terms are not inherent in the model
but we can use some weighing scheme like tf-idf.
To assign a numeric score to a document for a query, the
model measures the similarity between the query vector
and the document vector.
Typically, the angle between two vectors is used as a
measure of divergence between the vectors, and cosine
of the angle is used as the numeric similarity.
Cosine is 1 for identical vectors and 0 for orthogonal
vectors.
Vector Space Model (Ranked retrieval)
Scoring of document is computed as follows:
Score(q ,d ) =
𝑉 𝑞 ∗ 𝑉(𝑑)
𝑉(𝑞) ∗ 𝑉(𝑑)
Demerits
It considers documents as a list of words and does not
consider the ordering. Ex: New York is split into two words
New and York and then processed separately.
The vectors are usually very huge and most of the
components have value as zero. The query vector has to be
then multiplied with all the other documents for relevant
results.
Conclusion
The field of information retrieval has come a long way in the
last forty years, and has enabled easier and faster
information discovery.
Techniques developed in the field have been used in many
other areas and have yielded many new technologies which
are used by people on an everyday basis, e.g., web search
engines, junk-email filters, news clipping services.
Going forward, the field is attacking many critical problems
that users face in todays information-ridden world.
With exponential growth in the amount of information
available, information retrieval will play an increasingly
important role in future.
References
[1] http://people.ischool.berkeley.edu/~hearst/irbook/1/node1.html
[2] http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-buildingan-inverted-index-1.html
[3] http://en.wikipedia.org/wiki/Stemming
[4] http://www.cs.columbia.edu/~gravano/cs6111/Lectures/lect03-2.pdf
[5] http://nlp.stanford.edu/IR-book/html/htmledition/normalizationequivalence-classing-ofterms-1.html
[6] http://comminfo.rutgers.edu/~aspoerri/InfoCrystal/Ch_2.html#2.3.1
[7] Modern Information Retrieval by Amit Singhal.