Information Retrieval

Download Report

Transcript Information Retrieval

Information Retrieval
Chapter 2 by Rajendra Akerkar, Pawan Lingras
Presented by:
Xxxxxx
Information retrieval (IR) is finding material (usually documents) of an unstructured nature
(usually text) that satisfies an information need from within large collections (usually
stored on computers).
Process : Information Retrieval
Figure 2.1 Transforming a text
document to a weighted list of
keywords
Process : Information Retrieval
1. The first step in transforming a document is simply to
list all the words in a document.
2. The second step is removal of some of the most
commonly occurring words.
Data Mining has emerged as one of the most exciting and dynamic
fields in computing science. The driving force for data mining is
the presence of petabyte-scale online archives that potentially
contain valuable bits of information hidden in them. Commercial
enterprises have been quick to recognize the value of this
concept; consequently, within the span of a few years, the
software market itself for data mining is expected to be in excess
of $10 billion. Data mining refers to a family of techniques used
to detect interesting nuggets of relationships/knowledge in data.
While the theoretical underpinnings of the field have been around
for quite some time (in the form of pattern recognition,
statistics, data analysis and machine learning), the practice and
use of these techniques have been largely ad-hoc. With the
availability of large databases to store, manage and assimilate
data, the new thrust of data mining lies at the intersection of
database systems, artificial intelligence and algorithms that
efficiently analyze data. The distributed nature of several
databases, their size and the high complexity of many techniques
present interesting computational challenges.

A given word may occur in a variety of syntactic forms

The word connect, may appear as

A stem is what is left after its affixes (prefixes and suffixes) are
removed
◦ plurals
◦ past tense
◦ gerund forms (a noun derived from a verb)
◦ connector, connection, connections, connected, connecting, connects,
preconnection, and postconnection.
◦ ed, s, or, ed, ing, and ion are suffixes
◦ pre and post are prefixes



Use of stems may arguably improve retrieval performance
Users rarely specify the exact forms of the word they are
looking for
Reasonable to retrieve documents with similar words
Calculating frequency of each word
Term Document Matrix
•
•
•
•
Term-document matrix (TDM) is a two-dimensional
representation of a document collection.
Rows of the matrix represent various documents
Columns correspond to various index terms
Values in the matrix can be either the frequency or
weight of the index term (identified by the column)
in the document (identified by the row).
Thank You