Chapter 14 - VCU DMB Lab.

Download Report

Transcript Chapter 14 - VCU DMB Lab.

Chapter 14
TEXT MINING
Cios / Pedrycz / Swiniarski / Kurgan
Presented by: Yulong Zhang
Nov. 16, 2011
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
Outline
• Introduction
• Information Retrieval
–
–
–
–
–
–
–
Definition
Architecture of IR Systems
Linguistic Preprocessing
Measures of Text Retrieval
Vector-Space Model
Text Similarity Measures
Misc
• Improving Information Retrieval Systems
– Latent Semantic Indexing
– Relevance Feedback
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
2
What we have learned –
structured data
Feature extraction
Feature selection
Discretization
Clustering
Classification
……
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
3
But what should we do for these data?
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
4
Introduction
So far we have focused on data mining methods for
analysis and extraction of useful knowledge from
structured data such as flat files, relational,
transactional etc. data.
In contrast, we are text mining is concerned with analysis
of text databases that consist of mainly semi-structured
or unstructured data such as:
– collections of articles, research papers, e-mails, blogs,
discussion forums, and WWW pages
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
5
Introduction
Semi-structured data is neither completely structured
(e.g., relational table) nor completely unstructured (e.g.,
free text)
– semi-structured document has:
• some structured fields such as title, list of authors, keywords,
publication date, category, etc.
• and some unstructured fields such as abstract and contents
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
6
Important differences between the two:
• The number of features
Structured data
Semi-structured data
~100 - ~102
~103 of features
• Features of semi-structured data are sparse
• Growing speed in size (only a tiny portion is
relative and even less is useful)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
7
Introduction
There are three main types of retrieval within the
knowledge discovery process framework:
– data retrieval - retrieval of structured data from DBMS and data
warehouses (Chapter 6)
SELECT * FROM xx WHERE yy = zz
– information retrieval - concerns organization and retrieval of
information from large collections of semi-structured or
unstructured text-based databases and the web
– knowledge retrieval - generation of knowledge from (usually)
structured data (Chapters 9-13)
IF xx THEN yy ELSE zz
8
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
Information Retrieval
...[the] actions, methods and procedures for
recovering stored data to provide information
on a given subject
ISO 2382/1 (1984)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
9
Information Retrieval
Definitions
– database
• a collection of documents
– document
• a sequence of terms in a natural language that expresses ideas
about some topic
– term
• a semantic unit, a word, phrase, or root of a word
– query
• a request for documents that cover a particular topic of interest to
the user of an IR system
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
10
Term
Obama
Document
Database
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
11
Information Retrieval
Definitions
– IR system
• its goal is to find relevant documents in response to the user’s
request
• performs matching between the language of the query and the
language of the document
– simple word matching does not work since the same word has many
different semantic meanings
» e.g. MAKE
“to make a mistake”
Also, one word may have
“make of a car”
many morphological
“to make up excuses”
variants: make, makes,
“to make for an exit”
made, making
“it is just a make-believe”
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
12
Information Retrieval
Definitions
• other problems
– consider query “Abraham Lincoln”
should it return document that contains the following sentences:
“Abraham owns a Lincoln. It is a great car.”?
– consider query “what is the funniest movie ever made”: how can the IR
system know what the user’s idea of a funny movies is?
No wait! Data mining is an interesting course!
Wait! Is data mining an interesting course? No!
• difficulties include inherent properties of natural language, high
expectations of the user, etc.
– a number of mechanisms were developed to cope with them
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
13
Information Retrieval
Cannot provide one “best” answer to the user
– many algorithms provide one “correct” answer, such as
SELECT Price FROM Sales WHERE Item = “book”
or: find shortest path from node A to node B
– IR on the other hand provides a range of possibly best answers
and lets the user to choose
• query “Lincoln” may return information about
–
–
–
–
–
Abraham Lincoln
Lincoln dealership
The Lincoln Memorial
The University of Nebraska-Lincoln
The Lincoln University in New Zealand
IR systems do not give just one
right answer but perform an
approximate search that returns
multiple, potentially correct
answers.
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
14
Information Retrieval
IR system provides information based on the stored data
– the key is to provide some measurement of relevance
between the stored data and the user’s query
• i.e., the relation between requested information and retrieved
information
• given a query, the IR system has to check whether the stored
information is relevant to the query
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
15
Information Retrieval
IR system provides information based on the stored data
– IR systems use heuristics to find relevant information
• they find a “close” answer and use heuristic to measure its
closeness to the “right” answer
• the inherent difficulty is that very often we do not know what the
right answer is!
– we just measure how close to the right answer we can come
– the solution involves using measures of precision and recall
• they are used to measure the “accuracy” of IR systems
Will be discussed later
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
16
Outline
• Introduction
• Information Retrieval
–
–
–
–
–
–
–
Definition
Architecture of IR Systems
Linguistic Preprocessing
Measures of Text Retrieval
Vector-Space Model
Text Similarity Measures
Misc
• Improving Information Retrieval Systems
– Latent Semantic Indexing
– Relevance Feedback
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
17
Architecture of IR Systems
database
tagged
data D1: <DOC> <DOCNO>
1</DOCNO><TEXT>Lincoln
Park Zoo is every …</DOC>
D2: <DOC> <DOCNO>
2</DOCNO><TEXT>This
website includes a biography,
photographs, …</DOC>
….
textual
data
D1: Lincoln Park Zoo is
everyone’s zoo …
D2: This website includes a
biography, photographs, and
lots …
D3: Biography of Abraham
Lincoln, the sixteenth President
…
Welcome
to Lincoln
Center for
Perfor …
Welcome
to the
Univers. of
Nebraska
source
documents
inverted
index
lincoln: D1, D2, D13, D54, …
zoo: D2, D43, D198, …
website: D1, D2, D3, D4, …
university: D4, D8, D14, …
…
similarity
measure
similarity (D1, query)
similarity (D2, query)
similarity (D3, query)
similarity (D4, query)
…
document D4
document D52
document D12
document D134
…
March 31,
1849
Lincoln
returns …
list of
ranked
documents
transformed
query where
university
nebraska
lincoln
= 0.15
= 0.10
= 0.14
= 0.75
query
where is the University of
Nebraska Lincoln?
user
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
18
Architecture of IR Systems
– Search database
• is organized as an inverted index file of the significant
character strings that occur in the stored tagged data
– the inverted file specifies where the strings occur in the text
– the strings include words after excluding determiners,
conjunctions, and prepositions - known as STOP-WORDS
» determiner is a non-lexical element preceding a noun in a noun
phrase, e.g., the, that, two, a, many, all
» conjunction is used to combine multiple sentences, e.g., and, or
» preposition links nouns, pronouns, and phrases to other words in
a sentence, e.g., on, beneath, over, of, during, beside
– the stored words use a common form
» they are stemmed to obtain the ROOT FORM by removing common
prefixes and suffixes
» synonyms to a given word are also found and used
Disease, diseased, diseases, illness, unwellness, malady, sickness, …
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
19
Architecture of IR Systems
– Query
• is composed of character strings combined by Boolean operators
and additional features such as contextual or positional operators
– query is also preprocessed in terms of removing determiners,
conjunctions, and prepositions
– No linguistic analysis of the semantics of the stored texts, or
of the queries, is performed
• thus the IR systems are domain-independent
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
20
How does it fit in the CS taxonomy?
Computers
Databases
Artificial Intelligence
Robotics
Algorithms
Search
Natural Language Processing
Information
Retrieval
Machine
Translation
Networking
Language
Analysis
Semantics
Parsing
By Rada Mihalcea, “Natural Language Processing”
Outline
• Introduction
• Information Retrieval
–
–
–
–
–
–
–
Definition
Architecture of IR Systems
Linguistic Preprocessing
Measures of Text Retrieval
Vector-Space Model
Text Similarity Measures
Misc
• Improving Information Retrieval Systems
– Latent Semantic Indexing
– Relevance Feedback
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
22
Linguistic Preprocessing
Creation of the inverted index requires linguistic
preprocessing, which aims at extracting important terms
from a document represented as the bag of words.
Term extraction involves two main operations:
– Removal of stop words
– Stemming
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
23
Linguistic Preprocessing
• Removal of stop words
– stop words are defined as terms that are irrelevant although
they occur frequently in the documents:
• determiner is a non-lexical element preceding a noun in a noun
phrase, and includes articles (a, an, the), demonstratives, when
used with noun phrases (this, that, these, those), possessive
determiners (her, his, its, my, our, their, your) and quantifiers (all,
few, many, several, some, every)
• conjunction is a part of speech that is used to combine two words,
phrases, or clauses together, and includes coordinating
conjunctions (for, and, nor, but, or, yet, so), correlative
conjunctions (both … and, either … or, not (only) … but (… also)),
and subordinating conjunctions (after, although, if, unless,
because)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
24
Linguistic Preprocessing
• Removal of stop words
– stop words are defined as terms that are although they may
occur frequently in the documents:
• preposition links nouns, pronouns and phrases to other words
in a sentence (on, beneath, over, of, during, beside, etc.)
• Finally, the stop words include some custom-defined words,
which are related to the subject of the database
e.g., for a database that lists all research papers related to
brain modeling, the words brain and model should be removed
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
25
Linguistic Preprocessing
• Stemming
– words that appear in documents often have many
morphological variants
– each word that is not a stop word is reduced into its
corresponding stem word (term)
• words are stemmed to obtain root form by removing common
prefixes and suffixes
• in this way, we can identify groups of corresponding words
where the words in the group are syntactical variants of each
other, and collect only one word per group
– for instance, words disease, diseases, and diseased share a
common stem term disease, and can be treated as different
occurrences of this word
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
26
•For English it is not a big problem - publicly available algorithms
give good results. Most widely used is Porter stemmer at
http://www.tartarus.org/~martin/PorterStemmer/
E.g. in Slovenian language 10-20 different forms correspond to
the same word:
(“to laugh” in Slovenian): smej, smejal, smejala, smejale, smejali,
smejalo, smejati, smejejo, smejeta, smejete,
smejeva, smeješ, smejemo, smejiš, smeje, smejoč, smejta, smejte,
smejva
In Chinese……一切尽在不言中
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
27
Outline
• Introduction
• Information Retrieval
–
–
–
–
–
–
–
Definition
Architecture of IR Systems
Linguistic Preprocessing
Measures of Text Retrieval
Vector-Space Model
Text Similarity Measures
Misc
• Improving Information Retrieval Systems
– Latent Semantic Indexing
– Relevance Feedback
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
28
Measures of Text Retrieval
Let us suppose that an IR system returned a set of
documents to the user’s query.
We define measures that allow to evaluate how
accurate (correct) the system’s answer was.
Two types of documents can be found in a database:
– relevant documents, which are relevant to the user’s query
– retrieved documents, which are returned to the user by the
system
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
29
Precision
and Recall
retrieved
not retrieved
irrelevant
retrieved and irrelevant
not retrieved and irrelevant
relevant
X = retrieved and relevant not retrieved and relevant
Entire collection
of documents
Relevant
documents
Retrieved
documents
x
• Precision
– evaluates ability to retrieve documents that are mostly relevant
precision 
X
Total number of retrieved documents
• Recall
– evaluates ability of the search to find all of the relevant items
in the corpus.
recall 
X
Total number of relevant documents
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
30
Precision and Recall
• Trade-off between precision and recall
returns relevant
documents but
misses many of them
the ideal case
precision
1
returns most relevant
documents but includes
lots of unwanted documents
0
recall
1
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
31
Computing Recall
• Number of relevant documents is often not available,
and thus we use techniques to estimate it, such as
– sampling across the database and performing relevance
judgment on the documents
– applying different retrieval algorithms to the same
database for the same query
• the relevant documents are the aggregate of all found
documents
– the generated list is a golden standard to compute recall
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
32
Computing
Recall and
Precision
Precision
Recall
total # of relevant docs = 7
rank doc # relevant?
precision/recall
1
134
yes
P = 1/1 = 1, R = 1/7 = 0.14
2
1987
yes
P = 2/2 = 1, R = 2/7 = 0.29
3
21
4
8712
yes
P = 3/4 = 0.75, R = 3/7 = 0.43
5
112
6
567
yes
P = 4/6 = 0.67, R = 4/7 = 0.57
7
810
8
12
yes
P = 5/8 = 0.63, R = 5/7 = 0.71
9
346
10
478
11
7834
yes
P = 6/11 = 0.55, R = 6/7 = 0.86
12
3412
still missing one relevant document, and
thus will never reach 100% recall
• For a given query
– generate the ranked list of retrieved documents
– adjust a threshold on the ranked list to generate different sets of
retrieved documents, thus with different recall/precision measures
• mark each document in the ranked list that is relevant according to the
gold standard
• compute recall and precision for each position in the ranked list that
contains a relevant document
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
33
Computing Recall/Precision Points:
Example 1
n doc # relevant
1 588
x
2 589
x
3 576
4 590
x
5 986
6 592
x
7 984
8 988
9 578
10 985
11 103
12 591
13 772
x
14 990
Let total # of relevant docs = 6
Check each new recall point:
R=1/6=0.167; P=1/1=1
R=2/6=0.333; P=2/2=1
R=3/6=0.5;
P=3/4=0.75
R=4/6=0.667; P=4/6=0.667
Missing one
relevant document.
Never reach
R=5/6=0.833; p=5/13=0.38
100% recall
Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)
34
Computing Recall/Precision Points:
Example 2
n doc # relevant
1 588
x
2 576
3 589
x
4 342
5 590
x
6 717
7 984
8 772
x
9 321
x
10 498
11 113
12 628
13 772
14 592
x
Let total # of relevant docs = 6
Check each new recall point:
R=1/6=0.167; P=1/1=1
R=2/6=0.333; P=2/3=0.667
R=3/6=0.5;
P=3/5=0.6
R=4/6=0.667; P=4/8=0.5
R=5/6=0.833; P=5/9=0.556
R=6/6=1.0;
p=6/14=0.429
Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)
35
Compare Two or More Systems
• The curve closest to the upper right-hand
corner of the graph indicates the best
performance
Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)
36
R- Precision
• Precision at the R-th position in the ranking
of results for a query that has R relevant
documents.
n doc # relevant
1 588
x
2 589
x
3 576
4 590
x
5 986
6 592
x
7 984
8 988
9 578
10 985
11 103
12 591
13 772
x
14 990
R = # of relevant docs = 6
R-Precision = 4/6 = 0.67
Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)
37
F-Measure
• One measure of performance that takes into
account both recall and precision.
• Harmonic mean of recall and precision:
2 PR
2
F
1 1
P  R RP
• Compared to arithmetic mean, both need to
be high for harmonic mean to be high.
Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)
38
E Measure (parameterized F Measure)
• A variant of F measure that allows weighting
emphasis on precision over recall:
(1   ) PR (1   )
E
 2 1
2
 PR

R P
2
2
• Value of  controls trade-off:
–  = 1: Equally weight precision and recall (E=F).
–  > 1: Weight recall more.
–  < 1: Weight precision more.
Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)
39
Mean Average Precision
(MAP)
• Average Precision: Average of the precision
values at the points at which each relevant
document is retrieved.
– Ex1: (1 + 1 + 0.75 + 0.667 + 0.38 + 0)/6 = 0.633
– Ex2: (1 + 0.667 + 0.6 + 0.5 + 0.556 + 0.429)/6 = 0.625
• Mean Average Precision: Average of the
average precision value for a set of queries.
Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)
40
Non-Binary Relevance
• Documents are rarely entirely relevant or
non-relevant to a query
• Many sources of graded relevance
judgments
– Relevance judgments on a 5-point scale
– Multiple judges
– Click distribution and deviation from expected
levels (but click-through != relevance
judgments)
Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)
41
Cumulative Gain
• With graded relevance
judgments, we can
compute the gain at each
rank.
• Cumulative Gain at
rank n:
(Where reli is the graded
relevance of the document at
position i)
n
1
2
3
4
5
6
7
8
9
10
11
12
13
14
relevance
(gain)
doc #
588
589
576
590
986
592
984
988
578
985
103
591
772
990
1.0
0.6
0.0
0.8
0.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
0.2
0.0
CG n
1.0
1.6
1.6
2.4
2.4
3.4
3.4
3.4
3.4
3.4
3.4
3.4
3.6
3.6
Adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong)
42
Outline
• Introduction
• Information Retrieval
–
–
–
–
–
–
–
Definition
Architecture of IR Systems
Linguistic Preprocessing
Measures of Text Retrieval
Vector-Space Model
Text Similarity Measures
Misc
• Improving Information Retrieval Systems
– Latent Semantic Indexing
– Relevance Feedback
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
43
How to Measure Text Similarity?
It is a well-studied problem
– metrics use a “bag of words” model
• it completely ignores word order and syntactic structure
• it treats both document and query as a bag of independent words
– common “stop words” are removed
– words are stemmed to reduce them to their root form
• the preprocessed words are called terms
– vector-space model is used to calculate similarity measure
between documents and a query, and between two documents
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
44
The Vector-Space Model
Assumptions:
– Vocabulary: a set of all distinct terms that remain after
preprocessing documents in the database; it contains t index
terms
• these “orthogonal” terms form a vector space
– each term, i, in either a document or query, j, is given a realvalued weight wij.
• documents and queries are expressed as t-dimensional vectors
dj = (w1j, w2j, …, wtj)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
45
3D Example of the Vector-Space Model
Example
– document D1 = 2T1 + 6T2 + 5T3
– document D2 = 5T1 + 5T2 + 2T3
– query Q1 = 0T1 + 0T2 + 2T3
T3
5
– which document is
closer to query?
D1 = 2T1+ 6T2 + 5T3
2 Q = 0T1 + 0T2 + 2T3
• how to measure it?
D2 = 5T1 + 5T2 + 2T3
– Distance?
– Angle?
– Projection?
2
5
T1
5
6
T2
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
46
The Vector-Space
Model
D1
D2
:
:
Dn
T1
w11
w12
:
:
w1n
T2
w21
w22
:
:
w2n
…
…
…
:
:
…
Tt
wt1
wt2
wtn
• Collection of n documents
– represented in the vector-space model by a term-document
matrix
– a cell in the matrix corresponds to the weight of a term in
the document
• value of zero means that the term does not exist in the
document
• Next we explain how the weights are computed
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
47
Term Weights
Frequency of a term
– more frequent terms in a document are more important
• they are more indicative of the topic of a document
fij = frequency of term i in document j
– the frequency is normalized by dividing by the frequency of
the most common term in the document
tfij = fij / maxi(fij)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
48
Term Weights
Inverse document frequency
– used to indicate the term’s discriminative power
• terms that appear in many different documents are less indicative
for a specific topic
dfi = document frequency of term i = # of documents
containing term I
idfi = inverse document frequency of term i = log2 (N / dfi)
where N is the total # of documents in the database, and log2 is used to
dampen the effect relative to tfij
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
49
TF-IDF Weighting
Term frequency-inversed document frequency (tf-idf)
weighting
wij = tfij  idfi = tfij  log2 (N / dfi)
– the highest weight is assigned to terms that occur frequently in
the document but rarely in the rest of the database
• some other ways of determining term weights have also been
proposed
– the tf-idf weighting was found to work very well through extensive
experimentations, and thus it is widely used
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
50
Example of TF-IDF Weighting
Consider the following document:
“data cube contains x data dimension, y data dimension, and z data
dimension”
where grey color stands for “ignored” letters, and other colors indicate distinct
terms
– the frequencies are:
data (4), dimension (3), cube (1), contain (1)
– we assume that the entire collection contains 10,000 documents
and the document frequencies of these four terms are
data (1300), dimension (250), cube (50), contain (3100)
– then:
data:
tf = 4/4
idf = log2(10000/1300) = 2.94 tf-idf = 2.94
dimension: tf = 3/4
idf = log2(10000/250) = 5.32 tf-idf = 3.99
cube:
tf = 1/4
idf = log2(10000/50) = 7.64 tf-idf = 1.91
contain:
tf = 1/4
idf = log2(10000/3100) = 1.69 tf-idf = 0.42
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
51
Outline
• Introduction
• Information Retrieval
–
–
–
–
–
–
–
Definition
Architecture of IR Systems
Linguistic Preprocessing
Measures of Text Retrieval
Vector-Space Model
Text Similarity Measures
Misc
• Improving Information Retrieval Systems
– Latent Semantic Indexing
– Relevance Feedback
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
52
Text Similarity Measure
Text similarity measure is a function used to compute
degree of similarity between two vectors
– usually is used to measure similarity between the query and
each of the documents from the database
• it is used to rank the documents
– the order indicates their relevance to the query
• usually a threshold is used to control the number of retrieved
relevant documents
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
53
T3
5
Cosine Similarity
Measure
D1 = 2T1+ 6T2 + 5T3
2
Q = 0T1 + 0T2 + 2T3
2
Is defined as the cosine of the
angle between two vectors t


similarity (d j , q ) 

dj  q
 
dj  q

T1
5
6
T2
 ( wij  wiq )
i 1
t
t
 wij   wiq
i 1
– example
D2 = 5T1 + 5T2 + 2T3
5
2
2
where t is # of terms
i 1
D1 = 2T1 + 6T2 + 5T3, D2 = 5T1 + 5T2 + 2T3, Q1 = 0T1 + 0T2 + 2T3


(2  0  6  0  5  2)
10

 0.62
(4  36  25)  (0  0  4)
65  4


(5  0  5  0  2  2)
4

 0.27
(25  25  4)  (0  0  4)
54  4
similarity ( D1 , Q) 
similarity ( D2 , Q) 
document D1 is twice more similar to the query than document D2
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
54
Outline
• Introduction
• Information Retrieval
–
–
–
–
–
–
–
Definition
Architecture of IR Systems
Linguistic Preprocessing
Measures of Text Retrieval
Vector-Space Model
Text Similarity Measures
Misc
• Improving Information Retrieval Systems
– Latent Semantic Indexing
– Relevance Feedback
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
55
If we have extra time…
I can further talk about:
• Different levels of text mining
(word-> sentence -> segment -> …)
• The application of text mining in other areas
(Spam and malicious executables detection)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
56
Outline
• Introduction
• Information Retrieval
–
–
–
–
–
–
–
–
Definition
Architecture of IR Systems
Linguistic Preprocessing
Measures of Text Retrieval
Vector-Space Model
Text Similarity Measures
Levels of Text Mining
Misc
• Improving Information Retrieval Systems
– Latent Semantic Indexing
– Relevance Feedback
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
57
Improving IR Systems
The basic design of the IR systems described so far can
be enhanced to increase precision and recall, and to
improve the term-document matrix
– The former issue is often addressed by using relevance
feedback mechanism,
while the latter is addressed by using latent semantic indexing
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
58
Improving IR Systems
Latent Semantic Indexing
– improves effectiveness of the IR system by retrieving
documents that are more relevant to a user's query through
manipulation of the term-document matrix
• original term-document matrix is often too large for the available
computing resources, is presumed noisy (some anecdotal
occurrences of terms should be eliminated) and too sparse with
respect to the "true" term-document matrix
– therefore, the original matrix is approximated by a smaller, “denoisified” matrix
• the new matrix is computed using singular value decomposition
(SVD) technique (Chapter 7)
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
59
Improving IR
Systems
IR system
2
1
new terms
4
3
Relevance Feedback
query
3
5
new query
– modification of the
6
Matching
information search
documents
process to improve
user
accuracy
– it adds terms to the initial query that may match relevant
documents
1. perform IR search on the initial query
2. get feedback from the user as to what documents are relevant and find new
(with respect to the initial query) terms from known relevant documents
3. add new terms to the initial query to form a new query
4. repeat the search using the new query
5. return a set of relevant documents based on the new query
6. user evaluates the returned documents
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
60
Improving IR Systems
Relevance Feedback
– performed manually
• user manually identifies relevant documents, and new terms are
selected either manually or automatically
– performed automatically
• relevant documents are identified automatically by using the top
ranked documents, and new terms are selected automatically
1.
2.
3.
4.
5.
identification of the N top-ranked documents
identification of all terms from the N top-ranked documents
selection of the feedback terms
merging the feedback terms with the original query
identifying the top-ranked documents for the modified query
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
61
Improving IR Systems
Relevance Feedback
• Pros
– it usually improves average precision by increasing the
number of good terms in the query
• Cons
– requires more computational work
– it can decrease effectiveness
• one bad new word can undo the good caused by lots of good
words
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
62
References
Baeza-Yates, R. and Ribeiro-Neto, B 1999. Modern Information
Retrieval, Addison Wesley
Feldman, R. and Sanger, J. 2006. The Text Mining Handbook:
Advanced Approaches in Analyzing Unstructured Data,
Cambridge University Press
Hotho, A., Nürnberger, A. and Paaß, G. 2005. A Brief Survey of Text
Mining, GLDV-Journal for Computational Linguistics and
Language Technology, 20(1):19-62
Salton, G. and McGill, M. 1982. Introduction to Modern Information
Retrieval, McGraw-Hill
© 2007 Cios / Pedrycz / Swiniarski / Kurgan
63