ir_spring06_lec09b

Download Report

Transcript ir_spring06_lec09b

Modeling the Internet and the
Web:
Text Analysis
1
Outline
•
•
•
•
•
•
•
•
•
Indexing
Lexical processing
Content-based ranking
Probabilistic retrieval
Latent semantic analysis
Text categorization
Exploiting hyperlinks
Document clustering
Information extraction
2
Information Retrieval
• Analyzing the textual content of individual Web
pages
– given user’s query
– determine a maximally related subset of documents
• Retrieval
– index a collection of documents (access efficiency)
– rank documents by importance (accuracy)
• Categorization (classification)
– assign a document to one or more categories
3
Indexing
• Inverted index
– effective for very large collections of
documents
– associates lexical items to their occurrences
in the collection
• Terms 
– lexical items: words or expressions
• Vocabulary V
– the set of terms of interest
4
Inverted Index
• The simplest example
– a dictionary
• each key is a term   V
• associated value b() points to a bucket (posting
list)
– a bucket is a list of pointers marking all occurrences of 
in the text collection
5
Inverted Index
• Bucket entries:
– document identifier (DID)
• the ordinal number within the collection
– separate entry for each occurrence of the term
• DID
• offset (in characters) of term’s occurrence within this
document
– present a user with a short context
– enables vicinity queries
6
Inverted Index
7
Inverted Index Construction
• Parse documents
• Extract terms i
– if i is not present
• insert i in the inverted index
• Insert the occurrence in the bucket
8
Searching with Inverted Index
• To find a term  in an indexed collection of
documents
– obtain b() from the inverted index
– scan the bucket to obtain list of occurrences
• To find k terms
– get k lists of occurrences
– combine lists by elementary set operations
9
Inverted Index Implementation
• Size = (|V|)
• Implemented using a hash table
• Buckets stored in memory
– construction algorithm is trivial
• Buckets stored on disk
– impractical due to disk assess time
• use specialized secondary memory algorithms
10
Bucket Compression
• Reduce memory for each pointer in the buckets:
– for each term sort occurrences by DID
– store as a list of gaps - the sequence of differences
between successive DIDs
• Advantage – significant memory saving
– frequent terms produce many small gaps
– small integers encoded by short variable-length
codewords
• Example:
the sequence of DIDs: (14, 22, 38, 42, 66, 122, 131, 226 )
a sequence of gaps: (14, 8, 16, 4, 24, 56, 9, 95)
11
Lexical Processing
• Performed prior to indexing or converting
documents to vector representations
– Tokenization
• extraction of terms from a document
– Text conflation and vocabulary reduction
• Stemming
– reducing words to their root forms
• Removing stop words
– common words, such as articles, prepositions, noninformative adverbs
– 20-30% index size reduction
12
Tokenization
• Extraction of terms from a document
– stripping out
• administrative metadata
• structural or formatting elements
• Example
– removing HTML tags
– removing punctuation and special characters
– folding character case (e.g. all to lower case)
13
Stemming
• Want to reduce all morphological variants of a word
to a single index term
– e.g. a document containing words like fish and fisher may
not be retrieved by a query containing fishing (no fishing
explicitly contained in the document)
• Stemming - reduce words to their root form
• e.g. fish – becomes a new index term
• Porter stemming algorithm (1980)
– relies on a preconstructed suffix list with associated rules
• e.g. if suffix=IZATION and prefix contains at least one vowel
followed by a consonant, replace with suffix=IZE
– BINARIZATION => BINARIZE
14
Content Based Ranking
• A boolean query
– results in several matching documents
– e.g., a user query in google: ‘Web AND graphs’,
results in 4,040,000 matches
• Problem
– user can examine only a fraction of result
• Content based ranking
– arrange results in the order of relevance to user
15
Choice of Weights
query
q
web graph
document results
text
terms
d1
web web graph
web graph
d2
graph web net graph net
graph web net
d3
page web complex
page web complex
web
graph
q
wq1
wq2
d1
w11
w12
d2
w21
w22
d3
w31
net
page
complex
w34
w35
w23
What weights
retrieve most
relevant
pages?
16
Vector-space Model
• Text documents are mapped to a highdimensional vector space
• Each document d
– represented as a sequence of terms (t)
d = ((1), (2), (3), …, (|d|))
• Unique terms in a set of documents
– determine the dimension of a vector space
17
Example
document
text
terms
d1
web web graph
web graph
d2
graph web net graph net
graph web net
d3
page web complex
page web complex
Boolean representation of vectors:
V = [ web, graph, net, page, complex ]
V1 = [1 1 0 0 0]
V2 = [1 1 1 0 0]
V3 = [1 0 0 1 1]
18
Vector-space Model
• 1, 2 and 3 are terms
in document, x and x
are document vectors
• Vector-space
representations are
sparse, |V| >> |d|
19
Term frequency (TF)
• A term that appears many times within a
document is likely to be more important than a
term that appears only once
• nij - Number of occurrences of a term j in a
document di
• Term frequency
TFij

nij
di
20
Inverse document frequency (IDF)
• A term that occurs in a few documents is likely to
be a better discriminator than a term that
appears in most or all documents
• nj - Number of documents which contain the term
j
• n - total number of documents in the set
• Inverse document frequency
n
IDF j  log
nj
21
Inverse document frequency (IDF)
22
Full Weighting (TF-IDF)
• The TF-IDF weight of a term
j in document di is
xij  TFij  IDF j
23
Document Similarity
• Ranks documents by measuring the
similarity between each document and the
query
• Similarity between two documents d and d
is a function s(d, d) R
• In a vector-space representation the
cosine coefficient of two document vectors
is a measure of similarity
24
Cosine Coefficient
• The cosine of the angle formed by two document
vectors x and x is
xT x '
cos( x, x ) 
x  x'
'
• Documents with many common terms will have
vectors close to each other, than documents with
fewer overlapping terms
25
Retrieval and Evaluation
• Compute document vectors for a set of
documents D
• Find the vector associated with the user
query q
• Using s(xi, q), I = 1, ..,n, assign a similarity
score for each document
• Retrieve top ranking documents R
• Compare R with R* - documents actually
relevant to the query
26
Retrieval and Evaluation Measures
• Precision () - Fraction of retrieved documents
that are actually relevant
R  R*

R
• Recall () - Fraction of relevant documents that
are retrieved
*

RR
R*
27
Probabilistic Retrieval
• Probabilistic Ranking Principle (PRP)
(Robertson, 1977)
– ranking of the documents in the order of
decreasing probability of relevance to the user
query
– probabilities are estimated as accurately as
possible on basis of available data
– overall effectiveness of such as system will be
the best obtainable
28
Probabilistic Model
• PRP can be stated by introducing a Boolean
variable R (relevance) for a document d, for a
given user query q as P(R | d,q)
• Documents should be retrieved in order of
decreasing probability
P( R | d , q )  P( R | d ' , q )
• d - document that has not yet been retrieved
29
Latent Semantic Analysis
• Why need it?
– serious problems for retrieval methods based
on term matching
• vector-space similarity approach works only if the
terms of the query are explicitly present in the
relevant documents
– rich expressive power of natural language
• often queries contain terms that express concepts
related to text to be retrieved
30
Synonymy and Polysemy
• Synonymy
– the same concept can be expressed using different
sets of terms
• e.g. bandit, brigand, thief
– negatively affects recall
• Polysemy
– identical terms can be used in very different semantic
contexts
• e.g. bank
– repository where important material is saved
– the slope beside a body of water
– negatively affects precision
31
Latent Semantic Indexing(LSI)
• A statistical technique
• Uses linear algebra technique called singular
value decomposition (SVD)
– attempts to estimate the hidden structure
– discovers the most important associative patterns
between words and concepts
• Data driven
32
LSI and Text Documents
• Let X denote a term-document matrix
X = [x1 . . . xn]T
– each row is the vector-space representation of a document
– each column contains occurrences of a term in each
document in the dataset ̂
• Latent semantic indexing
– compute the SVD of X:
  UV
T
•  - singular value matrix
– set to zero all but largest K singular values - ̂
– obtain the reconstruction of X by:
ˆ  Uˆ V T

33
LSI Example
• A collection of documents:
d1:
Indian government goes for open-source software
d2:
Debian 3.0 Woody released
d3:
Wine 2.0 released with fixes for Gentoo 1.4 and Debian 3.0
d4:
gnuPOD released: iPOD on Linux… with GPLed software
d5:
Gentoo servers running at open-source mySQL database
d6:
Dolly the sheep not totally identical clone
d7:
DNA news: introduced low-cost human genome DNA chip
d8:
Malaria-parasite genome database on the Web
d9:
UK sets up genome bank to protect rare sheep breeds
d10: Dolly’s DNA damaged
34
LSI Example
• The term-document matrix XT
open-source
software
Linux
released
Debian
Gentoo
database
Dolly
sheep
genome
DNA
d1
1
1
0
0
0
0
0
0
0
0
0
d2
0
0
0
1
1
0
0
0
0
0
0
d3
0
0
0
1
1
1
0
0
0
0
0
d4
0
1
1
1
0
0
0
0
0
0
0
d5
1
0
0
0
0
1
1
0
0
0
0
d6
0
0
0
0
0
0
0
1
1
0
0
d7
0
0
0
0
0
0
0
0
0
1
2
d8
0
0
0
0
0
0
1
0
0
1
0
d9
0
0
0
0
0
0
0
0
0
1
0
d10
0
0
0
0
0
0
0
1
0
0
1
35
LSI Example
•
•
The reconstructed term-document matrix ̂ T after projecting on a subspace
of dimension K=2
 = diag(2.57, 2.49, 1.99, 1.9, 1.68, 1.53, 0.94, 0.66, 0.36, 0.10)
d1
open-source 0.34
software
0.44
Linux
0.44
released
0.63
Debian
0.39
Gentoo
0.36
database
0.17
Dolly
-0.01
sheep
-0.00
genome
0.02
DNA
-0.03
d2
0.28
0.37
0.37
0.53
0.33
0.30
0.14
-0.01
-0.00
0.01
-0.04
d3
0.38
0.50
0.50
0.72
0.44
0.41
0.19
-0.01
-0.00
0.02
-0.04
d4
0.42
0.55
0.55
0.79
0.48
0.45
0.21
-0.02
-0.01
0.01
-0.06
d5
0.24
0.31
0.31
0.45
0.28
0.26
0.14
0.03
0.03
0.10
0.11
d6
0.00
-0.01
-0.01
-0.01
-0.01
0.00
0.04
0.08
0.06
0.19
0.30
d7
0.04
-0.03
-0.03
-0.05
-0.03
0.03
0.25
0.45
0.34
1.11
1.70
d8
0.07
0.06
0.06
0.09
0.06
0.07
0.11
0.13
0.10
0.34
0.51
d9
0.02
0.00
0.00
-0.00
0.00
0.02
0.09
0.14
0.11
0.36
0.55
d10
0.01
-0.02
-0.02
-0.04
-0.02
0.01
0.12
0.21
0.16
0.53
0.81
36
Probabilistic LSA
• Aspect model (aggregate Markov model)
– let an event be the occurrence of a term  in a
document d
– let z{z1, … , zK} be a latent (hidden) variable
associated with each event
– the probability of each event (, d) is
P( , d )  P(d ) P( | z ) P( z | d )
z
• select a document from a density P(d)
• select a latent concept z with probability P(z|d)
• choose a term , sampling from P(|z)
37
Aspect Model Interpretation
• In a probabilistic latent semantic space
– each document is a vector
– uniquely determined by the mixing coordinates
P(zk|d), k=1,…,K
• i.e., rather than being represented through terms, a document is
represented through latent variables that in tern are responsible
for generating terms.
38
Analogy with LSI
• all n x m document-term joint probabilities
P  UV
T
– uik = P(di|zk)
– vjk = P(j|zk)
– kk = P(zk)
– P is properly normalized probability distribution
– entries are nonnegative
39
Fitting the Parameters
• Parameters estimated by maximum likelihood
using EM
– E step
P( zk | di ,  j )  P( j | zk ) P(di | zk ) P( zk )
– M step
P(  P(zk)n
P( j | zk )   nij P( z k | d i ,  j )
i 1
|V |
P( zk | di )   nij P( zk | di ,  j )
n
j 1
|V |
P( zk )   nij P( zk | di ,  j )
i 1 j 1
40
Text Categorization
• Grouping textual documents into different fixed
classes
• Examples
– predict a topic of a Web page
– decide whether a Web page is relevant with respect
to the interests of a given user
• Machine learning techniques
– k nearest neighbors (k-NN)
– Naïve Bayes
– support vector machines
41
k Nearest Neighbors
• Memory based
– learns by memorizing all the training instances
• Prediction of x’s class
– measure distances between x and all training
instances
– return a set N(x,D,k) of the k points closest to x
– predict a class for x by majority voting
• Performs well in many domains
– asymptotic error rate of the 1-NN classifier is always
less than twice the optimal Bayes error
42
Naïve Bayes
• Estimates the conditional probability of the class given the
document
P ( d | c,  ) P ( c |  )
P (c | d ,  ) 
 P ( d | c,  ) P ( c |  )
P(d |  )
•  - parameters of the model
• P(d) – normalization factor (cP(c|d)=1)
– classes are assumed to be mutually exclusive
• Assumption: the terms in a document are conditionally
independent given the class
– false, but often adequate
– gives reasonable approximation
• interested in discrimination among classes
43
Bernoulli Model
• An event – a document as a whole
– a bag of words
– words are attributes of the event
– vocabulary term  is a Bernoully attribute
• 1, if  is in the document
• 0, otherwise
– binary attributes are mutually independent
given the class
• the class is the only cause of appearance of each word
in a document
44
Bernoulli Model
• Generating a document
– tossing |V| independent coins
– the occurrence of each word in a document is a
Bernoulli event
P ( d | c,  ) P ( c |  )
P (c | d ,  ) 
 P ( d | c,  ) P ( c |  )
P(d |  )
|V |
P(d | c, )   x j P( j | c)  (1  x j )(1  P( j | c))
j 1
– xj = 1[0] - j does [does not] occur in d
– P(j|c) – probability of observing j in documents of
class c
45
Multinomial Model
• Document – a sequence of events
W1,…,W|d|
• Take into account
– number of occurrences of each word
– length of the document
– serial order among words
• significant (model with a Markov chain)
• assume word occurrences independent – bag-ofwords representation
46
Multinomial Model
• Generating a document
– throwing a die with |V| faces |d| times
– occurrence of each word is multinomial event
P ( d | c,  ) P ( c |  )
P (c | d ,  ) 
 P ( d | c,  ) P ( c |  )
P(d |  )
|V |
P(d | c, )  GP(| d |) P( j | c)
nj
j 1
• nj is the number of occurrences of j in d
• P(j|c) – probability that j occurs at any position
t  [ 1,…,|d| ]
• G – normalization constant
47
Learning Naïve Bayes
• Estimate parameters  from the available
data
• Training data set is a collection of labeled
documents { (di, ci), i = 1,…,n }
48
Learning Bernoulli Model
• c,j = P(j|c), j = 1,…,|V|, c = 1,…,K
– estimated as
ˆ
c, j
1

Nc
– Nc = |{ i : ci =c }|
– xij = 1 if j occurs in di
n
x
i:ci  c
ij
• class prior probabilities c = P(c)
– estimated as
Nc
ˆ
c 
n
49
Learning Multinomial Model
• Generative parameters c,j = P(j|c)
– must satisfy j c,j = 1 for each class c
• Distributions of terms given the class
q j  i:c c nij
n
ˆc , j 
'
i
  l 1 i:c c nil
|V |
i
– qj and  are hyperparameters of Dirichlet prior
– nij is the number of occurrences of j in di
• Unconditional class probabilities
' '
q
c  N c
ˆ
c 
' n
50
Support Vector Classifiers
• Support vector machines
– Cortes and Vapnik (1995)
– well suited for high-dimensional data
– binary classification
• Training set
D = {(xi,yi), i=1,…,n}, xi  Rm and yi  {-1,1}
• Linear discriminant classifier
– Separating hyperplane
{ x : f(x) = wTx + w0 = 0 }
• model parameters: w  Rm and w0  R
51
Support Vector Machines
• Binary classification function
h : Rm  {0, 1} defined as
1, if f ( x)  0
h( x )  
0, otherwise
• Training data is linearly separable:
– yi f(xi) > 0 for each i = 1,…,n
• Sufficient condition for D to be linearly separable
– number of training examples
n = |D| is less or equal to m + 1
52
Perceptron
Perceptron ( D )
1
2
3
4
5
6
7
8
9
10
11
12
w0
w0  0
repeat
e0
for i  1,…,n
do s  sign( yi( wTxi + w0 ))
if s < 0
then w  w + yixi
w0  w0 +yi
ee+1
until e = 0
return ( w, w0 )
53
Overfitting
54
Optimal Separating Hyperplane
• Unique for each linearly separable data set
• Its associated risk of overfitting is smaller than
for any other separating hyperplane
• Margin M of the classifier
– the distance between the separating hyperplane and
the closest training samples
– optimal separating hyperplane – maximum margin
• Can be obtained by solving the constraint
optimization problem
1
max M subject to
yi ( wT xi  w0 )  1, i  1,..., n
w, w 0
|| w ||
55
Optimal Hyperplane and Margin
56
Support Vectors
• Karush-Kuhn-Tucker condition for each xi:
 i [ yi (w xi  w0 )  1]  0
T
• If I > 0 then the distance of xi from the
separating hyperplane is M
• Support vectors - points with associated I > 0
• The decision function h(x) computed from
n
f ( x)   yi i x xi
T
i 1
57
Feature Selection
• Limitations with large number of terms
– many terms can be irrelevant for class discrimination
• text categorization methods can degrade in accuracy
– time requirements for learning algorithm increases
exponentially
• Feature selection is a dimensionality reduction
technique
– limits overfitting by identifying the irrelevant term
• Categorized into two types
– filter model
– wrapper model
58
Filter Model
• Feature selection is applied as a preprocessing step
– determines which features are relevant before learning takes
place
• For e.g., the FOCUS algorithm (Almuallim & Dietterich,
1991)
– performs exhaustive search of all vector space subsets,
– determines a minimal set of terms that can provide a consistent
labeling of the training data
• Information theoretic approaches perform well for filter
models
59
Wrapper Model
• Feature selection is based on the estimates of
the generalization error
– specific learning algorithm is used to find the error
estimates
– heuristic search is applied through subsets of terms
– set of terms with minimum estimated error is selected
• Limitations
– can overfit the data if used with classifiers having high
capacity
60
Information Gain Method
• Information Gain, G – Measure of information about the class that is
provided by the observation of each term
• Also defined as
– mutual information l(C, Wj) between the class C and the term Wj
K
1
G(W j )    P(c,  j ) log
c 1  j 0
P(c,  j )
P(c) P( j )
• For feature selection
– compute the information gain for each unique term
– remove terms whose information gain is less than some predefined
threshold
• Limitations
– relevance assessment of each term is done separately
– effect of term co-occurrences is not considered
61
Average Relative Entropy
Method
• Whole sets of features are tested for
relevance about the class (Koller and
Sahami, 1996)
• For feature selection
– determine relevance of a selected set using
the average relative entropy
62
Average Relative Entropy
Method
• Let x V, xg be the projection of x onto G  V
– to estimate quality of G measure distance between
P(C|x) and P(C|xg) using average relative entropy
G   P ( f ) g ( f )
f
• For optimal set of features
– G should be small
• Limitations
– parameters are computationally intractable
– distributions are hard to estimate accurately
63
Markov Blanket Method
• M is a Markov Blanket for term Wj
• If Wj is conditionally independent of all features in
V – M - {Wj}, given M  V, Wj M
• class C is conditionally independent of Wj, given M
• Feature selection is performed by
– removing features for which the Markov
blanket is found
64
Approximate Markov Blanket
• For each term Wj in G,
– compute the co-relation factor of Wj with Wi
– obtain a set M of k terms, that have highest
co-relation with Wj
– find the average cross entropy (Wj, Mj)
– select the term for which the average relative
entropy is minimum
• Repeat steps until a predefined number of
terms are eliminated from the set G
65
Measures of Performance
• Determines accuracy of the classification
model
• To estimate performance of a classification
model
– compare the hypothesis function with the true
classification function
• For a two class problem,
– performance is characterized by the confusion
matrix
66
Confusion Matrix
•
•
•
•
TN - irrelevant values not retrieved
TP - relevant values retrieved
FP - irrelevant values retrieved
FN - relevant values not retrieved
Predicted
Category
Actual Category
-
+
-
TN
FN
+
FP
TP
• Total retrieved terms = TP + FP
• Total relevant terms = TP + FN
67
Measures of Performance
• For balanced domains
– accuracy characterizes performance
A = (TP+TN) / |D|
– classification error, E = 1 - A
• For unbalanced domain
– precision and recall characterize performance

TP
TP  FP

TP
TP  FN
68
Precision-Recall Curve
Breakeven Point
At the breakeven point, (t*) = (t*)
69
Precision-Recall Averages
• Microaveraging
k
k
 TP

 
c
c 1

k
 (TP  FP
c 1
c
 TP
 
c)
c
c 1
k
 (TP  FN )
c 1
c
c
• Macroaveraging

M
1

K
K

c 1
c

M
1

K
K

c 1
c
70
Applications
• Text categorization methods use
– document vector or ‘bag of words’
• Domain specific aspects of the web
– for e.g., sports, citations related to AI
improves classification performance
71
Classification of Web Pages
• Use of text classification to
– extract information from web documents
– automatically generate knowledge bases
• Web  KB systems (Cravern et al.)
– train machine-learning subsystems
• predict about classes and relations
• populate KB from data collected from web
– provide ontolgy and training examples as
inputs
72
Knowledge Extraction
• Consists of two steps
– assign a new web page to one node of the
class hierarchy
– fill in the class attributes by extracting relevant
information from the document
• Naive Bayes classifier
– discriminate between the categories
– predict the class for a web page
73
Example
74
Experimental Results
Predicted
catefory
cou
Actual Category
stu
fac
sta
pro
dep
oth
Precision
Cou
202
17
0
0
1
0
552
26.2
Stu
0
421
14
17
2
0
519
43.3
Fac
5
56
118
16
3
0
264
17.9
Sta
0
15
1
4
0
0
45
6.2
Pro
8
9
10
5
62
0
384
13.0
Dep
10
8
3
1
5
4
209
1.7
Oth
19
32
7
3
12
0
1064
93.6
Recall
82.8
75.4
77.1
8.7
72.9
100.0
35.0
75
Classification of News Stories
• Reuters-21578
– consists of 21578 news stories, assembled
and manually labeled
– 672 categories each story can belong to more
than one category
• Data set is split into training and test data
76
Experimental Results
• ModApte split (Joachims 1998)
– 9603 training data and 3299 test data, 90 categories
Prediction Method
Performance
breakeven (%)
Naïve Bayes
73.4
Rocchio
78.7
Decision tree
78.9
K-NN
82.0
Rule induction
82.0
Support vector (RBF)
86.3
Multiple decision trees
87.8
77
Email and News Filtering
• ‘Bag of words’ representation
– removes important order information
– need to hand-program terms, for e.g., ‘confidential
message’, ‘urgent and personal’
• Naïve Bayes classifier is applied for junk email
filtering
• Feature selection is performed by
– eliminating rare words
– retaining important terms, determined by mutual
information
78
Example Data Set
• Data set consisted of
– 1578 junk messages
– 211 legitimate messages
• Loss of FP is higher than loss of FN
• Classify a message as junk
– only if probability is greater than 99.9%
79
Supervised Learning with
Unlabeled Data
• Assigning labels to training set is
– expensive
– time consuming
• Abundance of unlabeled data
– suggests possible use to improve learning
80
Why Unlabeled Data?
• Consider positive and negative examples
– as two separate distribution
– with very large number of samples available
parameters of distribution can be estimated well
– needs only few labeled points to decide which
gaussian is associated with positive and negative
class
• In text domains
– categories can be guessed using term cooccurrences
81
Why Unlabeled Data?
82
EM and Naïve Bayes
• A class variable for unlabeled data
– is treated as a missing variable
– estimated using EM
• Steps involved
– find the conditional probability, for each
document
– compute statistics for parameters using the
probability
– use statistics for parameter re-estimation
83
Experimental Results
84
Transductive SVM
• The optimization problem
– that leads to computing the optimal separating
hyperplane
min w subject to
w, w
0
yi ( wT xi  w0 )  1
– becomes –
min
y1' ,..., y n' , w , w0
w subject to
yi ( wT xi  w0 )  1
y 'j ( wT x 'j  w0 )  1
– missing values (y1, .., yn) are filled in using maximum
margin separation
85
Exploiting Hyperlinks – Co-training
• Each document instance has two sets of
alternate view (Blum and Mitchell 1998)
– terms in the document, x1
– terms in the hyperlinks that point to the document, x2
• Each view is sufficient to determine the class of
the instance
– Labeling function that classifies examples is the
same applied to x1 or x2
– x1 and x2 are conditionally independent, given the
class
86
Co-training Algorithm
• Labeled data are used to infer two Naïve
Bayes classifiers, one for each view
• Each classifier will
– examine unlabeled data
– pick the most confidently predicted positive
and negative examples
– add these to the labeled examples
• Classifiers are now retrained on the
augmented set of labeled examples
87
Relational Learning
• Data is in relational format
• Learning algorithm exploits the relations
among data items
• Relations among web documents
– hyperlinked structure of the web
– semi-structured organization of text in HTML
88
Example of Classification Rule
• FOIL algorithm (Quinlan 1990) is used
– to learn classification rules in the WebKB
domain
student(A) :- not(has_data(A)), not(has_comment(A)), link_to(B,A),
has_jane(B), has_paul(B), not(has_mail(B)).
89
Document Clustering
• Process of finding natural groups in data
– training data are unsupervised
– data are represented as bags of words
• Few useful applications
– automatic grouping of web pages into clusters
based on their content
– grouping results of a search engine query
90
Example
• User query – ‘World Cup’
• Excerpt from search engine results
–
–
–
–
http://www.fifaworldcup.com - soccer
http://www.dubaiworldcup.com – horse racing
http://www.wcsk8.com – robot soccer
http://www.robocup.org - skiing
• Document clustering results (www.vivisimo.com)
–
–
–
–
FIFA world cup (44)
Soccer (42)
Sports (24)
History (19)
91
Hierarchical Clustering
• Generates a binary tree, called
dendrogram
– does not presume a predefined number of
clusters
– consider clustering n objects
• root node consists of a cluster containing all n
objects
• n leaf nodes correspond to clusters, ,each
containing one of the n objects
92
Hierarchical Clustering
Algorithm
• Given
– a set of N items to be clustered
– NxN distance (or similarity) matrix
• Assign each item to its own cluster
– N items will have N clusters
• Find the closest pair of clusters and merge them into a
single cluster
– distances between the clusters equal the distances between the
items they contain
• Compute distances between the new cluster and each of
the old clusters
• Repeat until a single cluster of size N is formed
93
Hierarchical Clustering
• Chaining-effect
– 'closest' - defined as the shortest distance between
clusters
– cluster shapes become elongated chains
– objects far away from each other tend to be grouped
into the same cluster
• Different ways of defining 'closest‘
–
–
–
–
single-link clustering
complete-link clustering
average-distance clustering
domain specific knowledge, such as cosine distance,
TF-IDF weights, etc.
94
Probabilistic Model-based
Clustering
• Model-based clustering assumes
– existence of generative probabilistic model for
data, as a mixture model with K components
• Each component corresponds
– to a probability distribution model for one of
the clusters
• Need to learn the parameters of each
component model
95
Probabilistic Model-based
Clustering
• Apply Naïve Bayes model for document
clustering
– contains one parameter per dimension
– dimensionality of document vector is typically
high 5000-50000
96
Related Approaches
• Integrate ideas from hierarchical clustering
and probabilistic model-based clustering
– combine dimensionality reduction with
clustering
• Dimension reduction techniques can
destroy the cluster structure
– need for objective function to achieve more
reliable clustering in lower dimension space
97
Information Extraction
• Automatically extract unstructured text data
from Web pages
• Represent extracted information in some
well-defined schema
• E.g.
– crawl the Web searching for information about
certain technologies or products of interest
• extract information on authors and books from
various online bookstore and publisher pages
98
Info Extraction as Classification
• Represent each document as a sequence of words
• Use a ‘sliding window’ of width k as input to a
classifier
– each of the k inputs is a word in a specific position
• The system trained on positive and negative
examples (typically manually labeled)
• Limitation: no account of sequential constraints
– e.g. the ‘author’ field usually precedes the ‘address’
field in the header of a research paper
– can be fixed by using stochastic finite-state models
99
Hidden Markov Models
Example: Classify short segments
of text in terms whether they
correspond to the title, author
names, addresses, affiliations, etc.
100
Hidden Markov Model
• Each state corresponds to one of the fields that we
wish to extract
– e.g. paper title, author name, etc.
• True Markov state diagram is unknown at parse-time
– can see noisy observations from each state
• the sequence of words from the document
• Each state has a characteristic probability distribution
over the set of all possible words
– e.g. specific distribution of words from the state ‘title’
101
Training HMM
• Given a sequence of words and HMM
– parse the observed sequence into a
corresponding set of inferred states
• Viterbi algorithm
• Can be trained
– in supervised manner with manually labeled data
– bootstrapped using a combination of labeled and
unlabeled data
102