Data Mining For Hypertext - CS

Transcript Data Mining For Hypertext - CS

sdbi – winter 2001
Data Mining For Hypertext:
A Tutorial Survey
Based on a paper by:
Soumen Chakrabarti
Indian Institute Of technology Bombay.

[email protected]
Lecture by:
Noga Kashti
Efrat Daum
11/11/01
Lets start with definitions …


Hypertext - a collection of documents (or
"nodes") containing cross-references or
"links" which, with the aid of an interactive
browser program, allow the reader to move
easily from one document to another.
Data Mining - Analysis of data in a database
using tools which look for trends or anomalies
without knowledge of the meaning of the
data.
11/11/01
sdbi - winter 2001
2
Two Ways For Getting
Information From The Web :

Clicking On Hyperlinks

Searching Via Keyword Queries
11/11/01
sdbi - winter 2001
3
Some History …


Before the popular Web, Hypertext has
been used by ACM, SIGIR,
SIGLINK/SIGWEB and DIGITAL
LIBRARIES.
The old IR (Information retrieval) deals
with documents whereas the Web deals
with semi-structured data.
11/11/01
sdbi - winter 2001
4
Some Numbers ..


The Web exceeds
800 million HTML
pages on about
three million
servers.


Almost a million
pages are added
daily.
11/11/01
sdbi - winter 2001
A typical page
changes in a few
months.
Several hundred
gigabytes change
every month.
5
Difficulties With Accessing
Information On The Web:

Usual problems of text search (synonymy,
polysemy, text sensitivity) become much
more severe.

Semi-structured data.

Sheer size and flux.

No consistent standard or style.
11/11/01
sdbi - winter 2001
6
The Old Search Process Is
Often Unsatisfactory!


Deficiency of scale.
Poor accuracy (low recall and low
precision).
11/11/01
sdbi - winter 2001
7
Better Solutions: Data Mining
And Machine Learning


NL Techniques.
Statistical Techniques for learning
structure in various forms from text
hypertext and semi-structured data.
11/11/01
sdbi - winter 2001
8
Issues We’ll Discuss





Models
Supervised learning
Unsupervised learning
Semi-supervised learning
Social network analysis
11/11/01
sdbi - winter 2001
9
Models For Text

Representation for text with statistical
analyses only (bag-of-words):



11/11/01
The vector space model
The binary model
The multi-nominal model
sdbi - winter 2001
10
Models For Text (cont.)

The vector space model:



Documents -> tokens->canonical forms.
Canonical token is an axis in a Euclidean
space.
The t-th coordinate of d is n(d,t)


11/11/01
t is a term
d is a document
sdbi - winter 2001
11
The Vector Space Model: Normalize
The Document Length To 1

d 1   nd , t 
t

d2
 nd , t 
2
t



d

 max nd , t 
t
Nt
IDF(t)  1  log
N
n d, t 
TFIDF 
 IDF(t)
d1
11/11/01
sdbi - winter 2001
12
More Models For Text


The Binary Model : A document is a set of
terms, which is a subset of the lexicon. Word
counts are not significant.
The multinomial model : a die with |T|
faces. Every face has a probability θt of
showing up when tossed. Deciding of total
word count, the author tosses the die while
writing the term that shows up.
11/11/01
sdbi - winter 2001
13
Models For Hypertext

Hypertext: text with hyperlinks.
Varying levels of detail.

Example: Directed Graph(D,L)



11/11/01
D – The set of nodes/documents/pages
L – The set of links
sdbi - winter 2001
14
Models For Semi-structured
Data

A point of convergence for the
web(documents) and database(data)
communities
11/11/01
sdbi - winter 2001
15
Models For Semi-structured
Data(cont.)



like Topic Directories with treestructured hierarchies.
Examples: Open Directory Project ,
Yahoo!
Another representation: XML.
11/11/01
sdbi - winter 2001
16
Supervised Learning
(classification)

Algorithm Initialization: training
data, each item is marked with a label
or class from a discrete finite set.

Input: unlabeled data.

Algorithm roll: guess the data labels.
11/11/01
sdbi - winter 2001
17
Supervised Learning (cont.)


Example: topic directories
Advantages: help structure, restrict
keyword search, can enable powerful
searches.
11/11/01
sdbi - winter 2001
18
Probabilistic Models For Text
Learning


Let c1,…,cm be m classes or topics with
some training documents Dc.
Prior probability of a class:
Dc
D
c
c

T : the universe of terms in all the
training documents.
11/11/01
sdbi - winter 2001
19
Probabilistic Models For Text
Learning (cont.)

Naive Bayes classification:


11/11/01
Assumption: for each class c, there is
binary text generator model.
Model parameters: Φc,t – the probability
that a document in class c will mention
term t at lease once.
sdbi - winter 2001
20
Naive Bayes classification
(cont.)

Pr( d | c)   c ,t
td

 (1  
c ,t
)
tT ,td
Problems:


11/11/01
short documents are discouraged.
Pr (d|c) estimation is likely to be greatly
distorted.
sdbi - winter 2001
21
Naive Bayes classification
(cont.)

With the multinomial model:
 d1 
n ( d ,t )
 c ,t
Pr(d | c)  
 n(d , t ) td
11/11/01
sdbi - winter 2001
22
Naive Bayes classification
(cont.)

Problems:




Again, short documents are
discouraged.
Inter-term correlation ignored.
Multiplicative Φc,t ‘surprise’ factor.
Conclusion:

11/11/01
Both model are effective.
sdbi - winter 2001
23
More Probabilistic Models For
Text Learning





Parameter smoothing and feature
selection.
Limited dependence modeling.
The maximum entropy technique.
Support vector machines (SVMs).
Hierarchies over class labels.
11/11/01
sdbi - winter 2001
24
Learning Relations





Classification extension : a combination of
statistical and relational learning.
Improve accuracy.
The ability to invent predicates.
Can represent hyperlink graph structure and
word statistics of neighbor documents.
Learned rules will not be dependent on
specific keywords.
11/11/01
sdbi - winter 2001
25
Unsupervised learning

hypertext documents

a hierarchy among the documents

What is a good clustering?
11/11/01
sdbi - winter 2001
26
Basic clustering techniques

Techniques for Clustering:


11/11/01
kmeans
hierarchical agglomerative clustering
sdbi - winter 2001
27
Basic clustering techniques

documents



unweighted vector space
TFIDF vector space
similarity between two documents


11/11/01
cos(),  = the angle between their
corresponding vectors
the distance between the vectors lengths
(normalized)
sdbi - winter 2001
28
kmeans clustering

the kmeans algorithm:

input:



output:

11/11/01
d1,…,dn - set of n documents
k - the number of clusters desired (kn)
C1,…,Ck – k clusters with the n classifier
documents
sdbi - winter 2001
29
kmeans clustering

the kmeans algorithm (cont.):

initial: guess k initial means: m1,…mk

Until there are no changes in any means:


11/11/01
For each document d - d is in ci if ||d-mi|| is
the minimum of all the k distances.
For 1ik - replace mi with the means of all
the documents for ci.
sdbi - winter 2001
30
kmeans clustering
the kmeans algorithm – Example:

K=2
m2 m2 m2
m2
m1
K=3
m1
m
m2 2
m3
m1
m1
m1
m2
m3
m3
m1
m1
11/11/01
sdbi - winter 2001
31
kmeans clustering (cont.)

Problem:

high dimensionality


e.g.: if 30000 dimensions has only two possible
values, the vector space size is 230000
Solution:

11/11/01
Projecting out some dimensions
sdbi - winter 2001
32
Agglomerative clustering


documents are merged into superdocuments or
groups until only one group is left
Some definitions:

s(d1 , d 2 ) = the similarity between documents d1 and d2

the self-similarity of group A:
1
s( A) 
  s(d1 , d 2 )
A   A  1 d1 ,d 2A
d1  d 2
11/11/01
sdbi - winter 2001
33
Agglomerative clustering

The agglomerative clustering algorithm:

input:


output:

11/11/01
d1,…,dn - set of n documents
G – the final group with a nested hierarchy
sdbi - winter 2001
34
Agglomerative clustering
(cont.)

The agglomerative clustering algorithm:

Initial: G := {G1,…,Gn}, where Gi={di}

while |G|>1:



Find A and B in G such as s(A  B) is maximized
G := (G – {A,B})  {A  B}
Times: O(n2)
11/11/01
sdbi - winter 2001
35
Agglomerative clustering
(cont.)

The agglomerative clustering algorithm
Example:
- 0.1
Step
Initial:
1:
2:
6:
5:
4:
3:
s(b,c)=0.7
s(f,g)=0.6
s(b-c,d)=0.5
a
s(e,f-g)=0.4
s(a,b-c-d)=0.3
s(a-b-c-d,e-f-g)=0.1
b
- 0.2
- 0.3
e
d
- 0.4
a
e
c
f
sdbi - winter 2001
- 0.5
d
g
b
11/11/01
-0
c
- 0.6
f
g
36
- 0.7
- 0.8
Techniques from linear
algebra


Documents and terms are represented
by vectors in Euclidean space.
Applications of linear algebra to text
analysis:


11/11/01
Latent semantic indexing (LSI)
Random projections
sdbi - winter 2001
37
Co-occurring terms

Exemple:
car
Linear
potentiometer
for a racing
car gearbox…
gearbox
11/11/01
auto
Auto
Transmission
interchange
W/404 to
504?? …
transmission
sdbi - winter 2001
38
Latent semantic indexing (LSI)

Vector Space model of documents:



11/11/01
Let m=|T|, the lexicon size
Let n=the number of documents
Define Amxn = term-bydocument matrix
where: aij = the number of occurrences of
term i in document j.
sdbi - winter 2001
39
Latent semantic indexing (LSI)
terms
documents
d1 d 2
...
t1  
t2 
car


tm 

dn
...

similarity
auto
How to reduce it?
11/11/01
sdbi - winter 2001
40
Singular Value Decomposition
(SVD)

Let ARmxn, m  n be a matrix.
The singular value decomposition of A is the
factorization A=UDVT, where:
U and V are orthogonals, UTU=VTV=In
 D=diag(1,… n) with i0, 1in
then,
 U=[u1,…un], u1,…un are the left singular vectors
 V=[v1,…vn], v1,…vn are the right singular vectors
 1,… n are the singular values of A.

11/11/01
sdbi - winter 2001
41
Singular Value Decomposition
(SVD)



AAT=(UDVT)(VDTUT)=UDIDUT=UD2UT
 AATU=UD2=[12u1,…,n2un]
for 1in, AATui=i2ui
 the columns of U are the eigenvectors of AAT.
Similary, ATA=VD2VT
 the columns of V are the eigenvectors of ATA.
The eigenvalues of AAT (or ATA) are 12,…,n2
11/11/01
sdbi - winter 2001
42
Singular Value Decomposition
(SVD)

A  UDV T
 1
0
 u1 , u 2 ,..., u n  
 

0
0
...
2
...
...
0
0   v1T 


0  v2T 


    

 
 n  vn T 
  1v1T 


 2 v2 T 
T
T
T

 u1 , u 2 ,..., u n 
  1u1v1   2u 2 v2  ...   n u n vn
  

T 

v
 n n 

Let Ak   u v   2u2v2  ...   k uk vk , k  n
be the k-truncated SVD.
T
1 1 1


11/11/01
T
T
rank(Ak)=k
||A-AK||2 ||A-MK||2 for any matrix Mk of rank k.
sdbi - winter 2001
43
Singular Value Decomposition
(SVD)

Note: A, Ak  Rmxn
 A   U  D  V 

mxn

mxn

nxn
T
nxn
reduction
 A   U  D  
k
mxn
11/11/01

mxk
k

kxk
sdbi - winter 2001
k

T
Vk
kxn
44
LSI with SVD

Define qRm – a query vector.


Then, ATq Rn, the answer vector.


qi0 if term i is a part of the query.
(ATq)j0 if document j contains one or
more terms in the query.
How to do it better?
11/11/01
sdbi - winter 2001
45
LSI with SVD

Use Ak instead of A:
 calculate AkTq

Now, query on ‘car’ will return a
document containing the word ‘auto’.
11/11/01
sdbi - winter 2001
46
Random projections

Theorem:

let:
v  Rn
- a unit vector
 H - a randomly oriented
-dimensional subspace through the
origin
 X - random variable of the square of the length of the projection
of v on H
l


E
X

then:
n
and if
is chosen between log n and O n

l

 
l

l
l
Pr X       2 l  e
n
n

11/11/01
 l 1 2
4
sdbi - winter 2001
where 0   
1
2
47
Random projections

A projection of a set of points to a
randomly oriented subspace.
Small distortion in inter-points distances

The technique:



11/11/01
reducing the dimensionality of the points
speed up the distances computation
sdbi - winter 2001
48
Semi-supervised learning

Real-life applications:



labeled documents
unlabeled documents
Between supervised and unsupervised
learning
11/11/01
sdbi - winter 2001
49
Learning from labeled and
unlabeled documents

Expectation Maximization (EM) Algorithm:


Initial: train a naive Bayes classifier using only
labeled data.
Repeat EM iteration until near convergence :
 Estep:
1   nd , t   Pr c d 
 c ,t 
d
T   Pr c d    nd , 
d



Mstep: assign class probabilities Pr(c/d) to all
documents not labeled by the c,t estimates.
error is reduced by a third in the best cases.
11/11/01
sdbi - winter 2001
50
Relaxation labeling

The hypertext model:


documents are nodes in a hypertext graph.
There are other sources of information
induced by the links.
?
11/11/01
?
sdbi - winter 2001
51
Relaxation labeling





c=class, t=term, N=neighbors
In supervised learning: Pr(t|c)
In hypertext, using neighbors’ terms:
Pr( t(d),t(N(d)) |c)
Better model, using neighbors’ classes :
Pr( t(d),c(N(d)) |c]
Circularity
11/11/01
sdbi - winter 2001
52
Relaxation labeling

Resolve the circularity:


Initial: Pr(0)(c|d) to each document
dN(d1) where d1 is a test document (use
text-only)
Iterations:
Pr (c | d , N (d )) 
( r 1)
 Pr
(r )
c( N (d ))  Pr( r ) (c | d , c( N (d )))
c ( N ( d ))
11/11/01
sdbi - winter 2001
53
Social network analysis

Social networks:





between
between
between
between
pages.
academics by coauthoring, advising.
movie personnel by directing and acting.
people by making phone calls
web pages by hyperlinking to other web
Applications


11/11/01
Google
HITS
sdbi - winter 2001
54

p
PageRank (u )
PageRank (v)   (1  p) 
N
u v OutDegree (u )
where:
 means “link to”
N = total number of nodes in the Web graph



simulated a random walk on the web graph
used a score of popularity
the popularity score is precomputed
independent of the query
11/11/01
sdbi - winter 2001
55
Hyperlink induced topic search
(HITS)


Depended on a search engine
For each node u in the graph calculated
Authorities scores (au) and Hubs scores (hu):


Initialize hu=au=1
Repeat until convergence:

av :  hu
and
u v

h
u
u
11/11/01
and
hu :  av
u v
 a are normalized to 1
v
v
sdbi - winter 2001
56


Interesting page include links to others
interesting pages.
The goal:



11/11/01
many relevant pages
few irrelevant pages
fast
sdbi - winter 2001
57
Conclusion

Supervised learning


Probabilistic models
Unsupervised learning

Techniques for clustering:



Techniques for reducing:



k-means (top-down)
agglomerative (bottom-up)
LSI with SVD
Random projections
Semi-supervised learning

11/11/01
The EM algorithm
Relaxation labeling
sdbi - winter 2001
58
referance

http://www.engr.sjsu.edu/~knapp/HCIRDFSC/C/k_m
eans.htm

http://ei.cs.vt.edu/~cs5604/cs5604cnCL/CL-illus.html

http://www.cs.utexas.edu/users/inderjit/Datamining

Scatter/Gather: A Clusterbased Approach to Browsing
Large Document Collections (Cutting, Karger,
Pedersen, Tukey)
11/11/01
sdbi - winter 2001
59