slides - Charu Aggarwal
Download
Report
Transcript slides - Charu Aggarwal
On Node Classification
in Dynamic Content-based Networks
Charu Aggarwal
Nan Li
IBM T. J. Watson Research Center
[email protected]
University of California, Santa Barbara
[email protected]
Presented by
Nan Li
Motivation
Ke Wang
Jian Pei
“Mining”
“Sequential Pattern”
“Efficient”
“Data Mining”
“Association Rules”
“Systems”
…
“Rules”
Jiawei Han
…
“Data Mining”
“Databases”
“Clustering”
“Sequential Pattern”
…
Kenneth A. Ross
“Algorithms”
…
Year 22001
Motivation
Ke Wang
Jian Pei
“Association Rules”
“Pattern”
“Data Mining”
“Data Mining”
“Ranking”
“Stream”
“Web”
“Semantics”
Jiawei Han
…
…
Marianne Winslett
“Data Mining”
“Web”
“Parallel”
“Sequential Pattern”
“Automated”
…
“Data”
…
Xifeng Yan
Philip S. Yu
“Clustering”
“Distributed”
“Pattern Mining”
“Databases”
…
“Mining”
…
Year 32002
Motivation
Ke Wang
Jian Pei
“Mining”
“Sequential Pattern”
“Efficient”
“Mining”
“Association”
“Systems”
“Rules”
Jiawei Han
…
…
Charu Aggarwal
“Clustering”
“Mining”
“Indexing”
“Databases”
“Knowledge”
“Clustering”
“XML”
“Sequential Pattern”
…
…
Xifeng Yan
Philip S. Yu
“Algorithms”
“Association Rules”
“Graph”
“Clustering”
“Databases”
“Wireless”
“Sequential Mining”
“Web”
…
…
Year 42003
Motivation
Networks annotated with an increasing amount of text
• Citation networks, co-authorship networks, product databases with
large amounts of text content, etc.
• Highly dynamic
Node classification Problem
• Often arises in the context of many network scenarios in which the
underlying nodes are associated with content.
• A subset of the nodes in the network may be labeled.
• Can we use these labeled nodes in conjunction with the structure
for the classification of nodes which are not currently labeled?
Applications
5
Challenges
Information networks are very large
• Scalable and efficient
Many such networks are dynamic
• Updatable in real time
• Self-adaptable and robust
Such networks are often noisy
• Intelligent and selective
Heterogeneous correlations in such networks
A
B
A
C
B
A
C
B
C
6
Outline
Related Works
DYCOS: DYnamic Classification algorithm with
cOntent and Structure
• Semi-bipartite content-structure transformation
• Classification using a series of text and linkbased random walks
• Accuracy analysis
Experiments
• NetKit-SRL
Conclusion
7
Related Works
Link-based classification (Bhagat et al., WebKDD 2007)
•
•
Content-only classification (Nigam et al. Machine Learning 2000)
•
Each object’s own attributes only
Relational classification (Sen et al., Technical Report 2004)
•
•
Local iterative
Global nearest neighbor
Each object’s own attributes
Attributes and known labels of the neighbors
Collective classification (Macskassy & Provost, JMLR 2007, Sen et
al., Technical Report 2004, Chakrabarti, SIGMOD 1998)
•
•
Local classification
• Flexible: ranging from a decision tree to an SVM
Approximate inference algorithms
• Iterative classification
• Gibbs sampling
• Loopy belief propagation
• Relaxation labeling
8
Outline
Related Works
DYCOS: DYnamic Classification algorithm with
cOntent and Structure
• Semi-bipartite content-structure transformation
• Classification using a series of text and linkbased random walks
• Accuracy analysis
Experiments
• NetKit-SRL
Conclusion
9
DYCOS in A Nutshell
Node classification in a dynamic environment
Dynamic network: the entire network is denoted by Gt = (Nt, At, Tt) at time t.
Problem statement:
Classify the unlabeled nodes (Nt \ Tt) using both the content and structure
of the network for all the time stamps in an efficient and accurate manner
t
t+1
t+2
Both the structure and the content of the network change over time!
10
Semi-bipartite Transformation
Text-augmented
representation
• Leveraged for a random
walk-based classification
model that uses both links
and text
• Two partitions: structural
nodes and word nodes
• Semi-bipartite: one
partition of nodes is
allowed to have edges
either within the set, or
to nodes in the other
partition.
Efficient updates upon
dynamic changes
11
Random Walk-Based Classification
Random walks over
augmented structure
• Starting node: the unlabeled
node to be classified.
• Structural hop
• A random jump from a
structural node to one of
its neighbors
• Content-based multi-hop
• A jump from a
structural node to
another through implicit
common word nodes
• Structural parameter: ps
Classification
• Classify the starting node
with the most frequently
encountered class label
during the random walks
12
Gini-Index & Inverted Lists
Discriminative keywords
• A set Mt of the top m words with the highest discriminative power are
used to construct the word node partition.
• Gini-index
k
pi (w) ni (w) / n j (w)
j 1
k
G(w) p j (w) 2
j 1
of G(w) lies in the range (0, 1).
• The value
• Words with a higher value of gini-index are more discriminative
for classification
purposes.
Inverted lists
• Inverted list of keywords for each node
• Inverted list of nodes for each keyword
13
Analysis
Why do we care?
• DYCOS is essentially using Monte-Carlo sampling to sample various paths
from each unlabeled node.
• Advantage: fast approach
• Disadvantage: loss of accuracy
• Can we present analysis on how accurate DYCOS sampling is?
Probabilistic bound: bi-class classification
• Two classes C1 and C2
• E[Pr[C1]] = f1, E[Pr[C2]] = f2, f1 - f2 = b ≥ 0
• Pr[mis-classification] ≤ exp{-lb2/2}
Probabilistic bound: multi-class classification
• k classes {C1, C2, …, Ck}
• b-accurate
• Pr[b-accurate] ≥ 1 - (k-1)exp{-lb2/2}
14
Outline
Related Works
DYCOS: DYnamic Classification algorithm with
cOntent and Structure
• Semi-bipartite content-structure transformation
• Classification using a series of text and linkbased random walks
• Accuracy analysis
Experiments
• NetKit-SRL
Conclusion
15
Experimental Results
Data sets
• CORA: a set of research papers and the citation relations among them.
• Each node is a paper and each edge is a citation relation.
• A total of 12,313 English words are extracted from the paper titles.
• We segment the data into 10 synthetic time periods.
• DBLP: a set of authors and their collaborations
• Each node is an author and each edge is a collaboration.
• A total of 194 English words in the domain of computer science are used.
• We segment the data into 36 annual graphs from year 1975 to year 2010.
Name
Node #
Edge #
Class #
Labeled Node #
CORA
19,396
75,021
5
14,814
DBLP
806,635
4,414,135
5
18,999
16
Experimental Results
NetKit-SRL toolkit
•
An open-source and publicly available toolkit for statistical relational learning in
networked data (Macskassy and Provost, 2007).
• Instantiations of previous relational and collective classification algorithms
• Configuration
• Local classifier: domain-specific class prior
• Relational classifier: network-only multinomial Bayes classifier
• Collective inference: relaxation labeling
Parameters
•
1) The number of most discriminative words, m; 2) The size constraint of the
inverted list for each keyword a; 3) The number of top content-hop neighbors, q; 4)
The number of random walks, l; 5) The length of each random walk, h; 6) Structure
parameter, ps.
The results demonstrate that DYCOS improves the classification accuracy over NetKit
by 7.18% to 17.44%, while reducing the runtime to only 14.60% to 18.95% of that of NetKit.
17
Experimental Results
DYCOS vs. NetKit on CORA
Classification Accuracy Comparison
Classification Time Comparison
18
Experimental Results
Parameter Sensitivity of DYCOS
Sensitivity to m, l and h (a=30, ps=70%)
CORA Data
Sensitivity to a, m and ps (l=3, h=5)
DBLP Data
19
Experimental Results
Dynamic Updating Time: CORA
Time Period
1
2
3
4
5
Update Time (Sec.) 0.019
0.013
0.015
0.013
0.023
Time Period
7
8
9
10
0.014
0.014
0.013
0.011
6
Update Time (Sec.) 0.015
Dynamic Updating Time: DBLP
Time Period
1975-1989
1990-1999
2000-2010
Update Time (Sec.)
0.03107
0.22671
1.00154
20
Outline
Related Works
DYCOS: DYnamic Classification algorithm with
cOntent and Structure
• Semi-bipartite content-structure transformation
• Classification using a series of text and linkbased random walks
• Accuracy analysis
Experiments
• NetKit-SRL
Conclusion
21
Conclusion
We propose an efficient, dynamic and scalable method for
node classification in dynamic networks.
We provide analysis on how accurate the proposed method
will be in practice.
We present experimental results on real data sets, and
show that our algorithms are more effective and efficient
than competing algorithms.
22
23
Challenges
Information networks are very large
• Scalable and efficient
Many such networks are dynamic
• Updatable in real time
• Self-adaptable and robust
Such networks are often noisy
• Intelligent and selective
Heterogeneous correlations in such networks
• The correlations between the label of o and the content of o
• The correlations between the label of o and the contents and labels of
objects in the neighborhood of o
• The correlations between the label of o and the unobserved labels of
objects in the neighborhood of o
24
Analysis
Lemma Consider two classes with expected visit probabilities of f1 and f2
respectively, such that b = f1−f2 > 0. Then, the probability that the class which is
visited the most during the sampled random walk process is reversed to class 2, is
given by at most
e
l b 2 / 2
Definition Consider the node classification problem with a total of k classes. We
define the sampling process to be b-accurate, if none of the classes whose expected
visit probability is less than b of the class with the largest expected visit probability
turns out have the largest sampled visit probability.
Theorem The probability that the sampling process results in a b-accurate reported
majority class is given by at least
2
1 (k 1) elb
/2
25
Experimental Results
Classification accuracy comparison: DBLP
Classification time comparison: DBLP
26
Experimental Results
Sensitivity to m, l and h
Sensitivity to a, l and h
Sensitivity to m, a and ps
Sensitivity to a, m and ps
27