networkMining
Download
Report
Transcript networkMining
E
N
M
W
T
I
R
O
K
G
N
I
Jerry Scripps
N
Overview
What is network mining?
Motivation
Preliminaries
definitions
metrics
network types
Network mining techniques
What is Network Mining?
Statistics
Mathematics
Computer Science
Data Mining
Pattern Recognition
Machine Learning
Social Network
Analysis
Graph Theory
Network Mining
What is Network Mining?
Border Disciplines
Statistics
Computer Science
Physics
Math
Psychology
Law Enforcement
Sociology
Military
Biology
Medicine
Chemistry
Business
What is Network Mining?
Examples:
Discovering communities within
collaboration networks
Finding authoritative web pages on a
given topic
Selecting the most influential people in a
social network
Network Mining – Motivation
Emerging Data Sets
World wide web
Social networking
Collaboration databases
Customer or Employee sets
Genomic data
Terrorist sets
Supply Chains
Many more…
Network Mining – Motivation
Direct Applications
What is the
community around
msu.edu?
What are the
authoritative pages?
Who has the most
influence?
Who is the likely
member of terrorist
cell?
Is this a news story
about crime, politics or
business?
Network Mining – Motivation
Indirect Applications
Convert ordinary data
sets into networks
Integrate network
mining techniques
into other techniques
Preliminaries
Definitions
Metrics
Network Types
Definitions
Community
Node (vertex, point,
object)
Link (edge, arc)
Metrics
Node
Degree
Closeness
Betweenness
Clustering
coefficient
Node Pair
Graph distance
Min-cut
Common neighbors
Jaccard’s coef
Adamic/adar
Pref. attachment
Katz
Hitting time
Rooted pageRank
simRank
Bibliographic
metrics
Network
Characteristic
path length
Clustering
coefficient
Min-cut
diameter
Network Types – Random
Network Types – Small World
Watts &
Strogatz
Regular
Small
World
Random
Networks – Scale-free
GVSU FaceBook
GVSU FaceBook (log scale)
1000
1000
800
600
400
100
10
200
0
600
500
400
300
Degree
200
100
0
1
1000
100
10
Degree
1
Counts
Barabasi & Bonabeau
Degree follows a power law ~ 1/kn
Can be found in a wide variety of real-world
networks
Counts
Network recap
Network Type Clustering
coefficient
Random
Low
Characteristic Power Law
path length
Low
No
Regular
High
High
No
Small world
High
Low
?
Scale-free
?
?
Yes
Techniques
Link-Based Classification
Link Prediction
Ranking
Influential Nodes
Community Finding
Link-Based Classification
?
Include features from linked objects:
building a single model on all features
Fusion of link and attribute models
Link-Based Classification
Chakrabarti, et al.
Copying data from neighboring web pages
actually reduced accuracy
Using the label from neighboring page
improved accuracy
111011 ?
101011 B
010010 A
011110 A
Link-Based Classification
Lu & Getoor
Define vectors for attributes and links
Attribute data OA(X)
Link data LD(X) constructed using
mode (single feature – class of plurality)
count (feature for each class – count for neighbors)
binary (feature for each class – 0/1 if exists)
111011 ?
101011 B
010010 A
011110 A
OA (attr)
LD (link)
111011
210
1
A
…
…
Model 1Model Model 2
Link-Based Classification
Lu & Getoor
Define probabilities for both
1
P(c | w , OA( X ))
Attribute
exp( w OA( X )c) 1
o
Link
P(c | wl , LD( X ))
T
o
1
exp( wlT LD( X )c) 1
Class estimation: Cˆ ( X ) P(c | w0 , OA( X )) P(c | w1 , LD( X ))
Collective Classification
Uses both attributes and links
Iteratively update the unlabeled instances
message passing, loopy belief nets, etc.
Link-Based Classification
Summary
Using class of neighbors improves accuracy
Using separate models for attribute and link data further
improves accuracy
Other considerations:
improvements are possible by using community information
knowledge of network type could also benefit classifier
Techniques
Link-Based Classification
Link Prediction
Ranking
Influential Nodes
Community Finding
Link Prediction
Link Prediction
Liben-Nowell and Kleinberg
Tested node-pair metrics:
Graph distance
Common neighbors
Jaccards coefficient
Adamic/adar
Preferential attachment
Katz
Hitting time
Rooted PageRank
SimRank
Neighborhood
Ensemble of paths
Link Prediction - results
Link Prediction – newer methods
maximum likelihood
stochastic block model
probabilistic
Link Prediction – summary
There is room for growth – best predictor
has accuracy of only around 9%
Predicting collaborations is difficult
New problem could be to predict the
direction of the link
Techniques
Link-Based Classification
Link Prediction
Ranking
Influential Nodes
Community Finding
Link Completion
Ranking
Ranking – Markov Chain Based
Random-surfer analogy
Problem with cycles
PageRank uses random vector
Ranking – summary
Other methods such as HITS and SALSA
also based on Markov chain
Ranking has been applied in other areas:
text summarization
anomaly detection
Techniques
Link-Based Classification
Link Prediction
Ranking
Influential Nodes
Community Finding
Influence
Influence Maximization
Problem: find the best nodes to activate
Approaches:
degree – fast but not effective
greedy – effective but slow
improvements to greedy: degree heuristics
and Shapely value
use communities
cost-benefit – probabilistic approach
Maximizing influence
model-based
Problem – finding the k best nodes to activate to
maximize the number of nodes activated
Models:
independent cascade – when activated a node has a one-time
change to activate neighbors with prob. pij
linear threshold – node becomes activated when the percent of
its neighbors crosses a threshold
Maximizing influence
model-based
Models: independent cascade & linear threshold
A function f:S→S*, can be created using either
model
Functions use monte-carlo, hill-climbing solution
Submodular functions, f (S {v}) f (S ) f (T {v}) f (T )
where ST are proven in another work to be
NP-C but by using a hill-climbing solution can
get to within 1-1/e of optimum.
Maximizing influence – cost/benefit
Assumptions:
If customer purchases profit is:
product x sells for $100
a discount of 10% can be offered to various
prospective customers
90 if discount is offered
100 if discount is not offered
Expected lift in profit (ELP) from offering
discount is:
90*P(buy|discount) - 100*P(buy|no discount)
Maximizing influence – cost/benefit
n
ELP( X , Y , M ) r1 P( X i 1 | X k , Y , f i1 ( M )) r0 P( X i 1 | X k , Y , f i 0 ( M )) c
k
i 1
Goal is to find M that
maximizes global ELP
Three approximations
used:
single pass
greedy
hill-climbing
Xi is the decision of customer i
to buy
Y is vector of product
attributes
M is vector of marketing
decision
f is a function to set the ith
element of M
r0 and r1 are revenue gained
c is the cost of marketing
Comparison of approaches
Size of starting
set
uses attributes
probabilities
Cost/benefit
variable - based
on max. lift
yes
extracted from
data set
Model-based
fixed
no
assigned to
links
An extension would be to spread influence to
the most number of communities
Improvements can be made in speed
Techniques
Link-Based Classification
Link Prediction
Ranking
Influential Nodes
Community Finding
Communities
Gibson, Kleinberg and Raghavan
Query
Search Engine
Root Set
Base Set: add forward
and back links
Use HITS to find top 10
hubs and authorities
Flake, Lawrence and Giles
Uses Min-cut
Start with seed set
Add linked nodes
Find nodes from
outgoing links
Create virtual source node
Add virtual sink linking it to all nodes
Find the min-cut of the virtual source and
sink
Community Finding
Girvan and Newman – minimize betweenness
Clauset, et al. – agglomerative, uses modularity
Shi & Malik – spectral clustering
Communities - summary
There are many options for building
communities around a small group of
nodes
Possible future directions
finding communities in networks having
different link types
impact of network type on community finding
techniques