Transcript Slide 1
I
L
M
I
K
N
I
N
Jerry Scripps
G
N
Overview
What is link mining?
Motivation
Preliminaries
definitions
metrics
network types
Link mining techniques
What is Link Mining?
Graph Theory
Statistics
Link Mining
Data
Machine Learning
Social Network
Analysis
Database
What is Link Mining?
Examples:
Discovering communities within
collaboration networks
Finding authoritative web pages on a
given topic
Selecting the most influential people in a
social network
Link Mining – Motivation
Emerging Data Sets
World wide web
Social networking
Collaboration databases
etc.
Link Mining – Motivation
Direct Applications
What is the
community around
msu.edu?
What are the
authoritative pages?
Who has the most
influence?
Who is the likely
member of terrorist
cell?
Is this a news story
about crime, politics or
business?
Link Mining – Motivation
Indirect Applications
Convert ordinary data
sets into networks
Integrate link mining
techniques into other
techniques
Preliminaries
Definitions
Metrics
Network Types
Definitions
Community
Node (vertex, point,
object)
Link (edge, arc)
Metrics
Node
Degree
Closeness
Betweenness
Clustering
coefficient
Node Pair
Graph distance
Min-cut
Common neighbors
Jaccard’s coef
Adamic/adar
Pref. attachment
Katz
Hitting time
Rooted pageRank
simRank
Bibliographic
metrics
Network
Characteristic
path length
Clustering
coefficient
Min-cut
Network Types
Watts &
Strogatz
Regular
Small
World
Random
Networks – Scale-free
GVSU FaceBook
GVSU FaceBook (log scale)
1000
1000
800
600
400
100
10
200
0
600
500
400
300
Degree
200
100
0
1
1000
100
10
Degree
1
Counts
Barabasi & Bonabeau
Degree follows a power law ~ 1/kn
Can be found in a wide variety of real-world
networks
Counts
Network recap
Network Type Clustering
coefficient
Random
Low
Characteristic Power Law
path length
Low
No
Regular
High
High
No
Small world
High
Low
?
Scale-free
?
?
Yes
Techniques
Link-Based Classification
Link Prediction
Ranking
Influential Nodes
Community Finding
Link Completion
Link-Based Classification
?
Include features from linked objects:
building a single model on all features
Fusion of link and attribute models
Link-Based Classification
Chakrabarti, et al.
Copying data from neighboring web pages
actually reduced accuracy
Using the label from neighboring page
improved accuracy
111011 ?
101011 B
010010 A
011110 A
Link-Based Classification
Lu & Getoor
Define vectors for attributes and links
Attribute data OA(X)
Link data LD(X) constructed using
mode (single feature – class of plurality)
count (feature for each class – count for neighbors)
binary (feature for each class – 0/1 if exists)
111011 ?
101011 B
010010 A
011110 A
OA (attr)
LD (link)
111011
210
1
A
…
…
Model 1Model Model 2
Link-Based Classification
Lu & Getoor
Define probabilities for both
1
P(c | w , OA( X ))
Attribute
exp( w OA( X )c) 1
o
Link
P(c | wl , LD( X ))
T
o
1
exp( wlT LD( X )c) 1
Class estimation: Cˆ ( X ) P(c | w0 , OA( X )) P(c | w1 , LD( X ))
Link-Based Classification
Summary
Using class of neighbors improves accuracy
Using separate models for attribute and link data further
improves accuracy
Other considerations:
improvements are possible by using community information
knowledge of network type could also benefit classifier
Techniques
Link-Based Classification
Link Prediction
Ranking
Influential Nodes
Community Finding
Link Completion
Link Prediction
Link Prediction
Liben-Nowell and Kleinberg
Tested node-pair metrics:
Graph distance
Common neighbors
Jaccards coefficient
Adamic/adar
Preferential attachment
Katz
Hitting time
Rooted PageRank
SimRank
Neighborhood
Ensemble of paths
Link Prediction - results
Link Prediction – summary
There is room for growth – best predictor
has accuracy of only around 9%
Predicting collaborations is difficult
Finding communities could help if most
collaborations are intra-community
New problem could be to predict the
direction of the link
Techniques
Link-Based Classification
Link Prediction
Ranking
Influential Nodes
Community Finding
Link Completion
Ranking
Ranking – Markov Chain Based
Random-surfer analogy
Problem with cycles
PageRank uses random vector
Ranking – summary
Other methods such as HITS and SALSA
also based on Markov chain
Ranking has been applied in other areas:
text summarization
anomaly detection
Techniques
Link-Based Classification
Link Prediction
Ranking
Influential Nodes
Community Finding
Link Completion
Influence
Maximizing influence
model-based
Problem – finding the k best nodes to activate to
maximize the number of nodes activated
Models:
independent cascade – when activated a node has a one-time
change to activate neighbors with prob. pij
linear threshold – node becomes activated when the percent of
its neighbors crosses a threshold
Maximizing influence
model-based
Models: independent cascade & linear threshold
A function f:S→S*, can be created using either
model
Functions use monte-carlo, hill-climbing solution
Submodular functions, f (S {v}) f (S ) f (T {v}) f (T )
where ST are proven in another work to be
NP-C but by using a hill-climbing solution can
get to within 1-1/e of optimum.
Maximizing influence – cost/benefit
Assumptions:
If customer purchases profit is:
product x sells for $100
a discount of 10% can be offered to various
prospective customers
90 if discount is offered
100 if discount is not offered
Expected lift in profit (ELP) from offering
discount is:
90*P(buy|discount) - 100*P(buy|no discount)
Maximizing influence – cost/benefit
n
ELP( X , Y , M ) r1 P( X i 1 | X k , Y , f i1 ( M )) r0 P( X i 1 | X k , Y , f i 0 ( M )) c
k
i 1
Goal is to find M that
maximizes global ELP
Three approximations
used:
single pass
greedy
hill-climbing
Xi is the decision of customer i
to buy
Y is vector of product
attributes
M is vector of marketing
decision
f is a function to set the ith
element of M
r0 and r1 are revenue gained
c is the cost of marketing
Comparison of approaches
Size of starting
set
uses attributes
probabilities
Cost/benefit
variable - based
on max. lift
yes
extracted from
data set
Model-based
fixed
no
assigned to
links
An extension would be to spread influence to
the most number of communities
Improvements can be made in speed
Techniques
Link-Based Classification
Link Prediction
Ranking
Influential Nodes
Community Finding
Link Completion
Communities
Gibson, Kleinberg and Raghavan
Query
Search Engine
Root Set
Base Set: add forward
and back links
Use HITS to find top 10
hubs and authorities
Reddy and Kitsuregawa
Bipartite graph
Given an initial set of nodes T
build I from the nodes pointed to
from T
Repeat:
use relax_cocite to expand T and I
prune T and I using dense bipartite
graph function DBPG(T,I,α,β)
T
u
v
w
I
Flake, Lawrence and Giles
Uses Min-cut
Start with seed set
Add linked nodes
Find nodes from
outgoing links
Create virtual source node
Add virtual sink linking it to all nodes
Find the min-cut of the virtual source and
sink
Neville, Adler and Jensen
A
0110
Distance based on links and attributes
If link exists score is number of
B
common attributes zero otherwise
1100
score(A,B)=2, score(A,C)=1,
score(B,C)=0
Used with 3 partitioning algorithms:
Karger’s Min-Cut
MajorClust
Spectral partitioning by Shi & Malik
C
1101
Communities - summary
There are many options for building
communities around a small group of
nodes
Possible future directions
finding communities in networks having
different link types
impact of network type on community finding
techniques
Techniques
Link-Based Classification
Link Prediction
Ranking
Influential Nodes
Community Finding
Link Completion
Link Completion
Goldenberg, Kubica and Komarek
Problem: given a network and n-1 members
of a community find the nth
random
counting
popular
NB
NN
cGraph
BayesNet
EBS and LR
Conclusions
Link mining is a young, dynamic field of study
with problem areas that continue to emerge and
morph as techniques continue to evolve
Opportunities for improvements exist in
using community knowledge
using network knowledge
We are the living links in a life force that moves and plays around and
through us, binding the deepest soils with the farthest stars.
Alan Chadwick
Ranking
Based on Markov Chain
Rank is sum of node
weights from incoming
links
Breaks down when cycles
exist
3
2
4
5
9
6
14
15
9
Ranking - continued
General approach
PageRank
HITS approach
ap = authority score for p
Bp = backlinks of p
a p aq / N p (1 ) E ( p)
qB p
ap = authority score for p
hp = hub score for p
Bp = backlinks of p
Normalize between iterations
ap
hp
h
qB p
a
qB p
q
q