Transcript Slide 1

I
L
M
I
K
N
I
N
Jerry Scripps
G
N
Overview



What is link mining?
Motivation
Preliminaries




definitions
metrics
network types
Link mining techniques
What is Link Mining?
Graph Theory
Statistics
Link Mining
Data
Machine Learning
Social Network
Analysis
Database
What is Link Mining?
Examples:
 Discovering communities within
collaboration networks
 Finding authoritative web pages on a
given topic
 Selecting the most influential people in a
social network
Link Mining – Motivation
Emerging Data Sets




World wide web
Social networking
Collaboration databases
etc.
Link Mining – Motivation
Direct Applications





What is the
community around
msu.edu?
What are the
authoritative pages?
Who has the most
influence?
Who is the likely
member of terrorist
cell?
Is this a news story
about crime, politics or
business?
Link Mining – Motivation
Indirect Applications


Convert ordinary data
sets into networks
Integrate link mining
techniques into other
techniques
Preliminaries



Definitions
Metrics
Network Types
Definitions
Community
Node (vertex, point,
object)
Link (edge, arc)
Metrics
Node




Degree
Closeness
Betweenness
Clustering
coefficient
Node Pair











Graph distance
Min-cut
Common neighbors
Jaccard’s coef
Adamic/adar
Pref. attachment
Katz
Hitting time
Rooted pageRank
simRank
Bibliographic
metrics
Network



Characteristic
path length
Clustering
coefficient
Min-cut
Network Types
Watts &
Strogatz
Regular
Small
World
Random
Networks – Scale-free

GVSU FaceBook
GVSU FaceBook (log scale)
1000
1000
800
600
400
100
10
200
0
600
500
400
300
Degree
200
100
0
1
1000
100
10
Degree
1
Counts

Barabasi & Bonabeau
Degree follows a power law ~ 1/kn
Can be found in a wide variety of real-world
networks
Counts

Network recap
Network Type Clustering
coefficient
Random
Low
Characteristic Power Law
path length
Low
No
Regular
High
High
No
Small world
High
Low
?
Scale-free
?
?
Yes
Techniques






Link-Based Classification
Link Prediction
Ranking
Influential Nodes
Community Finding
Link Completion
Link-Based Classification
?
Include features from linked objects:

building a single model on all features

Fusion of link and attribute models
Link-Based Classification
Chakrabarti, et al.


Copying data from neighboring web pages
actually reduced accuracy
Using the label from neighboring page
improved accuracy
111011 ?
101011 B
010010 A
011110 A
Link-Based Classification
Lu & Getoor

Define vectors for attributes and links


Attribute data OA(X)
Link data LD(X) constructed using



mode (single feature – class of plurality)
count (feature for each class – count for neighbors)
binary (feature for each class – 0/1 if exists)
111011 ?
101011 B
010010 A
011110 A
OA (attr)
LD (link)
111011
210
1
A
…
…
Model 1Model Model 2
Link-Based Classification
Lu & Getoor

Define probabilities for both
1
P(c | w , OA( X )) 
 Attribute
exp(  w OA( X )c)  1
o


Link
P(c | wl , LD( X )) 
T
o
1
exp(  wlT LD( X )c)  1
Class estimation: Cˆ ( X )  P(c | w0 , OA( X )) P(c | w1 , LD( X ))
Link-Based Classification
Summary



Using class of neighbors improves accuracy
Using separate models for attribute and link data further
improves accuracy
Other considerations:


improvements are possible by using community information
knowledge of network type could also benefit classifier
Techniques






Link-Based Classification
Link Prediction
Ranking
Influential Nodes
Community Finding
Link Completion
Link Prediction
Link Prediction
Liben-Nowell and Kleinberg
Tested node-pair metrics:
 Graph distance
 Common neighbors
 Jaccards coefficient
 Adamic/adar
 Preferential attachment
 Katz
 Hitting time
 Rooted PageRank
 SimRank
Neighborhood
Ensemble of paths
Link Prediction - results
Link Prediction – summary




There is room for growth – best predictor
has accuracy of only around 9%
Predicting collaborations is difficult
Finding communities could help if most
collaborations are intra-community
New problem could be to predict the
direction of the link
Techniques






Link-Based Classification
Link Prediction
Ranking
Influential Nodes
Community Finding
Link Completion
Ranking
Ranking – Markov Chain Based



Random-surfer analogy
Problem with cycles
PageRank uses random vector
Ranking – summary


Other methods such as HITS and SALSA
also based on Markov chain
Ranking has been applied in other areas:


text summarization
anomaly detection
Techniques






Link-Based Classification
Link Prediction
Ranking
Influential Nodes
Community Finding
Link Completion
Influence
Maximizing influence
model-based


Problem – finding the k best nodes to activate to
maximize the number of nodes activated
Models:


independent cascade – when activated a node has a one-time
change to activate neighbors with prob. pij
linear threshold – node becomes activated when the percent of
its neighbors crosses a threshold
Maximizing influence
model-based




Models: independent cascade & linear threshold
A function f:S→S*, can be created using either
model
Functions use monte-carlo, hill-climbing solution
Submodular functions, f (S  {v})  f (S )  f (T  {v})  f (T )
where ST are proven in another work to be
NP-C but by using a hill-climbing solution can
get to within 1-1/e of optimum.
Maximizing influence – cost/benefit

Assumptions:



If customer purchases profit is:



product x sells for $100
a discount of 10% can be offered to various
prospective customers
90 if discount is offered
100 if discount is not offered
Expected lift in profit (ELP) from offering
discount is:

90*P(buy|discount) - 100*P(buy|no discount)
Maximizing influence – cost/benefit
n
ELP( X , Y , M )   r1 P( X i  1 | X k , Y , f i1 ( M ))  r0 P( X i  1 | X k , Y , f i 0 ( M ))  c
k
i 1


Goal is to find M that
maximizes global ELP
Three approximations
used:



single pass
greedy
hill-climbing






Xi is the decision of customer i
to buy
Y is vector of product
attributes
M is vector of marketing
decision
f is a function to set the ith
element of M
r0 and r1 are revenue gained
c is the cost of marketing
Comparison of approaches
Size of starting
set
uses attributes
probabilities


Cost/benefit
variable - based
on max. lift
yes
extracted from
data set
Model-based
fixed
no
assigned to
links
An extension would be to spread influence to
the most number of communities
Improvements can be made in speed
Techniques






Link-Based Classification
Link Prediction
Ranking
Influential Nodes
Community Finding
Link Completion
Communities
Gibson, Kleinberg and Raghavan
Query
Search Engine
Root Set
Base Set: add forward
and back links
Use HITS to find top 10
hubs and authorities
Reddy and Kitsuregawa



Bipartite graph
Given an initial set of nodes T
build I from the nodes pointed to
from T
Repeat:


use relax_cocite to expand T and I
prune T and I using dense bipartite
graph function DBPG(T,I,α,β)
T
u
v
w
I
Flake, Lawrence and Giles







Uses Min-cut
Start with seed set
Add linked nodes
Find nodes from
outgoing links
Create virtual source node
Add virtual sink linking it to all nodes
Find the min-cut of the virtual source and
sink
Neville, Adler and Jensen
A




0110
Distance based on links and attributes
If link exists score is number of
B
common attributes zero otherwise
1100
score(A,B)=2, score(A,C)=1,
score(B,C)=0
Used with 3 partitioning algorithms:
 Karger’s Min-Cut
 MajorClust
 Spectral partitioning by Shi & Malik
C
1101
Communities - summary


There are many options for building
communities around a small group of
nodes
Possible future directions


finding communities in networks having
different link types
impact of network type on community finding
techniques
Techniques






Link-Based Classification
Link Prediction
Ranking
Influential Nodes
Community Finding
Link Completion
Link Completion
Goldenberg, Kubica and Komarek
Problem: given a network and n-1 members
of a community find the nth








random
counting
popular
NB
NN
cGraph
BayesNet
EBS and LR
Conclusions


Link mining is a young, dynamic field of study
with problem areas that continue to emerge and
morph as techniques continue to evolve
Opportunities for improvements exist in


using community knowledge
using network knowledge
We are the living links in a life force that moves and plays around and
through us, binding the deepest soils with the farthest stars.
Alan Chadwick
Ranking



Based on Markov Chain
Rank is sum of node
weights from incoming
links
Breaks down when cycles
exist
3
2
4
5
9
6
14
15
9
Ranking - continued

General approach




PageRank 
HITS approach




ap = authority score for p
Bp = backlinks of p
a p    aq / N p  (1   ) E ( p)
qB p
ap = authority score for p
hp = hub score for p
Bp = backlinks of p
Normalize between iterations
ap 
hp 
h
qB p
a
qB p
q
q