networkMining

Download Report

Transcript networkMining

E
N
M
W
T
I
R
O
K
G
N
I
Jerry Scripps
N
Overview



What is network mining?
Motivation
Preliminaries




definitions
metrics
network types
Network mining techniques
What is Network Mining?
Statistics
Mathematics
Computer Science
Data Mining
Pattern Recognition
Machine Learning
Social Network
Analysis
Graph Theory
Network Mining
What is Network Mining?
Border Disciplines






Statistics
Computer Science
Physics
Math
Psychology
Law Enforcement






Sociology
Military
Biology
Medicine
Chemistry
Business
What is Network Mining?
Examples:
 Discovering communities within
collaboration networks
 Finding authoritative web pages on a
given topic
 Selecting the most influential people in a
social network
Network Mining – Motivation
Emerging Data Sets








World wide web
Social networking
Collaboration databases
Customer or Employee sets
Genomic data
Terrorist sets
Supply Chains
Many more…
Network Mining – Motivation
Direct Applications





What is the
community around
msu.edu?
What are the
authoritative pages?
Who has the most
influence?
Who is the likely
member of terrorist
cell?
Is this a news story
about crime, politics or
business?
Network Mining – Motivation
Indirect Applications


Convert ordinary data
sets into networks
Integrate network
mining techniques
into other techniques
Preliminaries



Definitions
Metrics
Network Types
Definitions
Community
Node (vertex, point,
object)
Link (edge, arc)
Metrics
Node




Degree
Closeness
Betweenness
Clustering
coefficient
Node Pair











Graph distance
Min-cut
Common neighbors
Jaccard’s coef
Adamic/adar
Pref. attachment
Katz
Hitting time
Rooted pageRank
simRank
Bibliographic
metrics
Network




Characteristic
path length
Clustering
coefficient
Min-cut
diameter
Network Types – Random
Network Types – Small World
Watts &
Strogatz
Regular
Small
World
Random
Networks – Scale-free

GVSU FaceBook
GVSU FaceBook (log scale)
1000
1000
800
600
400
100
10
200
0
600
500
400
300
Degree
200
100
0
1
1000
100
10
Degree
1
Counts

Barabasi & Bonabeau
Degree follows a power law ~ 1/kn
Can be found in a wide variety of real-world
networks
Counts

Network recap
Network Type Clustering
coefficient
Random
Low
Characteristic Power Law
path length
Low
No
Regular
High
High
No
Small world
High
Low
?
Scale-free
?
?
Yes
Techniques





Link-Based Classification
Link Prediction
Ranking
Influential Nodes
Community Finding
Link-Based Classification
?
Include features from linked objects:

building a single model on all features

Fusion of link and attribute models
Link-Based Classification
Chakrabarti, et al.


Copying data from neighboring web pages
actually reduced accuracy
Using the label from neighboring page
improved accuracy
111011 ?
101011 B
010010 A
011110 A
Link-Based Classification
Lu & Getoor

Define vectors for attributes and links


Attribute data OA(X)
Link data LD(X) constructed using



mode (single feature – class of plurality)
count (feature for each class – count for neighbors)
binary (feature for each class – 0/1 if exists)
111011 ?
101011 B
010010 A
011110 A
OA (attr)
LD (link)
111011
210
1
A
…
…
Model 1Model Model 2
Link-Based Classification
Lu & Getoor

Define probabilities for both
1
P(c | w , OA( X )) 
 Attribute
exp(  w OA( X )c)  1
o


Link
P(c | wl , LD( X )) 
T
o
1
exp(  wlT LD( X )c)  1
Class estimation: Cˆ ( X )  P(c | w0 , OA( X )) P(c | w1 , LD( X ))
Collective Classification



Uses both attributes and links
Iteratively update the unlabeled instances
message passing, loopy belief nets, etc.
Link-Based Classification
Summary



Using class of neighbors improves accuracy
Using separate models for attribute and link data further
improves accuracy
Other considerations:


improvements are possible by using community information
knowledge of network type could also benefit classifier
Techniques





Link-Based Classification
Link Prediction
Ranking
Influential Nodes
Community Finding
Link Prediction
Link Prediction
Liben-Nowell and Kleinberg
Tested node-pair metrics:
 Graph distance
 Common neighbors
 Jaccards coefficient
 Adamic/adar
 Preferential attachment
 Katz
 Hitting time
 Rooted PageRank
 SimRank
Neighborhood
Ensemble of paths
Link Prediction - results
Link Prediction – newer methods



maximum likelihood
stochastic block model
probabilistic
Link Prediction – summary



There is room for growth – best predictor
has accuracy of only around 9%
Predicting collaborations is difficult
New problem could be to predict the
direction of the link
Techniques






Link-Based Classification
Link Prediction
Ranking
Influential Nodes
Community Finding
Link Completion
Ranking
Ranking – Markov Chain Based



Random-surfer analogy
Problem with cycles
PageRank uses random vector
Ranking – summary


Other methods such as HITS and SALSA
also based on Markov chain
Ranking has been applied in other areas:


text summarization
anomaly detection
Techniques





Link-Based Classification
Link Prediction
Ranking
Influential Nodes
Community Finding
Influence
Influence Maximization


Problem: find the best nodes to activate
Approaches:





degree – fast but not effective
greedy – effective but slow
improvements to greedy: degree heuristics
and Shapely value
use communities
cost-benefit – probabilistic approach
Maximizing influence
model-based


Problem – finding the k best nodes to activate to
maximize the number of nodes activated
Models:


independent cascade – when activated a node has a one-time
change to activate neighbors with prob. pij
linear threshold – node becomes activated when the percent of
its neighbors crosses a threshold
Maximizing influence
model-based




Models: independent cascade & linear threshold
A function f:S→S*, can be created using either
model
Functions use monte-carlo, hill-climbing solution
Submodular functions, f (S  {v})  f (S )  f (T  {v})  f (T )
where ST are proven in another work to be
NP-C but by using a hill-climbing solution can
get to within 1-1/e of optimum.
Maximizing influence – cost/benefit

Assumptions:



If customer purchases profit is:



product x sells for $100
a discount of 10% can be offered to various
prospective customers
90 if discount is offered
100 if discount is not offered
Expected lift in profit (ELP) from offering
discount is:

90*P(buy|discount) - 100*P(buy|no discount)
Maximizing influence – cost/benefit
n
ELP( X , Y , M )   r1 P( X i  1 | X k , Y , f i1 ( M ))  r0 P( X i  1 | X k , Y , f i 0 ( M ))  c
k
i 1


Goal is to find M that
maximizes global ELP
Three approximations
used:



single pass
greedy
hill-climbing






Xi is the decision of customer i
to buy
Y is vector of product
attributes
M is vector of marketing
decision
f is a function to set the ith
element of M
r0 and r1 are revenue gained
c is the cost of marketing
Comparison of approaches
Size of starting
set
uses attributes
probabilities


Cost/benefit
variable - based
on max. lift
yes
extracted from
data set
Model-based
fixed
no
assigned to
links
An extension would be to spread influence to
the most number of communities
Improvements can be made in speed
Techniques





Link-Based Classification
Link Prediction
Ranking
Influential Nodes
Community Finding
Communities
Gibson, Kleinberg and Raghavan
Query
Search Engine
Root Set
Base Set: add forward
and back links
Use HITS to find top 10
hubs and authorities
Flake, Lawrence and Giles







Uses Min-cut
Start with seed set
Add linked nodes
Find nodes from
outgoing links
Create virtual source node
Add virtual sink linking it to all nodes
Find the min-cut of the virtual source and
sink
Community Finding



Girvan and Newman – minimize betweenness
Clauset, et al. – agglomerative, uses modularity
Shi & Malik – spectral clustering
Communities - summary


There are many options for building
communities around a small group of
nodes
Possible future directions


finding communities in networks having
different link types
impact of network type on community finding
techniques