Prediction - UBC Computer Science

Download Report

Transcript Prediction - UBC Computer Science

Link Prediction in Social
Networks
Laks V.S. Lakshmanan
(based on Kleinberg and Liben-Nowell. The Link
Prediction Problem for Social Networks. CIKM
2003.)
The Problem
Examples: Co-authorship graphs, scientists
serving on common PC/editorial board,
business leaders working on common
board, etc.
 Can current state of network be used to
predict future links?
 Note: factors extraneous to n/w (need to
be) ignored (e.g., they met on a beach).

Motivation & Significance
Model for Network evolution.
 Predict likely interactions, not explicitly
observed, based on observed links –
useful for terrorist network monitoring.
 Link prediction is one instance of Social
Network Analysis.
 Application for “Friend” suggestion in
online SN. Notice, this takes on a flavor of
link recommendation.

Main Contributions
Features intrinsic to n/w can offer a good measure of how
likely a future link is.
 Can significantly outperform chance.
 Need more sophisticated measures than direct ones (e.g.,
geodesic distances).
 Note: Extra-network features can significantly improve
prediction accuracy: e.g., keywords describing interests of
each scientist, or keywords extracted from their
papers/abstracts/titles. Not considered here.

◦ See Al Hasan et al. SDM workshop 2006.
http://www.cs.rpi.edu/~zaki/PaperDir/LINK06.pdf
◦ Ben Taskar et al. Link prediction in relational data. In Proceedings
of Neural Information Processing Systems, 2004.
◦ A. Popescul and L. Ungar. Statistical relational learning for link
prediction. In Workshop on Learning Statistical Models from
Relational Data at IJCAI, 2003.
Link Prediction Problem Formalized

G[t0,t0’] = the multi-graph of social interactions
between actors (users) during the training
interval [t0,t0’].
◦ Separate edge for each interaction (paper, phone call,
email, ...) with timestamp.
Predict edges in test interval [t1,t1’], where
t1>t0’.
 E_old = edges present in G[t0,t0’].
 E_new = edges in G[t1,t1’] but not in
G[t0,t0’].
 Refinement: Restrict to those nodes adjacent to
>=κ edges in G[t0,t0’] and to >=κ’ edges in
G[t1,t1’]. Call them the core nodes.

Example
John
Mary
Hui
Asif
Ram
John
Nita
Mary
Asif
Ram
Hui
•Say κ = κ’ = 2.
•Only evaluate how well we
do on {John, Ram, Asif,
Mary}.
•E*_new = {(J,M), (R,M)},
when restricted to the
“core” nodes above.
•Evaluation: among the topmost likely edges predicted,
how well we do on precision
and recall.
•Precision = analog of
soundness.
•Recall = analog of
completeness.
Some simple notions



Triadic closure [Easley & Kleinberg 2010]: if
x and y have a common friend, then the
likelihood of x, y being friends increases.
Clustering coefficient of node x = fraction of
its neighbors that are connected by a link.
Bearman and Moody’s study: teenage girls
who have a low clustering coefficient in their
network of friends are significantly more
likely to contemplate suicide than those
whose clustering coefficient is high! (Ouch!)
Various predictors
Notation: Γ(u) = neighbors of u in
G[t0,t0’].
 score(u,v) = score that indicates
likelihood of (u,v) being in E*_new.
 Neighborhood-based:

◦ Shortest distance: (negated) length of
shortest path (geodesic).
◦ Common neighbors: Γ(u) п Γ(v).
◦ Jaccard coefficient:
|Γ(u)пΓ(v)|/|Γ(u)υΓ(v)|.
More predictors
◦ Adamic/Adar: ∑zЄΓ(u)пΓ(v) 1/log|Γ(z)|.
◦ Preferential attachment: |Γ(u)|x|Γ(v)|.

Based on ensemble of paths:
◦ Katz, Hitting/Commute Time, Rooted
PageRank, SimRank.
The Katz Score







Katz: ∑i βi|pathsi(u,v)|, where 0<β<1.
Dealing with multi-graphs:
◦ Unweighted  boolean.
◦ Weighted  count #interactions b/w
neighbors.
Shorter paths  more affinity.
More paths  more affinity.
K = βA + β2A2 + ... = I – (I – βA)-1, where
A = adjacency matrix of graph.
Notice computational challenges.
Fact: For convergence, need β < [1/(largest
eigenvalue of A)].
Hitting & Commute Time

Hitting & Commute Time: Perform a random
walk, starting at u, follow neighbor of current
node w.p. 1/deg(curr).
◦ Adapted in the obvious way if edges are weighted.

Hu,v=expected time for walk to reach v.
◦ Not symmetric.



Cu,v=Hu,v+Hv,u.
πx=stationary distribution weight of x.
Stationary normed versions:
◦ Hu,vπv.
◦ Hu,vπv+Hv,uπu.

Score, in all cases, negative of measure. (Why?)
Rooted PageRank

Rooted PR:πv in the following RW: At any
step,
◦ Jump to u with probability p.
◦ Move to one of the neighbors of current
node at random with probability 1-p.
Avoids some pitfalls of HT/CT by
minimizing dependence on nodes far away
from both u and v.
 Computational challenge (of RPR, HT, CT)
– similar to PageRank.

SimRank
SimRank:
SR(u,u) = 1.
When u != v,
SR(u,v) =
γ.∑wЄΓ(u)∑xЄΓ(v)SR(w,x)/|Γ(u)||Γ(v)|,
where 0<γ<1.
SR(u,v) = 0 when Γ(u) u Γ(v) = .
RW interpretation: E[γi], where i=earliest time at
which RWs started at u and v meet for the
first time.
Qn: Can the RW interpretation be leveraged to
exploit PageRank-like techniques to compute
SimRank?

Predictors Summary
Neighborhood-based (easy to compute) and
Path-based (more accurate).
 Output sorted in descending score order of
pages.
 Consider top-n pages among core nodes for
evaluation.
 A performance trick, which pays off with
quality as well: use low rank approximation
Ak of adjacency matrix A.
 Works very well since usually A is extremely
sparse.

Meta-measures
score*(u,v) = |Γ(v) п Top-K(u)|, for
some K.
//how many of your neighbors are my
nearest score-neighbors?//
 score*(u,v) = ∑zЄΓ(v)пTop-K(u)score(u,z).
//what is the total strength of my
relationship with my nearest scoreneighbors who are your
neighbors?//  unseen bigrams.

Conclusions


While many predictors significantly
outperform random, their precision can use
considerable improvement (e.g., < 20%).
Need for better prediction algorithms:
◦ Perhaps more recent training data is more
important?
◦ Combining measures (e.g., using Bayesian
classifier).
◦ Leveraging node properties.

Need fast algorithms for approximating
measures.
Recommended Papers
◦ Link Prediction: & SN evolution
◦ Al Hasan et al. Link prediction using supervised learning. SDM
2006 Workshop.
◦ Ben Taskar et al. Link prediction in relational data. In Proceedings
of Neural Information Processing Systems, 2004.
◦ A. Popescul and L. Ungar. Statistical relational learning for link
prediction. In Workshop on Learning Statistical Models from
Relational Data at IJCAI, 2003.
◦ Jure Leskovec, Lars Backstrom, Ravi Kumar, and Andrew Tomkins.
Microscopic evolution of social networks. In Proceedings of the
14th ACM International Conference on Knowledge Discovery
and Data Mining (KDD'08).
◦ Jure Leskovec, Jon M. Kleinberg, and Christos Faloutsos. Graphs
over time: densification laws, shrinking diameters and possible
explanations. In Proceedings of the 11th ACM International
Conference on Knowledge Discovery and Data Mining
(KDD'05).
Recommended Papers










Computational Aspects:
Kurt C. Foster, Stephen Q. Muth, John J. Potterat, and Richard B. Rothenberg. A faster katz
status score algorithm. Computational & Mathematical Organization Theory, 7(4):275285,
2001.
Purnamrita Sarkar and Andrew W. Moore. A tractable approach to finding closest
truncated-commute-time neighbors in large graphs. In Proceedings of the 23rd Conference
on Uncertainty in Artificial Intelligence (UAI'07).
Purnamrita Sarkar, Andrew W. Moore, and Amit Prakash. Fast incremental proximity search
in large graphs. In Proceedings of the 25th International Conference on Machine Learning
(ICML'08).
Glen Jeh and Jennifer Widom. Simrank: a measure of structural-context similarity. In
Proceedings of the Eighth ACM International Conference on Knowledge Discovery and
Data Mining (KDD'02).
Dmitry Lizorkin et al. Accuracy estimate and optimization techniques for SimRank
computation.VLDB Journal, 2008.
Daniel Fogaras, et al. Towards scaling fully personalized pagerank: Algorithms, lower
bounds, and experiments. Internet Mathematics, 2(3), 2005.
Peixiang Zhao, Jiawei Han, and Yizhou Sun. P-Rank: a comprehensive structural similarity
measure over information networks. CIKM 2009.
P. Zhao, J. Han, and Y. Sun. P-rank: a comprehensive structural similarity measure over
information networks. InCIKM, pages 553{562, 2009.
Pei Lee, Laks V.S. Lakshmanan and Jeffrey Xu Yu. On Top-k Structural Similarity Search.
ICDE 2012.
Recommended Papers





Surveys & Some Theory:
Lise Getoor and Christopher P. Diehl. Link Mining: A Survey.
SIGKDD Explorations 2005.
Linyuan Lü and Tao Zhou. Link prediction in complex
networks: A survey. Physica A: Statistical Mechanics and its
Applications. Volume 390, Issue 6, 15 March 2011, Pages
1150–1170.
Mohammad Al Hasan and Mohammed J. Zaki. A Survey of
Link Prediction in Social Networks. Social Network Data
Analytics 2011, pp. 243-275.
Purnamrita Sarkar, Deepayan Chakrabarti, and Andrew W.
Moore. 2011. Theoretical justification of popular link
prediction heuristics. In Proceedings of the Twenty-Second
international joint conference on Artificial Intelligence - Volume
Volume Three (IJCAI'11), Toby Walsh (Ed.), Vol. Volume Three.
AAAI Press 2722-2727.