16-02-05-aditya-ita-talkx - People at VT Computer Science
Download
Report
Transcript 16-02-05-aditya-ita-talkx - People at VT Computer Science
Leveraging Information Theory for
Mining Graphs and Sequences:
From Propagation to Segmentation
B. Aditya Prakash
Computer Science
Virginia Tech.
ITA Workshop, San Diego, Feb 5, 2016
Networks are everywhere!
Facebook Network [2010]
Gene Regulatory Network
[Decourty 2008]
Human Disease Network
[Barabasi 2007]
The Internet [2005]
Prakash 2016
2
Dynamical Processes over networks
are also everywhere!
Prakash 2016
3
Why do we care?
• Social collaboration
• Information Diffusion
• Viral Marketing
• Epidemiology and Public Health
• Cyber Security
• Human mobility
• Games and Virtual Worlds
• Ecology
........
Prakash 2016
4
Why do we care? (1: Epidemiology)
• Dynamical Processes over networks
[AJPH 2007]
SI Model
Diseases over contact networks
Prakash 2016
CDC data: Visualization of
the first 35 tuberculosis
(TB) patients and their
1039 contacts
5
Why do we care? (1: Epidemiology)
• Dynamical Processes over networks
• Each circle is a hospital
• ~3000 hospitals
• More than 30,000 patients
transferred
[US-MEDICARE
NETWORK 2005]
Problem: Given k units of
disinfectant, whom to immunize?
Prakash 2016
6
Why do we care? (1: Epidemiology)
~6x
fewer!
CURRENT PRACTICE
[US-MEDICARE
NETWORK 2005]
OUR METHOD
Prakash 2016
7
Hospital-acquired inf. took 99K+
lives, cost $5B+ (all per year)
Why do we care? (2: Online
Diffusion)
> 800m users, ~$1B
revenue [WSJ 2010]
~100m active users
> 50m users
Prakash 2016
8
Why do we care? (2: Online
Diffusion)
• Dynamical Processes over networks
Buy Versace™!
Followers
Celebrity
Social Media Marketing
Prakash 2016
9
Why do we care?
(3: To change the world?)
• Dynamical Processes over networks
Social networks and Collaborative Action
Prakash 2016
10
High Impact – Multiple Settings
epidemic out-breaks
Q. How to squash rumors faster?
products/viruses
Q. How do opinions spread?
transmit s/w patches
Q. How to market better?
Prakash 2016
11
Research Theme
ANALYSIS
Understanding
POLICY/
ACTION
DATA
Large real-world
networks & processes
Prakash 2016
Managing/Utili
zing 12
Research Theme – Public Health
ANALYSIS
Will an epidemic
happen?
POLICY/
ACTION
DATA
Modeling # patient
transfers
Prakash 2016
How to control
out-breaks?
13
Research Theme – Social Media
ANALYSIS
# cascades in
future?
POLICY/
ACTION
DATA
Modeling Tweets
spreading
Prakash 2016
How to market
better?14
In this talk
Q1: How to find hidden
culprits?
DATA
Large real-world
networks & processes
Q2: How to segment multidimensional sequences?
Prakash 2016
15
Outline
• Motivation
• Part 1: Learning Models (Empirical Studies)
– Q1: How to find hidden culprits?
– Q2: How to segment data sequences?
• Conclusion
Prakash 2016
16
Culprits Motivation
• Patient zeroes
– Who started the epidemic?
• Rumors
– Who started the rumor?
Prakash 2016
17
But: Real data is noisy!
We don’t know who exactly are infected
• Epidemiology
CDC
– Public-health surveillance
Lab
Hospital
Not sure
CNN headlines
Not sure
?
?
Surveillance Pyramid
[Nishiura+, PLoS ONE 2011]
Each level has a certain probability to
miss some truly infected people
Prakash 2016
18
Real data is noisy!
Correcting missing data is by itself very important
• Social Media
– Twitter: due to the uniform samples [Morstatter+, ICWSM
2013], the relevant ‘infected’ tweets may be missed
Tweets
Missing
?
Sampled
Tweets
? Missing
Sampling
Prakash 2016
19
Outline
•
•
•
•
•
Motivation---Introduction
Problem Definition
Our Approach
Experiments
Conclusion
Prakash 2016
20
The Problem
• GIVEN:
[Sundareisan, Vreeken, Prakash 2015]
– Graph G(V, E) from historical data
– Infected set D Ì V, sampled (p%) and incomplete
– Infectivity β of the virus (assumed to follow the SI
model)
• FIND:
– Seed set i.e. patient zeros/culprits
– Set C- (the missing infected nodes)
– Ripple R (the order of infections)
Prakash 2016
21
Related Work – Culprits (Partial)
• Shah and Zaman, IEEE TIT, 2011
– One seed.
– Provably finds MLE seed for d-regular trees
– SI process
• Lappas et. al., KDD, 2010.
– k seeds (takes in Input k)
– Infected graph assumed to be in steady-state
– IC model
• Prakash et. al., ICDM, 2012. (NetSleuth)
– Finds number of seeds automatically.
– Assumes no mistakes in infected set D.
Prakash 2016
22
Related Work – Missing Nodes
(Partial)
• Costenbader and Valente 2003; Kossinets 2006,
Borgatti et al. 2006
– Study the effect of sampling on macro level network
statistics
• Adiga et. al. 2013
– Sensitivity of total infections to noise in network
structure
• Sadikov et al., WSDM, 2011
– correct for sampling for macro level cascade statistics
Prakash 2016
23
Outline
• Motivation---Introduction
• Problem Definition
• Our Approach
–
–
–
–
MDL
Decoupling
Finding S given C
Finding C given S
• Experiments
• Conclusion
Prakash 2016
24
MDL-Minimum Description Length
Principle
• Occam’s Razor
– Simplest model is the best model
• “Induction by Compression”
• Related to Bayesian approaches
• MDL cost in bits = Model cost + Data cost
– Best model least cost in bits
Data +
Model
Channel
Sender
Prakash 2016
Receiver
25
MDL Encoding For Our Problem
The Model
Seeds (S), Ripple (R)
Sender
Graph G(V, E)
Infectivity (β)
Sampling (p)
Seeds (S)
Infected set (D C-)
Ripple (R)
Missing nodes (C-)
Missing nodes (C-)
Data given
model
Prakash 2016
Receiver
Graph G(V, E)
Infectivity (β)
Sampling (p)
26
Model (S, R) Cost
• Scoring the seed set (S)
Number of possible |S|sized sets
En-coding integer |S|
• Scoring the ripple?
Prakash 2016
27
Model (S, R) Cost
• Scoring a ripple (R)
Infected
Snapshot
Original
Graph
Ripple R1
Ripple R2
Prakash 2016
28
Model (S, R) Cost
• Ripple cost
Ripple R
How long is the ripple
How the ‘frontier’
advances
Prakash 2016
29
Cost of the data (C-)
• We have to transmit the missed nodes C(green nodes)
– So that receiver can recover D
Detail: γ = 1 – p i.e. the
probability of a node to be truly
missing
Prakash 2016
30
Total MDL Cost
• Finally
• And our problem is now
– Find S, R, C- to minimize it
Prakash 2016
31
Outline
• Motivation---Introduction
• Problem Definition
• Our Approach
–
–
–
–
MDL
Decoupling
Finding S given C
Finding C given S
• Experiments
• Conclusion
Prakash 2016
32
Our Approach: Decoupling
• The two problems are
– Finding seeds/ripple (S, R)
– Finding Missing nodes (C-)
• Can we decouple them?
Prakash 2016
33
Decoupling the problems (contd.)
• Finding seeds depends on missing nodes.
Legend
Missing nodes
Seed
Infected node
NetSleuth: No missing
nodes as Input
NetSleuth: correct missing
nodes filled in as input
Prakash 2016
34
Decoupling the problems (cont.)
• Finding missing nodes also depends on seeds.
Not
Infected
Infected
Most probably A was
missed
B
Seed
A
S
Prakash 2016
35
Outline
• Motivation---Introduction
• Problem Definition
• Our Approach
–
–
–
–
MDL
Decoupling
Finding S given C
Finding C given S
• Experiments
• Conclusion
Prakash 2016
36
Finding missing nodes (S) and
culprits (C-)
• Suppose an oracle gives us the missing nodes
(C-)
• We have complete infected set (D U C-)
• Apply NetSleuth directly
– NO SAMPLING INVOLVED
– Will give us the
seed set
* Prakash et. al.,
ICDM 2012
Prakash 2016
Legend
Missing nodes
Seed
Infected node
Applying NetSleuth* on
Oracle’s Answer
37
Outline
• Motivation---Introduction
• Problem Definition
• Our Approach
–
–
–
–
MDL
Decoupling
Finding S given C
Finding C given S
• Experiments
• Conclusion
Prakash 2016
38
Missing Nodes (C-) given (S)
• Oracle gives us S, find C• Naïve Approach?
– Find all possible C– Pick the best one according to MDL
– Infeasible! ( æ V ö sets)
ç
÷
è V\D ø
Prakash 2016
39
Our Approach
• Sub-problem 1: |Seeds| = 1
– |Missing nodes| = 1
• Sub-problem 2: Finding the right number of
missing nodes.
• Sub-problem 3: |Seeds| > 1
Prakash 2016
40
Sub Problem 1: Best hidden culprit
given one seed
• Best node is one which makes the Seed s more
likely
– We use empirical risk as the measure
• Sanity Check: ideally risk should be 0
– So best hidden culprit,
Prakash 2016
41
Sub-Problem 1: Best Hidden Culprit
• Using some results in Prakash et. al. 2012 (see
details in paper), we can rewrite it as
u1 is the eigenvector corresponding to the
smallest eigenvalue of the Laplacian submatrix
of D
Prakash 2016
42
Detour: Laplacian Submatrix
• Laplacian = Deg(G) – A(G)
Laplacian
Degree
Adjacency
• LD = take only rows for nodes in D
(Laplacian submatrix)
Laplacian
Laplacian
Submatrix
D
• u1 (smallest eigenvalue’s eigenvector)
ƛ
Prakash 2016
Eigenvector
43
Okay
• How to solve this quickly?
Proof Omitted:
see paper
Prakash 2016
44
Best hidden hazard
• Choose n* such
Measures
– how connected a node n is to centrally located
infected nodes w.r.t. s in D
– Depends on the seed as well as the structure
Prakash 2016
45
Sub-Problem 2: How many missing
nodes?
• MDL?
• Add nodes based on Z-scores till MDL
increases.
– MDL is not convex!
– But it has convex like behavior…..
Prakash 2016
46
Sub-Problem - 3: What if |Seeds| > 1
SKIP!
Using z-scores:
Missing nodes are
near one seed
Ideal: Missing nodes
near both seeds
Prakash 2016
47
Sub problem 3: What if |Seeds| > 1
• Exonerate previous seeds
SKIP!
– Make previous seeds uninfected and calculate u1
– The blame is transferred to the locality of the
older seed
• Complete Z-score = maxover all seeds Z-score (n)
– Maximum as we need high quality missing nodes
• Take nodes with top-k complete Z-scores
Prakash 2016
48
Finding missing nodes given seeds
Phew!
Prakash 2016
49
The complete algorithm – NetFill
(Outline)
Running time: sub-quadratic in practice
Prakash 2016
50
Outline
•
•
•
•
•
Motivation---Introduction
Problem Definition
Our Approach
Experiments
Conclusion
Prakash 2016
51
Datasets
• Real and Synthetic graphs.
• Real and Simulated cascades.
• Graphs
– GRID
– AS-OREGON
– FLIXSTER
• a friendship network with movie ratings
• Cascade: the same movie rating from friends
– MEME-TRACKER
Webpage
Meme
Time
Citation
• hl-mt and hl-hl
Prakash 2016
Gomez-Rodriguez et. al. 2010
52
Baselines
• NETSLEUTH
• Simulation
– Simulate the SI process till we reach D
– Seeds = Input.
– Missing nodes = I \ D.
• Frontier
– Nodes “next in line” to be infected.
• At the boundary (frontier) of infected set.
• Seeds = Find seeds given missing nodes ( NetSleuth on
Frontier + data D)
Prakash 2016
53
Visualizing Performance (Grid
connected)
NetSleuth
Seeds
Missing nodes
Legend: Correct
Simulation
Seeds
Missing nodes
FP
FN
Frontier
Seeds
Missing nodes
Seeds
Prakash 2016
NetFill
Seeds
Missing nodes
Infected
54
MDL Grid: Finding the correct size
of Missing nodes (automatically)
Prakash 2016
55
Evaluation Metrics
• For the accuracy of C- (missing nodes)
– Precision
– MCC-Matthew’s correlation coefficient.
-1 <= MCC <= 1
Closer to 1 the
better
Prakash 2016
56
Evaluation Metrics (contd.)
• For seeds (S) and ripple (R)
– Q = MDL(algorithm) / MDL(true)
• From literature (see paper for details)
– Again, closer to 1 the better
Prakash 2016
57
Grid-connected
(Synthetic Graph, Synthetic Cascades)
Closer to 1 the better
Prakash 2016
58
AS-Oregon
(Real Graph, Synthetic Cascades)
Closer to 1 the better
Prakash 2016
59
Meme-Tracker HL-MT
(Real Graph, Real Cascades)
See Paper for more
experiments e.g.
scalability, robustness etc.
Closer to 1 the better
Prakash 2016
60
Meme-Tracker– case study
• 96,000 node graph for the meme “State of the
economy”
• Found missing websites like
“www.nbcbayarea.com”,
“chicagotribune.com” and some blog posts.
Prakash 2016
61
Outline
• Motivation
• Part 1: Learning Models (Empirical Studies)
– Q1: How to find hidden culprits?
– Q2: How to segment data sequences?
• Conclusion
Prakash 2016
62
Problem 2: Sequence Segmentation
[Chen, Amiri, Prakash 2016]
Prakash 2016
63
Data sequence
• A data sequence is a sequence of data observations
with time information. D={(x1T,t1),…,(xNT,tN)}
Multidimensional
Sequences
Each observation is a
patient (a set of
feature values)
Each observation is
a tweet (a vector of
words)
Prakash 2016
64
Segmenting data sequences
• Given a data sequence D={(x1T,t1),…,(xNT,tN)}, we want
to automatically find a time segmentation in a way
that the consecutive segments are not similar.
Data sequence D={(x1T,t1),…,(xNT,tN)}
s1
t1
s2
c1
s3
c2
s4
c3
time
si: segment i
ci: cut point i
tN
Find a segmentation (ci as the cut points) such that
patterns in s1 and s2 are different, similarly s2&s3, s3&s4
Prakash 2016
65
Properties of a good solution
• P1 (Temporal Closeness)
– Some data values co-occur (happen close in time)
together in the data sequence (we capture these
values as co-occurrence clusters).
• P2 (Self-Guided)
– We want to automatically find the number of cut
points and the optimal segmentation without using
parameters like an aggregation window, or the
number of segments.
• P3 (Scalability)
– The solution should finish within reasonable time for
real datasets.
Prakash 2016
66
DASSA---Framework
(DAta Sequence Segmentation
Automatically)
• STEP 1: We use information
bottleneck and MDL principle
to find co-occurrence clusters
as the new data
representation.
• STEP 2: We construct a
segment-graph to efficiently
represent all possible
segmentations, with carefully
designed edge weights.
• STEP 3: We define the best
path, and propose a novel
algorithm to find the optimal
path (faster than the state of
the art).
Prakash 2016
67
STEP 1: Co-occurrence Clustering
• Why do we want to find co-occurring data values?
See the example of a color-ball sequence below.
Prakash 2016
68
STEP 1: Co-occurrence Clustering
• Information bottleneck [Tisbhy et al.] finds a compact representation
of X such that the information about Y is maximally kept.
• In our formulation, we use X as the set of data values, Y as the set of
all time segments.
– find a compact representation (the co-occurrence clusters) such that the
information about time segments are maximally kept (Similar/Dissimilar
segments remain similar/dissimilar).
• Example: [1, 100], 1
[2], 2
[50], 3
[1, 100], 4
[2], 5
[5], 6
Prakash 2016
69
STEP 1: Co-occurrence Clustering
• To find , we initialize as X, and greedily
merge to minimize the loss of temporal
information: i.e.
–
= X (and given P(X,Y))
– Merge xi and xj such δ(.) is minimum
• Output:
,
,
Prakash 2016
70
STEP 1: Co-occurrence clustering
• Obvious question: when to stop?!
• Answer: use MDL
– best model takes least number of bits
Pseudo code for the merging
using our MDL code
Cost to describe the model
Cost to describe the data given the model
Prakash 2016
71
STEP 2: Building the segment-graph
• We use segment-graph (which is a DAG) to represent all
possible segmentation of the original sequence
– Nodes: each node is a possible time segment, we further introduce a
source node s and a target node t
– Edges: an edge is created for each adjacent time segment pair, if the
ending time of node i equals to the start time of node j, we have an
edge e(i,j).
– Edge weights:
Satisfying three
important
axioms, see
details in paper
Prakash 2016
72
STEP 3: Finding the optimal path
• Optimal path = maximum average edge weight
(ALP)
• We propose a novel DAG-ALP algorithm to efficiently
find the average longest path of a DAG in O(h|E|),
while the state of the art ALP algorithm is
O(|V|2|E|). (See details of the algorithm in paper)
Average longest path in
the segment-graph
(corresponds to the
optimal segmentation in
the original sequence).
Prakash 2016
73
Experiments
• Datasets: Datasets from different domains are used
to test the performance of DASSA.
Prakash 2016
74
Experiments
• Baselines: There is no existing algorithm that
segments general data sequences. Hence, we adapt
one of the time series segmentation algorithm, and
further use three variations of DASSA as baselines.
Prakash 2016
75
Experiments
• For data sequences with ground truth segmentation,
we use the F1 score to measure the performance of
different algorithm.
Finding the exact ground
truth in some cases, beating
the baselines
Prakash 2016
76
Experiments
• For data sequences with ground truth segmentation,
we use the F1 score to measure the performance of
different algorithm.
Prakash 2016
77
Experiments---Flu propagation
• For data sequences with ground truth segmentation,
we use the F1 score to measure the performance of
different algorithm.
Prakash 2016
Interesting pattern found
for disease propagation
78
Case Studies---Twitter
• We do case studies to reveal interesting patterns in
the data sequences.
Frequent words used in different
segments show interesting shifting
pattern
Prakash 2016
79
Case Studies---Ebola (Sierra Leone)
• We do case studies to reveal interesting patterns in
the data sequences.
1. Ebola moves from
Kono and Kambia to Bo
2. Deaths reduce
3. New cases also go
down
Prakash 2016
80
Outline
• Motivation
• Part 1: Learning Models (Empirical Studies)
– Q1: How to find hidden culprits?
– Q2: How to segment data sequences?
• Conclusion
Prakash 2016
81
Future Plans
ANALYSIS
Understanding
POLICY/
ACTION
DATA
Large real-world
networks & processes
Managing
Prakash 2016
82
Scalability – Big Data
• Datasets of unprecedented scale
– High dimensionality and sample size!
• Need scalable algorithms for
– Learning Models
– Developing Policy
• Leverage parallel systems
– Map-Reduce clusters (like Hadoop) for data-intensive
jobs (more than 6000 machines)
– Parallelized compute-intensive simulations (like
Condor)
Prakash 2016
83
Uncertain Data in Cascade analysis
(more implementable policies)
Correcting for missing data
Original, Nodes
sampled off
Designing More Robust
Immunization Policies
Culprits, and missing
nodes filled in
Zhang and Prakash. CIKM
2014
Sundereisan, Vreeken, Prakash. 2014
Prakash 2016
84
Group-based Immunization
[Zhang, Adiga, Vullikanti, Prakash, 2015]
How to select groups to minimize the epidemic?
• Epidemiology
• People are grouped by ages,
demographics, occupations …
• Social Media
A
D
• Friends are grouped by the
same interests
• E.g., Facebook pages
C
B
F
E
Results:
First approximation algorithms
for the problem
Prakash 2015
85
Graph Sequences
• Segmenting graph sequences
…….
Segment flu cascades?
• Correct for missing data in such snapshots
Prakash 2016
86
References
1.
2.
Scalable Vaccine Distribution in Large Graphs given Uncertain Data (Yao Zhang and B. Aditya Prakash) -- In CIKM 2014.
Fast Influence-based Coarsening for Large Networks (Manish Purohit, B. Aditya Prakash, Chahhyun Kang, Yao Zhang and V. S.
Subrahmanian) – In SIGKDD 2014
3.
4.
DAVA: Distributing Vaccines over Large Networks under Prior Information (Yao Zhang and B. Aditya Prakash) -- In SDM 2014
Fractional Immunization on Networks (B. Aditya Prakash, Lada Adamic, Jack Iwashnya, Hanghang Tong, Christos Faloutsos) – In
SDM 2013
Spotting Culprits in Epidemics: Who and How many? (B. Aditya Prakash, Jilles Vreeken, Christos Faloutsos) – In ICDM 2012,
Brussels Vancouver (Invited to KAIS Journal Best Papers of ICDM.)
Gelling, and Melting, Large Graphs through Edge Manipulation (Hanghang Tong, B. Aditya Prakash, Tina Eliassi-Rad, Michalis
Faloutsos, Christos Faloutsos) – In ACM CIKM 2012, Hawaii (Best Paper Award)
Rise and Fall Patterns of Information Diffusion: Model and Implications (Yasuko Matsubara, Yasushi Sakurai, B. Aditya Prakash,
Lei Li, Christos Faloutsos) – In SIGKDD 2012, Beijing
Interacting Viruses on a Network: Can both survive? (Alex Beutel, B. Aditya Prakash, Roni Rosenfeld, Christos Faloutsos) – In
SIGKDD 2012, Beijing
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
Winner-takes-all: Competing Viruses or Ideas on fair-play networks (B. Aditya Prakash, Alex Beutel, Roni Rosenfeld, Christos
Faloutsos) – In WWW 2012, Lyon
Threshold Conditions for Arbitrary Cascade Models on Arbitrary Networks (B. Aditya Prakash, Deepayan Chakrabarti, Michalis
Faloutsos, Nicholas Valler, Christos Faloutsos) - In IEEE ICDM 2011, Vancouver (Invited to KAIS Journal Best Papers of
ICDM.)
Times Series Clustering: Complex is Simpler! (Lei Li, B. Aditya Prakash) - In ICML 2011, Bellevue
Epidemic Spreading on Mobile Ad Hoc Networks: Determining the Tipping Point (Nicholas Valler, B. Aditya Prakash, Hanghang
Tong, Michalis Faloutsos and Christos Faloutsos) – In IEEE NETWORKING 2011, Valencia, Spain
Formalizing the BGP stability problem: patterns and a chaotic model (B. Aditya Prakash, Michalis Faloutsos and Christos
Faloutsos) – In IEEE INFOCOM NetSciCom Workshop, 2011.
On the Vulnerability of Large Graphs (Hanghang Tong, B. Aditya Prakash, Tina Eliassi-Rad and Christos Faloutsos) – In IEEE
ICDM 2010, Sydney, Australia
Virus Propagation on Time-Varying Networks: Theory and Immunization Algorithms (B. Aditya Prakash, Hanghang Tong,
Nicholas Valler, Michalis Faloutsos and Christos Faloutsos) – In ECML-PKDD 2010, Barcelona, Spain
MetricForensics: A Multi-Level Approach for Mining Volatile Graphs (Keith Henderson, Tina Eliassi-Rad, Christos Faloutsos,
Leman Akoglu, Lei Li, Koji Maruhashi, B. Aditya Prakash and Hanghang Tong) - In SIGKDD 2010, Washington D.C.
Prakash 2016
87
Acknowledgements
Collaborators
Deepayan Chakrabarti,
Hanghang Tong,
Kunal Punera,
Ashwin Sridharan,
Sridhar Machiraju,
Mukund Seshadri,
Alice Zheng,
Lei Li,
Polo Chau,
Nicholas Valler,
Alex Beutel,
Xuetao Wei
Christos Faloutsos
Roni Rosenfeld,
Michalis Faloutsos,
Lada Adamic,
Theodore Iwashyna (M.D.),
Dave Andersen,
Tina Eliassi-Rad,
Iulian Neamtiu,
Varun Gupta,
Jilles Vreeken,
V. S. Subrahmanian
John Brownstein (M.D.)
Prakash 2016
88
Acknowledgements
• Students
Liangzhe Chen
Shashidhar Sundereisan
Benjamin Wang
Yao Zhang
Sorour Amiri
Prakash 2016
89
Acknowledgements
Funding
Prakash 2016
90
Propagation, Networks
and Sequences
B. Aditya Prakash
http://www.cs.vt.edu/~badityap
Analysis
Policy/Action
Prakash 2016
Data
91