Slide - Virginia Tech

Download Report

Transcript Slide - Virginia Tech

IEEE BDSE2013
Computational Methods for Testing Adequacy and Quality
of Massive Synthetic Proximity Social Networks
Huadong Xia, Christopher Barrett,
Jiangzhuo Chen, Madhav Marathe
Network Dynamics and Simulation Science Laboratory
Virginia Tech
NDSSL TR-13-153
Acknowledgement
We thank our external collaborators and members of the Network Dynamics and
Simulation Science Laboratory (NDSSL) for their suggestions and comments.
This work has been partially supported by DTRA Grant HDTRA1-11-1-0016, DTRA
CNIMS Contract HDTRA1-11-D-0016-0001, NIH MIDAS Grant 2U01GM070694-09, NSF
PetaApps Grant OCI-0904844, NSF NetSE Grant CNS-1011769.
Outline
•
•
•
•
Background and Contributions
Methods: Network Synthesis
Comparison of Large Scale Networks
Conclusions
Importance of Computational Epidemiological Models
• Pandemics cause substantial social, economic
and health impacts
– 1918 flu pandemic, killed 50-100 million people or 3
to 5 percent of world population.
– …
– SARS 2003, H1N1 2009, Avian flu (H7N9) 2013
• Mathematical and Computational models have
played an important role in understanding and
controlling epidemics
– controlled experiments are not allowed for ethic
consideration.
– understand the space-time dynamics of epidemics
Networked Epidemiology
•
•
•
•
Heterogeneous
Spatial-Temporal features of populations
Massive, Irregular, Dynamic and Unstructured
Social contact networks are usually synthesized
(Figure From the Internet)
The Four V’s in Networked Epidemiology
Volume
Velocity
Variety
Interactions Change
every second
Node Status changes
every second
Facts in Delhi
They are modeled in
minute scale
•
•
•
•
•
Demographics
Geographic
Temporal Feature
Virus Infectivity
……
Veracity
• Data
Do we collect
enough raw data to
render a clear
picture?
13.85M Population
2.67M Households
7am
>200M Contacts
9am
2.64M Locations
3pm
8pm
• Method
Do we extract all
useful information
out of available raw
data?
Social Contact Network Modeling and Analysis
• The Veracity of the network one makes depends on:
– Time available to make such a network (human, computational)
– The data available to make the network
– The specific question that one would like to investigate
• Different level of networks may be retrieved for the same region.
• How do we evaluate networks that span large regions?
– How to compare two networks constructed for the same
population?
– When is the synthesized network adequate?
Contributions
• Propose a number of network measurements to understand
and compare urban scale social contact networks which are
extremely large, dynamics and unstructured.
• Explore quantitatively the adequacy standards in modeling
proximity networks.
Outline
•
•
•
•
Background and Contributions
Methods: Network Synthesis
Comparison of Large Scale Networks
Conclusions
Synthetic Populations and Their Contact Networks
Goal:
 Determine who are
where and when.
Process:
 Create a statistically
accurate baseline
population
 Assign each individual
to a home
 Estimate their
activities and where
these take place
 Determine individual’s
contacts & locations
throughout a day.
Constructing Synthetic Social Contact Networks
What Is a Network
People
Vertex attributes:
• age
• household size
• gender
• income
•…
Locations
Vertex attributes:
• (x,y,z)
• land use
•…
Edge attributes:
• activity type: shop, work,
school
• (start time 1, end time 1)
• (start time 2, end time 2)
•…
• Networks capture social interaction pertinent to the disease
• We focus on flu like diseases and the appropriate network is a social
contact network based on proximity relationship.
Two Sets of Data Sources and Generation Methods
for Delhi Synthetic Population and Network
Data & Methods
data
the coarse network
the detailed network
demographics
India census 2001
India census 2001 + microdata (India Human
Development Survey - UMD)
geographic
data
LandScan 2007
MapMyIndia
Thane travel survey
activity
generic activity templates
residential contact survey
method
people
distribution
distribution/IPF
locations
density
Real locations+ home along
roads
activity
schedules
categorized templates
activity
locations
decision tree + templates
configuration model
gravity model
Residential Contacts: for the Detailed Network Only
Office
Residential
Area
Mall
School
Population Synthesis
M33
F22
F11
F2
F46
M65
F36
M65
F11
F22
M1\23
F2
F4
M13
F6
M17
F22
F36
M53
F46
M23
M47
M23
M71
M47
F22
M71
F4
M17
M53
F6
M13
Extract individuals
Split into HHs
F2
F6
M71
F46
M53
M65
F36
M17
F22
F46
M65
F4
F4
M13
F22
F36
M53
F6
M17
M23
M47
M71
M47
M23
F2
F22
M13
M33
F11
F11
F22
Population for the coarse network
M21
Population for the detailed network
How to Compare Two Networks
• Metrics
–
–
–
–
Entity level: the population, built infrastructure and their layout
Collective level: validate against aggregate statistics.
Network level: structural properties
Epidemic dynamics level: policy effects
Comparison for Synthetic Populations
Individual level age-gender structure
Household level demographic structure
Entropy: 1.35 v.s. 1.02
Precision of Location Distribution
LandScan
Grid
Synthetic Locations
Real Locations
the Coarse Network
the Detailed Network
Activity Statistics
Temporal Visiting Degree in Random Selected
Locations
Note: First Row: the coarse network; Second Row: the detailed network
GPL: Temporal and Spatial Properties
travel distance distribution
radius of gyration distribution
GPL: Structural Properties
• The people-location network GPL: the degree of a large portion
of nonhome Locations have a power law like distribution.
People-People Network GP
Disease Spread in a Social Network
• Within-host disease model: SEIR
• Between-host disease model:
– probabilistic transmissions along edges of social contact
network
– from infectious people to susceptible people
Epidemic Simulations to Study the Delhi Population
• Disease model
 Flu similar to H1N1 in 2009: assume R0=1.35, 1.40, 1.45, 1.60
(only the results when R0=1.35 are shown, but others are similar)
 SEIR model: heterogeneous incubation and infectious durations
 10 random seeds every day
• Interventions
 Vaccination: implemented at the beginning of epidemic; compliance rate
25%
 Antiviral: implemented when 1% population are infectious; covers 50%
population; effective for 15 days
 School closure: implemented when 1% population are infectious;
compliance rate 60%; lasts for 21 days
 Work closure: implemented when 1% population are infectious; compliance
rate 50%; lasts for 21 days
• Total five configurations (including base case). Each configuration is
simulated for 300 days and 30 replicates
Comparison in Epidemic Simulations
• Impact to Epidemic Dynamics (R0=1.35):
–
The coarse network exploits generic activity schedules, where people travel much more frequently.
Therefore, the two networks show very different epidemic dynamics in base case.
Epidemic Simulation Results: Interventions
•
Similarities of two networks:
– Vaccination is still most effective strategy.
– Pharmaceutical interventions is more effective than the non-pharmaceutical.
– School closure is more effective than work closure
•
Differences of two networks
– Severity is significantly different
– In delaying outbreak of disease, school closure is more effective than Antiviral in the
coarse network, which is on the contrary in the detailed network.
Metrics Review
Categories
Metrics
Household Structure
Location Layout
Underlying Synthetic Population
Duration of Activities
Number of Daily Activities
Travel Distance
Radius of Gyration
GPL
GP
Temporal Degree of Random Locations
Degree of People-Location Graphs
Degree, Clustering Coefficient,
Contact Duration, Shortest Path
No Interventions
Epidemic Dynamics
Pharmaceutical Interventions
Non-Pharmaceutical Interventions
Conclusions
• Novel methodologies in creating a realistic social contact
network for a typical urban area in developing countries
• Comparison to a coarser network suggests:
– Similarity reflects generic properties for social contact networks
– Region specific features are captured in the detailed model
– The epidemic dynamics of the region is strongly influenced by activity
pattern and demographic structure of local residents
– A higher resolution social contact network helps us make better public
health policy
• A realistic representation of social networks require adequate
empirical input. We propose the criteria of adequacy:
– Does the new input decrease uncertainty of the system?
– Does the new input significantly change epidemics and intervention
policy?
END
Questions?
EXTRA SLIDES
Epidemic Simulation Results: Vulnerability
•
•
•
Calibrate R0 to be 1.35
Vulnerability is defined as: Normalized number of infected over 10,000 runs of random
simulations
Vulnerability distribution of the detailed network is flat comparing to the coarse network,
and it is less vulnerable due to less frequent travel.
Epidemic Simulation Results
• Calibrate R0 to be 1.35
Delhi: National Capital Territory of India
• Case study:
– Delhi (NCT-I): a representative south Asian city that was never studied
before.
• Statistics:
–
–
–
–
13.85 million people in 2001; 22 million in 2011
Most populous metropolis: 2nd in India; 4th in the world
573 square miles, 9 regions (refer to the pic)
The Yamuna river going through urban area.
• Unique socio-cultural characteristics:
– Large slum area
– Tropical weather
– Environmental hygiene
Two Versions of Delhi Networks
• The coarse network:
– Based on very limited data
– Generic methodology applicable to any region in world
• The detailed network:
– Requires household level micro sample data and other detailed data,
not available for all countries
• Improvement on results is expected:
– to evaluate the network generation model;
– to understand importance of different levels of details.
V1: Synthetic Population Generation
•
Population generation
Input: Joint distribution of age and gender of the population in Delhi (from the India
census 2001)
Algorithm:
– Normalize the counts in the joint distribution of age and gender into a joint
probability table
– Create 13.85 million individuals one by one.
For each individual:
Randomly select a cell c with the probability of each cell of the city.
Create a person with the age and gender corresponding to the cell c.
End
Output: 13.85 million individuals are created, each individual is associated with
disaggregate attributes of gender and age.
Data Input
•
Demographic Data: basic census data + India Micro-Sample
–
–
India Census 2001
Micro sample for household structure: India Human Development Survey 2005 by the University
of Maryland and the National Council of Applied Economic Research, which tells about each
household sample: hh size, hh head’s age, hh income, house types, animal care; and also for each
individual in the hh: demographic details, religion, work, marital status, relationship to head, etc.
• Activity Data: Thane travel survey + residential contacts survey
– Activity templates from 2001 Household Travel Survey statistics for Thane, India, and
2005-2009 school attendance statistics from the UNESCO Institute of Statistics (UIS)
o Activity templates are extracted with CART, and assigned to synthetic population with
decision tree.
– Survey on residential area contacts in India, conducted by NDSSL
o Approximate 40% adults in India do not travel to work. The survey focused on them.
o Collected people’s age, gender, and contact durations/frequencies near their home.
• Location Data: MapMyIndia data
–
–
–
–
Ward-wise statistics for population and households.
Coordinates for locations such as schools, shopping centers, hotels etc.
Infrastructures such as roads, railway stations, land use etc.
Boundary for each city, town and ward.
V2: synthetic population creation method
•
Same methodology as we did for US populations:
Input: total # of households
Aggregate distribution of demographic properties from Census: hh size, householder’s age
Household micro-samples
Output: Synthetic population with household structure. Each individual is assigned an age and gender.
Algorithm:
1. Estimate joint distribution of household size and householder’s age:
1) construct a joint table of hh size and householder’s age: fill in # of samples for each cell
2) multiply total # of households to distributions to calculate marginal totals for the table
3) run IPF to get a convergent joint table
4) normalize: divide counts in each cell with (total # of samples), it’s probability for each cell.
(illustrated in next slide)
2. create the synthetic households and population:
1) randomly select a cell with the probability in joint table
2) select a household sample h from all samples associated with that cell uniformly at random
3) create a synthetic household H, so that H has same members as h, each member in H has same
demographic attributes as those in h.
4) repeat step 2.1-2.3, until # of synthetic households is equal to the total # of households from
Census.
IPF example
Row Adjustment
Start
20
30
35
15
35
6
8
9
3
40
6
10
10
14
Row
Column
20
30
35
15
35
40
25
25
3
10
9
8
Column Adjustment
Iteration 1
29.62 39.61
20 8.00 8.00
30 8.57 10.71
35 11.25 12.50
15 1.80 8.40
30.76
4.00
10.71
11.25
4.80
Iteration 2
34.81 40.09
20 9.10 7.77
30 10.25 10.95
35 13.27 12.60
15 2.20 8.77
25.10
3.13
8.81
9.13
4.03
Iteration 3: Finished
34.99 40.00
20 9.14 7.75
30 10.30 10.92
35 13.34 12.57
15 2.21 8.76
25.00
3.11
8.78
9.09
4.02
20.78
29.65
35.06
14.51
35
40
9.45 8.08
10.13 10.82
13.29 12.62
2.13 8.48
25
3.25
8.71
9.14
3.90
20.02
30.00
35.01
14.98
35
40
9.15 7.76
10.30 10.92
13.34 12.57
2.21 8.75
25
3.12
8.77
9.09
4.02
20.00
30.00
35.00
15.00
35
40
9.14 7.75
10.30 10.92
13.34 12.57
2.21 8.76
25
3.11
8.77
9.09
4.02
V2: household distribution – a snapshot
•
•
Households are distributed along real streets/community blocks.
V2 avoids to distribute households on rivers, lakes and green land etc. (V1 distribute them
uniformly within each 1(miles)*1(miles) block)
Flowchart: Generating Activity Sequences based on Thane Survey
for Delhi-V2
• Activity templates generation
Data sources:
Demographics of
the Thane sample
population;
UIS stat
decision
tree
Frequency
distribution of
reported activity
sequences
sampling
Frequency
distribution of trips:
Trip start time
Trip length
sampling
Outcome:
Commute
categories
Activity
sequences
1) Demographics
2) Act template:
Activity sequence
Activity duration
Generation of the Residential Network
• Motivation of the residential contact network:
– Approximate 40% adults in India do not travel to work. The network model interaction
among them around their homes (within residential area).
• Survey data collected:
– age, gender of staying at home people: node label
– contact durations/frequencies of each person near their home: edge label/node
degree
• Formal question: generate a random network s.t.
– Given degree distribution of a bunch of nodes
– Given label of each node
– Assumption: network tend to be homophilous (nodes of the similar labels is connected
with higher probability )
• Method:
– Configuration model with the added feature of node homophilous.
– Refer to the next slide for details.
Random Network Generation: configuration
model with the added feature of node homophilous.
For each edge-type in (long-dur, mid-dur, short-dur), do:
1. Initialize each node with a degree drawn i.i.d. from the degree distribution
according to its label (age/gender)
2. Form a list of “stubs” – connections of nodes that haven’t be matched with
neighbors. Call it stubList.
3. Pick a starting node v0 randomly.
4. For each of v0’s stubs, choose an element v1 from the stubList as described in
following:
1) v1 is chosen randomly from the stubList;
2) if v1 is same as v0 or already connected to v0, go to 4.1).
3) with a probability p (>0.5), we do
test if v1 is similar to v0,
if not, go to 4.1) and repeat the selection.
4) create an edge between v0 and v1, its duration is computed randomly based
on the edge-type (long, mid or short duration)
Done.