Sampling Research Questions

Download Report

Transcript Sampling Research Questions

Sampling Research Questions
Bruce D. Spencer
Statistics Department and Institute for Policy Research
Northwestern University
SAMSI Workshop 10/21/10
Introduction
• At the end of the opening workshop the
group in Sampling, Modeling, and
Inference raised a number of open
questions related to sampling.
• Today I will discuss those questions, most
of which are still unsolved.
2
Goal of Sample-Based Inference
• What is the target of the inference?
– a stochastic model that generated a network
or set of networks
– population of networks, e.g., dynamic
networks
– multiple networks on a single population of
edges
– single network
3
Various Network Sampling Designs
• Conventional sample design to learn about the network
– probabilities do not depend on observed data
– E.g., Current Population Survey
• Adaptive sample design using the network
– probabilities may depend on observed data
– E.g. RDS; ego-centric samples; link-tracing designs
• Two-phase sampling to target further investigation of
missing data or measurement error
• Subsampling (?) to reduce computational burden at
possible loss of efficiency
4
Conventional Sampling Design to
Learn about the Network(s)
Samples of nodes or of edges - used for
• description of network(s)
• prediction of future state of network
• prediction of links/gaps/nodes
• fitting a model to the graph
5
Limitations from Sampling
• Sampling introduces random error into the estimates
(and possibly bias, since E f(X) ≠ f (EX) for nonlinear f )
• Sampling variance needs to be estimated, maybe bias
does too; may be problematic for small samples
• Some population characteristics may not be “estimable”
from a sample
– E.g., maximum path length between any two nodes?
– Number of components in a general graph?
– What does “estimable” mean?
6
Limitations from Sampling
• If elements of interest (edges/non-edges, stars, motifs,
etc.) have unequal probabilities of being observed, then
– need to know the probabilities and adjust for them
– or, need to have a model that explains the population
– or, sometimes, both.
7
E.g.: Induced Graph Sampling
• Undirected parent graph (V, G)
• Sample nodes S  V
• Observe G(S)  G – observe edge/non-edge
between u, v iff u,v S
• Conventional sampling with possibly unequal
probabilities (including multiple- frame stratified
multi-stage): probability of including u1,u2 ,...,uj
and excluding u1,u2 ,...,vk knowable for any j, k
• Denote inclusion probabilities by  (.)
8
Horvitz-Thompson Estimators of Totals
Unbiased estimator of N  | E | is Nˆ  1/  (u ).
uS
Unbiased estimator of R  | G | is Rˆ 
1
2
T
u ,vS
u ,v
/  (u,v )
with Tu,v  1 if u  v are adjacent and  0 otherwise.
Unbiased estimators of variances of H-T estimators
V (Nˆ ), V (Rˆ ), etc. are available.
9
H-T Estimators of Triad Distribution
Define
Tk,u,v,w = 1 if u,v,w are distinct vertices sharing
k edges and
= 0 otherwise
Tk number of triads in E with 0 < k < 3 edges
Unbiased estimator of Tk is Tˆk  61  Tu,v ,w /  (u,v ,w ).
u ,v ,wS
Other totals estimated similarly, e.g., number of
stars or other motifs.
10
Degree Distribution
•
•
•
•
du degree of node u (its number of edges)
M maximum degree in (E, G)
Nr number of nodes of degree 0 < r < M
(F0,F1,…,FM) is degree distribution, with Fr =Nr /N
• Degree distribution of the sample can differ from
degree distribution of the population.
“Subnets of Scale-Free Networks are Not ScaleFree: Sampling Properties of Networks” Stumpf,
Wiuf, May (PNAS, 2005)
11
Estimation of Degree Distribution
• Induced subgraph from SRS of size n from (E,G)
• Nr number of nodes of degree r in parent graph
• Nr(S) number of nodes of degree r in subgraph
Set N  (N0 , , NM )T and N(S )  (N0 (S ), , NM (S ))T .
ˆ  N where
O. Frank (1980 JSPI ): If n  M then E N
ˆ  B1N(S ) and B is a triangular matrix whose
N
entries are probabilities referring to the
hypergeometric distribution.
12
Estimation of Degree Distribution
To get degree distribution, set Fˆr  Nˆ r / Nˆ .
This will have small bias for large enough n.
I have extended the proof to accommodate
complex sampling with unequal probabilities.
1
1
ˆ
ˆ
Instead of N  B N(S ) we use B N(S )
ˆ  B.
with E B
13
Estimation of Mean and Variance of
Degree Distribution
The mean of the degree distribution is equal to
2R / N and the variance can be shown to equal
T1  NT2  3(N  1)T3
R2
2
)4 2
N 2
N
(Frank 1981, Soc. Meth.). We have unbiased
H-T estimates of N, R,Tk . Plug in and (optionally)
subtract Vˆ (Rˆ / Nˆ ) for the variance estimate.
14
Partial Recap
• Using induced graph subsamples from conventional samples where
joint inclusion probabilities are known, we can estimate
– population values of descriptive statistics based on totals
– degree distribution.
• (Only undirected graphs at one point in time discussed.)
• What about
– other descriptive statistics
– model fitting
– large variances when sample size small
– adaptive samples?
15
Approaches to Model Fitting
1. You trust* your model.
•
•
Under certain conditions** on the sample design
and the model, you can ignore the way the sample
was selected and treat the sample as having been
generated from the model.
The sampling mechanism needs to be carefully
examined to make sure it meets the requirements,
which depend on the model being used.
* Reagan and others, “trust but verify”
** Handcock and Gile (2010 AoAS) call the condition “amenability” and
relate it to “ignorability” (Rubin 1976).
16
Approaches to Model Fitting
“Model as descriptive statistic”. You do not necessarily
believe the model, but you want to fit the model the
way you would if you completely observed the
population.
2.
•
•
•
•
Anathema to many social scientists. . .
E.g., in ERGMs, model fitting for population depends on
sufficient statistics that are population totals. One can
estimate them with H-T estimates (or alternatives) and then fit
model. (Pavel Krivitsky poster)
I have not investigated how to implement for other models.
If both approaches are tried, “large” differences in fits can
indicate model misspecification.
17
Adaptive Sampling
• Probabilities of observations depend on data from sampled units.
• Provides more information about network than conventional samples
(Frank). Note: variances may be too large when sample is
conventional but sparse.
• Probabilities of observing triads and larger typically unavailable, and
even probabilities for dyads known for ego-centric designs but not
link-tracing designs. (H-G 2010)
• In order to use full data, either need to estimate unknown
probabilities (hard!!) or rely on model if amenability condition can be
verified and model validated.
• E.g., when using conventional unequal probability samples to
estimate a population total, the amenability condition typically does
not hold.
18
Model Validation
• Model validation is important, but challenging when
sampling probabilities are unknown.
• At the heart of every adaptive sample is a conventional
sample.
• Use conventional sample to fit model as descriptive
statistic. Compare result to model fitted under
assumption of ignorability/amenability for (i) conventional
sample and (ii) larger and more informative adaptive
sample.
19
Recap
•
What is the population (network, or set of
networks) from which sample is selected?
Sample design (and inference) to learn about
the network
•
–
–
–
–
Static
Over time
Description of network
Prediction of future state of network and prediction
of links/gaps/nodes
20
Recap
• Sample design (and inference) using the
network to learn about a population
– Respondent Driven Sampling
– Adaptive Sampling
– Others
– Static and over time
21
Recap
• Subsampling design (and inference) to
– Ease computational burden
– Target further investigation to learn about
measurement error
• When can inferences be made based on
sample design information to provide
approx. unbiasedness whether or not
model is valid?
22
Recap
• How can model inferences be made?
– What models?
•
•
•
•
Exponential random graph models
Mixed membership stochastic block models
Latent space models
Agent based models
– What network characteristics (what summary
statistics)
23
Recap
• What is effect of measurement error (and
missing data, non-response) on inferences
about network?
– RDS samples
– Others
• How to design and analyze randomized
experiments when subjects are part of a static
network? Dynamic?
– Google experiments
– Experiments on adolescents in schools (e.g., drug
counseling, obesity “treatment”) – effects on peers
24