slides

Transcript slides

Statistics vs Big Data
Constantinos Daskalakis
CSAIL and EECS, MIT
Greek Stochastics 𝜃
YOU WANT BIG DATA?
I’LL GIVE YOU BIG DATA!
BIG Data
small
Facebook: 20 petabytes images daily
Human genome: 40 exabytes storage by 2025
SKA Telescope: 1 exabyte daily
High-dimensional
DNA
microarray
Computer
vision
Expensive
Experimental
drugs
Financial
records
What properties do your BIG
distributions have?
e.g.1: play the lottery?
Is it uniform?
Is the lottery unfair?
from Hitlotto.com:
“Lottery experts agree, past number histories
can be the key to predicting future winners.”
True Story!
• Polish lottery Multilotek
– Choose “uniformly” at random distinct 20 numbers
out of 1 to 80.
– Initial machine biased
• e.g., probability of 50-59 too small
Thanks to Krzysztof Onak (pointer) and Eric Price (graph)
New Jersey Pick 3,4 Lottery
• New Jersey Pick k ( =3,4) Lottery.
– Pick k random digits in order.
– 10k possible values.
• Data:
– Pick 3 - 8522 results from 5/22/75 to 10/15/00
• 2-test (on Excel) answers
"42% confidence”
– Pick 4 - 6544 results from 9/1/77 to 10/15/00.
•
•
•
•
fewer results than possible values
not a good idea to run 𝜒 2 test
convergence to 𝜒 2 distribution won’t kick in for small sample size
(textbook) rule of thumb: expected number of at least 5
observations of each element in the domain under the null
hypothesis to run 𝜒 2
e.g. 2: Independence Testing
Shopping patterns:
Independent of zip code?
e.g.2: Linkage Disequilibrium
Genome
locus 1
locus 2
locus 𝑛
Single nucleotide polymorphisms, are they independent?
Suppose 𝑛 loci, 2 possible states each, then:
• state of one’s genome ∈ {0,1}𝑛
• humans: some distribution 𝑝 over {0,1}𝑛
Question: Is 𝑝 a product dist’n OR far from all product dist’ns?
should we expect the genomes from the 1000 human genomes
project to be sufficient? up to how many loci?
e.g. 3: Outbreak of diseases
• Similar patterns in different years?
• More prevalent near large airports?
Flu 2005
Flu 2006
Distributions on BIG domains
• Given samples of a distribution, need to know, e.g.,
–
–
–
–
entropy
number of distinct elements
“shape” (monotone, unimodal, etc)
closeness to uniform, Gaussian, Zipfian…
• No assumptions on shape of distribution
– i.e., parametric, smoothness, monotonicity, normality,…
• Considered in Statistics, information theory, machine
learning, databases, algorithms, physics, biology,…
Old questions, new challenges
Modern Setting
Classical Setting
Domain:
1000 tosses
Domain:
One human genome
Small domain 𝐷
Large domain 𝐷
𝑛 𝑙𝑎𝑟𝑔𝑒, 𝐷 𝑠𝑚𝑎𝑙𝑙
𝑛 𝑠𝑚𝑎𝑙𝑙, 𝐷 𝑙𝑎𝑟𝑔𝑒
(comparatively)
Asymptotic analysis
Computation not crucial
New challenges:
samples, computation,
communication, storage
A Key Question
• How many samples do you need in terms of
domain size?
– Do you need to estimate the probabilities of each
domain item?
-- OR -– Can sample complexity be sublinear in size of the
domain?
Rules out standard statistical
techniques
Aim
Algorithms with sublinear sample complexity
STATISTICS
MACHINE
LEARNING
INFORMATION
THEORY ALGORITHMS
The Menu
Motivation
Problem Formulation
Uniformity Testing, Goodness of Fit
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
The Menu
Motivation
Problem Formulation
Uniformity Testing, Goodness of Fit
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
Problem formulation
discrete
Model
𝒫: family of distributions over 𝐷
may be non-parametric, e.g. unimodal, product, log-concave
Problem
Given: samples from unknown 𝑝
with probability 0.9, distinguish
𝑝∈𝒫
vs
Objective
Minimize samples
Computational efficiency
𝑑(𝑝, 𝒫) > 𝜀
ℓ1 𝑝, 𝑞
min
𝑞∈𝒫
2
max |𝑝 𝓔 − 𝑞(𝓔)|
𝑒𝑣𝑒𝑛𝑡𝑠 𝓔
Sublinear
in |𝐷|?
Well-studied Problem
(Composite) hypothesis testing
–
–
–
–
–
Neyman-Pearson test
Kolmogorov-Smirnov test
Pearson’s chi-squared test
Generalized likelihood ratio test
…
Quantities of Interest
𝑃𝐹 = Pr 𝒂𝒄𝒄𝒆𝒑𝒕 𝑤ℎ𝑒𝑛 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝒇𝒂𝒍𝒔𝒆
𝑃𝑀 = Pr(𝒓𝒆𝒋𝒆𝒄𝒕 𝑤ℎ𝑒𝑛 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠 𝒕𝒓𝒖𝒆)
Focus
- Sublinear in |𝐷|?
- Strong control for
Consistency
false positives?
Error exponents: exp −𝑠 ⋅ 𝑅 as 𝑠 → ∞
Asymptotic regime: Results kick in when 𝑠 ≫ |𝐷|
The Menu
Motivation
Problem Formulation
Uniformity Testing, Goodness of Fit
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
The Menu
Motivation
Problem Formulation
Uniformity Testing, Goodness of Fit
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
Testing Fairness of a Coin
• 𝑏 : unknown probability of
1
•
•
𝒫 = Bernoulli
2
𝑝 = Bernoulli(𝑏)
•
𝑝∈𝒫
vs
𝑑TV (𝑝, 𝒫) > 𝜀
• Question: Is 𝑏 = 0.5 OR 𝑏 − 0.5 > 𝜖?
• Goal: Toss coin several times, deduce correct answer w/ probability > 0.99
• Number of samples required?
𝑘 𝑍
𝑖=1 𝑖
– Can estimate 𝑏, by tossing 𝑘 times, then taking 𝑏 =
– By concentration bounds, if 𝑘 > 𝑂
• Are Ω
1
𝜖2
1
𝜖2
𝑘
𝜖
, 𝑏 − 𝑏 < 3 , w/ probability > 0.99
many samples necessary?
– Suppose there is tester using 𝑘 samples
– Then it can distinguish one sample from 𝑋 = (𝑋1 , … , 𝑋𝑘 ) where each 𝑋𝑖 ∼ 𝐵(0.5) from
one sample from 𝑌 = 𝑌1 , … , 𝑌𝑘 where each 𝑌𝑖 ∼ 𝐵 0.5 + 𝜖 w/ probability > 0.99
– Claim: Any tester has error probability at least
– dTV 𝑋, 𝑌 ≤ 2 ⋅ 𝐻 𝑋, 𝑌 = 𝑂 𝜖 ⋅ 𝑘
1
(1 − dTV (𝑋, 𝑌))
2
Testing Uniformity
•
•
𝒫 = Uniform𝐷
𝑝 : unknown
• 𝑝: unknown distribution over 𝐷
• 𝑝∈𝒫
• Sample access to 𝑝
• Question: is 𝑝 = 𝑈𝐷 or 𝑑TV 𝑝, 𝑈𝐷 > 𝜖 ?
• [Paninski’03]: Θ
|𝐷|
𝜖2
vs
𝑑TV (𝑝, 𝒫) > 𝜀
samples and time
“Intuition:”
• (Lower Bound) Suppose 𝑞 is uniform distribution over {1, … , 𝑚}
𝑚
and 𝑝 is either uniform on 1, … , 𝑚 or uniform on a random
2
size subset of {1, … , 𝑚}
- unless Ω( 𝑚) samples are observed, there are no collisions,
hence cannot distinguish
• (Upper Bound) Collision statistics suffice to distinguish
Proving Lower Bounds
• [Le Cam’73]: Consider two disjoint sets of distributions 𝒫1 , 𝒫2 .
Suppose algorithm 𝒜 is given 𝑘 samples from some unknown
𝑝 ∈ 𝒫1 ∪ 𝒫2 and claims to distinguish 𝑝 ∈ 𝒫1 vs p ∈ 𝒫2 .
• Then:
Pr 𝑒𝑟𝑟𝑜𝑟 ≥
1
2
1−
inf
𝑝1 ∈𝑐𝑜𝑛𝑣𝑘 𝒫1
𝑝2 ∈𝑐𝑜𝑛𝑣𝑘 𝒫2
𝑑 𝑇𝑉 𝑝1 , 𝑝2
𝑐𝑜𝑛𝑣𝑘 𝒫 : all distributions generating
samples as follows
- choose a random distribution 𝑝 from 𝒫
(according to some distribution over 𝒫)
- then generate 𝑘 samples from 𝑝
Proving Lower Bounds
• [Le Cam’73]: Consider two disjoint sets of distributions 𝒫1 , 𝒫2 .
Suppose algorithm 𝒜 is given 𝑘 samples from some unknown
𝑝 ∈ 𝒫1 ∪ 𝒫2 and claims to distinguish 𝑝 ∈ 𝒫1 vs p ∈ 𝒫2 .
• Then:
Pr 𝑒𝑟𝑟𝑜𝑟 ≥
1
2
1−
inf
𝑝1 ∈𝑐𝑜𝑛𝑣𝑘 𝒫1
𝑝2 ∈𝑐𝑜𝑛𝑣𝑘 𝒫2
𝑑 𝑇𝑉 𝑝1 , 𝑝2
𝑐𝑜𝑛𝑣𝑘 𝒫 : all distributions generating
samples as follows
- choose a random distribution 𝑝 from 𝒫
(according to some distribution over 𝒫)
- then generate 𝑘 samples from 𝑝
Proving Lower Bounds
• [Le Cam’73]: Consider two disjoint sets of distributions 𝒫1 , 𝒫2 .
Suppose algorithm 𝒜 is given 𝑘 samples from some unknown
𝑝 ∈ 𝒫1 ∪ 𝒫2 and claims to distinguish 𝑝 ∈ 𝒫1 vs p ∈ 𝒫2 .
• Then:
Pr 𝑒𝑟𝑟𝑜𝑟 ≥
To prove Ω
𝐷
𝜖2
1
2
1−
inf
𝑝1 ∈𝑐𝑜𝑛𝑣𝑘 𝒫1
𝑝2 ∈𝑐𝑜𝑛𝑣𝑘 𝒫2
lower bound for uniformity
testing take:
• 𝒫1 = Uniform 𝑚
• 𝒫2 = 𝑞 | 𝑞2𝑖+1 =
1±𝑐𝜖
, 𝑞2𝑖+2
𝑚
=
1∓𝑐𝜖
, ∀𝑖
𝑚
𝑑 𝑇𝑉 𝑝1 , 𝑝2
𝑐𝑜𝑛𝑣𝑘 𝒫 : all distributions generating
samples as follows
- choose a random distribution 𝑝 from 𝒫
(according to some distribution over 𝒫)
- then generate 𝑘 samples from 𝑝
The Menu
Motivation
Problem Formulation
Uniformity Testing, Goodness of Fit
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
Identity Testing (“goodness of fit”)
• 𝑝, 𝑞: distributions over 𝐷
• 𝑞: given; sample access to 𝑝
• Question: is 𝑝 = 𝑞 or 𝑑TV 𝑝, 𝑞 > 𝜖 ?
•
•
𝒫= 𝑞
𝑝 : unknown
•
𝑝∈𝒫
vs
𝑑TV (𝑝, 𝒫) > 𝜀
• [Batu-Fisher-Fortnow-Kumar-Rubinfeld-White’01]…
• [Paninski’08, Valiant-Valiant’14]: Θ
|𝐷|
𝜖2
samples and time
• [w/ Acharya-Kamath’15]: a tolerant goodness of fit test with same
sample size can distinguish: 𝜒 2 𝑝, 𝑞 ≤ 𝜖 2 vs ℓ12 𝑝, 𝑞 > 4 ⋅ 𝜖 2
– 𝜒 2 𝑝, 𝑞 ≔
𝑝𝑖 −𝑞𝑖 2
𝑖∈𝐷
𝑞𝑖
– Cauchy-Schwarz: 𝜒 2 𝑝, 𝑞 ≥ ℓ1 𝑝, 𝑞
2
A new 𝜒 2 - Goodness of Fit Test
• Goal: given 𝑞 and sample access to 𝑝 distinguish:
Case 1: 𝜒 2 𝑝, 𝑞 ≤ 𝜖 2 vs Case 2: ℓ12 𝑝, 𝑞 ≥ 4 ⋅ 𝜖 2
• Approach: Draw Poisson(𝑚) many samples from 𝑝
Side-Note:
• 𝑁𝑖 : # of appearances of symbol 𝑖 ∈ 𝐷
• Pearson’s 𝜒 2 test uses
𝑁𝑖−𝑚⋅𝑞𝑖 2
– 𝑁𝑖 ~ Poisson 𝑚 ⋅ 𝑝𝑖
statistic 𝑖
– 𝑁𝑖
𝑖∈𝐷
independent random variables
• Statistic: 𝑍 =
𝑁𝑖 −𝑚⋅𝑞𝑖 2 −𝑁𝑖
𝑖
𝑚⋅𝑞𝑖
𝜒 2 (𝑝, 𝑞)
𝑚⋅𝑞𝑖
• Subtracting 𝑁𝑖 in the
numerator gives an unbiased
estimator and importantly
may hugely decrease
variance
– 𝐸 𝑍 =𝑚⋅
– Case 1: 𝐸 𝑍 ≤ 𝑚 ⋅ 𝜖 2 ; Case 2: 𝐸 𝑍 ≥ 𝑚 ⋅ 4 𝜖 2
– chug chug chug…bound variance of 𝑍 → O
suffice to distinguish
𝐷
𝜖2
samples
The Menu
Motivation
Problem Formulation
Uniformity Testing, Goodness of Fit
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
The Menu
Motivation
Problem Formulation
Uniformity Testing, Goodness of Fit
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
Testing Properties of Distributions
• so far 𝒫 ={single distribution}
– restrictive, as rarely know hypothesis distribution exactly
• natural extension: test structural properties
‒ monotonicity: “PDF is monotone,” e.g. cancer vs radiation
‒ unimodality: “PDF is single-peaked,” e.g. single source of disease
‒ log-concavity: “log PDF is concave”
‒ monotone-hazard rate: “log (1 − CDF) is concave”
‒ product distribution, e.g. testing linkage disequilibrium
• Example question:
― 𝒫 = {𝑢𝑛𝑖𝑚𝑜𝑑𝑎𝑙 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛𝑠 𝑜𝑣𝑒𝑟 [𝑚]}
― Sample access to 𝑝
― Is 𝑝 unimodal OR is 𝑝 𝜖-far from or unimodal distributions?
Testing Properties of Distributions
[w/ Acharya and Kamath 2015]:
1. Testing identity, monotonicity, log-concavity, monotone hazard-rate, unimodality
for distributions over (ordered set) 𝐷 is doable w/ 𝑂
2.
|𝐷|
𝜖2
samples and time.
Testing monotonicity/independence of a distribution over 𝐷 = 𝑚
w/ 𝑂
𝑚𝑑/2
𝜖2
≡𝑂
|𝐷|
𝜖2
𝑑
is doable
samples and time.
– previous best for monotonicity testing: 𝑂
𝑚𝑑−0.5
𝜖4
[Bhattacharya-Fisher-Rubinfeld-Valiant’11]
– previous best for independence: d=2, worst bounds [Batu et al.’01]
3.
All bounds above are optimal
– Matching lower bounds for 1 and 2 via Le Cam.
4.
Unified approach, computationally efficient tests
N.B. Contemporaneous work of [Canonne et al’2015] provide a different unified approach for
testing structure but their results are suboptimal.
A Natural Approach
Goal: Given 𝒫 and sample access to 𝑝, distinguish 𝑝 ∈ 𝒫 vs ℓ1 𝑝, 𝒫 > 𝜖.
Choose a Hypothesis 𝑞 ∈ 𝒫
Test the Hypothesis
(how well does 𝑞 fit 𝑝?)
A Natural Approach (cont’d)
• Goal: Given 𝒫 and sample access to 𝑝, distinguish 𝑝 ∈ 𝒫 vs ℓ1 𝑝, 𝒫 > 𝜖.
• A Learning-Followed-By-Testing Algorithm:
1.
Learn hypothesis 𝑞 ∈ 𝒫 s.t.
•
•
2.
•
•
𝜖
2
(needs cheap “proper learner”)
ℓ1 𝑝, 𝒫 > 𝜖 ⟹ ℓ1 𝑝, 𝑞 > 𝜖 (automatic since 𝑞 ∈ 𝒫)
Reduce to “tolerant goodness of fit”
given 𝜖, sample access to 𝑝, and explicit description of 𝑞, distinguish
𝜖
ℓ1 𝑝, 𝑞 < 2 vs ℓ1 𝑝, 𝑞 > 𝜖
Tolerant tester requires almost linear #samples in the support of 𝑝
–
•
𝑝 ∈ 𝒫 ⇒ ℓ1 𝑝, 𝑞 <
namely Ω(|𝐷|/ log |𝐷|) samples [Valiant-Valiant’10]
Could try investing more samples for more accurate learning, but proper
learning complexity vs tolerant testing complexity tradeoff does not work out
to give optimal testing complexity
A Modified Approach
• Goal: Given 𝒫 and sample access to 𝑝, distinguish 𝑝 ∈ 𝒫 vs ℓ1 𝑝, 𝒫 > 𝜖.
• A Learning-Followed-By-Testing Algorithm:
1.
Learn hypothesis 𝑞 ∈ 𝒫 s.t.
•
•
2.
•
𝑝 ∈ 𝒫 ⇒ ℓ1 𝑝, 𝑞 <
𝜖
2
(needs cheap “proper learner”)
ℓ1 𝑝, 𝒫 > 𝜖 ⟹ ℓ1 𝑝, 𝑞 > 𝜖 (automatic since 𝑞 ∈ 𝒫)
Reduce to “tolerant goodness of fit”
given 𝜖, sample access to 𝑝, and explicit description of 𝑞, distinguish
𝜖
ℓ1 𝑝, 𝑞 < 2 vs ℓ1 𝑝, 𝑞 > 𝜖
𝐷
•
Now tolerant testing has the right complexity of 𝑂
•
Pertinent Question: are there sublinear proper learners in 𝜒 2 ?
–
𝜖2
We show that the 𝜒 2 -learning complexity is dominated by the testing
complexity for all properties of distributions we consider
Tutorial: part 2
Summary so far
• Hypothesis Testing in the small sample regime.
𝑝
i.i.d.
samples
•
•
•
•
•
𝑝 unknown distribution over some discrete set 𝐷
𝒫: set of distributions over 𝐷
Given: 𝝐, 𝜹, sample access to 𝑝
Goal: w/ prob ≥ 1 − 𝜹 tell 𝑝 ∈ 𝒫 vs ℓ1 𝑝, 𝒫 > 𝝐
Properties of interest: Is 𝑝 uniform? unimodal? logconcave? MHR? product measure?
• All above properties can be tested w/ O
Test
Pass/Fail?
𝑫
𝝐𝟐
⋅ 𝐥𝐨𝐠
𝟏
𝜹
samples and time
• Unified approach based on modified Pearson’s goodness
of fit test: statistic 𝑍 =
𝑁𝑖 −𝐸𝑖 2 −𝑵𝒊
𝑖∈𝐷
𝐸𝑖
― tight control for false positives: want to be able to both assert
and reject the null hypothesis
― accommodate sublinear sample size
The Menu
Motivation
Problem Formulation
Uniformity Testing, Goodness of Fit
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
The Menu
Motivation
Problem Formulation
Uniformity Testing, Goodness of Fit
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
Other Distances (beyond ℓ1 )
• So far focused on ℓ1 (a.k.a. total variation) distance
• Given sample access to 𝑝, w/ prob ≥ 1 − 𝜹 distinguish:
𝑝 ∈ 𝒫 vs ℓ1 𝑝, 𝒫 > 𝝐
• Stronger distances?
– [Acharya-D-Kamath]: results are actually shown for 𝜒 2 𝑝, 𝒫 ≥ ℓ1 𝑝, 𝒫
– Should also extend to KL 𝑝, 𝒫 ≥ ½ ⋅ ℓ1 𝑝, 𝒫 2
• Weaker distances?
– ℓ2 is easy to test for [Goldreich Ron], but makes less sense, e.g.
p = (2/m,2/m,2/m,…,2/n,0,0,0,0,0,0,0….0)
q = (0,0,0,0,0,0,0….0,2/m,2/m,2/m,…,2/m)
– l1 distance = 2
l2 distance =
2
𝑚
2
Tolerance
• So far, focused on non-tolerant version:
– Given set of distributions 𝒫, and sample access to 𝑝
– Distinguish: 𝑝 ∈ 𝒫
vs 𝑑 𝑇𝑉 𝑝, 𝒫 > 𝝐
• Tolerant version:
― Distinguish: 𝑑 𝑇𝑉 𝑝, 𝒫 <
𝝐
2
― [Valiant-Valiant’10]: Ω
vs 𝑑 𝑇𝑉 𝑝, 𝒫 > 𝝐
𝐷
log |𝐷|
samples are needed
• Tolerant version in (𝜒 2 , ℓ1 ):
― Distinguish: 𝜒 2 𝑝, 𝒫 <
𝝐
2
vs ℓ12 𝑝, 𝒫 > 𝝐
― [w/ Acharya, Kamath’15]: O
|𝐷|
𝜖2
Different ratios
change the
constants in 𝑂()
notation
samples suffice
Goodness of Fit
Our goodness of fit test was given an explicit distribution 𝑞 and
sample access to a distribution 𝑝, and was asked to test 𝑝 = 𝑞
vs 𝑑 𝑇𝑉 𝑝, 𝑞 > 𝝐.
Sometimes both distributions are unknown, e.g.
Transactions of 20-30 yr olds
Transactions of 30-40 yr olds
Same or different?
Goodness of Fit w/ two unknowns
p
q
i.i.d.
samples
i.i.d.
samples
Test
Pass/Fail?
Given sample access to two unknown
distributions 𝑝, 𝑞:
Distinguish 𝑝 = 𝑞 vs 𝑑 𝑇𝑉 (𝑝, 𝑞) > 𝝐
Goodness of Fit w/ two unknowns
• [Batu Fortnow Rubinfeld Smith White], [P. Valiant], …
• [Chan Diakonikolas Valiant Valiant]: Tight upper and lower
bound of:
Θ max
2
|𝐷|3
4
𝜖3
,
𝐷
𝜖2
.
• Why different?
– Collision statistics are all that matter
– Collisions on “heavy” elements can hide collision statistics of rest of
the domain
– Construct pairs of distributions where heavy elements are identical,
but “light” elements are either identical or very different
Continuous Distributions
• What can we say about continuous distributions?
– without extra assumptions such as smoothness of PDF/parametric
modeling cannot stick to hard distances (ℓ1 , 𝜒 2 , 𝐾𝐿)
• Instead of restricting 𝑝, 𝒫, let us switch distances
• Can extend results if we switch to Kolmogorov distance
– recall: 𝑑 𝑇𝑉 𝑝, 𝑞 = sup |𝑝 ℰ − 𝑞(ℰ)|
ℰ
– in contrast: 𝑑𝐾 𝑝, 𝑞 =
sup
|𝑝 ℰ − 𝑞(ℰ)|
ℰ: 𝐫𝐞𝐜𝐭𝐚𝐧𝐠𝐥𝐞
• Now want to distinguish: 𝑝 ∈ 𝒫 vs 𝑑𝐾 𝑝, 𝒫 > 𝝐
• Claim: Tolerant testing in Kolmogorov distance of any distribution
property (continuous or discrete) of 𝑑-dimensional distributions
𝑑
can be performed from 𝑂 2 samples.
𝜖
• Importantly: Kolmogorov distance allows graceful scaling with
dimensionality of data
Dvoretzky–Kiefer–Wolfowitz inequality
• Suppose 𝑋1 , … , 𝑋𝑛 i.i.d. samples from (single-dimensional) 𝐹,
and let 𝐹𝑛 be the resulting empirical CDF, namely
1
𝐹𝑛 𝑥 =
𝑛
Then: Pr 𝑑𝐾 𝐹, 𝐹𝑛 > 𝜖 ≤
𝑛
1𝑋𝑖≤𝑥 .
𝑖=1
2
−2𝑛𝜖
2𝑒
,
∀𝜖 > 0.
1
𝜖2
• i.e. 𝑂
samples suffice to learn any single-dimensional dist’n
to within 𝜖 in Kolmogorov distance.
• VC Inequality ⟹ same is true for 𝑑-dimensional distributions
𝑑
when #samples is at least 𝑂 2
𝜖
• After learning in Kolmogorov, can tolerant test any property.
• Runtime under investigation.
– trouble: computing/approximating the Kolmogorov distance of two highdimensional distributions is generally a hard computational question.
The Menu
Motivation
Problem Formulation
Uniformity Testing, Goodness of Fit
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
The Menu
Motivation
Problem Formulation
Uniformity Testing, Goodness of Fit
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
Testing in High-Dimensions
• Already talked about testing high-dimensional
distributions in Kolmogorov distance.
– Sample complexity is 𝑂
𝑑
𝜖2
• Next Focus: discrete distributions, stronger
distances
High-Dimensional Discrete Distn’s
•
Consider source generating 𝑛-bit strings ∈ 0,1
–
–
–
–
•
400 bit
images
0011010101 (sample 1)
0101001110 (sample 2)
0011110100 (sample 3)
…
Are bits/pixels independent?
– Our algorithms require Θ
•
2𝑛 2
𝜖2
samples
1
Is source generating graphs over 𝑛 nodes Erdos-Renyi 𝐺 𝑛, 2 ?
•
•
𝑛
Our algorithms require Θ
𝑛
/2
2 2
𝜖2
samples
Exponential dependence on 𝑛 unsettling, but necessary
– Lower bound exploits high possible correlation among bits
•
Nature is not adversarial
– Often high dimensional systems have structure, e.g. Markov random fields
fields (MRFs), graphical models (Bayes nets), etc
Testing high-dimensional distributions with combinatorial structure?
High-Dimensional Discrete Distn’s
•
Consider source generating 𝑛-bit strings ∈ 0,1 𝑛
– 0011010101[w/
(sample
1) Kamath’16]: If unknown 𝑝 is known to be an Ising
Dikkala,
– 0101001110 (sample 2)
1
model,
then
poly
𝑛,
samples suffice to test independence,
– 0011110100 (sample 3)
𝜖
– …
goodness of fit. (extends to MRFs)
•
Are bits/pixels independent?
– Our algorithms require Θ
•
samples
1
Is source generating graphs over 𝑛 nodes Erdos-Renyi 𝐺 𝑛, 2 ?
•
•
2𝑛 2
𝜖2
Our algorithms require Θ
𝑛
/2
2 2
𝜖2
samples
Exponential dependence on 𝑛 unsettling, but necessary
– Lower bound exploits high possible correlation among bits
•
Nature is not adversarial
– Often high dimensional systems have structure, e.g. Markov random fields
fields (MRFs), graphical models (Bayes nets), etc
Testing high-dimensional distributions with combinatorial structure?
400 bit
images
Ising Model
• Statistical physics, computer vision, neuroscience, social
science
• Ising model:
– Probability distribution defined in terms of a graph 𝐺 = 𝑉, 𝐸 ,
edge potentials 𝜃𝑒 , node potentials 𝜃𝑣
– State space ±1 𝑉
𝑝𝜃 𝑥 ∝ exp
𝜃𝑒 𝑥𝑢 𝑥𝑣 +
𝑒= 𝑢,𝑣 ∈𝐸
𝜃𝑣 𝑥𝑣
𝑣∈𝑉
– High |𝜃𝑒 |’s ⟹ strongly (anti-)correlated spins
𝜃𝑣 = 0
Ising
Model:
Strong
vs
weak
ties
“high temperature regime”
“low temperature regime”
𝜃𝑒 = 1
𝜃𝑒 = 0.5
𝜃𝑒 = 0.25
𝜃𝑒 = 0.125
𝜃𝑒 = 0
Testing Ising Models
𝑝𝜃 𝑥 ∝ exp
𝜃𝑒 𝑥𝑢 𝑥𝑣 +
𝑒= 𝑢,𝑣 ∈𝐸
𝜃𝑣 𝑥𝑣
𝑣∈𝑉
• Given: sample access to Ising model 𝑝𝜃 over 𝐺 = (𝑉, 𝐸) w/ 𝑛 = |𝑉| nodes, 𝑚 = |𝐸|
edges
– 𝜃 unknown, graph G unknown
– 𝑝𝜃 supported on ±1 𝑉
• Goal: distinguish 𝑝𝜃 ∈ ℐ ±1 𝑉 vs ℓ1 𝑝𝜃 , ℐ ±1 𝑉 > 𝜖
1
• [w/ Dikkala, Kamath’16]: poly 𝑛, samples suffice
𝜖
• Warmup:
 SKL 𝑝𝜃 , 𝑝𝜃′ = KL 𝑝𝜃 , 𝑝𝜃′ +KL 𝑝𝜃′ , 𝑝𝜃
𝜃𝑒 − 𝜃𝑒′ 𝐸𝜃 𝑋𝑢 𝑋𝑣 − 𝐸𝜃′ 𝑋𝑢 𝑋𝑣
=
𝑢,𝑣 =𝑒∈𝐸
𝜃𝑣 − 𝜃𝑣′ 𝐸𝜃 𝑋𝑣 − 𝐸𝜃′ 𝑋𝑣
+
𝑣∈𝑉
 SKL 𝑝𝜃 , 𝑝𝜃′ ≥ ℓ1 𝑝𝜃 , 𝑝𝜃′
2
Testing Ising Models
𝑝𝜃 𝑥 ∝ exp
𝜃𝑒 𝑥𝑢 𝑥𝑣 +
𝑒= 𝑢,𝑣 ∈𝐸
𝜃𝑣 𝑥𝑣
𝑣∈𝑉
• Given: sample access to Ising model 𝑝𝜃 over 𝐺 = (𝑉, 𝐸) w/ 𝑛 = |𝑉| nodes, 𝑚 = |𝐸|
edges
– 𝜃 unknown, graph G unknown
– 𝑝𝜃 supported on ±1 𝑉
• Goal: distinguish 𝑝𝜃 ∈ ℐ ±1 𝑉 vs 𝐒𝐊𝐋 𝑝𝜃 , ℐ ±1 𝑉 > 𝜖
1
• [w/ Dikkala, Kamath’16]: poly 𝑛, samples suffice
𝜖
• Warmup:
 SKL 𝑝𝜃 , 𝑝𝜃′ = KL 𝑝𝜃 , 𝑝𝜃′ +KL 𝑝𝜃′ , 𝑝𝜃
𝜃𝑒 − 𝜃𝑒′ 𝐸𝜃 𝑋𝑢 𝑋𝑣 − 𝐸𝜃′ 𝑋𝑢 𝑋𝑣
=
𝑢,𝑣 =𝑒∈𝐸
𝜃𝑣 − 𝜃𝑣′ 𝐸𝜃 𝑋𝑣 − 𝐸𝜃′ 𝑋𝑣
+
𝑣∈𝑉
 SKL 𝑝𝜃 , 𝑝𝜃′ ≥ ℓ1 𝑝𝜃 , 𝑝𝜃′
2
Focus:
𝜃𝑣 = 0
Testing Ising Models
𝑝𝜃 𝑥 ∝ exp
𝜃𝑒 𝑥𝑢 𝑥𝑣
𝑒= 𝑢,𝑣 ∈𝐸
• Given: sample access to Ising model 𝑝𝜃 over 𝐺 = (𝑉, 𝐸) w/ 𝑛 = |𝑉| nodes, 𝑚 = |𝐸|
edges
– 𝜃 unknown, graph G unknown
– 𝑝𝜃 supported on ±1 𝑉
Cheaper nonlocalizing test?
• Goal: distinguish 𝑝𝜃 = 𝑈 ±1 𝑉 vs SKL 𝑝𝜃 , 𝑈 ±1 𝑉 > 𝜖
• [w/ Dikkala, Kamath’16]: poly 𝑛,
• Warmup: SKL 𝑝𝜃 , 𝑝𝜃′
• SKL 𝑝𝜃 , 𝑈 ±1 𝑉 > 𝜖 ⟹
=
1
𝜖
samples suffice
𝑢,𝑣 =𝑒∈𝐸
𝜃𝑒 − 𝜃𝑒′ 𝐸𝜃 𝑋𝑢 𝑋𝑣 − 𝐸𝜃′ 𝑋𝑢 𝑋𝑣
𝜖
𝐸∋𝑒= 𝑢,𝑣
𝜃𝑒 ⋅ 𝐸𝜃 𝑋𝑢 𝑋𝑣 > 𝜖 ⟹ ∃ 𝑢, 𝑣 : 𝐸𝜃 𝑋𝑢 𝑋𝑣 > 𝑚⋅𝜃
max
• 𝑝𝜃 = 𝑈 ±1 𝑉 ⟹ ∀ 𝑢, 𝑣 : 𝐸𝜃 𝑋𝑢 𝑋𝑣 = 0
• Hence, can distinguish which is the case from 𝑂
2
𝑚2 ⋅𝜃𝑚𝑎𝑥
𝜖2
samples.
Can localize departure from Uniformity (independence) on some edge
Focus:
𝜃𝑣 = 0
Testing Ising Models
𝑝𝜃 𝑥 ∝ exp
𝜃𝑒 𝑥𝑢 𝑥𝑣
𝑒= 𝑢,𝑣 ∈𝐸
• Given: sample access to Ising model 𝑝𝜃 over 𝐺 = (𝑉, 𝐸) w/ 𝑛 = |𝑉| nodes, 𝑚 = |𝐸|
edges
– 𝜃 unknown, graph G unknown
– 𝑝𝜃 supported on ±1 𝑉
Cheaper nonlocalizing test?
• Goal: distinguish 𝑝𝜃 = 𝑈 ±1 𝑉 vs SKL 𝑝𝜃 , 𝑈 ±1 𝑉 > 𝜖
• Claim: By expending some samples, can identify distinguishing statistic of the form
Z = 𝑢,𝑣 𝑐𝑢𝑣 𝑋𝑢 𝑋𝑣 , where 𝑐𝑢𝑣 ∈ {±1} for all 𝑢, 𝑣.
• Issue: can’t bound 𝑉𝑎𝑟[𝑍] intelligently as var’s aren’t pairwise ind.
• If 𝜃𝑒 = 0, ∀𝑒, then 𝑉𝑎𝑟 𝑍 = 𝑛2
• O.w. best can say is trivial 𝑉𝑎𝑟 𝑍 = 𝑂(𝑛4 )
Low temperature.
– and is, in fact, tight:
• consider two disjoint cliques with super-strong ties
• suppose all 𝑐𝑢,𝑣 = 1, for all 𝑢, 𝑣
• 𝑍 dances around its mean by Ω n2
How about high
temperature?
𝜃𝑒 = +∞
𝜃𝑒 = +∞
𝜃𝑣 = 0
Ising
Model:
Strong
vs
weak
ties
“high temperature regime”
“low temperature regime”
𝜃𝑒 ≤ 1/𝑑𝑚𝑎𝑥
Exponential mixing of the
Glauber dynamics
𝜃𝑒 = 1
𝜃𝑒 = 0.5
⟹ 𝑂 𝑛 ⋅ log n mixing of
the Glauber dynamics
𝜃𝑒 = 0.25
𝜃𝑒 = 0.125
𝜃𝑒 = 0
Testing Ising Models
𝑝𝜃 𝑥 ∝ exp
Focus:
𝜃𝑣 = 0
𝜃𝑒 𝑥𝑢 𝑥𝑣
𝑒= 𝑢,𝑣 ∈𝐸
• Given: sample access to Ising model 𝑝𝜃 over 𝐺 = (𝑉, 𝐸) w/ 𝑛 = |𝑉| nodes, 𝑚 = |𝐸|
edges
– 𝜃 unknown, graph G unknown
– 𝑝𝜃 supported on ±1 𝑉
• Goal: distinguish 𝑝𝜃 = 𝑈 ±1 𝑉 vs SKL 𝑝𝜃 , 𝑈 ±1 𝑉 > 𝜖
• Claim: By expending some samples, can identify distinguishing statistic of the form
Z = 𝑢,𝑣 𝑐𝑢𝑣 𝑋𝑢 𝑋𝑣 , where 𝑐𝑢𝑣 ∈ {±1} for all 𝑢, 𝑣.
• Issue: can’t bound 𝑉𝑎𝑟[𝑍] intelligently as var’s aren’t pairwise ind.
• If 𝜃𝑒 = 0, ∀𝑒, then 𝑉𝑎𝑟 𝑍 = 𝑛2
• Low temperature: 𝑉𝑎𝑟 𝑍 = 𝑂 𝑛4
• High temperature: 𝑉𝑎𝑟 𝑍 = 𝑂
𝑛3
𝑑𝑚𝑎𝑥
⟹ O(𝑛2 ) for dense graphs
– proof via exchangeable pairs [Stein,…,Chatterjee 2006]
Exchangeable Pairs
• Goal: Given 𝑓(⋅) and random vector 𝑋 ∼ 𝐷 want to bound
moments of 𝑓(𝑋), or prove concentration of 𝑓(𝑋) about its mean
• Approach:
– Define pair of random vectors (𝑋, 𝑋 ′ ) such that:
• (𝑋, 𝑋 ′ ) has the same distribution as 𝑋 ′ , 𝑋
(exchangeability)
• marginal distributions are 𝐷
(faithfulness)
– Find 𝐹(𝑥, 𝑦) anti-symmetric function (i.e. 𝐹 𝑥, 𝑦 = −𝐹 𝑦, 𝑥 , ∀ 𝑥, 𝑦) such
that 𝐸[𝐹(𝑋, 𝑋′)|𝑋] = 𝑓(𝑋) − 𝐸[𝑓(𝑋)]
• Claims:
1
2
= ⋅ 𝐸 𝑓 𝑋 − 𝑓 𝑋′
1.
𝑉𝑎𝑟 𝑓 𝑋
2.
Let 𝑣 𝑋 = ⋅ 𝐸
1
2
𝑓 𝑋 − 𝑓 𝑋′
⋅ 𝐹 𝑋, 𝑋 ′
⋅ 𝐹 𝑋, 𝑋 ′
If 𝑣 𝑋 ≤ 𝐶 a.s. then Pr 𝑓 𝑋 − 𝐸 𝑓 𝑋
𝑋.
≥ 𝑡 ≤ 2 ⋅ 𝑒 −𝑡
2 /2𝐶
Silly Example
• 𝑓 𝑋 = 𝑛𝑖=1 𝑋𝑖 , where 𝑋 = 𝑋1 , … , 𝑋𝑛 ∼ 𝐷1 × ⋯ × 𝐷𝑛
– Suppose for all 𝑖: 𝐸 𝑋𝑖 = 𝜇𝑖 and 𝑋𝑖 ∈ [𝑎𝑖 , 𝑏𝑖 ] a.s.
• Goal: prove concentration of 𝑓 𝑋 about its mean
• Defining exchangeable pair 𝑋, 𝑋 ′ as follows:
– sample 𝑋 ∼ 𝐷1 × ⋯ × 𝐷𝑛
– pick 𝑖 u.a.r.; sample 𝑋𝑖′ ∼ 𝐷𝑖 ; and set 𝑋𝑗′ = 𝑋𝑗 , ∀𝑗 ≠ 𝑖
• Choose anti-symmetric function: 𝐹 𝑋, 𝑋 ′ = 𝑛 ⋅ 𝑓 𝑋 − 𝑓 𝑋 ′
– 𝐸 𝐹 𝑋, 𝑋’ 𝑋 = 𝑛 ⋅ 𝑓 𝑋 −
1
2
• Bounding 𝑣 𝑋 = ⋅ 𝐸
1 1
2 𝑛
– 𝑣 𝑋 = ⋅ ⋅
𝑖𝐸
• Pr 𝑓 𝑋 − 𝐸 𝑓 𝑋
𝑛
𝑛
𝑖
𝜇𝑖 +
𝑓 𝑋 − 𝑓 𝑋′
𝑋𝑖 − 𝑋𝑖′
2
≤
1
2
𝑖
⋅ 𝐹 𝑋, 𝑋 ′
𝑏𝑖 − 𝑎𝑖
≥ 𝑡 ≤ 2 ⋅ 𝑒 −𝑡
2/
𝑗≠𝑖 𝑋𝑗
𝑖
2
𝑏𝑖 −𝑎𝑖 2
=
𝑖 𝑋𝑖
𝑋?
− 𝜇𝑖
Interesting Example: Ising
• 𝑋 ∼ 𝑝𝜃 𝑥 ∝ exp 𝑒= 𝑢,𝑣 ∈𝐸 𝜃𝑒 𝑥𝑢 𝑥𝑣 ; 𝑓𝑐 𝑥 =
• How to define exchangeable pair?
𝑢,𝑣 𝑐𝑢𝑣 𝑋𝑢 𝑋𝑣
– Natural approach: sample 𝑋 ∼ 𝑝𝜃
– Do one step of the Glauber dynamics from 𝑋 to find 𝑋’
• i.e. pick random node 𝑣 and resample 𝑋𝑣′ from marginal of 𝑝𝜃 at 𝑣
conditioning on the state of all other nodes being 𝑋−𝑣
• Harder question: find anti-symmetric 𝐹 ⋅,⋅ s.t. 𝐸[𝐹(𝑋, 𝑋’)|𝑋] = 𝑓𝑐 (𝑋) − 𝐸[𝑓𝑐 (𝑋)]
– Approach that works for any 𝑓:
•
′
• 𝐹 𝑥, 𝑥 ′ = ∞
𝑡=0 𝐸[𝑓 𝑋𝑡 − 𝑓(𝑋𝑡 ′)|𝑋0 = 𝑥, 𝑋0 = 𝑥′]
• where 𝑋𝑡 𝑡≥0 , 𝑋𝑡′ 𝑡≥0 are two (potentially coupled) executions of the Glauber
dynamics starting from states 𝑥 and 𝑥′ respectively
Requires a good
Challenging question:
coupling
1
– bound 𝑉𝑎𝑟 𝑓𝑐 𝑋
= 2 ⋅ 𝐸 𝑓𝑐 𝑋 − 𝑓𝑐 𝑋 ′ ⋅ 𝐹 𝑋, 𝑋 ′
1
=2
∞
𝑡=0 𝐸
𝑓𝑐 𝑋 − 𝑓𝑐 𝑋 ′
⋅ 𝐸 𝑓𝑐 𝑋𝑡 − 𝑓 𝑋𝑡′ 𝑋0 , 𝑋0′
Need to show function contracts as
Glauber dynamics unravels
Showing Contraction
Generous coupling: choose same node, but update independently
Lemma 1:
Different than “greedy
coupling” typically used
where the same node is
chosen and the update is
coordinated to maximize
probability of same update
Lemma 2:
The Menu
Motivation
Problem Formulation
Uniformity Testing, Goodness of Fit
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
Future Directions
The Menu
Motivation
Problem Formulation
Uniformity Testing, Goodness of Fit
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
Future Directions
Markov Chain Testing
Example:
n = 52 cards, ≥ 7 riffle shuffles needed*
𝑑 𝑇𝑉 Uni, 6 ×
> 0.6
[Diakonis,Bayer’92]
𝑑 𝑇𝑉 Uni, 7 ×
≈ 0.33
*riffle shuffle = Gilbert-Shannon-Reeds (GSR) model for distribution on card permutations.
Empirical Fact:
vs.
different Markov chains! [Diakonis’03]
[Ongoing work with Dikkala, Gravin]
• Question: how close is real shuffle to the GSR distribution?
• Given: sample access to
∼ 𝐹riffle (𝑛2 variables)
• Goal: distinguish
is GSR vs ℓ1
, 𝐺𝑆𝑅 > 𝜀
Markov Chain Testing
Question: test 𝑀 = 𝑀∗ vs 𝑑𝑖𝑠𝑡 𝑀, 𝑀∗ > 𝜀
Distance?
• Let 𝑃, 𝑄 be the transition matrices of chains 𝑀, 𝑀∗
• Object of interest: word𝑣𝑘 𝑃 = 𝑣 ≡ 𝑣1 → v2 → ⋯ → 𝑣𝑘 ∼ 𝑃 × 𝑃 × ⋯ × 𝑃
• Pertinent question: asymptotic 𝑑 𝑇𝑉 word𝑣𝑘 P , word𝑣𝑘 Q
as 𝑘 → ∞ ?
spectral radius
• Easier to quantify 1 − 𝑑𝐻 word𝑣𝑘 P , word𝑣𝑘 Q
≈𝜌
𝑃𝑖𝑗 ⋅ 𝑄𝑖𝑗
• So proposed notion of distance: 𝑑𝑖𝑠𝑡 𝑀, 𝑀∗ = 1 − 𝜌
𝑃𝑖𝑗 ⋅ 𝑄𝑖𝑗
𝑘
, for all 𝑣
Results: testing symmetric 𝑀, 𝑀∗ with 𝑂(𝜀𝑛2 + cover time) samples
[Ongoing work with Dikkala, Gravin]
Testing Combinatorial Structure
Efficiently testing combinatorial structure
Is the phylogenic tree
assumption true?
Sapiens-Neanderthal
early interbreeding
[Slatkin et al’13]
Is a graphical model a tree?
[ongoing work with Bresler, Acharya]
Testing from a Single Sample
• Given one social network, one brain, etc., how
can we test the validity of a certain generative
model?
• Get many samples from one sample?
INFORMATION
THEORY
MACHINE
LEARNING
STATISTICS
ALGORITHMS
Thank You!

slides

Transcript slides

Directory