Transcript slides

Statistics vs Big Data
Constantinos Daskalakis
CSAIL and EECS, MIT
Greek Stochastics πœƒ
YOU WANT BIG DATA?
I’LL GIVE YOU BIG DATA!
BIG Data
small
Facebook: 20 petabytes images daily
Human genome: 40 exabytes storage by 2025
SKA Telescope: 1 exabyte daily
High-dimensional
DNA
microarray
Computer
vision
Expensive
Experimental
drugs
Financial
records
What properties do your BIG
distributions have?
e.g.1: play the lottery?
Is it uniform?
Is the lottery unfair?
from Hitlotto.com:
β€œLottery experts agree, past number histories
can be the key to predicting future winners.”
True Story!
β€’ Polish lottery Multilotek
– Choose β€œuniformly” at random distinct 20 numbers
out of 1 to 80.
– Initial machine biased
β€’ e.g., probability of 50-59 too small
Thanks to Krzysztof Onak (pointer) and Eric Price (graph)
New Jersey Pick 3,4 Lottery
β€’ New Jersey Pick k ( =3,4) Lottery.
– Pick k random digits in order.
– 10k possible values.
β€’ Data:
– Pick 3 - 8522 results from 5/22/75 to 10/15/00
β€’ 2-test (on Excel) answers
"42% confidence”
– Pick 4 - 6544 results from 9/1/77 to 10/15/00.
β€’
β€’
β€’
β€’
fewer results than possible values
not a good idea to run πœ’ 2 test
convergence to πœ’ 2 distribution won’t kick in for small sample size
(textbook) rule of thumb: expected number of at least 5
observations of each element in the domain under the null
hypothesis to run πœ’ 2
e.g. 2: Independence Testing
Shopping patterns:
Independent of zip code?
e.g.2: Linkage Disequilibrium
Genome
locus 1
locus 2
locus 𝑛
Single nucleotide polymorphisms, are they independent?
Suppose 𝑛 loci, 2 possible states each, then:
β€’ state of one’s genome ∈ {0,1}𝑛
β€’ humans: some distribution 𝑝 over {0,1}𝑛
Question: Is 𝑝 a product dist’n OR far from all product dist’ns?
should we expect the genomes from the 1000 human genomes
project to be sufficient? up to how many loci?
e.g. 3: Outbreak of diseases
β€’ Similar patterns in different years?
β€’ More prevalent near large airports?
Flu 2005
Flu 2006
Distributions on BIG domains
β€’ Given samples of a distribution, need to know, e.g.,
–
–
–
–
entropy
number of distinct elements
β€œshape” (monotone, unimodal, etc)
closeness to uniform, Gaussian, Zipfian…
β€’ No assumptions on shape of distribution
– i.e., parametric, smoothness, monotonicity, normality,…
β€’ Considered in Statistics, information theory, machine
learning, databases, algorithms, physics, biology,…
Old questions, new challenges
Modern Setting
Classical Setting
Domain:
1000 tosses
Domain:
One human genome
Small domain 𝐷
Large domain 𝐷
𝑛 π‘™π‘Žπ‘Ÿπ‘”π‘’, 𝐷 π‘ π‘šπ‘Žπ‘™π‘™
𝑛 π‘ π‘šπ‘Žπ‘™π‘™, 𝐷 π‘™π‘Žπ‘Ÿπ‘”π‘’
(comparatively)
Asymptotic analysis
Computation not crucial
New challenges:
samples, computation,
communication, storage
A Key Question
β€’ How many samples do you need in terms of
domain size?
– Do you need to estimate the probabilities of each
domain item?
-- OR -– Can sample complexity be sublinear in size of the
domain?
Rules out standard statistical
techniques
Aim
Algorithms with sublinear sample complexity
STATISTICS
MACHINE
LEARNING
INFORMATION
THEORY ALGORITHMS
The Menu
Motivation
Problem Formulation
Uniformity Testing, Goodness of Fit
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
The Menu
Motivation
Problem Formulation
Uniformity Testing, Goodness of Fit
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
Problem formulation
discrete
Model
𝒫: family of distributions over 𝐷
may be non-parametric, e.g. unimodal, product, log-concave
Problem
Given: samples from unknown 𝑝
with probability 0.9, distinguish
π‘βˆˆπ’«
vs
Objective
Minimize samples
Computational efficiency
𝑑(𝑝, 𝒫) > πœ€
β„“1 𝑝, π‘ž
min
π‘žβˆˆπ’«
2
max |𝑝 𝓔 βˆ’ π‘ž(𝓔)|
𝑒𝑣𝑒𝑛𝑑𝑠 𝓔
Sublinear
in |𝐷|?
Well-studied Problem
(Composite) hypothesis testing
–
–
–
–
–
Neyman-Pearson test
Kolmogorov-Smirnov test
Pearson’s chi-squared test
Generalized likelihood ratio test
…
Quantities of Interest
𝑃𝐹 = Pr 𝒂𝒄𝒄𝒆𝒑𝒕 π‘€β„Žπ‘’π‘› β„Žπ‘¦π‘π‘œπ‘‘β„Žπ‘’π‘ π‘–π‘  𝒇𝒂𝒍𝒔𝒆
𝑃𝑀 = Pr(𝒓𝒆𝒋𝒆𝒄𝒕 π‘€β„Žπ‘’π‘› β„Žπ‘¦π‘π‘œπ‘‘β„Žπ‘’π‘ π‘–π‘  𝒕𝒓𝒖𝒆)
Focus
- Sublinear in |𝐷|?
- Strong control for
Consistency
false positives?
Error exponents: exp βˆ’π‘  β‹… 𝑅 as 𝑠 β†’ ∞
Asymptotic regime: Results kick in when 𝑠 ≫ |𝐷|
The Menu
Motivation
Problem Formulation
Uniformity Testing, Goodness of Fit
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
The Menu
Motivation
Problem Formulation
Uniformity Testing, Goodness of Fit
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
Testing Fairness of a Coin
β€’ 𝑏 : unknown probability of
1
β€’
β€’
𝒫 = Bernoulli
2
𝑝 = Bernoulli(𝑏)
β€’
π‘βˆˆπ’«
vs
𝑑TV (𝑝, 𝒫) > πœ€
β€’ Question: Is 𝑏 = 0.5 OR 𝑏 βˆ’ 0.5 > πœ–?
β€’ Goal: Toss coin several times, deduce correct answer w/ probability > 0.99
β€’ Number of samples required?
π‘˜ 𝑍
𝑖=1 𝑖
– Can estimate 𝑏, by tossing π‘˜ times, then taking 𝑏 =
– By concentration bounds, if π‘˜ > 𝑂
β€’ Are Ξ©
1
πœ–2
1
πœ–2
π‘˜
πœ–
, 𝑏 βˆ’ 𝑏 < 3 , w/ probability > 0.99
many samples necessary?
– Suppose there is tester using π‘˜ samples
– Then it can distinguish one sample from 𝑋 = (𝑋1 , … , π‘‹π‘˜ ) where each 𝑋𝑖 ∼ 𝐡(0.5) from
one sample from π‘Œ = π‘Œ1 , … , π‘Œπ‘˜ where each π‘Œπ‘– ∼ 𝐡 0.5 + πœ– w/ probability > 0.99
– Claim: Any tester has error probability at least
– dTV 𝑋, π‘Œ ≀ 2 β‹… 𝐻 𝑋, π‘Œ = 𝑂 πœ– β‹… π‘˜
1
(1 βˆ’ dTV (𝑋, π‘Œ))
2
Testing Uniformity
β€’
β€’
𝒫 = Uniform𝐷
𝑝 : unknown
β€’ 𝑝: unknown distribution over 𝐷
β€’ π‘βˆˆπ’«
β€’ Sample access to 𝑝
β€’ Question: is 𝑝 = π‘ˆπ· or 𝑑TV 𝑝, π‘ˆπ· > πœ– ?
β€’ [Paninski’03]: Θ
|𝐷|
πœ–2
vs
𝑑TV (𝑝, 𝒫) > πœ€
samples and time
β€œIntuition:”
β€’ (Lower Bound) Suppose π‘ž is uniform distribution over {1, … , π‘š}
π‘š
and 𝑝 is either uniform on 1, … , π‘š or uniform on a random
2
size subset of {1, … , π‘š}
- unless Ξ©( π‘š) samples are observed, there are no collisions,
hence cannot distinguish
β€’ (Upper Bound) Collision statistics suffice to distinguish
Proving Lower Bounds
β€’ [Le Cam’73]: Consider two disjoint sets of distributions 𝒫1 , 𝒫2 .
Suppose algorithm π’œ is given π‘˜ samples from some unknown
𝑝 ∈ 𝒫1 βˆͺ 𝒫2 and claims to distinguish 𝑝 ∈ 𝒫1 vs p ∈ 𝒫2 .
β€’ Then:
Pr π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ β‰₯
1
2
1βˆ’
inf
𝑝1 βˆˆπ‘π‘œπ‘›π‘£π‘˜ 𝒫1
𝑝2 βˆˆπ‘π‘œπ‘›π‘£π‘˜ 𝒫2
𝑑 𝑇𝑉 𝑝1 , 𝑝2
π‘π‘œπ‘›π‘£π‘˜ 𝒫 : all distributions generating
samples as follows
- choose a random distribution 𝑝 from 𝒫
(according to some distribution over 𝒫)
- then generate π‘˜ samples from 𝑝
Proving Lower Bounds
β€’ [Le Cam’73]: Consider two disjoint sets of distributions 𝒫1 , 𝒫2 .
Suppose algorithm π’œ is given π‘˜ samples from some unknown
𝑝 ∈ 𝒫1 βˆͺ 𝒫2 and claims to distinguish 𝑝 ∈ 𝒫1 vs p ∈ 𝒫2 .
β€’ Then:
Pr π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ β‰₯
1
2
1βˆ’
inf
𝑝1 βˆˆπ‘π‘œπ‘›π‘£π‘˜ 𝒫1
𝑝2 βˆˆπ‘π‘œπ‘›π‘£π‘˜ 𝒫2
𝑑 𝑇𝑉 𝑝1 , 𝑝2
π‘π‘œπ‘›π‘£π‘˜ 𝒫 : all distributions generating
samples as follows
- choose a random distribution 𝑝 from 𝒫
(according to some distribution over 𝒫)
- then generate π‘˜ samples from 𝑝
Proving Lower Bounds
β€’ [Le Cam’73]: Consider two disjoint sets of distributions 𝒫1 , 𝒫2 .
Suppose algorithm π’œ is given π‘˜ samples from some unknown
𝑝 ∈ 𝒫1 βˆͺ 𝒫2 and claims to distinguish 𝑝 ∈ 𝒫1 vs p ∈ 𝒫2 .
β€’ Then:
Pr π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ β‰₯
To prove Ξ©
𝐷
πœ–2
1
2
1βˆ’
inf
𝑝1 βˆˆπ‘π‘œπ‘›π‘£π‘˜ 𝒫1
𝑝2 βˆˆπ‘π‘œπ‘›π‘£π‘˜ 𝒫2
lower bound for uniformity
testing take:
β€’ 𝒫1 = Uniform π‘š
β€’ 𝒫2 = π‘ž | π‘ž2𝑖+1 =
1±π‘πœ–
, π‘ž2𝑖+2
π‘š
=
1βˆ“π‘πœ–
, βˆ€π‘–
π‘š
𝑑 𝑇𝑉 𝑝1 , 𝑝2
π‘π‘œπ‘›π‘£π‘˜ 𝒫 : all distributions generating
samples as follows
- choose a random distribution 𝑝 from 𝒫
(according to some distribution over 𝒫)
- then generate π‘˜ samples from 𝑝
The Menu
Motivation
Problem Formulation
Uniformity Testing, Goodness of Fit
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
Identity Testing (β€œgoodness of fit”)
β€’ 𝑝, π‘ž: distributions over 𝐷
β€’ π‘ž: given; sample access to 𝑝
β€’ Question: is 𝑝 = π‘ž or 𝑑TV 𝑝, π‘ž > πœ– ?
β€’
β€’
𝒫= π‘ž
𝑝 : unknown
β€’
π‘βˆˆπ’«
vs
𝑑TV (𝑝, 𝒫) > πœ€
β€’ [Batu-Fisher-Fortnow-Kumar-Rubinfeld-White’01]…
β€’ [Paninski’08, Valiant-Valiant’14]: Θ
|𝐷|
πœ–2
samples and time
β€’ [w/ Acharya-Kamath’15]: a tolerant goodness of fit test with same
sample size can distinguish: πœ’ 2 𝑝, π‘ž ≀ πœ– 2 vs β„“12 𝑝, π‘ž > 4 β‹… πœ– 2
– πœ’ 2 𝑝, π‘ž ≔
𝑝𝑖 βˆ’π‘žπ‘– 2
π‘–βˆˆπ·
π‘žπ‘–
– Cauchy-Schwarz: πœ’ 2 𝑝, π‘ž β‰₯ β„“1 𝑝, π‘ž
2
A new πœ’ 2 - Goodness of Fit Test
β€’ Goal: given π‘ž and sample access to 𝑝 distinguish:
Case 1: πœ’ 2 𝑝, π‘ž ≀ πœ– 2 vs Case 2: β„“12 𝑝, π‘ž β‰₯ 4 β‹… πœ– 2
β€’ Approach: Draw Poisson(π‘š) many samples from 𝑝
Side-Note:
β€’ 𝑁𝑖 : # of appearances of symbol 𝑖 ∈ 𝐷
β€’ Pearson’s πœ’ 2 test uses
π‘π‘–βˆ’π‘šβ‹…π‘žπ‘– 2
– 𝑁𝑖 ~ Poisson π‘š β‹… 𝑝𝑖
statistic 𝑖
– 𝑁𝑖
π‘–βˆˆπ·
independent random variables
β€’ Statistic: 𝑍 =
𝑁𝑖 βˆ’π‘šβ‹…π‘žπ‘– 2 βˆ’π‘π‘–
𝑖
π‘šβ‹…π‘žπ‘–
πœ’ 2 (𝑝, π‘ž)
π‘šβ‹…π‘žπ‘–
β€’ Subtracting 𝑁𝑖 in the
numerator gives an unbiased
estimator and importantly
may hugely decrease
variance
– 𝐸 𝑍 =π‘šβ‹…
– Case 1: 𝐸 𝑍 ≀ π‘š β‹… πœ– 2 ; Case 2: 𝐸 𝑍 β‰₯ π‘š β‹… 4 πœ– 2
– chug chug chug…bound variance of 𝑍 β†’ O
suffice to distinguish
𝐷
πœ–2
samples
The Menu
Motivation
Problem Formulation
Uniformity Testing, Goodness of Fit
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
The Menu
Motivation
Problem Formulation
Uniformity Testing, Goodness of Fit
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
Testing Properties of Distributions
β€’ so far 𝒫 ={single distribution}
– restrictive, as rarely know hypothesis distribution exactly
β€’ natural extension: test structural properties
β€’ monotonicity: β€œPDF is monotone,” e.g. cancer vs radiation
β€’ unimodality: β€œPDF is single-peaked,” e.g. single source of disease
β€’ log-concavity: β€œlog PDF is concave”
β€’ monotone-hazard rate: β€œlog (1 βˆ’ CDF) is concave”
β€’ product distribution, e.g. testing linkage disequilibrium
β€’ Example question:
― 𝒫 = {π‘’π‘›π‘–π‘šπ‘œπ‘‘π‘Žπ‘™ π‘‘π‘–π‘ π‘‘π‘Ÿπ‘–π‘π‘’π‘‘π‘–π‘œπ‘›π‘  π‘œπ‘£π‘’π‘Ÿ [π‘š]}
― Sample access to 𝑝
― Is 𝑝 unimodal OR is 𝑝 πœ–-far from or unimodal distributions?
Testing Properties of Distributions
[w/ Acharya and Kamath 2015]:
1. Testing identity, monotonicity, log-concavity, monotone hazard-rate, unimodality
for distributions over (ordered set) 𝐷 is doable w/ 𝑂
2.
|𝐷|
πœ–2
samples and time.
Testing monotonicity/independence of a distribution over 𝐷 = π‘š
w/ 𝑂
π‘šπ‘‘/2
πœ–2
≑𝑂
|𝐷|
πœ–2
𝑑
is doable
samples and time.
– previous best for monotonicity testing: 𝑂
π‘šπ‘‘βˆ’0.5
πœ–4
[Bhattacharya-Fisher-Rubinfeld-Valiant’11]
– previous best for independence: d=2, worst bounds [Batu et al.’01]
3.
All bounds above are optimal
– Matching lower bounds for 1 and 2 via Le Cam.
4.
Unified approach, computationally efficient tests
N.B. Contemporaneous work of [Canonne et al’2015] provide a different unified approach for
testing structure but their results are suboptimal.
A Natural Approach
Goal: Given 𝒫 and sample access to 𝑝, distinguish 𝑝 ∈ 𝒫 vs β„“1 𝑝, 𝒫 > πœ–.
Choose a Hypothesis π‘ž ∈ 𝒫
Test the Hypothesis
(how well does π‘ž fit 𝑝?)
A Natural Approach (cont’d)
β€’ Goal: Given 𝒫 and sample access to 𝑝, distinguish 𝑝 ∈ 𝒫 vs β„“1 𝑝, 𝒫 > πœ–.
β€’ A Learning-Followed-By-Testing Algorithm:
1.
Learn hypothesis π‘ž ∈ 𝒫 s.t.
β€’
β€’
2.
β€’
β€’
πœ–
2
(needs cheap β€œproper learner”)
β„“1 𝑝, 𝒫 > πœ– ⟹ β„“1 𝑝, π‘ž > πœ– (automatic since π‘ž ∈ 𝒫)
Reduce to β€œtolerant goodness of fit”
given πœ–, sample access to 𝑝, and explicit description of π‘ž, distinguish
πœ–
β„“1 𝑝, π‘ž < 2 vs β„“1 𝑝, π‘ž > πœ–
Tolerant tester requires almost linear #samples in the support of 𝑝
–
β€’
𝑝 ∈ 𝒫 β‡’ β„“1 𝑝, π‘ž <
namely Ξ©(|𝐷|/ log |𝐷|) samples [Valiant-Valiant’10]
Could try investing more samples for more accurate learning, but proper
learning complexity vs tolerant testing complexity tradeoff does not work out
to give optimal testing complexity
A Modified Approach
β€’ Goal: Given 𝒫 and sample access to 𝑝, distinguish 𝑝 ∈ 𝒫 vs β„“1 𝑝, 𝒫 > πœ–.
β€’ A Learning-Followed-By-Testing Algorithm:
1.
Learn hypothesis π‘ž ∈ 𝒫 s.t.
β€’
β€’
2.
β€’
𝑝 ∈ 𝒫 β‡’ β„“1 𝑝, π‘ž <
πœ–
2
(needs cheap β€œproper learner”)
β„“1 𝑝, 𝒫 > πœ– ⟹ β„“1 𝑝, π‘ž > πœ– (automatic since π‘ž ∈ 𝒫)
Reduce to β€œtolerant goodness of fit”
given πœ–, sample access to 𝑝, and explicit description of π‘ž, distinguish
πœ–
β„“1 𝑝, π‘ž < 2 vs β„“1 𝑝, π‘ž > πœ–
𝐷
β€’
Now tolerant testing has the right complexity of 𝑂
β€’
Pertinent Question: are there sublinear proper learners in πœ’ 2 ?
–
πœ–2
We show that the πœ’ 2 -learning complexity is dominated by the testing
complexity for all properties of distributions we consider
Tutorial: part 2
Summary so far
β€’ Hypothesis Testing in the small sample regime.
𝑝
i.i.d.
samples
β€’
β€’
β€’
β€’
β€’
𝑝 unknown distribution over some discrete set 𝐷
𝒫: set of distributions over 𝐷
Given: 𝝐, 𝜹, sample access to 𝑝
Goal: w/ prob β‰₯ 1 βˆ’ 𝜹 tell 𝑝 ∈ 𝒫 vs β„“1 𝑝, 𝒫 > 𝝐
Properties of interest: Is 𝑝 uniform? unimodal? logconcave? MHR? product measure?
β€’ All above properties can be tested w/ O
Test
Pass/Fail?
𝑫
𝝐𝟐
β‹… π₯𝐨𝐠
𝟏
𝜹
samples and time
β€’ Unified approach based on modified Pearson’s goodness
of fit test: statistic 𝑍 =
𝑁𝑖 βˆ’πΈπ‘– 2 βˆ’π‘΅π’Š
π‘–βˆˆπ·
𝐸𝑖
― tight control for false positives: want to be able to both assert
and reject the null hypothesis
― accommodate sublinear sample size
The Menu
Motivation
Problem Formulation
Uniformity Testing, Goodness of Fit
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
The Menu
Motivation
Problem Formulation
Uniformity Testing, Goodness of Fit
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
Other Distances (beyond β„“1 )
β€’ So far focused on β„“1 (a.k.a. total variation) distance
β€’ Given sample access to 𝑝, w/ prob β‰₯ 1 βˆ’ 𝜹 distinguish:
𝑝 ∈ 𝒫 vs β„“1 𝑝, 𝒫 > 𝝐
β€’ Stronger distances?
– [Acharya-D-Kamath]: results are actually shown for πœ’ 2 𝑝, 𝒫 β‰₯ β„“1 𝑝, 𝒫
– Should also extend to KL 𝑝, 𝒫 β‰₯ ½ β‹… β„“1 𝑝, 𝒫 2
β€’ Weaker distances?
– β„“2 is easy to test for [Goldreich Ron], but makes less sense, e.g.
p = (2/m,2/m,2/m,…,2/n,0,0,0,0,0,0,0….0)
q = (0,0,0,0,0,0,0….0,2/m,2/m,2/m,…,2/m)
– l1 distance = 2
l2 distance =
2
π‘š
2
Tolerance
β€’ So far, focused on non-tolerant version:
– Given set of distributions 𝒫, and sample access to 𝑝
– Distinguish: 𝑝 ∈ 𝒫
vs 𝑑 𝑇𝑉 𝑝, 𝒫 > 𝝐
β€’ Tolerant version:
― Distinguish: 𝑑 𝑇𝑉 𝑝, 𝒫 <
𝝐
2
― [Valiant-Valiant’10]: Ξ©
vs 𝑑 𝑇𝑉 𝑝, 𝒫 > 𝝐
𝐷
log |𝐷|
samples are needed
β€’ Tolerant version in (πœ’ 2 , β„“1 ):
― Distinguish: πœ’ 2 𝑝, 𝒫 <
𝝐
2
vs β„“12 𝑝, 𝒫 > 𝝐
― [w/ Acharya, Kamath’15]: O
|𝐷|
πœ–2
Different ratios
change the
constants in 𝑂()
notation
samples suffice
Goodness of Fit
Our goodness of fit test was given an explicit distribution π‘ž and
sample access to a distribution 𝑝, and was asked to test 𝑝 = π‘ž
vs 𝑑 𝑇𝑉 𝑝, π‘ž > 𝝐.
Sometimes both distributions are unknown, e.g.
Transactions of 20-30 yr olds
Transactions of 30-40 yr olds
Same or different?
Goodness of Fit w/ two unknowns
p
q
i.i.d.
samples
i.i.d.
samples
Test
Pass/Fail?
Given sample access to two unknown
distributions 𝑝, π‘ž:
Distinguish 𝑝 = π‘ž vs 𝑑 𝑇𝑉 (𝑝, π‘ž) > 𝝐
Goodness of Fit w/ two unknowns
β€’ [Batu Fortnow Rubinfeld Smith White], [P. Valiant], …
β€’ [Chan Diakonikolas Valiant Valiant]: Tight upper and lower
bound of:
Θ max
2
|𝐷|3
4
πœ–3
,
𝐷
πœ–2
.
β€’ Why different?
– Collision statistics are all that matter
– Collisions on β€œheavy” elements can hide collision statistics of rest of
the domain
– Construct pairs of distributions where heavy elements are identical,
but β€œlight” elements are either identical or very different
Continuous Distributions
β€’ What can we say about continuous distributions?
– without extra assumptions such as smoothness of PDF/parametric
modeling cannot stick to hard distances (β„“1 , πœ’ 2 , 𝐾𝐿)
β€’ Instead of restricting 𝑝, 𝒫, let us switch distances
β€’ Can extend results if we switch to Kolmogorov distance
– recall: 𝑑 𝑇𝑉 𝑝, π‘ž = sup |𝑝 β„° βˆ’ π‘ž(β„°)|
β„°
– in contrast: 𝑑𝐾 𝑝, π‘ž =
sup
|𝑝 β„° βˆ’ π‘ž(β„°)|
β„°: 𝐫𝐞𝐜𝐭𝐚𝐧𝐠π₯𝐞
β€’ Now want to distinguish: 𝑝 ∈ 𝒫 vs 𝑑𝐾 𝑝, 𝒫 > 𝝐
β€’ Claim: Tolerant testing in Kolmogorov distance of any distribution
property (continuous or discrete) of 𝑑-dimensional distributions
𝑑
can be performed from 𝑂 2 samples.
πœ–
β€’ Importantly: Kolmogorov distance allows graceful scaling with
dimensionality of data
Dvoretzky–Kiefer–Wolfowitz inequality
β€’ Suppose 𝑋1 , … , 𝑋𝑛 i.i.d. samples from (single-dimensional) 𝐹,
and let 𝐹𝑛 be the resulting empirical CDF, namely
1
𝐹𝑛 π‘₯ =
𝑛
Then: Pr 𝑑𝐾 𝐹, 𝐹𝑛 > πœ– ≀
𝑛
1𝑋𝑖≀π‘₯ .
𝑖=1
2
βˆ’2π‘›πœ–
2𝑒
,
βˆ€πœ– > 0.
1
πœ–2
β€’ i.e. 𝑂
samples suffice to learn any single-dimensional dist’n
to within πœ– in Kolmogorov distance.
β€’ VC Inequality ⟹ same is true for 𝑑-dimensional distributions
𝑑
when #samples is at least 𝑂 2
πœ–
β€’ After learning in Kolmogorov, can tolerant test any property.
β€’ Runtime under investigation.
– trouble: computing/approximating the Kolmogorov distance of two highdimensional distributions is generally a hard computational question.
The Menu
Motivation
Problem Formulation
Uniformity Testing, Goodness of Fit
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
The Menu
Motivation
Problem Formulation
Uniformity Testing, Goodness of Fit
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
Testing in High-Dimensions
β€’ Already talked about testing high-dimensional
distributions in Kolmogorov distance.
– Sample complexity is 𝑂
𝑑
πœ–2
β€’ Next Focus: discrete distributions, stronger
distances
High-Dimensional Discrete Distn’s
β€’
Consider source generating 𝑛-bit strings ∈ 0,1
–
–
–
–
β€’
400 bit
images
0011010101 (sample 1)
0101001110 (sample 2)
0011110100 (sample 3)
…
Are bits/pixels independent?
– Our algorithms require Θ
β€’
2𝑛 2
πœ–2
samples
1
Is source generating graphs over 𝑛 nodes Erdos-Renyi 𝐺 𝑛, 2 ?
β€’
β€’
𝑛
Our algorithms require Θ
𝑛
/2
2 2
πœ–2
samples
Exponential dependence on 𝑛 unsettling, but necessary
– Lower bound exploits high possible correlation among bits
β€’
Nature is not adversarial
– Often high dimensional systems have structure, e.g. Markov random fields
fields (MRFs), graphical models (Bayes nets), etc
Testing high-dimensional distributions with combinatorial structure?
High-Dimensional Discrete Distn’s
β€’
Consider source generating 𝑛-bit strings ∈ 0,1 𝑛
– 0011010101[w/
(sample
1) Kamath’16]: If unknown 𝑝 is known to be an Ising
Dikkala,
– 0101001110 (sample 2)
1
model,
then
poly
𝑛,
samples suffice to test independence,
– 0011110100 (sample 3)
πœ–
– …
goodness of fit. (extends to MRFs)
β€’
Are bits/pixels independent?
– Our algorithms require Θ
β€’
samples
1
Is source generating graphs over 𝑛 nodes Erdos-Renyi 𝐺 𝑛, 2 ?
β€’
β€’
2𝑛 2
πœ–2
Our algorithms require Θ
𝑛
/2
2 2
πœ–2
samples
Exponential dependence on 𝑛 unsettling, but necessary
– Lower bound exploits high possible correlation among bits
β€’
Nature is not adversarial
– Often high dimensional systems have structure, e.g. Markov random fields
fields (MRFs), graphical models (Bayes nets), etc
Testing high-dimensional distributions with combinatorial structure?
400 bit
images
Ising Model
β€’ Statistical physics, computer vision, neuroscience, social
science
β€’ Ising model:
– Probability distribution defined in terms of a graph 𝐺 = 𝑉, 𝐸 ,
edge potentials πœƒπ‘’ , node potentials πœƒπ‘£
– State space ±1 𝑉
π‘πœƒ π‘₯ ∝ exp
πœƒπ‘’ π‘₯𝑒 π‘₯𝑣 +
𝑒= 𝑒,𝑣 ∈𝐸
πœƒπ‘£ π‘₯𝑣
π‘£βˆˆπ‘‰
– High |πœƒπ‘’ |’s ⟹ strongly (anti-)correlated spins
πœƒπ‘£ = 0
Ising
Model:
Strong
vs
weak
ties
β€œhigh temperature regime”
β€œlow temperature regime”
πœƒπ‘’ = 1
πœƒπ‘’ = 0.5
πœƒπ‘’ = 0.25
πœƒπ‘’ = 0.125
πœƒπ‘’ = 0
Testing Ising Models
π‘πœƒ π‘₯ ∝ exp
πœƒπ‘’ π‘₯𝑒 π‘₯𝑣 +
𝑒= 𝑒,𝑣 ∈𝐸
πœƒπ‘£ π‘₯𝑣
π‘£βˆˆπ‘‰
β€’ Given: sample access to Ising model π‘πœƒ over 𝐺 = (𝑉, 𝐸) w/ 𝑛 = |𝑉| nodes, π‘š = |𝐸|
edges
– πœƒ unknown, graph G unknown
– π‘πœƒ supported on ±1 𝑉
β€’ Goal: distinguish π‘πœƒ ∈ ℐ ±1 𝑉 vs β„“1 π‘πœƒ , ℐ ±1 𝑉 > πœ–
1
β€’ [w/ Dikkala, Kamath’16]: poly 𝑛, samples suffice
πœ–
β€’ Warmup:
 SKL π‘πœƒ , π‘πœƒβ€² = KL π‘πœƒ , π‘πœƒβ€² +KL π‘πœƒβ€² , π‘πœƒ
πœƒπ‘’ βˆ’ πœƒπ‘’β€² πΈπœƒ 𝑋𝑒 𝑋𝑣 βˆ’ πΈπœƒβ€² 𝑋𝑒 𝑋𝑣
=
𝑒,𝑣 =π‘’βˆˆπΈ
πœƒπ‘£ βˆ’ πœƒπ‘£β€² πΈπœƒ 𝑋𝑣 βˆ’ πΈπœƒβ€² 𝑋𝑣
+
π‘£βˆˆπ‘‰
 SKL π‘πœƒ , π‘πœƒβ€² β‰₯ β„“1 π‘πœƒ , π‘πœƒβ€²
2
Testing Ising Models
π‘πœƒ π‘₯ ∝ exp
πœƒπ‘’ π‘₯𝑒 π‘₯𝑣 +
𝑒= 𝑒,𝑣 ∈𝐸
πœƒπ‘£ π‘₯𝑣
π‘£βˆˆπ‘‰
β€’ Given: sample access to Ising model π‘πœƒ over 𝐺 = (𝑉, 𝐸) w/ 𝑛 = |𝑉| nodes, π‘š = |𝐸|
edges
– πœƒ unknown, graph G unknown
– π‘πœƒ supported on ±1 𝑉
β€’ Goal: distinguish π‘πœƒ ∈ ℐ ±1 𝑉 vs π’πŠπ‹ π‘πœƒ , ℐ ±1 𝑉 > πœ–
1
β€’ [w/ Dikkala, Kamath’16]: poly 𝑛, samples suffice
πœ–
β€’ Warmup:
 SKL π‘πœƒ , π‘πœƒβ€² = KL π‘πœƒ , π‘πœƒβ€² +KL π‘πœƒβ€² , π‘πœƒ
πœƒπ‘’ βˆ’ πœƒπ‘’β€² πΈπœƒ 𝑋𝑒 𝑋𝑣 βˆ’ πΈπœƒβ€² 𝑋𝑒 𝑋𝑣
=
𝑒,𝑣 =π‘’βˆˆπΈ
πœƒπ‘£ βˆ’ πœƒπ‘£β€² πΈπœƒ 𝑋𝑣 βˆ’ πΈπœƒβ€² 𝑋𝑣
+
π‘£βˆˆπ‘‰
 SKL π‘πœƒ , π‘πœƒβ€² β‰₯ β„“1 π‘πœƒ , π‘πœƒβ€²
2
Focus:
πœƒπ‘£ = 0
Testing Ising Models
π‘πœƒ π‘₯ ∝ exp
πœƒπ‘’ π‘₯𝑒 π‘₯𝑣
𝑒= 𝑒,𝑣 ∈𝐸
β€’ Given: sample access to Ising model π‘πœƒ over 𝐺 = (𝑉, 𝐸) w/ 𝑛 = |𝑉| nodes, π‘š = |𝐸|
edges
– πœƒ unknown, graph G unknown
– π‘πœƒ supported on ±1 𝑉
Cheaper nonlocalizing test?
β€’ Goal: distinguish π‘πœƒ = π‘ˆ ±1 𝑉 vs SKL π‘πœƒ , π‘ˆ ±1 𝑉 > πœ–
β€’ [w/ Dikkala, Kamath’16]: poly 𝑛,
β€’ Warmup: SKL π‘πœƒ , π‘πœƒβ€²
β€’ SKL π‘πœƒ , π‘ˆ ±1 𝑉 > πœ– ⟹
=
1
πœ–
samples suffice
𝑒,𝑣 =π‘’βˆˆπΈ
πœƒπ‘’ βˆ’ πœƒπ‘’β€² πΈπœƒ 𝑋𝑒 𝑋𝑣 βˆ’ πΈπœƒβ€² 𝑋𝑒 𝑋𝑣
πœ–
πΈβˆ‹π‘’= 𝑒,𝑣
πœƒπ‘’ β‹… πΈπœƒ 𝑋𝑒 𝑋𝑣 > πœ– ⟹ βˆƒ 𝑒, 𝑣 : πΈπœƒ 𝑋𝑒 𝑋𝑣 > π‘šβ‹…πœƒ
max
β€’ π‘πœƒ = π‘ˆ ±1 𝑉 ⟹ βˆ€ 𝑒, 𝑣 : πΈπœƒ 𝑋𝑒 𝑋𝑣 = 0
β€’ Hence, can distinguish which is the case from 𝑂
2
π‘š2 β‹…πœƒπ‘šπ‘Žπ‘₯
πœ–2
samples.
Can localize departure from Uniformity (independence) on some edge
Focus:
πœƒπ‘£ = 0
Testing Ising Models
π‘πœƒ π‘₯ ∝ exp
πœƒπ‘’ π‘₯𝑒 π‘₯𝑣
𝑒= 𝑒,𝑣 ∈𝐸
β€’ Given: sample access to Ising model π‘πœƒ over 𝐺 = (𝑉, 𝐸) w/ 𝑛 = |𝑉| nodes, π‘š = |𝐸|
edges
– πœƒ unknown, graph G unknown
– π‘πœƒ supported on ±1 𝑉
Cheaper nonlocalizing test?
β€’ Goal: distinguish π‘πœƒ = π‘ˆ ±1 𝑉 vs SKL π‘πœƒ , π‘ˆ ±1 𝑉 > πœ–
β€’ Claim: By expending some samples, can identify distinguishing statistic of the form
Z = 𝑒,𝑣 𝑐𝑒𝑣 𝑋𝑒 𝑋𝑣 , where 𝑐𝑒𝑣 ∈ {±1} for all 𝑒, 𝑣.
β€’ Issue: can’t bound π‘‰π‘Žπ‘Ÿ[𝑍] intelligently as var’s aren’t pairwise ind.
β€’ If πœƒπ‘’ = 0, βˆ€π‘’, then π‘‰π‘Žπ‘Ÿ 𝑍 = 𝑛2
β€’ O.w. best can say is trivial π‘‰π‘Žπ‘Ÿ 𝑍 = 𝑂(𝑛4 )
Low temperature.
– and is, in fact, tight:
β€’ consider two disjoint cliques with super-strong ties
β€’ suppose all 𝑐𝑒,𝑣 = 1, for all 𝑒, 𝑣
β€’ 𝑍 dances around its mean by Ξ© n2
How about high
temperature?
πœƒπ‘’ = +∞
πœƒπ‘’ = +∞
πœƒπ‘£ = 0
Ising
Model:
Strong
vs
weak
ties
β€œhigh temperature regime”
β€œlow temperature regime”
πœƒπ‘’ ≀ 1/π‘‘π‘šπ‘Žπ‘₯
Exponential mixing of the
Glauber dynamics
πœƒπ‘’ = 1
πœƒπ‘’ = 0.5
⟹ 𝑂 𝑛 β‹… log n mixing of
the Glauber dynamics
πœƒπ‘’ = 0.25
πœƒπ‘’ = 0.125
πœƒπ‘’ = 0
Testing Ising Models
π‘πœƒ π‘₯ ∝ exp
Focus:
πœƒπ‘£ = 0
πœƒπ‘’ π‘₯𝑒 π‘₯𝑣
𝑒= 𝑒,𝑣 ∈𝐸
β€’ Given: sample access to Ising model π‘πœƒ over 𝐺 = (𝑉, 𝐸) w/ 𝑛 = |𝑉| nodes, π‘š = |𝐸|
edges
– πœƒ unknown, graph G unknown
– π‘πœƒ supported on ±1 𝑉
β€’ Goal: distinguish π‘πœƒ = π‘ˆ ±1 𝑉 vs SKL π‘πœƒ , π‘ˆ ±1 𝑉 > πœ–
β€’ Claim: By expending some samples, can identify distinguishing statistic of the form
Z = 𝑒,𝑣 𝑐𝑒𝑣 𝑋𝑒 𝑋𝑣 , where 𝑐𝑒𝑣 ∈ {±1} for all 𝑒, 𝑣.
β€’ Issue: can’t bound π‘‰π‘Žπ‘Ÿ[𝑍] intelligently as var’s aren’t pairwise ind.
β€’ If πœƒπ‘’ = 0, βˆ€π‘’, then π‘‰π‘Žπ‘Ÿ 𝑍 = 𝑛2
β€’ Low temperature: π‘‰π‘Žπ‘Ÿ 𝑍 = 𝑂 𝑛4
β€’ High temperature: π‘‰π‘Žπ‘Ÿ 𝑍 = 𝑂
𝑛3
π‘‘π‘šπ‘Žπ‘₯
⟹ O(𝑛2 ) for dense graphs
– proof via exchangeable pairs [Stein,…,Chatterjee 2006]
Exchangeable Pairs
β€’ Goal: Given 𝑓(β‹…) and random vector 𝑋 ∼ 𝐷 want to bound
moments of 𝑓(𝑋), or prove concentration of 𝑓(𝑋) about its mean
β€’ Approach:
– Define pair of random vectors (𝑋, 𝑋 β€² ) such that:
β€’ (𝑋, 𝑋 β€² ) has the same distribution as 𝑋 β€² , 𝑋
(exchangeability)
β€’ marginal distributions are 𝐷
(faithfulness)
– Find 𝐹(π‘₯, 𝑦) anti-symmetric function (i.e. 𝐹 π‘₯, 𝑦 = βˆ’πΉ 𝑦, π‘₯ , βˆ€ π‘₯, 𝑦) such
that 𝐸[𝐹(𝑋, 𝑋′)|𝑋] = 𝑓(𝑋) βˆ’ 𝐸[𝑓(𝑋)]
β€’ Claims:
1
2
= β‹… 𝐸 𝑓 𝑋 βˆ’ 𝑓 𝑋′
1.
π‘‰π‘Žπ‘Ÿ 𝑓 𝑋
2.
Let 𝑣 𝑋 = β‹… 𝐸
1
2
𝑓 𝑋 βˆ’ 𝑓 𝑋′
β‹… 𝐹 𝑋, 𝑋 β€²
β‹… 𝐹 𝑋, 𝑋 β€²
If 𝑣 𝑋 ≀ 𝐢 a.s. then Pr 𝑓 𝑋 βˆ’ 𝐸 𝑓 𝑋
𝑋.
β‰₯ 𝑑 ≀ 2 β‹… 𝑒 βˆ’π‘‘
2 /2𝐢
Silly Example
β€’ 𝑓 𝑋 = 𝑛𝑖=1 𝑋𝑖 , where 𝑋 = 𝑋1 , … , 𝑋𝑛 ∼ 𝐷1 × β‹― × π·π‘›
– Suppose for all 𝑖: 𝐸 𝑋𝑖 = πœ‡π‘– and 𝑋𝑖 ∈ [π‘Žπ‘– , 𝑏𝑖 ] a.s.
β€’ Goal: prove concentration of 𝑓 𝑋 about its mean
β€’ Defining exchangeable pair 𝑋, 𝑋 β€² as follows:
– sample 𝑋 ∼ 𝐷1 × β‹― × π·π‘›
– pick 𝑖 u.a.r.; sample 𝑋𝑖′ ∼ 𝐷𝑖 ; and set 𝑋𝑗′ = 𝑋𝑗 , βˆ€π‘— β‰  𝑖
β€’ Choose anti-symmetric function: 𝐹 𝑋, 𝑋 β€² = 𝑛 β‹… 𝑓 𝑋 βˆ’ 𝑓 𝑋 β€²
– 𝐸 𝐹 𝑋, 𝑋’ 𝑋 = 𝑛 β‹… 𝑓 𝑋 βˆ’
1
2
β€’ Bounding 𝑣 𝑋 = β‹… 𝐸
1 1
2 𝑛
– 𝑣 𝑋 = β‹… β‹…
𝑖𝐸
β€’ Pr 𝑓 𝑋 βˆ’ 𝐸 𝑓 𝑋
𝑛
𝑛
𝑖
πœ‡π‘– +
𝑓 𝑋 βˆ’ 𝑓 𝑋′
𝑋𝑖 βˆ’ 𝑋𝑖′
2
≀
1
2
𝑖
β‹… 𝐹 𝑋, 𝑋 β€²
𝑏𝑖 βˆ’ π‘Žπ‘–
β‰₯ 𝑑 ≀ 2 β‹… 𝑒 βˆ’π‘‘
2/
𝑗≠𝑖 𝑋𝑗
𝑖
2
𝑏𝑖 βˆ’π‘Žπ‘– 2
=
𝑖 𝑋𝑖
𝑋?
βˆ’ πœ‡π‘–
Interesting Example: Ising
β€’ 𝑋 ∼ π‘πœƒ π‘₯ ∝ exp 𝑒= 𝑒,𝑣 ∈𝐸 πœƒπ‘’ π‘₯𝑒 π‘₯𝑣 ; 𝑓𝑐 π‘₯ =
β€’ How to define exchangeable pair?
𝑒,𝑣 𝑐𝑒𝑣 𝑋𝑒 𝑋𝑣
– Natural approach: sample 𝑋 ∼ π‘πœƒ
– Do one step of the Glauber dynamics from 𝑋 to find 𝑋’
β€’ i.e. pick random node 𝑣 and resample 𝑋𝑣′ from marginal of π‘πœƒ at 𝑣
conditioning on the state of all other nodes being π‘‹βˆ’π‘£
β€’ Harder question: find anti-symmetric 𝐹 β‹…,β‹… s.t. 𝐸[𝐹(𝑋, 𝑋’)|𝑋] = 𝑓𝑐 (𝑋) βˆ’ 𝐸[𝑓𝑐 (𝑋)]
– Approach that works for any 𝑓:
β€’
β€²
β€’ 𝐹 π‘₯, π‘₯ β€² = ∞
𝑑=0 𝐸[𝑓 𝑋𝑑 βˆ’ 𝑓(𝑋𝑑 β€²)|𝑋0 = π‘₯, 𝑋0 = π‘₯β€²]
β€’ where 𝑋𝑑 𝑑β‰₯0 , 𝑋𝑑′ 𝑑β‰₯0 are two (potentially coupled) executions of the Glauber
dynamics starting from states π‘₯ and π‘₯β€² respectively
Requires a good
Challenging question:
coupling
1
– bound π‘‰π‘Žπ‘Ÿ 𝑓𝑐 𝑋
= 2 β‹… 𝐸 𝑓𝑐 𝑋 βˆ’ 𝑓𝑐 𝑋 β€² β‹… 𝐹 𝑋, 𝑋 β€²
1
=2
∞
𝑑=0 𝐸
𝑓𝑐 𝑋 βˆ’ 𝑓𝑐 𝑋 β€²
β‹… 𝐸 𝑓𝑐 𝑋𝑑 βˆ’ 𝑓 𝑋𝑑′ 𝑋0 , 𝑋0β€²
Need to show function contracts as
Glauber dynamics unravels
Showing Contraction
Generous coupling: choose same node, but update independently
Lemma 1:
Different than β€œgreedy
coupling” typically used
where the same node is
chosen and the update is
coordinated to maximize
probability of same update
Lemma 2:
The Menu
Motivation
Problem Formulation
Uniformity Testing, Goodness of Fit
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
Future Directions
The Menu
Motivation
Problem Formulation
Uniformity Testing, Goodness of Fit
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
Future Directions
Markov Chain Testing
Example:
n = 52 cards, β‰₯ 7 riffle shuffles needed*
𝑑 𝑇𝑉 Uni, 6 ×
> 0.6
[Diakonis,Bayer’92]
𝑑 𝑇𝑉 Uni, 7 ×
β‰ˆ 0.33
*riffle shuffle = Gilbert-Shannon-Reeds (GSR) model for distribution on card permutations.
Empirical Fact:
vs.
different Markov chains! [Diakonis’03]
[Ongoing work with Dikkala, Gravin]
β€’ Question: how close is real shuffle to the GSR distribution?
β€’ Given: sample access to
∼ 𝐹riffle (𝑛2 variables)
β€’ Goal: distinguish
is GSR vs β„“1
, 𝐺𝑆𝑅 > πœ€
Markov Chain Testing
Question: test 𝑀 = π‘€βˆ— vs 𝑑𝑖𝑠𝑑 𝑀, π‘€βˆ— > πœ€
Distance?
β€’ Let 𝑃, 𝑄 be the transition matrices of chains 𝑀, π‘€βˆ—
β€’ Object of interest: wordπ‘£π‘˜ 𝑃 = 𝑣 ≑ 𝑣1 β†’ v2 β†’ β‹― β†’ π‘£π‘˜ ∼ 𝑃 × π‘ƒ × β‹― × π‘ƒ
β€’ Pertinent question: asymptotic 𝑑 𝑇𝑉 wordπ‘£π‘˜ P , wordπ‘£π‘˜ Q
as π‘˜ β†’ ∞ ?
spectral radius
β€’ Easier to quantify 1 βˆ’ 𝑑𝐻 wordπ‘£π‘˜ P , wordπ‘£π‘˜ Q
β‰ˆπœŒ
𝑃𝑖𝑗 β‹… 𝑄𝑖𝑗
β€’ So proposed notion of distance: 𝑑𝑖𝑠𝑑 𝑀, π‘€βˆ— = 1 βˆ’ 𝜌
𝑃𝑖𝑗 β‹… 𝑄𝑖𝑗
π‘˜
, for all 𝑣
Results: testing symmetric 𝑀, π‘€βˆ— with 𝑂(πœ€π‘›2 + cover time) samples
[Ongoing work with Dikkala, Gravin]
Testing Combinatorial Structure
Efficiently testing combinatorial structure
Is the phylogenic tree
assumption true?
Sapiens-Neanderthal
early interbreeding
[Slatkin et al’13]
Is a graphical model a tree?
[ongoing work with Bresler, Acharya]
Testing from a Single Sample
β€’ Given one social network, one brain, etc., how
can we test the validity of a certain generative
model?
β€’ Get many samples from one sample?
INFORMATION
THEORY
MACHINE
LEARNING
STATISTICS
ALGORITHMS
Thank You!