How to Fake Data If You Must

Download Report

Transcript How to Fake Data If You Must

How to Fake Data
if you must
Rachel Fewster
Latest in a series of useful manuals…
How to Fake Data
if you must
1965
2014
1954
The Key Idea…
If a hat is
covered
evenly in
red and
white
stripes…
The Key Idea…
If a hat is
covered
evenly in
red and
white
stripes…
… it will be
half red
and half
white.
What do real data look like?
Most humans have a
very poor sense of
randomness!
or
or
All the world's a stage,
And all the men and
women merely players.
They have their exits and
their entrances;
And one man in his time
plays many parts...
Contrariwise,
if it was so,
it might be;
and if it were so,
it would be;
but as it isn't,
it ain't.
That's logic.
or
200 coin tosses
0=Tail, 1=Head
or
Real
Fake
In 200 tosses, there
is a 97% chance of a
or
run of at least 6
Real data
consecutive 0s/1s. Sports teams win/lose...
have long
Real
Stocks and shares
rise/fall... runs of
… understandingfailures or
successes
randomness is useful!
Fake data
are usually
Fake
completely
far-fetched
In 200 tosses,
there is a 97%
chance of a run of
at least 6
consecutive 0s/1s.
•
•
•
Bad luck for sportspeople after
being featured on the cover of
Sports Illustrated magazine
Baseballer Ed Matthews was the
first person ever pictured, in
1954
Immediately lost a 9-game
winning streak and also broke
his hand
or
• Real data contain patterns that fakers
don’t expect
• Fake-detectors can exploit this!
Who wants to fake data?
• Electoral finance returns…
• Toxic emissions reports…
Can I
fool
you?!
• Business tax returns…
• Me!
Land areas of world countries: real or fake?
1
2
3
4
5
6
7
8
9
IIIII
III
III
I
I
II
I
Land areas of world countries: real or fake?
1
2
3
4
5
6
7
8
9
IIIII
III
III
I
I
II
I
1
2
3
4
5
6
7
8
9
I
I
III
I
IIII
I
II
III
Land areas of world countries: real or fake?
1
2
3
4
5
6
7
8
9
IIIII
III
III
I
I
II
I
This one
has as
many 1s as
5-9s put
together!
1
2
3
4
5
6
7
8
9
I
I
III
I
IIII
I
II
III
This one
seems
more
even…
Real land areas of world countries
11 of them
begin with
IIIII
1
digits 1 – 4…
2 III
3
4 III
Only5 5 begin
I
6 digits
with
57– 9…
I
8 II
9 I
Random
Newspaper:
IIII IIII
1
IIII III
2
IIII
3
II
4
IIII
5
6
II
7
III
8
9
10 out of 34
numbers begin
with a 1…
None out of 34
begin with a 9!
The Curious Case of the Grimy Log-books
• In 1881, American astronomer Simon
Newcomb noticed something funny about
books of logarithm tables…
The Curious Case of the Grimy Log-books
The Curious Case of the Grimy Log-books
The Curious Case of the Grimy Log-books
The Curious Case of the Grimy Log-books
The first pages
deal with
numbers
The books were always
beginning with
grubby on the first pages… digits 1 and 2…
The last pages
deal with
numbers
…
beginning with
digits 8 and 9…
but clean on the last pages
The Curious Case of the
Grimy Log-books
Why?
People seemed to look up
numbers beginning with 1
and 2 more often than
they looked up numbers
beginning
with
8
and
9.
Because numbers
beginning with 1 and 2
are MORE COMMON
than numbers
beginning with 8 and
9!!
Newcomb’s Law
American Journal of Mathematics, 1881
The First Digits…
Over 30% of numbers
begin with a 1
Only 5% of
numbers
begin with a 9
The First Digits…
Numbers beginning with a 1
Numbers beginning with a 9
There is the same
“opportunity” for
numbers to begin
with 9 as with 1 … but
for some reason they
don’t!
What was obvious to Newcomb...
Log books are precisely
6.58 times grubbier on
page 1 than on page 9?
0.301 = log10(2/1)
0.176 = log10(3/2)
0.125 = log10(4/3)
Chance of a
 d 1
number starting  log10 

 d 
with digit d
Reactions to Newcomb’s law
Nothing!
…for 57 years!
Enter Frank Benford: 1938
Physicist with the
General Electric
Company
Assembled over
20,000 numbers and
counted their first
digits!
‘A study as wide as time and energy permitted.’
Populations
Numbers from newspapers
Drainage rates of rivers
Numbers from Readers Digest articles
Street addresses of American Men of Science
About 30% begin with a 1
About 5% begin with a 9
Benford gave the ‘law’ its name…
Anomalous
…but no explanation.
numbers !!
“…The logarithmic law applies to
outlaw numbers that are without known
relationship,
rather than to those that follow an
orderly course;
and so the logarithmic relation is essentially
a Law of Anomalous Numbers.”
Outlaw numbers…?
Outlaw numbers…?
Readers Digest numbers
Street addresses of American Men of Science
Populations
Newspaper numbers
Drainage rates of rivers
Explanations for Benford’s
Law
What is the
explanation?
• Numbers from a wide range
of data sources
have about 30% of 1’s, down to only 5% of 9’s.
• Benford called these ‘outlaw’ or ‘anomalous’
numbers. They include street addresses of
American Men of Science, populations, areas,
numbers from magazines and newspapers.
• Benford’s ‘orderly’ numbers don’t follow the law
– like atomic weights and physical constants
Popular Explanations
• Scale Invariance
• Base Invariance
• Complicated Measure Theory
• Divine choice
• Mystery of Nature
Scale Invariance
• Assume there is a universal law of nature
SCALING (multiplying)
governing
frequencies
thethe
numbers
should of first digits.
give the same
Readers Digest numbers
Street addresses of American Men of Science
frequencies for the
Populations Newspaper
numbers Drainage rates of rivers
digits
• Whatever the law is, it shouldn’t depend
on the UNITS of the measurements
km2
miles2
acres
Scale Invariance
• Assume there is a universal law of nature
SCALING (multiplying)
governing
frequencies
thethe
numbers
should of first digits.
give the same
frequencies for the
digits
• Whatever the law is, it shouldn’t depend
on the UNITS of the measurements
Requiring the same digit-frequencies for
2
2
km
miles
acres
any scaling forces the ‘universal
law’
to be Benford’s Law
Scale Invariance
• Assume there is a universal law of nature
governing the frequencies of first digits.
But…
Why should there be a
universal law in the first
place?
Requiring the same digit-frequencies for
2
2
km
miles
acres
any scaling forces the ‘universal
law’
to be Benford’s Law
Base Invariance
• Again… assume there is a universal law
of nature governing the frequencies of first
digits.
• It shouldn’t matter whether it is being
observed by humans with 10 fingers…
… or ducks with 6 toes.
Base Invariance
Chance of a
d

1


•number
Again… assume
there
is
a
universal
law
starting  log 

of nature governing the frequencies
of first
10
d
with
digit
d
in


digits.
base
10
There
should
be
nothing
• It shouldn’t matter whether it is being
special
about
Base
10.
observed by humans with 10 fingers…
Again, requiring
same
law
for every
… orthe
ducks
with
6 toes.
number base forces the ‘universal law’
to be Benford’s Law
Base Invariance
• Assume there is a universal law of nature
governing the frequencies of first digits.
But…
WHY should
there be a
universal law?
Again, requiring the same law for every
number base forces the ‘universal law’
to be Benford’s Law
Popular Explanations
• Scale Invariance They
Thesedon’t
two say
explain
thatwhy
IF
there
thereisshould
a universal
be a law
law,
• Base Invariance it must
to start
be Benford’s.
with!
• Complicated Measure Theory
• Divine choice
• Mystery of Nature
In a nutshell …
If you grab numbers from all over
the place (a random mix of
distributions), their digit
frequencies ultimately converge
to Benford’s Law
That’s why THIS
works well
It doesn’t really
explain WHAT will
work well, nor why
It doesn’t explain why
street addresses of
American Men of Science
works well!
The Key Idea…
If a hat is
covered
evenly in
red and
white
stripes…
The Key Idea…
If a hat is
covered
evenly in
red and
white
stripes…
… it will be
half red
and half
white.
If the red stripes cover half the
base, they’ll cover about half the hat
The red stripes and the white stripes
even out over the shape of the hat
What if the red stripes cover 30% of the
base?
0 0.3
1 1.3
2 2.3
3 3.3
4 4.3
5 5.3
Then they’ll cover about 30% of the hat.
6
What if the red stripes cover precisely
fraction 0.301 of the base?
Then they’ll cover fraction ~0.301 of the hat.
0 0.301
1 1.301
2 2.301
3 3.301
4 4.301
5 5.301
0.301 = log10(2/1)
6
It doesn’t have to be a symmetricallyshaped hat
It doesn’t have to be a symmetricallyshaped hat
What has
a stripey hat
got to do with
Think of X as a random number…
We want the probability that X has
first digit = 1
Let the ‘hat’ be a
probability density
curve for X
Then AREAS on the hat
give PROBABILITIES
for X
Think of X as a random number…
We want the probability that X has
first digit = 1
Let the ‘hat’ be a
probability density
Area = 0.95 from 1 to
curve for X
Then AREAS on the hat
give PROBABILITIES
5
for X
Pr(1 < X < 5) = 0.95
Total
area = 1
In the same way ….
0 0.301
1 1.301
2 2.301
3 3.301
4 4.301
5 5.301
6
If the red stripes somehow represent the X
values with first digit = 1,
and the red stripes have area ~ 0.301,
then Pr(X has first digit 1) ~ 0.301.
So X values with first digit=1 somehow lie
on a set of evenly spaced stripes?
Write X in Scientific Notation:
So X values with first digit=1 somehow lie
on a set of evenly spaced stripes?
Write X in Scientific Notation:
X  r 10
r is
between
1 and 10
n
n is an
integer
For example…
124  1.24 10
2
76  7.6 10
1
X  r 10
r is
between
1 and 10
n
n is an
integer
For example…
124  1.24 10
2
76  7.6 10
X  r 10
For the first
digit of X,
only r
matters
1
n
X has first digit  1
exactly when 1  r  2
For example…
124  1.24 10
1<r<2
J
r>2
J
2
76  7.6 10
X  r 10
For the first
digit of X,
only r
matters
1
n
X has first digit  1
exactly when 1  r  2
X  r 10
n
X has first digit  1
exactly wh en 1  r  2
Take logs to base 10…
log X  log r  log(10 )
n
Or in other words…
log X  log r  n
log X  log r  n
r is
between
1 and 10
n is an
integer
log X  log r  n
r is
between
1 and 10
X has first digit 1
when 1  r  2
i.e. ...when
log 1  log r  log 2
n is an
integer
log X  log r  n
r is
between
1 and 10
n is an
integer
X has first digit 1
when 1  r  2
i.e. ...when
i.e. ...when
0  log r  0.301
log 1  log r  log 2
log X  log r  n
X has first digit 1 when
0  log r  0.301
n is an
integer
X has first digit 1 precisely when log(X) is
between n and n + 0.301 for any integer n
n=0:
n=1:
n=2:
0  log X  0.301 X from 1 to 2
1  log X  1.301 X from 10 to 20
2  log X  2.301 X from 100 to 200
log X  log r  n
X has first digit 1 when
0  log r  0.301
n is an
integer
X has first digit 1 precisely when log(X) is
between n and n + 0.301 for any
integer
n
STRIPES!!
n=0:
n=1:
n=2:
0  log X  0.301
1  log X  1.301
2  log X  2.301
The ‘hat’ is the
probability
density curve
for log(X)
0 0.301
1 1.301
2 2.301
3 3.301
4 4.301
5 5.301
X values with first digit = 1 satisfy:
n=0:
n=1:
n=2:
0  log X  0.301
1  log X  1.301
2  log X  2.301
and so
on!
6
The ‘hat’ is the
probability
density curve
for log(X)
0 0.301
1 1.301
2 2.301
3 3.301
4 4.301
5 5.301
6
X values with first digit = 1 satisfy:
n=0:
n=1:
n=2:
0  log X  0.301 X from 1 to 2
1  log X  1.301 X from 10 to 20
2  log X  2.301 X from 100 to 200
0 0.301
1 1.301
2 2.301
3 3.301
4 4.301
5 5.301
6
So X values with first digit=1 DO lie on
evenly spaced stripes, on the log scale!
The PROBABILITY of getting first digit 1 is
the AREA of the red stripes,
~ approx the fraction on the base, = 0.301.
We’ve done it!
We’ve shown that we
really should expect
the first digit to be
1 about 30% of the
time
Intuitively…
0 0.301
So the smallest numbers
(first digit = 1) are
stretched out, and get the
highest probability
1 1.301
2 2.301
3 3.301
4 4.301
5 5.301
6
The log scale distorts:
small numbers (e.g. 100) are stretched out;
larger numbers (e.g. 900) are bunched up.
The first digit corresponds to regularly
spaced stripes on the log scale.
When is this going to work?
0 0.301
1 1.301
2 2.301
3 3.301
4 4.301
5 5.301
We need a lot of stripes to balance out big
The
distribution
of
ones and little ones
X needs to be
We get one stripe every integer…
WIDE on the log
So we need a lot of scale
integers!
6
When is this going to work?
0 0.301
1 1.301
2 2.301
3 3.301
4 4.301
5 5.301
X ranges from 0 to 6 on the log scale…
So it ranges from 1 to 106 on usual scale!
1 .. 2 .. Miss a few ... 999,999 .. 1,000,000
6
These are Benford’s ‘Outlaw Numbers’!
0 0.301
1 1.301
2 2.301
3 3.301
4 4.301
5 5.301
6
All we need is a distribution that is:
• WIDE (4 – 6 orders of magnitude or more)
• Reasonably SMOOTH …
Then the red stripes will even out to cover
about 30% of the total area.
Outlaw numbers…!
Try it on some well-known distributions
Histogram of X :
the numbers of interest
Histogram of log10(X) :
the ‘hat’
Try it on some well-known distributions
Probability density curves
overlaid
Despite
small well-known
number of stripes,
an
Try
it onthe
some
distributions
excellent fit to Benford:
1s : 0.33 vs 0.30
2s : 0.17 vs 0.17
...
9s: 0.049 vs 0.046
Stripes on the hat: numbers in
the red stripes begin with 1
Why does it go wrong with Uniform?
Histogram of X :
the numbers of interest
Histogram of log10(X) :
the ‘hat’
Why does it go wrong with Uniform?
Probability density curves
overlaid
The escalating
hat-shape
meansUniform?
that the
Why
does it go
wrong with
stripes for 1 carry LESS than their rim
proportion; stripes for 9 carry MORE
The hat doesn’t fulfil the
‘smooth’ criterion: has a
discontinuity at the end
Because we understand why Benford’s Law
works, we don’t need to be surprised by the
obvious counterexamples
In Real Life…
First digits very good
fit to Benford!
World Populations:
From 50 for the Pitcairn Islands …
To 1.3 x 109 for China…
Wide (9 integers => 9 stripes)
In Real Life…
World Populations:
From 50 for the Pitcairn Islands …
To 1.3 x 109 for China…
Electorate populations?
From 583,000 to 773,000 in California:
The hat has less than
one stripe! Benford
doesn’t work here.
Of course not!
All the first
digits are 5, 6,
or 7…
But naturally occurring populations are a
different story
Yes! It’s Benford!
Cities in California:
- from 94 in the city of Vernon…
- to 3.9 million in Los Angeles…
Wide enough (5 integers => 5 stripes)
Your tax return….?
If you plan to fake data, you should first
check whether it ought to be Benford!
But beware: the IRD also has some other
tricks up its sleeve….
My opinions
Benford’s Law is:
• Interesting
• Fun / quirky
• Not very useful
• Not at all mysterious
Even statistical
goodness-of-fit
tests that
accommodate
uncertainty are
usually too strict –
highly Benford data
will ‘fail’ the test
It is a heuristic:
• It can be explained
• It’s futile to try to prove it – it won’t
succumb to rigour because it isn’t exact
My opinions
Benford’s Law is:
• Interesting
• Fun / quirky
• Not very useful
• Not at all mysterious
Even the
explanation is
based on a model:
in real life, data are
not random
variables drawn
from probability
density functions J
It is a heuristic:
• It can be explained
• It’s futile to try to prove it – it won’t
succumb to rigour because it isn’t exact
Thanks for listening!
To find out more:
• A Simple Explanation of Benford’s Law
by R. M. Fewster
The American Statistician, 2009.
• My website (includes class activity)