Benford`s Law
Download
Report
Transcript Benford`s Law
Benford’s Very Strange Law
John D. Barrow
Simon Newcomb
1888:"We are probably nearing the limit of all we can know about astronomy"
1835-1909
‘Note on the Frequency of Use of the Different Digits in Natural Numbers’, 1881
Log Tables Yield…
Newcomb’s ‘Law’
"That the ten digits do not occur with
equal frequency must be evident to
anyone making much use of logarithmic
tables, and noticing how much faster
the first pages wear out than the last
ones.
The first significant figure is oftener 1
than any other digit, and the frequency
diminishes up to 9."
The law of probability of the occurrence of
numbers is such that all mantissae [fractional
part] of their logarithms are equally probable.
Data on first digits are evenly spread on a logarithmic scale
But it will not be on a linear scale. They become increasingly sparse
Newcomb said this law was “evident”
P(d) [log(d+1) – log(d)]/[log(10) – log(1)] = log(1 + 1/d)
Probability of the First Digit Being Equal to d
P(d)= log10[1 + 1/d], d = 1, 2,..
Ignore signs and take first digit after decimal point eg for -3.1526 it is 1
A Big Surprise
You might have thought P(1) = P(2) = P(3) = ….P(9) = 0.11..
But…
P(1) = 0.30
P(2) = 0.18
P(3) = 0.12
P(4) = 0.10
P(5) = 0.08
P(6) = 0.07
P(7) = 0.06
P(8) = 0.05
P(9) = 0.05
Rediscovered by
Frank Benford
at GEC in 1938
1883-1948
P(d)= log10[1 + 1/d] first-digit distribution
then becomes known as
“Benford’s Law”
‘The Law of Anomalous Numbers’ (1938)
Benford gathered 20,000 pieces of data and studied
First-digit frequencies
Data
1
2
3
4
5
6
7
8
9
River
areas
31.0%
16.4
10.7
11.3
7.2
8.6
5.5
4.2
5.1
Base
ball
32.7
17.6
12.6
9.8
7,4
6.4
4.9
5.6
3.0
magazi
nes
33.4
18.5
12.4
7.51
7.1
6.5
5.5
4.9
4.2
Powers
of 2
30
17
13
10
7
7
6
6
5
20
tables
30.6
18.5
12.4
9.4
8.0
6.4
5.1
4.9
4.7
half lives
29.6
17.8
1.7
10.5
9,9
4.8
5.2
5.2
5.2
Benford
Law
30.1
17.6
12.5
9.7
7.9
6.7
5.8
5.1
4.6
Random street addresses
Picking Raffle Tickets
P(1) = 1/2
P(1) = 1/3
P(1) = 1/5
P(1) = 1/9
P(1) goes up as be go to 19 tickets, then falls
P(1)
P(1) depends on the
number of tickets
Number of tickets
P(1)
Take an average over all
Possible numbers of tickets
The average is 30.1%
Number of tickets
S. Mould
Universal distribution P(x) for numbers with units
Means it must be scale invariant
P(kx) = f(k)P(x)
Since P(x)dx = 1 we must have P(kx)dx = 1/k
so 1/k = P(kx)dx = f(k) P(x)dx = f(k)
Means f(k) = 1/k
d/dk of P(kx) = f(k)P(x)
xdP(kx)/d(kx) d(kx)/dk = -P(x)/k2
Put k = 1
Means P(x) = 1/x
In reality we won’t go to zero or infinity so don’t worry about 0 1/x dx being infinite
Other Digits
By the same kind of analysis we can determine the probability that the
second digit will have a certain value.
It's only necessary to consider a single order of magnitude, since the pattern
is repeated on each order.
For example, in the base 10, the probability of the
second digit being "3" is equal to the sum of the probabilities
of the first two digits being "1.3", "2.3", "3.3", ... or "9.3" for numbers in the
range from 1 to 10.
This is indicated by the shaded regions in the logarithmic scale:
The fraction in 1.4 to 1.3 is
Now just find the fractions in 2.2 to 2.3 etc and add all the answers together
Probabilities for Successive Significant Digits
P(first digit is d) = log[1 + 1/d], d = 1,2,3,…9.
P(second digit is d) = 9k=1 log[1 + (10k+d)-1], d = 0,1,2…9. (Newcomb)
The joint distribution of all digits can be found and they are not independent
P(first = d1, …,kth = dk) = log[1 + (i=1k di 10k-i)-1]
Eg for 0.314; P(3,1,4) = log[1 + (314)-1] = 0.0014..
Unconditional probability that second digit is 1 is P(second digit =1) = 0.109,
But conditional probability that it is 1 given that the first is 1 is 0.115
Dependence falls off fast as distance between digits increases
Distn of the nth digit approaches a uniform distribution on 0,1,2,…,9 very fast
as n , so P 1/10 for occurrence of each 0,1,2…,9 as log(1 + 1/n) 1/n
Invariances Pick Out Benford
Scale invariance –
no preferred units
Base invariance wrt base of
arithmetic b
P(d) = logb(1 + 1/d)
But why should there be a
distribution like this at all?
Do All First-Digit Distributions
Follow Newcomb-Benford?
US tax return data
Random number generator
Not Everything Follows Benford
Continued fraction digits are mostly 1’s in
general but they are not Benford-Newcomb-like
a = k + x = integer + fractional part
For almost all real numbers:
P(k) = ln[1 + 1/k(k + 2)]/ln[2]
P(1) = 0.41, P(2) = 0.17, P(3) = 0.09, P(4) = 0.06, P(5) = 0.04
Steeper than Benford: P(k) 1/k2 as k
ln(1+x) x
First digits are Benford-Newcomb distributed so long as
• Data measure same phenomena (eg all prices or areas)
• There is no built in max or min values
• The numbers are not assigned (like phone nos)
• The underlying distribution is fairly smooth
• More observations of small items than large ones
• Data spans several whole numbers on the log scale:
* The distribution must be broad rather than narrow *
Red area is relative
Prob first digit is 1
Blue area is relative
Prob first digit is 8
Ratios of areas
proportional to widths
Eg incomes. populns
Broad
Ratios of areas not
proportional to widths
Eg human heights, IQ
scores Narrow
Different Types of Data
Benford-like ?
yes
yes
no
yes
Winning Lotteries
The Massachusetts Numbers Game – State Lottery
1. Bet on a 4-digit number
2. A 4-digit number is generated randomly
3. All winners share the jackpot
A Possible Strategy
To avoid sharing the prize. Assume entrants pick numbers from
their experience (ie not at random) and obey Benford’s law. So pick
numbers that are least probable by the Benford-Newcomb law. So
start with 9’s and 8’s
Evidence (Hill 1988) that numbers ‘randomly’ chosen by people tend
to start with low digits
Generalised Benford’s Laws
A random process with probability distribution P(x) 1/x
gives Benford data for first digits:
P(d)= log[1 + 1/d]
Random processes with P(x) 1/xa and a 1 give
P(x) = C dd+1 x-a dx = (101-a – 1)-1[(d+1)1-a – d1-a]
For a = 2: P(d = 1) = 0.56, P(d = 2) = 0.185,
P(d = 3) =0.09, P(d = 9) = 0.012
For prime numbers from 1 to N
a(N) = 1/[logN – c]
c = 1.10 + 0.05 large N
Perone et al
A Well-defined Approach to Uniformity by the Primes
Christian Perone
a = 1.10
Detecting Fraud
‘Natural’ distributions and their combinations should follow Benford
Maybe ‘Doctored’ or ‘artificial’ constructions do not ???
Mark Nigrini Univ. Cincinnati PhD thesis (1992)
‘The detection of income evasion through an analysis of digital distributions’
Data from the lines of 169,662 IRS model files follow Benford's law
closely.
Fraudulent data taken from a 1995 King’s County, New York, District
Attorney's Office study of cash disbursement and payroll in business
don’t follow Benford's law.
The fraudulent or concocted data appear to have far fewer numbers
starting with 1 and many more starting with 5 or 6 than do true data.
Forensic Accounting with Newcomb-Benford
Robert Burton, the chief financial investigator for the Brooklyn District
Attorney recalled in an interview that he had read an article by Dr. Nigrini
that fascinated him.
"He had done his Ph.D. dissertation on the potential use of Benford's Law
to detect tax evasion, and I got in touch with him in what turned out to be a
mutually beneficial relationship," Mr. Burton said. "Our office had handled
seven cases of admitted fraud, and we used them as a test of Dr. Nigrini's
computer program. It correctly spotted all seven cases as "involving
probable fraud."
He feels your pain
President Clinton’s Tax Returns over 13 Years