Applying Benford’s Law to Large, Natural Data Sets

Download Report

Transcript Applying Benford’s Law to Large, Natural Data Sets

APPLYING BENFORD’S LAW
OF LEADING DIGITS TO
LARGE, NATURAL DATA SETS
BACKGROUND OF BENFORD’S LAW


Discovered by Simon Newcomb in 1881 and again
by Frank Benford in 1938, Benford’s Law of
Leading Digits suggests that in a majority of
real-life data sets, the leading digits of data
entries are logarithmically, rather than
uniformly, distributed.
As such, in a base 10 number system we would
expect to observe a leading digit of 1 about 30.1%
of the time, whereas we would expect to observe a
leading digit of 9 only about 4.5% of the time.
INTRODUCTION TO BENFORD’S LAW


For any positive real number x, we can represent
x in scientific notation as
x = MB(x) ∙ Bk(x)
where MB(x) is called the mantissa of x, and k(x)
represents the exponent value.
Benford’s Law of Leading Digits provides us with
an expected distribution of the mantissas in a
natural data set. According to the law, the
probability of observing a data entry beginning
with digit d in base B is
Prob(first digit is d) = logB(1+1/d).
BENFORD BASE-10 DISTRIBUTION
For a base 10 number system, we expect to see the following
distribution of digits in a data set that satisfies Benford’s Law.
Leading Digit
Benford Base-10 Probability
1st Digit
2nd Digit
3rd Digit
4th Digit
0
-----------
0.11968
0.10178
0.10018
1
0.30103
0.11389
0.10138
0.10014
2
0.17609
0.10882
0.10097
0.10010
3
0.12494
0.10433
0.10057
0.10006
4
0.09691
0.10031
0.10018
0.10002
5
0.07918
0.09668
0.09979
0.09998
6
0.06695
0.09337
0.09940
0.09994
7
0.05799
0.09035
0.09902
0.09990
8
0.05115
0.08757
0.09864
0.09986
9
0.04576
0.08500
0.09827
0.09982
APPLICATIONS OF BENFORD’S LAW


Benford’s Law has been a useful tool in detecting
fraud and data irregularities in the past due to the
fact that humans are notoriously bad random number
generators. Discrepancies from the Benford
distribution may suggest issues in data validity, such
as inconsistencies in data collection methods,
rounding errors, or even nefarious activities such as
fraud or data distortion.
Knowing that the leading digits of the mantissas
should be logarithmically distributed (becoming
gradually more uniform the further out we progress),
we can compare combinations of the first digits and
last digits to the Benford and Uniform distributions,
respectively, to judge conformity to the expected digit
distribution.
COMPARISON OF NATURAL DATA SETS



In this study, we compare two natural data sets and
their conformity to the Benford distribution.
The first data set is an example of a natural data set
with an extremely good fit for the Benford
distribution. This data set is made up of hydrology
and streamflow statistics from the U.S. Geological
Survey.
The second is a much more controversial data set that
demonstrates considerable discrepancies from the
digit distribution we would expect if it were truly
Benford. This data set was derived from a paper
published by Phil Jones and Michael Mann, two of the
researchers accused of data distortion in the 2009
Climategate Scandal.
ISSUES ARISING FROM BENFORD ANALYSIS


It is still an open question as to which data sets
should conform to Benford’s Law. In general, it
suffices for the data set to be large, span multiple
orders of magnitude, and have a sufficient number of
significant digits. However, it is still possible for the
data to fail to be Benford without any nefarious
activity despite having met these conditions.
Though the Chi-square statistic is the most popular
and well-documented statistic, we must take into
account the extreme sensitivity of the Chi-square
statistic when dealing with large data sets that have
few degrees of freedom (in cases such as these, the
Chi-square statistic tends to overestimate the error).
For comparison’s sake, we include these values, but
rely primarily on the mean absolute deviation
(percent deviation from the intended distribution) for
our analysis.
INTENTIONS

The primary goal of this study is to get a sense of
when Benford’s Law should hold for natural data sets.
As such, discrepancies from Benford’s Law need not
indicate fraud or nefarious activity, and it is not our
intent to accuse anyone of such behavior; our goal is
to see whether or not certain data sets follow
Benford’s Law, and comment on the results.
HYDROLOGY STATISTICS
BACKGROUND AND DATA DESCRIPTION
This data set is comprised of streamflow
statistics from the U.S. Geological Survey.
 The characteristics of this data make it perfect
for a Benford analysis:

The data spans a time period of 130 years.
 The data set is the largest analyzed in Benford
literature to date.
 The data set spans nine orders of magnitude.
 The methods employed to measure stream flow have
not changed at all during the time period, suggesting
that there will be no distortions due to data collection
method changes.

HYDROLOGY STATISTICS
DESCRIPTION OF BENFORD TESTS


In a previous study by Miller-Nigrini (2008), a first
two-digit analysis was performed on this data set,
displaying a very close fit to the Benford distribution.
In this study, we analyzed the distribution of the first
three-digits. The statement of Benford’s Law may be
revised as follows to predict the probability of
observing a data entry beginning with a predetermined three-digit combination:
Prob(first three digits are d1d2d3) =
log10[1+1/(100d1+10d2+d3)]
HYDROLOGY STATISTICS
RESTRICTING THE DATA SET



Due to potential rounding discrepancies, we only wanted to
include numbers with at least four significant digits so that
the first three would be unaltered by rounding. However,
by pruning our data set to include only values for which we
could trust the first three digits, we were limiting ourselves
to a mere 16.1% of our original 457,440 data entries,
existing in only one order of magnitude. This resulted in a
strange, non-Benford distribution.
Having restricted our data set so thoroughly, we could not
conclude that our data set was truly non-Benford.
Therefore, we decide to ignore the limitation on significant
digits and perform a Benford analysis on a larger portion of
the data set under the assumption that any rounding
errors would “smooth out” over such a large data set. This
enabled us to look at a data set with over 400,000 entries
spanning six orders of magnitude.
We compare and contrast these two subsets of the
hydrology statistics in the following slides.
HYDROLOGY STATISTICS
COMPARISON OF RESTRICTED VS. UNRESTRICTED DATA
Restricted First 3-Digits
Unrestricted First 3-Digits
In this comparison, we can see that the unrestricted data set
displays a much better conformance to the Benford
distribution (shown in pink). This may be attributed in part
to the difference in size, but is primarily due to the presence
of data entries spanning an extra five orders of magnitude in
the unrestricted data as opposed to the restricted set.
HYDROLOGY STATISTICS
MEASURING CONFORMANCE TO BENFORD’S LAW

The following table reports the Chi-square and absolute
mean deviation values for Benford tests of the first, first
two, and first three (both restricted and unrestricted)
digits. Again, we treat the significant Chi-square values
with caution, as several of our data sets contain over
400,000 values.
Benford Test
Chi-Square
Mean Absolute
Deviation
First Digit
45.82
0.00086
First Two-Digits
178.74
0.00017
First Three-Digits (Restricted)
12054.70
0.00039
First Three-Digits (Unrestricted)
23345.30
0.00020
HYDROLOGY STATISTICS
CONCLUSIONS

Though both data sets display a relatively good fit for the
Benford distribution, we notice that ignoring our initial
limitation on the number of significant digits (thereby
giving us a much larger data set spanning five additional
orders of magnitude) gives us a better value for the mean
absolute deviation. Our unrestricted data set is a mere
0.02% away from what we would expect in a Benford
distributed data set.
Restricted
Unrestricted
Size of Data Set
73,828
446,055
Orders of Magnitude
1
6
# Significant Digits
≥4
≥3
Mean Abs. Deviation
0.00039
0.00020
CLIMATE DATA
BACKGROUND AND DATA DESCRIPTION



A massive email leak at the Climatic Research Unit
in November 2009 led to allegations of scientific
misconduct and data distortion in the climate science
community. The scandal soon earned the title
“Climategate”.
The data set analyzed was comprised of data from a
paper published by Phil D. Jones and Michael Mann
(two of the researchers accused in the scandal) in
2004, titled “Global Surface Temperatures Over the
Past Two Millennia”.
The data set contains a total of 32,451 observations
(measured as deviations from an average
temperature); this data set was further broken down
into 30 data subsets (ranging in size from 335 to 1991
entries) covering different regions of the world.
CLIMATE DATA
DESCRIPTION OF BENFORD TESTS


Because these data entries were measured as
deviations from an average temperature, the option of
a first-digit analysis was discarded (due to the
presence of so many data entries beginning with 0).
Instead, the last two digits were analyzed in four
different Benford tests:




Endings: Compares each of the 100 possible last two
digit combinations to the expected uniform probability,
1/100.
Non/Doubles: Compares the proportion of total nondouble endings to 9/10 and the proportion of total double
digit endings to 1/10.
Non/Doubles(Split): Compares the proportion of total
non-double endings to 9/10 and the proportion of each
double digit ending to 1/100.
Doubles(Conditional): Evaluates the double digit ending
proportions conditionally (given that a double occurs),
comparing each double digit ending combination to 1/10
of the total double digit endings.
CLIMATE DATA
DISTRIBUTION OF DOUBLE-DIGIT ENDING COMBINATIONS

In an amalgamation of all 30 data subsets, we
observe a significant spike of values ending in the
double digit ending combination 77, and a deficit
of values ending in 00.
CLIMATE DATA
ANALYSIS OF CLIMATE DATA AMALGAMATION


In an analysis of all 32,451 data entries, we see a 3.93%
deviation from our expected Uniform distribution in the
Doubles (Conditional) test.
An issue that arises in the climate data analysis is the fact
that with only three significant digits, we would not expect
the last two digit distribution to be entirely uniform, as we
have not progressed far enough out in the mantissa to
ensure uniformity. We should see a distribution that is
slightly biased toward lower values, though less so than a
first digit Benford distribution.
Statistic
Benford Last Two-Digits Test
Endings
Non/Doubles
Non/Doubles(S)
Doubles(C)
Chi-Square
7288.75
6.34
594.49
613.89
Mean Abs. Dev.
0.00424
0.00419
0.00387
0.03927
CLIMATE DATA
ANALYSIS OF INDIVIDUAL DATA SUBSETS

Due to large discrepancies in the number of times that
particular ending digit combinations were observed, we
chose to analyze each of the thirty data subsets
individually. This uncovered a number of subsets with
ending digit distributions that seemed to be outside the
realm of random chance. We include the double digit
ending statistics for two of these strange subsets (the
Western US Unsmoothed and Tasmania Unsmoothed data
sets) below:
Data Set
#Entries
00
11
22
33
44
55
66
77
88
99
#Doubles
West. US
1781
4
6
4
5
1
8
0
38
0
24
90
57
80
64
57
0
0
0
0
0
0
258
Tasmania 1991
In our next few slides, we provide an example of the
analysis that was performed on each of the thirty data
subsets, using the data from these two subsets.
CLIMATE DATA
ANALYSIS OF SUBSETS – WESTERN US UNSMOOTHED

If this data subset were truly Benford, we would expect to
see a slight bias toward the lower double digit
combinations, and a more uniform decrease as the ending
digit combinations increase. When the last two-digit
analysis was expanded to include all 100 possible ending
combinations, we observed a random scattering of large
numbers of occurrences interspersed with 17 ending
combinations that did not occur a single time.
Statistic
Benford Last Two-Digits Test
Endings
Non/Doubles
Non/Doubles(S)
Doubles(C)
Chi-Square
1399.07
48.42
125.23
152.00
Mean Abs. Dev.
0.00771
0.04947
0.01169
0.09778
In addition to the non-Benford pattern of ending digit
combinations, we have a 9.78% deviation from the
distribution of double-digit endings (Doubles(C)) that we
would expect if this subset were Benford.
CLIMATE DATA
ANALYSIS OF SUBSETS – TASMANIA UNSMOOTHED


As seen in the previous table, this data subset demonstrates a
strong bias towards lower double-digit ending combinations.
Originally, we suspected that this anomaly may be due to a lack
of range (i.e. if our range covered only the interval [0,0.4], we
would not expect to observe any ending combinations above 40).
However, our range covers the interval [-4.43,3.59], and
includes ending digit combinations ranging from 00 to 99.
An expansion of the analysis to include all 100 possible twodigit ending combinations demonstrates an impressive lack of
pattern, with 46 ending combinations being observed not a
single time, while others are observed as many as 80 times. A
sample of this data is included below:
00
01
02
03
04
05
06
07
08
09
10
11
12
13
14
57
0
0
72
2
0
79
0
49
2
0
80
6
0
66
CLIMATE DATA
ANALYSIS OF SUBSETS –TASMANIA UNSMOOTHED (CONT.)

Though the majority of the last two-digit tests reveal
only a 1 or 2% discrepancy from the expected
distribution, our Doubles (Conditional) test reports a
deviation of 12.00% from the Uniform distribution
that we would expect if this data set were truly
Benford.
Statistic
Benford Last Two-Digits Test
Endings
Non/Doubles
Non/Doubles(S)
Doubles(C)
Chi-Square
3261.49
19.36
538.58
400.68
Mean Abs. Dev.
0.01134
0.02958
0.01629
0.12000
CLIMATE DATA
CONCLUSIONS


An identical analysis performed on all thirty data
subsets revealed multiple cases of disparities from the
uniform last two-digit distribution to which they were
compared. Even with significant deviations in several
of the subsets, we would expect that if the data were
truly Benford, these discrepancies would smooth out
in an amalgamation of all 32,451 values.
As mentioned previously, we currently have no way of
determining if this data set should conform to
Benford’s Law. Though the data set is large, spans
multiple orders of magnitude, and reports data values
to three significant digits, it is still entirely possible
that the deviations are due to non-nefarious factors,
such as rounding errors, discrepancies in data
collection methods, or simply non-Benford behavior.
CONCLUSIONS



In this study we have seen two natural data sets whose
conformance to Benford’s Law vary drastically; a set of
hydrology statistics whose leading digit distribution is a
very close fit, and a set of controversial climate statistics
whose ending digit distribution reveals many discrepancies
from a Benford data set.
It is still an open question as to which data sets should
conform to Benford’s Law; though many researchers
believe that this law is a characteristic intrinsic to our
number system, there is no set of criteria that guarantee
conformance to the expected leading digit distribution.
It has been our goal to provide an in-depth Benford
analysis of several large natural data sets, to demonstrate
Benford techniques, and to address common issues that
arise in a Benford analysis.