Introductionx - The City University of New York

Download Report

Transcript Introductionx - The City University of New York

Introduction
and Background
FOS/FCM 705
Mathematical Statistics for Forensic Analysis
Applied Bayesian Statistics
Spring 2017
Bayes Theorem Example
•
A taxicab was involved in a hit and run
accident at night. Two cab companies, the
Green and the Blue, operate in the city. The
following facts are known:
o 85% of the cabs in the city are Green and 15% are
blue
o A witness identified the cab as blue. The court tested
the reliability of the witness under the same
circumstances that existed on the night of the
accident and concluded that the witness correctly
identified each one of the two colors 80% of the time
and failed 20% of the time.
•
•
What is the probability that the cab involved in
the accident was actually blue?
What does probability mean in this context?
All Models are Wrong,
Some Models are Useful
Now it would be very remarkable if any system existing in the real world
could be exactly represented by any simple model. However, cunningly
chosen parsimonious models often do provide remarkably useful
approximations. For example, the law PV = RT relating pressure P, volume V
and temperature T of an "ideal" gas via a constant R is not exactly true for
any real gas, but it frequently provides a useful approximation and
furthermore its structure is informative since it springs from a physical view
of the behavior of gas molecules.
For such a model there is no need to ask the question "Is the model true?".
If "truth" is to be the "whole truth" the answer must be "No". The only
question of interest is "Is the model illuminating and useful?".
Box, G. E. P. (1979), "Robustness in the strategy
of scientific model building", in Launer, R. L.;
Wilkinson, G. N., Robustness in
Statistics, Academic Press, pp. 201–236.
Bayes Theorem Example Comment
• Notice that in the previous example we assumed
we knew the various percentages and the question
was essentially “If we know these percentages then
what is the probability that the cab was blue?”
• This is a characteristic of a probability problem – We
assume that we know what the ‘model’ is and then
use probability theory to calculate the probability of
the event we are interested in.
Bayes Theorem Example More Comments
• Suppose there were 1000 cabs in the city. Then 850 are green
and 150 are blue.
o
o
If the actual cab was blue then the witness would say ‘blue’ for 80% of the 150 blue
cabs or for 120 cabs, and the witness would say ‘blue’ for 20% of the 850 green cabs
or 170 cabs. (20% wrong). At trial the witness says ‘blue.’ What is the probability the
cab was blue?
If we repeated the identical situation many times, what percentage of the time
when the witness said blue would the cab be blue? (frequentist probability)
• Suppose our original assumptions about the population or the
witness were incorrect. What about our conclusion?
• Garbage In, Garbage Out
• Suppose the judge ‘thought’ apriori that witnesses can identify
the correct color somewhere between 60 and 80 per cent of
the time. Now what percentage of the time when the witness
said blue would the cab be blue?
Virginia Murders
Run this code and the graphical output is:
Is there a trend?
Virginia Murders -2
• Notice that to analyze the Virginia Data we first
entered the data into the computer. Next we
plotted the data so that we had a graphical
summary. Lastly we “fit” a best straight line through
the data. So far this is descriptive statistics. As soon
as we use this straight line to predict future values
we enter the realm of inferential statistics.
Atlantic Hurricanes
• The following is a list of the numbers of Atlantic
hurricanes for each year from 1960-2010, 51 years in
all.
• How can we describe this data? You’ll notice
that just looking at the numbers doesn’t tell us
too much because there are too many
numbers.
• We can describe the numbers graphically or
with numeric summaries.
Atlantic Hurricanes
The chart to the right is called a histogram or
sometimes a bar plot. It gives the frequencies and the
number of times each occurred. This is a graphical
summary. It summarizes the following which is a tabular
summary. Note the number of hurricanes on the 1st
row and the frequencies on the second.
•
•
The following is a numerical summary of the
hurricane data. The six numbers you see are called
the minimum, maximum, median, mean, 1st
quartile, and 3rd quartile.
• Notice that in the hurricane example (as far as
we have followed it.) we had some data and
we summarized it. This is the province of
descriptive statistics.
Matching Hats
• You go to a party with 4 others and you all
check your hats. The hat check person
unfortunately is a little tipsy. Thus she
randomly chooses hats and gives the hats
back to the party members as they leave.
o What is the probability that 1 person gets the
correct hat?
o What is the probability that 2 people get the
correct hat?
o What is the probability that nobody gets the
correct hat?
• What does probability mean in this case?
Matching Hats Continued
• Suppose we do not know how to solve the
matching hat problem using probability. We might
‘simulate’ the experience multiple times. Each time
we look at the number of matches.
• After many trials a good guess for the probability of
1 match is the proportion of times we get 1 match.
Also a good guess for the probability of 2,3,…, k
matches is the proportion of times we get that
many matches in our simulation.
• In this case we are estimating the probabilities from
the simulated sample. We used statistics.
Matching Hat
simulation
• Using the R code on the 2nd following
page I ran the matching hat
simulation 50000 times and created
the following bar chart:
Matching Hat
Simulation
Note that we obtained 1 match about 37% of
the time.
• In many repeatable situations we do not know if
an outcome we are looking for is going to occur
on the next trial but in a long sequence of trials
the relative frequency of occurrence of this
outcome approaches some fixed value. We call
this fixed value the probability of the outcome.
• In our situation the probability of 1 match is
about 37%. This may be used as an estimate of
the true probability.
•
R Code for Hat Check Problem
Probability, Statistics and the
Hat Check Problem
• Notice that in the original statement of the problem
we stated the entire problem including that the hat
check person ‘randomly’ selected the hats. Thus
our model was completely determined and the
solution (which we didn’t give) could be
determined. This solution should be the long run
relative frequency of occurrence of getting 1
match if we did the experiment. (Probability
problem)
• The simulation solutions gives us an estimate of the
true probability based just on the sample. (Statistics)
Probability, Statistics and the Hat Check Problem
In this particular problem, the theoretical probabilities could
be calculated. (Not easy but they follow:
Note the results are pretty close to the simulation. Often, the simulation can
be completed with pretty good accuracy much more easily than developing
the related theoretical math. Sometime the math problem is so difficult that
the simulation is the only reasonable way to approximate a solution.
More Probability Problems
• Two fair dice are tossed. What is the probability that the
sum of the numbers on the two up faces equals 6?
• Assuming that you take a true-false test and you know
nothing so you guess each answer. If the test has 7
questions then what is the probability that you get
above 50% correct? How about if you think you have
between a 50% and 70% chance of getting any
question correct? What if you take the test and get 5
correct. Now if you take the test again then what is the
probability that you get above 50% correct?
• It is known that the number of hurricanes to hit the
Atlantic per year is a Poisson random variable with
parameter λ= 6.176 and that years are independent
from one another. What is the probability that 5
hurricanes hit the Atlantic next year. (Note our
assumptions here. Are they reasonable?)
More on Hurricane Problem
Statistics
• For the hurricane problem on the previous page an
interesting question is how we ‘knew’ that the
number of Atlantic hurricanes in a season is a
Poisson random variable with parameter λ= 6.176
and years independent of one another.
o The answer is that we looked at data pertaining to Atlantic hurricanes and
used that sample data to infer what the true hurricane number distribution
is. This is an example of inferential statistics, where we infer something
about a population from a sample.
o In probability problems we assume we know the characteristics of the
population and consider how samples will behave.
o In inferential statistics problems we use the sample to infer some
characteristics of the population.
More on Hurricane Problem
Statistics
•
For the hurricane data we
collected data on the number of
Atlantic hurricanes for the 51
years between 1960 and 2010.
The mean number of hurricanes
was 6.176 (Add up frequencies
and divide the result by 51. We
then plotted a histogram and
then the density function (We’ll
learn about this later) of a Poisson
distribution on the same plot.
Does the Poisson closely fit the
histogram? If yes then we may
use the Poisson distribution to
predict numbers of hurricanes in
the future. This would be
inferential statistics.
Why Try Poisson?
• Poisson probability
mass function is:
For k = 0,1,2,3,4,…. where λ is
some positive parameter
representing the mean of X and
k is the number of ‘successes’ in
this case hurricanes. For λ=6.176
this plots as:
Does the plot look something
like the hurricane histogram? In
addition, the Poisson distribution
often appears when we are
counting random events which
occur over time.
Statistics Definition
• The discipline of statistics provides methods for
organizing and summarizing data and for drawing
conclusions based on information contained in the
data.
• This includes methods for
o Designing experiments to collect data
o Extracting information from data through organization, summary and
display.
o Making decisions and predictions in the presence of uncertainty and
variation.
Populations, Data, Samples
• Engineers and scientists are constantly exposed to
collections of facts, or data
• Usually the data comes from a well-defined
collection of objects we are interested in, called the
population.
• Examples of possible populations:
o All students at John Jay College
o The quality of all light bulbs being made by General Electric
• Usually we are interested in some characteristic of
the population. If we measure that characteristic for
all members of the population then we have taken
a census. Usually a census is impractical or
unfeasible. Then we just measure the characteristic
of interest for a well-chosen sample from the
population. A sample is a subset of the population.
Variables
• A variable is any characteristic whose value may
change from one object to another in the
population. We shall initially denote variables by
lowercase letters from the end of our alphabet.
Examples include
• x = brand of calculator owned by a student
• y = number of visits to a particular Web site during a
specified period
• z = braking distance of an automobile under
specified conditions
Variables and Samples
• Usually data for our census or sample results from
making observations either on a single variable or
simultaneously on two or more variables.
• A univariate data set consists of observations on a
single variable.
• For example, we might determine the type of
transmission, automatic (A) or manual (M), on each
of ten automobiles recently purchased at a certain
dealership, resulting in the categorical data set
•
M
A
A
A
M
A
A
M
A
A
Variables and Samples II
• The following sample of lifetimes (hours) of brand D
batteries put to a certain use is a numerical
univariate data set:
•
5.6 5.1 6.2 6.0 5.8 6.5 5.8 5.5
• We have bivariate data when observations are
made on each of two variables.
• An example of bivariate data would be measuring
the height and weight of each subject so an
observation might be a pair like (68, 132)
• When each observation involves two or more
variables then we talk about multivariate data.
Branches of Statistics
• An investigator who has collected data may wish
simply to summarize and describe important
features of the data. This entails using methods from
descriptive statistics.
• So creating any summary of data from a sample is
part of descriptive statistics. This may involve
o Graphical Techniques such as creating histograms , boxplots, dotplots,
scatter plots, pie charts, etc.
o Calculation of numerical summary measures, such as means, standard
deviations, and correlation coefficients.
• Much of the display results and calculation needed
with descriptive statistics are done with dedicated
computer statistics packages. We will use R.
Branches of Statistics II
• Having obtained a sample from a population, an
investigator would frequently like to use sample
information to draw some type of conclusion (make
an inference of some sort) about the population.
• That is, the sample is a means to an end rather than
an end in itself. Techniques for generalizing from a
sample to a population are gathered within the
branch of our discipline called inferential statistics.
• Mathematical probability theory is needed to
develop inferential statistics and study the field.
• We start with a short introduction to sampling and
descriptive statistics.
Sampling
• The population we are interested in may be a finite,
identifiable, unchanging collection of individuals or
objects. Then we are considering an enumerative
study.
o An example would be if we are interested in the average GPA of all
students attending John Jay College in the Spring 2017 semester.
o In an enumerative study a sampling frame is a listing of all items in the
population that might be part of the sample. Hopefully, but not always,
the sampling frame coincides with the population.
• For example, the frame might consist of all signatures on a petition to
qualify a certain initiative for the ballot in an upcoming election; a
sample is usually selected to ascertain whether the number of valid
signatures exceeds a specified value.
• An analytic study is just defined as one which is not
enumerative.
o For example, we might be interested in the average weight of a
McDonald’s Quarter Pounder Hamburger. This is not a fixed list. More
quarter pounders are cooked every day.
Collecting Data
• If data is not properly collected, an investigator
may not be able to answer the questions under
consideration with a reasonable degree of
confidence. Garbage in, Garbage out!!!
• A common problem is that the target population—
the one about which conclusions are to be
drawn—may be different from the population
actually sampled(sampling frame). See the
following slide for a famous case.
Landon in a Landslide: The Poll That Changed Polling
The 1936 presidential election proved a decisive battle, not only in
shaping the nation’s political future but for the future of opinion
polling. The Literary Digest, the venerable magazine founded in
1890, had correctly predicted the outcomes of the 1916, 1920, 1924,
1928, and 1932 elections by conducting polls. These polls were a
lucrative venture for the magazine: readers liked them;
newspapers played them up; and each “ballot” included a
subscription blank. The 1936 postal card poll claimed to have
asked one fourth of the nation’s voters which candidate they
intended to vote for. In Literary Digest's October 31 issue, based on
more than 2,000,000 returned post cards, it issued its prediction:
Republican presidential candidate Alfred Landon would win 57
percent of the popular vote and 370 electoral votes.
Has anyone heard about President Landon? Why was Literary
Digest’s poll off by so much?
1936 Presidential Election
Look up: 1936
Presidential Election
in Wikipedia for the
information on the
left and a fuller
discussion of the
election.
It looks like Landon
did not get 57% of
the vote.
1936 Presidential Election
• Literary Digest’s sampling frame for their survey was
not representative of the voters in 1936. Although it
had polled 10 million individuals (only about 2.4
million of these individuals responded, an
astronomical sum for any survey), it had surveyed
firstly its own readers, a group with disposable
incomes well above the national average of the
time . The Digest used two other lists, registered
automobile owners and telephone subscribers.
Pretty much all of these lists contained people who
had jobs. In 1936 were were in the depths of the
Great Depression. Many (most) voters did not have
jobs. Hence the poll erred. (Literary Digest in
Wikipedia)
2016 Presidential Election
• Almost all polls for the 2016 presidential election in
the United States predicted that Hillary Clinton
would defeat Donald Trump. In each state several
polls were taken at various times by various
methods. The polling ‘experts’ weighted these polls
in various complicated manners in coming up with
the final prediction.
• The following are the results of simulating the results
of the election in numbers of electoral college votes
won by Hillary Clinton. The simulation was done the
night before the election using probabilities
appearing on the web site fivethirtyeight.com
2016 Presidential Election
•
In the following simelcttot is the vector of simulated
electoral college votes for Hillary Clinton. The simulation
was done 10000 times. The descriptive statistics are fairly
self-explanatory.
The next morning fivethirtyeight.com changed its prediction
somewhat to give Hillary Clinton about a 69% chance of winning.
2016 Presidential Election
The methods used by most of the ’experts’ used previous information to figure
out how to weight the results. Simply said, the probabilities used in the
predictions involved calculations of probabilities conditional on previous
information about the pollsters, the district, the country and their
relationships.
The fivethirtyeight.com prediction was actually one of the better
predictions. In fact, the following link is a discussion by Nate Silver of
fivethirtyeight.com of why his predictions were so ‘good.’
http://fivethirtyeight.com/features/why-fivethirtyeight-gave-trump-abetter-chance-than-almost-anyone-else/
Sampling Methods – Simple Random Sample
• When data collection entails selecting individuals or objects
from a frame, the simplest method for ensuring a
representative selection is to take a simple random sample.
This is one for which any particular subset of the specified size
(e.g., a sample of size 100) has the same chance of being
selected as any other. Easy to say, not always so easy to do.
• A simple random sample is considered the gold standard
when we want to obtain information about a population from
a sample.
1969 Draft Lottery
• December 1, 1969 marked the date of the first draft lottery
held since 1942. This drawing determined the order of
induction for men born between January 1, 1944 and
December 31, 1950. A large glass container held 366 blue
plastic balls containing every possible birth date and affecting
men between 18 and 26 years old.
• WASHINGTON, Jan. 3 The new draft lottery is being challenged
by statisticians and politicians on the ground that the selection
process did not produce a truly random result.
Results
http://en.wikipedia.org/w
iki/Draft_lottery_(1969)
1969 Draft Lottery
• The method used seems to be random. What
happened. This was on TV. (Nobody trusted the
government at that time.) Balls were segregated by
month. Each month’s balls were dropped into the
container in sequence with thorough mixing between.
So at the end all balls were (supposedly) mixed. Then
the balls were picked 1 by 1 usually by famous people.
First number picked corresponded to a birthday.(e.g. 32
was February 1.) All men with that birthday were drafted
before anyone else. And so on.
• What happened? End of year tended to be picked
early. (I was happy. My birthday is in March.)
• The selection wasn’t random. It’s difficult to select a
random sample.
Simple Random Sample by Computer
• Computers are currently one of the most reliable
mechanisms for creating a simple random sample.
• Look at the following session from the R statistical
package.
• Starting from a list of the integers from 1 – 20 the sample
function produced a pseudo-random list of size 5 – 8 5 13
20 18
• Results from computerized procedures tend to be
superior to humanly generated results.
• If the army had computers and used the code on the
following page then the draft lottery results might have
been more fair.
Code for Draft Lottery
New York Lottery Take 5
Sampling Methods – Stratified Sample
• Stratified sampling, entails separating the
population units into non-overlapping groups and
taking a simple random sample sample from each
one.
• For example, A store wants information on incomes
of customers buying TV’s. They break these buyers
into Sony, Samsung, and Other and then take a
random sample from each group. Later this data is
combined in some way.
Sampling Methods
• A convenience sample is obtained by selecting
individuals or objects without systematic
randomization.
o For example suppose I wish to determine the mean weight of students at
John Jay College. I conveniently choose as my sample the students in
Mat 301 and weigh each one. Are my results representative of the weights
all John Jay students.
o This is not considered a reliable sampling method
• With a self-selected sample the respondents
individually decide whether to take part.
o For example a radio show host wants to determine who is the best singer
so asks his listeners to text in a name. Only interested listeners answer.
o A school runs a satisfaction survey and asks students their ratings for
various services. The survey is sent to all students but only 5% of students
reply. One result is that 82% of respondents reported they were happy with
school e-mail service. The administration used this number in its planning
for the next year. Reliable?
What’s Next?
Descriptive
Statistics