Class 5 Lecture: Probability and the Normal Curve

Download Report

Transcript Class 5 Lecture: Probability and the Normal Curve

Sociology 5811:
Lecture 6: Probability,
Probability Distributions,
Normal Distributions
Copyright © 2005 by Evan Schofer
Do not copy or distribute without
permission
Announcements
• Problem set #1 Due Today
• Problem Set #2 handed out today; due in a week
• Class Schedule
• Done with univariate stats
• Starting probability today
Review: Z-Score
• The Z-score: One way to assess relative
placement of cases in a distribution
– Can be used for comparisons, like quantiles
• Converts all values of variables to a new scale,
with mean of zero, S.D. of 1
– Scores typically run from about –3 to +3
• Formula:
d i (Yi  Y )
Zi  
sY
sY
Probability Defined
• Definition: “The probability of a particular
outcome is the proportion of times that outcome
would occur in a long run of repeated
observations (Agresti & Finlay 1997, p. 81)”
• Probability of event A defined as p(A):
outcomes in which A occurs
p( A) 
total number of outcomes
• Example: Coin Flip… probability of “heads”
– 1 outcome is “heads”, 2 total possible outcomes
– p(“heads”) = 1 / 2 = .5
Probability
• Question: What is the probability of picking a
red marble out of a bowl with 2 red and 8 green?
outcomes in which red occurs
p (red ) 
total number of outcomes
There are 2
outcomes that
are red
There are 10 total
possible outcomes
p(red) = 2 divided by 10
p(red) = .20
Frequencies and Probability
• Note: The probability of picking a color relates
to the frequency of each color in the jar
– 8 green marbles, 2 red marbles, 10 total
– p(Green) = .8 p(Red) = .2
• For nominal or ordinal variables:
f ( x)
p( x) 
N
• Where, f(x) is the frequency of x in a sample
Frequency Charts and Probability
GSS Data (N=2904)
HIGHEST YEAR OF SCHOOL COMPLETED
1000
Note that the
total N is 2904
800
Note that 392
individuals have
16 years of
education
600
Frequency
400
200
0
0
4
3
6
5
8
7
10
9
12
11
14
13
16
15
18
17
20
19
Probabilities: Nominal/Ordinal
• Height of bars in a frequency chart reflects the
probability of choosing cases from our dataset
• If we pulled some case randomly from our data
• What is the Probability of choosing a person from
the dataset with 16 years of education?
• Notation: p(Y=16)
• Computed as number of people with 16 years of
education (frequency) divided by total N:
f (Y  16) 392
p(Y  16) 
 .135
N
2904
Probability Distributions
• In a frequency plot, the height of bars reflects
frequency
• Dividing each value by N converts a chart to a
“probability distribution”
• Indicating the probability of choosing an individual with a
given value of Y
• Entire plots can be converted to probability
distributions
• Shape of the distribution is preserved
• Height of bar represents probabilities rather than
frequencies.
Probability Distribution Example
HIGHEST YEAR OF SCHOOL COMPLETED
.440
.330
As we
calculated,
p(Y=16) = .135
.220
Percent
.110
00
0
4
3
6
5
8
7
10
9
12
11
14
13
16
15
HIGHEST YEAR OF SCHOOL COMPLETED
18
17
20
19
Probability: Continuous Variables
• Continuous measures can take on an infinite
number of values
• So, it doesn’t make sense to think of the probability of
picking any exact value
• 1. Typically, only one case has a given value
• The sample may contain a case with 16.238908 years of
education: p(Y=16.238908) = 1/N
• 2. Most exact values have a frequency of 0
• Ex: 0 cases with 16.48900242 years of education
• The probability of p(Y=16.48900242) is zero.
Continuous Distributions
• Continuous distributions can be approximated by
connecting peaks of a histogram:
Line approximates
height of bars for
all values of Y
Continuous Probability Distributions
• For continuous probability distributions:
• Probabilities are not associated with single values
• e.g., the probability that Y=16
• Instead, probabilities are associated with a range
of values
• e.g., the probability that Y is between 15 and 20
• These are visually represented by the area under a
distribution between 15 and 20
Area under curve in range
p(Y  in a range ) 
Total area under curve
Continuous Probability Distributions
p(Red) = Red Area / (Red Area + Blue Area)
Probability Distributions: Notation
• Notation: 
• Greek alpha () is used to refer to a probability for a
continuous distribution
• Notation: p(15<Y<20) = 
•  = Probability of variable Y between 15 and 20
• You can also choose an open-ended range
• p(Y>.4) = 
• Or multiple ranges
• p(.2<Y<.4 and Y>8) = 
• Question: If p(Y>MdnY) = , what is ?
Continuous Probability
Distributions Examples
• p(a<Y<b) = 

Continuous Probability
Distributions Examples
• p(Y<a) = 

Continuous Probability
Distributions Examples
• p(Y<a, Y>b) = 


The “Normal” Distribution
• A particular shape of symmetrical distribution
that comes up a lot
• Some biological phenomena have this distribution, such as
height, cholesterol levels
• Certain statistical regularities take this form
• It is a “Bell-Shaped” distribution
• Note: not all bell-shaped curves are normal distributions.
Example of a Normal Curve
1.2
1.0
.8
.6
.4
.2
0.0
-2.07
-1.21
-.36
.50
1.36
Normal Curve, Mean = .5, SD = .7
2.21
3.07
Normal Curves
• Normal Curves are a “family” of curves
• They all share the same general curvature and
formula
• But, there are infinite variations with different means,
standard deviations
• They have different centers (means) and are more or less
spread out.
• Examples of different normal curves:
• Mean for male height = 70 inches, S.D = 4
• Mean for cholesterol = 182, S.D. = 38.
Formula for Normal Curves
• The shape of a normal probability distribution
can be expressed as a function:
p (Y ) 
• Where:
–
–
–
–
e
 (Y  Y )
2
2
2
/ 2 Y
2
Y
e refers to a constant (2.718)
 refers to a constant (3.142)
 refers to the mean of the normal curve
 refers to the standard deviation of the normal curve.
Properties of Normal Curves
• If you choose a mean and s.d., you can plot a
corresponding normal curve
• Probability distributions can also be normal
• Remember: the proportion of area under the curve in a
given range is equal to the probability of picking someone
in that range
• Normal curves are useful because:
• The probability of cases falling in a certain range on a
normal curve are well known
• Thus, it is easy to determine p(a<Y<b)!
Properties of Normal Curves
• Normal curves have well-known properties:
– 68% of area under the curve (and thus cases) fall
within 1 standard deviation of the mean
– 95% of cases fall within 2 standard deviations
– 99% of cases fall within 3 standard deviations
• In fact, the a percentage can be easily determined
for any number of standard deviations (e.g.,
s=1.5, s=2.3890)
• Note: This is only true of normal curves
• You can’t apply these rules to non-normal distributions.
Properties of Normal Curves
• The predictable link between standard deviations
and percent of cases falling near the mean makes
normal curves very useful
• 1. You can determine the probability associated
with any range of values around the mean
• e.g., there is a .95 probability that a person randomly chosen
will fall within 2 SD of mean
• 2. You can convert Z-scores (# standard
deviations) into something like a percentile
• If a case falls 3 standard deviations above the mean, it must
be in the 99th percentile.
Properties of Normal Curves
• Visually:
Question: Why
are these
referred to as
Z, 2Z, 3Z?
Normal Distribution: Example
• Male height is normally distributed
– Distribution: mean = 70 inches, S.D. = 4 inches
Question: Is this a
frequency distribution
or a probability
distribution?
55
60
65
70
75
80
85
Normal Distribution: Example
• Male height is normally distributed
• Distribution: mean = 70 inches, S.D. = 4 inches
• What is the range of heights that encompasses
99% of the population?
• Hint: that’s +/- 3 standard deviations
• Answer: 70 +/- (3)(4) = 70 +/- 12
• Range = 58" to 82“
• This is very useful information
• Ex: If you are designing a car to comfortably fit most
people.
Normal Distribution: Example
• 99% of cases fall within 3 S.D. of mean
A total of 1%
fall above 82
inches or below
58 inches
55
60
65
70
75
80
85
Normal Distributions and Inference
• The link between normal distributions and
probabilities allows us to draw conclusions
• Example: Suppose you are a detective
• You suspect that a person is taking an illegal drug
• One side-effect of the drug is that it raises cholesterol to
extremely high levels
• Strategy: Take a sample of blood from person
• Compare with known distribution for normal people
• Observation: Blood cholesterol is 5 standard deviations
above the mean…
Normal Distributions and Inference
• What can you tell by knowing cholesterol is 5
standard deviations above the mean?
• 99% are within 3 standard deviations, 1% not
• A much lower percentage fall 5 S.D’s from the mean
• Based on properties of a normal curve:
• Only .000000287 of cases fall 5 or more S.D’s from the
mean
• Conclusion: It is improbable that the person is
not taking drugs
• But, in a world of 6 billion people, there are 1,722 such
people – you can’t be absolutely certain…
Samples and Populations
• Issue: As social scientists, we wish to describe
and understand large sets of people (or
organizations or countries)
– School achievement of American teenagers
– Fertility of individuals in Indonesia
– Behavior of organizations in the auto industry
• Problem: It is seldom possible to collect data on
all relevant people (or organizations or countries)
that we hope to study.
Samples and Populations
• How can we calculate the mean or standard
deviation for a population, without data on most
individuals?
– Without even knowing the total N of the population?
• Are we stuck?
• IDEA: Maybe we can gain some understanding
of large groups, even if we have information
about only some of the cases within the group
– We can examine part of the group and try to make
intelligent guesses about what the entire group is like.
Populations Defined
• Population: The entire set of persons, objects, or
events that have at least one common characteristic
of interest to a researcher (Knoke, p. 15)
• Populations (and things we’d like to study)
– Voting age Americans (their political views)
– 6th grade students attending a particular school (their
performance on a math test)
– People (their response to a new AIDS drug)
– Small companies (their business strategies).
Population: Defined
• People in those populations have one common
characteristic, even if they are different in many
other ways
– Example: Voting age Americans may differ wildly,
but they share the fact that they are voting aged
Americans
• Beyond literal definition, a population is the
general group that we wish to study and gain
insight into.
Sample: Defined
• Sample: A subset of a population
– Any subset, chosen in any way
– But, manner of choosing makes some samples more
useful than others
– Datasets are usually samples of a larger population
• Beyond literal definition, sample often means
“the group that we have data on”.
Statistical Inference: Defined
• Our Goal: to describe populations
• However, we only have data on a sample (a
subset) of the population
• We hope that studying a sample will give us some
insight into the overall population
• Statistical Inference: making statistical
generalizations about a population from evidence
contained in a sample (Knoke, 77).
Statistical Inference
• When is statistical inference likely to work?
• 1. When a sample is large
– If a sample approaches the size of the population, it is
likely be a good reflection of that population
• 2. When a sample is representative of the entire
population
– As opposed to a sample that is atypical in some way,
and thus not reflective of the larger group.
Random Samples
• One way to get a representative sample is by
choosing one randomly
• Definition: A sample chosen from a population
such that each observation has an equal chance
of being selected (Knoke, p. 77)
– Probability of selection:
1
p (selection ) 
N
• Randomness is one strategy to avoid “bias”, the
circumstance when a sample is not representative
of the larger population.
Biased Samples: Examples
• Biased samples can lead to false conclusions
about characteristics of populations
• What are the problems with these samples?
– Internet survey asking people the number of CDs they
own (population = all Americans)
– Telephone survey conducted during the day of
political opinions (pop = voting age Americans)
– Survey of an Intro Psych class on causes of stress and
anxiety (pop = All humans)
– Survey of Fortune 500 firms on reasons that firms
succeed (pop = all companies).
Statistical Inference
• Statistical inference involves two tasks:
• 1. Using information from a sample to estimate
properties of the population
• 2. Using laws of statistics and information from
the sample to determine how close our estimate is
likely to be
– We can determine whether or not we are confident in
our assessment of a population
Statistical Inference Example
• Population: Students in the United States
• Sample: Individuals in this classroom
• Question: What is the mean number of CD’s
owned by students in the US?
– Goal #1: Use information on students in this class to
guess the mean number of CD’s owned by students in
the US
– Goal #2: Try to determine how close (or far off) our
estimate of the population mean might be. Estimate
the quality of the guess.
• Part #2 helps prevent us from drawing
inappropriate conclusions from #1
Population and Sample Notation
• Characteristics of populations are called
parameters
• Characteristics of a sample are called statistics
• To keep things straight, mathematicians use
Greek letters to refer to populations and Roman
letters to refer to samples
–
–
–
–
Mean of sample is: Y-bar
Mean of population is Greek mu: μ
Standard deviation of sample is: s
Standard deviation of a population is lower case
Greek sigma: σ
Population and Sample Notation
• Estimates of a population parameter based on
information from a sample is called a “point
estimate”
– Example of a point estimate: Based on this sample,
I’d guess that the mean # of CDs owned by students
in the U.S. is 47.
• Formulas to estimate a population parameter from
a sample are “estimators”
Estimation: Notation
• We often wish to estimate population parameters,
using information from a sample we have
• We may use a variety of formulas to do this
• Mathematicians identify estimates of population
parameters in formulas by placing a caret (“^” )
over the parameter
– The caret is called a “hat”
– An estimate of  is called “sigma-hat”
– Symbol:
σ̂
Population and Sample Distributions


Y
s
Populations and Samples
• Population parameters (μ, σ) are constants
– There is one true value, but it is unknown
• Sample statistics (Y-bar, s) are variables
– Up until now we’ve treated them as constants
– There are many possible samples, and thus many possible
values for each
– In fact, the range of possible values makes up a distribution –
the “sampling distribution”
• This provides insight into the probable location of the
population mean
– Even if you only have one single sample to look at
– This “trick” lets us draw conclusions!!!