Class 6 Lecture: Samples and Populations

Download Report

Transcript Class 6 Lecture: Samples and Populations

Sociology 5811:
Lecture 6: Samples, Populations
Copyright © 2005 by Evan Schofer
Do not copy or distribute without
permission
Announcements
• Problem set #2 due next Tuesday, Sept 27
Problem Set: Z-table
• Several problems require looking up area under
the normal curve associated with certain Z-scores
• Requires use of “Z-table”
• Found on Knoke, p. 459
• Issue: We know that 95% of area under a normal
curve falls within +/- 2 standard deviations
• Thus: Area under normal curve from Z = -2 to Z = 2 is
equal to .95
• Area left of Z = -2 and right of Z = 2 is .05
• But, what if we want area for a value like 1.4?
• Z-table lists areas for all values!
Problem Set: Z-table
Area from 0
to Z = .15
Area beyond
+Z = .35
Let’s look at
Z=.40
Question:
What is Area
from -Z to
+Z?
Review: Probability
• Probability of event A defined as p(A):
outcomes in which A occurs
p( A) 
total number of outcomes
• “The probability of a particular outcome is the
proportion of times that outcome would occur in
a long run of repeated observations (Agresti &
Finlay 1997, p. 81)”
p(red) = 2 divided by 10
p(red) = .20
Probability
• Question: What is the probability of picking red
twice in a row (assuming you replaced the red
one after you picked)?
• Answer: Combined probabilities multiply
• Each probability is .20
• .20 x .20 = .04
• Under 5% chance!
• Conclusion: If you pick many times, you are
unlikely to continually get atypical colors
• It can happen, but it is very improbable.
• Ex: Picking red 5 times: Probability is .00032.
Review: Probability Distributions
• Both nominal/ordinal and continuous measures
can be conceived of as probability distributions
– Nominal/Ordinal: Height of bars indicates probability
of picking someone with that value
– Continuous: Can’t be graphed in separate bars
• Instead, a continuous curve approximates probability
• Area under curve = probability of picking a case within a
given range of values.
Review: Probability Distributions
• Notation: 
– Greek alpha () is used to refer to probabilities in a
range for a continuous distribution

Review: Probability Distributions
• P(Y<a)= 

Review: Probability Distributions
• P(Y<a, Y>b)= 


Review: Normal Distributions
• Normal curves have well-known properties:
• 68% of area under the curve (and thus cases) fall within 1
standard deviation of the mean
• 95% of cases fall within 2 standard deviations
• 99% of cases fall within 3 standard deviations
• Percentages translate directly into probabilities
• Thus, it is easy to determine the probability
associated with any range around the mean
• e.g., there is a .95 probability that a person randomly chosen
will fall within 2 SD of mean
• This property makes normal curves very useful!
Samples and Populations
• Issue: As social scientists, we wish to describe
and understand large sets of people (or
organizations or countries)
• School achievement of American teenagers
• Fertility of individuals in Indonesia
• Behavior of organizations in the auto industry
• Problem: It is seldom possible to collect data on
all relevant people (or organizations or countries)
that we hope to study.
Samples and Populations
• How can we calculate the mean or standard
deviation for a population, without data on most
individuals?
• Without even knowing the total N of the population?
• Are we stuck?
• IDEA: Maybe we can gain some understanding
of large groups, even if we have information
about only some of the cases within the group
• We can examine part of the group and try to make
intelligent guesses about what the entire group is like.
Populations Defined
• Population: The entire set of persons, objects, or
events that have at least one common characteristic
of interest to a researcher (Knoke, p. 15)
• Populations (and things we’d like to study)
• Voting age Americans (their political views)
• 6th grade students attending a particular school (their
performance on a math test)
• People (their response to a new AIDS drug)
• Small companies (their business strategies).
Population: Defined
• People in those populations have one common
characteristic, even if they are different in many
other ways
• Example: Voting age Americans may differ wildly, but they
share the fact that they are voting aged Americans
• Beyond literal definition, a population is the
general group that we wish to study and gain
insight into.
Sample: Defined
• Sample: A subset of a population
• Any subset, chosen in any way
• But, manner of choosing makes some samples more useful
than others
• Datasets are usually samples of a larger population
• Beyond literal definition, sample often means
“the group that we have data on”.
Statistical Inference: Defined
• Our Goal: to describe populations
– However, we only have data on a sample (a subset) of
the population
– We hope that studying a sample will give us some
insight into the overall population
• Statistical Inference: making statistical
generalizations about a population from evidence
contained in a sample (Knoke, 77).
Statistical Inference
• When is statistical inference likely to work?
• 1. When a sample is large
• If a sample approaches the size of the population, it is likely
be a good reflection of that population
• 2. When a sample is representative of the entire
population
• As opposed to a sample that is atypical in some way, and
thus not reflective of the larger group.
Random Samples
• One way to get a representative sample is by
choosing one randomly
• Definition: A sample chosen from a population
such that each observation has an equal chance
of being selected (Knoke, p. 77)
– Probability of selection:
1
p (selection ) 
N
• Randomness is one strategy to avoid “bias”, the
circumstance when a sample is not representative
of the larger population.
Biased Samples: Examples
• Biased samples can lead to false conclusions
about characteristics of populations
• What are the problems with these samples?
– Internet survey asking people the number of CDs they
own (population = all Americans)
– Telephone survey conducted during the day of
political opinions (pop = voting age Americans)
– Survey of an Intro Psych class on causes of stress and
anxiety (pop = All humans)
– Survey of Fortune 500 firms on reasons that firms
succeed (pop = all companies).
Statistical Inference
• Statistical inference involves two tasks:
• 1. Using information from a sample to estimate
properties of the population
• 2. Using laws of statistics and information from
the sample to determine how close our estimate is
likely to be
– We can determine whether or not we are confident in
our assessment of a population
Statistical Inference Example
• Population: Students in the United States
• Sample: Individuals in this classroom
• Question: What is the mean number of CD’s
owned by students in the US?
• Goal #1: Use information on students in this class to guess
the mean number of CD’s owned by students in the US
• Goal #2: Try to determine how close (or far off) our
estimate of the population mean might be. Estimate the
quality of the guess.
• Part #2 helps prevent us from drawing
inappropriate conclusions from #1
Population and Sample Notation
• Characteristics of populations are called
parameters
• Characteristics of a sample are called statistics
• To keep things straight, mathematicians use
Greek letters to refer to populations and Roman
letters to refer to samples
–
–
–
–
Mean of sample is: Y-bar
Mean of population is Greek mu: μ
Standard deviation of sample is: s
Standard deviation of a population is lower case
Greek sigma: σ
Population and Sample Notation
• Estimates of a population parameter based on
information from a sample is called a “point
estimate”
– Example of a point estimate:
• Based on this sample, I estimate that the mean # of CDs
owned by students in the U.S. is 47.
• Formulas to estimate a population parameter from
a sample are “estimators”.
Estimation: Notation
• We often wish to estimate population parameters,
using information from a sample we have
• We may use a variety of formulas to do this
• Mathematicians identify estimates of population
parameters in formulas by placing a caret (“^” )
over the parameter
– The caret is called a “hat”
– An estimate of  is called “sigma-hat”
– Symbol:
σ̂
Populations and Samples
• Population parameters (μ, σ) are constants
• There is one true value, but it is usually unknown
• Sample statistics (Y-bar, s) are variables
• Up until now we’ve treated them as constants
• But, there are many possible samples
• Different samples yield different values of the mean & S.D.
– Like any variable, the mean and S.D. have a
distribution!
• Called the “sampling distribution”
• Made up of all values for any given population
• We’ll discuss it later…
Population and Sample Distributions


Y
s
Population Distributions
• Population distributions are typically conceived
of as probability distributions
• Because we don’t usually see the whole thing… We just
pull individuals out based on relative probability
• Some populations are finite and could graphed as
a raw frequency plot or histogram (examples?)
• Many populations are infinite, can’t ever be graphed as a
frequency plot/histogram (examples?)
• The main thing that matters about a population is
how likely you are to pick a person with a given
value (or in a range of values).
Populations and Samples:
Overview
Characteristics
Characteristics
are:
Notation
Estimate
Population
Sample
“parameters”
“statistics”
constant (one variables (varies
for population) for each sample)
Roman ( Y , s)
Greek (, )
“hat”: σ̂
“point estimate”
based on sample
Review: Normal Distribution
• Example: Blood Cholesterol
• normally distributed
• mean = 200
• S.D. = 40
• What is the range of cholesterol that encompasses
95% of the population?
• Answer: 200 +/- (2)(40) = 200 +/- 80
– Range = 120 to 280
Normal Distributions and Inference
• The link between normal distributions and
probabilities allows us to draw conclusions
• Example: Suppose you are a detective
• You suspect that a person is taking an illegal drug
• One side-effect of the drug is that it raises cholesterol to
extremely high levels
• Strategy: Take a sample of blood from person
• Compare with known distribution for normal people
• Observation: Blood cholesterol is 5 standard deviations
above the mean…
Normal Distributions and Inference
• What can you tell by knowing cholesterol is 5
standard deviations above the mean?
• 99% are within 3 standard deviations, 1% not
• A much lower percentage fall 5 S.D’s from the mean
• Based on properties of a normal curve:
• Only .000000287 of cases fall 5 or more S.D’s from the
mean
• Conclusion: It is improbable that the person is
not taking drugs
• But, in a world of 6 billion people, there are 1,722 such
people – you can’t be absolutely certain…