Hatfield.Topic 1 - Department of Statistics

Download Report

Transcript Hatfield.Topic 1 - Department of Statistics

Department of Statistics
TEXAS A&M UNIVERSITY
STAT 211
Instructor: Keith Hatfield
1
Topic 1: Data collection and summarization
•
•
•
•
•
•
Populations and samples
Frequency distributions
Histograms
Mean, median, variance and standard deviation
Quartiles, interquartile range
Boxplots
What is Statistics?
• What do you think of when you hear the
word “statistics”? (sports, boring, not applicable to my field of
study)
• Statistics: The science of collecting,
classifying, and interpreting data.
• Anticipated learning outcomes:
– appreciate and apply basic statistical
methods in an everyday life setting
(Election polls, clinical trials, lies, big lies & statistics)
– appreciate and apply basic statistical
methods in their scientific field
3
Collecting data
• Observational study
– Observe a group and measure quantities of interest.
– This is passive data collection in that one does not
attempt to influence the group.
– The purpose of the study is to describe the group.
• Experimental study
– Deliberately impose treatments on groups in order to
observe responses.
– The purpose is to study whether the treatments cause
a change in the responses
4
Observational Study Terms
• Population: The entire group of interest
• Sample: A part of the population selected to draw
conclusions about the entire population
• Census: A sample that attempts to include the
entire population
• Parameter: A concept that describes the
population
• Statistic: A number produced from a sample that
5
estimates a population parameter
Horry County SC, Murder Case
• Do juries properly represent the racial makeup
of Horry County which is 13% African American?
• What is the population parameter of interest?
• What sample statistic could be used to estimate
the parameter and does the sample support the
claim?
• 295 jurors summoned, 22 were African
American
6
Experiment Terms
• Experimental Group: A collection of
experimental units subjected to a difference
in treatment, imposed by the experimenter.
• Control Group: A collection of experimental
units subjected to the same conditions as
those in an experimental group except that
no treatment is imposed.
• This design helps control for potential
confounding effects.
7
What are “confounding” effects?
• When you have multiple factors in a study and you
can’t tell which factor causes a change in the
variable of interest.
• Example: Does going to church make you live
longer?.....Not necessarily. There are too many other
factors or “lurking variables”, discussed later.
• Best to set up study with everything else constant
and have only one factor changed. That way, you’re
more apt to identify that the change in the variable
is due to the change you instituted in the study.
8
NCTR study (National Center for Toxicological Research)
• A large scale study was conducted to see if a new drug might have
potential toxic effects. They used rats for the experiment.
• Dose groups of 0, 100, 200, and 400 ppg were evaluated for liver
tumors at the end of a two week exposure to the drug. (which is
the control and which are the experimental groups?)
• What comparisons would you want to make?
• Should you evaluate each group on consecutive days at the end of
the study?
9
Analyzing data with StatCrunch
• StatCrunch is a statistical software package that runs
through a Web browser.
• You can access StatCrunch once you have registered and
created an account ($$). See the information tab in
eCampus for details.
• No tutorials for StatCrunch, but demonstrations of how to
perform basis tasks and tests will be done in class.
• Note that the homework uses StatCrunch. Several
datasets will be given in the homework and in class
examples. I don’t advise using your calculator for this
purpose as it can be tedious and lead to input errors.
10
All about variables
• Variable: Any characteristic or quantity to be measured on
units in a study
• Categorical variable: Places a unit into one of several
categories
– Examples: Gender, race, political party
• Quantitative variable: Takes on numerical values for which
arithmetic makes sense
– Examples: SAT score, number of siblings, cost of textbooks
• Univariate data has one variable.
• Bivariate data has two variables.
• Multivariate data has three or more variables.
11
Cereal data
mfr
A = American Home; G = General Mills; K = Kelloggs; N = Nabisco; P = Post; Q = Quaker Oats; R = Ralston Purina
type
cold or hot
calories
calories per serving
protein
grams of protein
fat
grams of fat
sodium
milligrams of sodium
fiber
grams of dietary fiber
carbo
grams of complex carbohydrates
sugars
grams of sugars
potass
milligrams of potassium
vitamins
vitamins and minerals - 0, 25, or 100, indicating the typical percentage of FDA recommended
shelf
display shelf (1, 2, or 3, counting from the floor)
weight
weight in ounces of one serving
cups
number of cups in one serving
rating
a rating of the cereal
12
Summarizing a single categorical variable
• Frequency - number of times the value occurs in the data
• Relative frequency - proportion of the data with the value
mfr
Frequency
Relative Frequency
A
1
0.012987013
G
22
0.2857143
K
23
0.2987013
N
6
0.077922076
P
9
0.116883114
Q
8
0.103896104
R
8
0.103896104
Cereal data
13
Analyzing a single quantitative variable
• Consider the concentration data which contains the
concentration of suspended solids in parts per million at 50
locations along a river.
• What is a typical concentration? (Generally characterized by
the center of the data)
• How much spread is there in the concentrations along the
river? (Generally, the relative “width” of the data…how
dispersed they are around the center)?
– Wide versus narrow and the inherent good and bad things about
spread.
– Discuss the difference in typical and spread if taken at a single
point on the river, versus several points along the river.
14
Histograms
• Histogram - bar graph of binned or grouped data where the
height of the bar above each bin denotes the frequency (relative
frequency) of values in the bin
• Typical concentration?
• Spread?
• Roughly how many
concentrations below 50?
15
Choosing the number of histogram bins
• General rule: # of bins 
# of observations
– Most stat packages will do this for you, but sometimes you may
want to change the number of bins or categories, depending on
what you want the data to convey….
• Following is a sample of historical geyser eruptions
from Old Faithful in Yellowstone National Park.
Demonstration done in class, typical outputs
shown on next two slides show same data from
different perspectives.
Old Faithful data
16
Data presented from an alarmist point of view
17
Data presented from a “calming” point of view
18
Describing the shape of quantitative data
• Symmetric data has roughly the same mirror image on
each side of a center value.
• Skewed data has one side (either right or left) which is
much longer than the other relative to the mode (peak
value).
– The above definitions are most useful when describing data
with a single mode.
• Multimodal data has more than one mode.
• Beware of outliers when describing shape.
• Shape of the concentration data?
19
States data from 1996
• Define the shape of each variable.
POVERTY
percentage of the state population living in poverty
CRIME
violent crime rate per 100,000 population
COLLEGE
percentage of states population who are enrolled in college
METRO
percentage of the state population living in a metropolitan area
INCOME
median household income in 1996 dollars
20
Shapes of states data – Percentage living in poverty
21
Shapes of states data – Violent crime rates per 100K
22
Shapes of states data - % living in metro area
23
Shapes of states data – Income
24
Summary statistics for quantitative data
• Measures of central tendency (typical)
– The sample median is the middle observation if the values are
arranged in increasing order.
– The sample mean of n observations is the average, the sum of
the values divided by n.
X1 ,..., X n represents n data values
n
X
X
i 1
i
n
25
Summary statistics for quantitative data
•
pth percentile -the value such that p×100% of values are below it and (1p) ×100% are above it (How to actually find the value? Multiply the
percentile by # of observations and round up if necessary).
– first quartile (Q1) is the 25th percentile
– second quartile (Q2) 50th percentile (median)
– third quartile (Q3) is the 75th percentile
•
5-number summary: Min, Q1, Q2, Q3, Max
– Boxplots: Stacking boxplots can be very useful for comparing multiple
groups (you’ll see in 2 slides).
26
• From the boxplot above
– Are more than 75% of the values below 80?
– Are more than 75% of the values above 40?
– What percentage of values fall roughly between 45 and
70?
– Is the data symmetrical?
– What are the approximate maximum and minimum
values?
27
Summary statistics for quantitative data
• Measures of spread:
– Interquartile range, IQR = Q3-Q1, the range of the middle
50% of the data
– sample variance, s2, is the sum of squared deviations from the
sample mean divided by n-1
n
s 
2
(X
i 1
i
 X)
2
n 1
– sample standard deviation, s, is the square root of sample
variance. Preferred because it has the same units as the data.
28
Calculation of sample variance (partial from data)
Obs
1
2
3
4
5
6
7
8
9
10
Totals
x
5
4
3
2
2
5
7
3
4
9
44
x bar
4.4
4.4
4.4
4.4
4.4
4.4
4.4
4.4
4.4
4.4
(x-xbar) (x-xbar)^2
0.6
0.4
-0.4
0.2
-1.4
2
-2.4
5.8
-2.4
5.8
0.6
0.4
2.6
6.8
-1.4
2
-0.4
0.2
4.6
21.2
0
44.4
( x

x )
x^2
25
16
9
4
4
25
49
9
16
81
238
29
Cereal data
• Compare rating across shelf…
– Numerically using StatCrunch “Summary Stats”
30
Cereal info – Comparative boxplots
• Boxplot/outliers – An example of comparative bloxplots.
– Graphically using StatCrunch “Graphics>Boxplots”
31
Comparing measures of
central tendency and spread
• The sample mean and the sample standard deviation
are good measures of center and spread, respectively,
for symmetric data
• If the data set is skewed or has outliers, the sample
median and the interquartile range are more
commonly used.
• Note about trimmed mean.
32
Case Study: Salary data
• A fictitious large university decides to study the salaries of their
graduates. A survey was conducted of 2232 recent graduates
from engineering and education majors.
• The salary data consists of three variables:
– Gender: Male or Female
– Major: Education or Engineering
– Salary: Reported in $
• What types of variables do we have?
33
Salary data by major
• Are both majors equally represented in the survey?
• Do salaries differ across major?
34
Salary data by gender
• Are both genders equally represented in the survey?
Summary statistics for Salary:
Group by: Gender
Gender
n
Mean
Female
1,088
41,108
Male
1,144
50,589
Variance
97,633,984
86,189,224
Std. Dev.
9,881
9,284
Median
36,369
54,471
Min
33,070
29,027
Max
64,279
61,533
• Do salaries differ across gender? Discrimination?
35
Salary data by gender within each major
• How do male and female salaries compare in engineering?
Summary statistics for Salary:
Where: Major=Engineering
Group by: Gender
Gender
n
Mean
Female
232
59,921
Male
924
55,022
Variance
3,900,454
4,146,587
Std. Dev.
1,975
2,036
Median
59,994
55,019
Min
53,598
48,019
Max
64,279
61,533
• How do male and female salaries compare in education?
Summary statistics for Salary:
Where: Major=Education
Group by: Gender
Gender
n
Mean
Female
856
36,009
Male
220
31,971
Variance
1,004,212
1,238,722
Std. Dev.
1,002
1,113
Median
36,009
32,002
Min
33,070
29,027
Please read the additional file for Topic 1 for more info
Max
39,411
35,608
36