Topic 01 - Dept. of Statistics, Texas A&M University
Download
Report
Transcript Topic 01 - Dept. of Statistics, Texas A&M University
Department of Statistics
TEXAS A&M UNIVERSITY
STAT 211
Instructor: Webster West
Register for the online materials
•
•
•
•
•
•
•
Go to http://dl.stat.tamu.edu/dostat/
Click on Register here
Specify account info and click Submit
Click on log in
Enter your account info and click on Log in!
Click on Add course
Enter the following information
– Course reference:
– Registration code:
Topic 1: Data collection and summarization
•
•
•
•
Populations and samples - pages 5 - 8
Frequency distributions - pages 14 – 17
Histograms - pages 17 - 20
Mean, median, variance and standard deviation
- pages 24 - 28
• Quartiles, interquartile range - pages 29– 31
• Boxplots - pages 31 - 33
What is Statistics?
• What do you think of when you hear the
word “statistics”?
• Statistics: The science of collecting,
classifying, and interpreting data.
• Anticipated learning outcomes:
– appreciate and apply basic statistical
methods in an everyday life setting
– appreciate and apply basic statistical
methods in their scientific field
Collecting data
• Observational study: Observe a group and
measure quantities of interest. This is
passive data collection in that one does not
attempt to influence the group. The purpose
of the study is to describe the group.
• Experiment: Deliberately impose
treatments on groups in order to observe
responses. The purpose is to study whether
the treatments cause a change in the
responses.
Observational Study Terms
• Population: The entire group of interest
• Sample: A part of the population selected to
draw conclusions about the entire population
• Census: A sample that attempts to include the
entire population
• Parameter: A fixed unknown number that
describes the population
• Statistic: A number produced from a sample
that estimates a population parameter
Horry County SC Murder Case
• Do juries properly represent the racial makeup
of Horry County which is 13% African American?
• 295 jurors summoned, 22 were African
American
• What is the population parameter of interest?
• What sample statistic could be used to estimate
the parameter?
Experiment Terms
• Experimental Group: A collection of
experimental units subjected to a real
treatment.
• Control Group: A collection of experimental
units subjected to the same conditions as
those in an experimental group except that
no treatment is imposed.
• This design helps control for potential
confounding effects.
NCTR study
• A large scale study was conducted to see if a new
drug might have potential toxic effects.
• Dose groups of 0, 100, 200, and 400 ppg were
evaluated for liver tumors at the end of a two week
exposure to the drug.
• What comparisons would you want to make?
• Should you evaluate each group on consecutive
days at the end of the study?
Analyzing data with StatCrunch
• StatCrunch is a statistical software package that
runs through a Web browser like Internet Explorer.
• You can access StatCrunch for free via DoStat.
• If you are not on the TAMU system, you will need to
enter the passcode,
.
• When you access the StatCrunch site, the window
below will appear. Click on the Run button.
All about variables
• Variable: Any characteristic or quantity to be
measured on units in a study
• Categorical variable: Places a unit into one of
several categories
– Examples: Gender, race, political party
• Quantitative variable: Takes on numerical
values for which arithmetic makes sense
– Examples: SAT score, number of siblings, cost of textbooks
• Univariate data has one variable.
• Bivariate data has two variables.
• Multivariate data has three or more variables.
Cereal data
What types of vaiables do we have in this data set?
mfr
A = American Home; G = General Mills; K = Kelloggs; N = Nabisco; P = Post; Q = Quaker Oats; R = Ralston Purina
type
cold or hot
calories
calories per serving
protein
grams of protein
fat
grams of fat
sodium
milligrams of sodium
fiber
grams of dietary fiber
carbo
grams of complex carbohydrates
sugars
grams of sugars
potass
milligrams of potassium
vitamins
vitamins and minerals - 0, 25, or 100, indicating the typical percentage of FDA recommended
shelf
display shelf (1, 2, or 3, counting from the floor)
weight
weight in ounces of one serving
cups
number of cups in one serving
rating
a rating of the cereal
Summarizing a single categorical variable
• Frequency - number of times the value occurs in the data
• Relative frequency - proportion of the data with the value
• Cereal Data
mfr
Frequency
Relative Frequency
A
1
0.012987013
G
22
0.2857143
K
23
0.2987013
N
6
0.077922076
P
9
0.116883114
Q
8
0.103896104
R
8
0.103896104
Analyzing a single quantitative variable
• Consider the concentration data which contains
the concentration of suspended solids in parts
per million at 50 locations along a river.
• What is a typical concentration along the river?
• How much spread is there in the concentrations
along the river?
• Typical is generally characterized by the center
of the data
• Spread is generally reported as an interval
containing most of the data
Histograms
• Histogram - bar graph of binned data where the
height of the bar above each bin denotes the
frequency (relative frequency) of values in the bin
• Typical concentration?
• Spread?
• Roughly how many concentrations below 50?
• StatCrunch
Choosing the number of histogram bins
• General rule: # of bins # of observations
• Choosing the number of bins for a histogram can
be tricky! Consider the Old Faithful data.
Describing the shape of quantitative data
• Symmetric data has roughly the same
mirror image on each side of a center value.
• Skewed data has one side (either right or
left) which is much longer than the other
relative to the mode (peak value).
• The above definitions are most useful when
describing data with a single mode.
• Multimodal data has more than one mode.
• Beware of outliers when describing shape.
• Shape of the concentration data?
States data from 1996
• Define the shape of each variable.
POVERTY
percentage of the state population living in poverty
CRIME
violent crime rate per 100,000 population
COLLEGE
percentage of states population who are enrolled in college
METRO
percentage of the state population living in a metropolitan area
INCOME
median household income in 1996 dollars
• Where does TX fall for each variable?
Stem and leaf plots
• Separate each value into a stem (all but the
rightmost digit) and a leaf (the rightmost digit)
• Write unique sorted stems in a vertical column
• Add each leaf to the right of its stem in
increasing order
Variable: concentration
2 : 7
• StatCrunch
3
3
4
4
5
5
6
6
7
7
8
8
9
9
:
:
:
:
:
:
:
:
:
:
:
:
:
:
024
6779
002
56778
03
66689
1111222
55566899
012
55679
3
7
1
5
Histograms vs. Stem and leaf plots
• Stem and leaf plots (typically) display actual
data values whereas histograms do not
• Stem and leaf plots are more useful for
small data sets (less than 100 values)
• Histograms can be constructed for larger
data sets
Summary statistics for quantitative data
• Measures of center (typical)
– The sample median is the middle observation if
the values are arranged in increasing order.
– The sample mean of n observations is the
average, the sum of the values divided by n.
X1 ,..., X n represents n data values
n
X
X
i 1
n
i
Summary statistics for quantitative data
• pth percentile -the value such that p×100% of
values are below it and (1-p) ×100% are above it
– first quartile (Q1) is the 25th percentile
– second quartile (Q2) 50th percentile (median)
– third quartile (Q3) is the 75th percentile
• 5-number summary: Min, Q1, Q2, Q3, Max
– Boxplots: Stacking boxplots can be very useful for
comparing multiple groups
Summary statistics for quantitative data
• Measures of spread:
– Interquartile range, IQR = Q3-Q1, the range of the
middle 50% of the data
– sample variance, s2, is the sum of squared deviations
from the sample mean divided by n-1
n
s
2
(X
i 1
i
X)
2
n 1
– sample standard deviation, s, is the square root of
sample variance. Preferred because it has the same units
as the data.
Cereal data
• Compare rating across shelf.
Comparing measures of center and spread
• The sample mean and the sample standard
deviation are good measures of center and
spread, respectively, for symmetric data
• If the data set is skewed or has outliers, the
sample median and the interquartile range
are more commonly used
• Mean versus median
Case Study: Salary data
• A fictitious large university decides to study the
salaries of their graduates. A survey was
conducted of 2232 recent graduates from
engineering and education majors.
• The salary data consists of three variables:
– Gender: Male or Female
– Major: Education or Engineering
– Salary: Reported in $
• What types of variables do we have?
Salary data by major
• Are both majors equally represented in the survey?
• Do salaries differ across major?
Salary data by gender
• Are both genders equally represented in the survey?
• Do salaries differ across gender? Discrimination?
Salary data by gender within each major
• How do male and female salaries compare in engineering?
• How do male and female salaries compare in education?
• What’s going on?
Let’s Make a Deal
• This is motivation to study probability.
• Should you switch or should you stay with
your original choice?