Reasoning with Statistics
Download
Report
Transcript Reasoning with Statistics
Descriptive statistics
for one variable
描述性统计
What to describe?
What is the “location” or “center” of the
data? (“measures of location”)
How do the data vary? (“measures of
variability”).
Types of statistics
Descriptive Statistics
Gives numerical and graphic
procedures to summarize a
collection of data in a clear
and understandable way
Inferential Statistics
Provides procedures to
draw inferences about a
population from a sample
Reasons for using statistics
aid in summarization
aid in “getting at what’s going on”
aid in extracting “information” from the data
aid in communication
Frequency distribution
The frequency with which observations are
assigned to each category or point on a
measurement scale.
Most basic form of descriptive statistic
May be expressed as a percentage of the total sample
found in each category
Source : Reasoning with Statistics, by Frederick Williams &
Peter Monge, fifth edition, Harcourt College Publishers.
Frequency distribution
The distribution is “read” differently depending
upon the measurement level
Nominal scales are read as discrete measurements at
each level (no ordering)
Ordinal measures show tendencies, but categories
should not be compared (ordering exists, but not
distance)
Interval (distance exists, but no ratios) and ratio
scales (ratios exist) all for comparison among
categories
Sex
N Mean Median TrMean StDev SE Mean
female 126 91.23 90.00 90.83
11.32 1.01
male 100 96.79 110.00 105.62
17.39 1.74
Minimum Maximum
female 65.00 120.00
male 75.00 162.00
Q1
Q3
85.00 98.25
95.00 118.75
Fastest Ever Driving Speed
226 Stat 100 Students, Fall '98
100
Men
126
Women
70
80
90
100 110 120 130 140 150 160
Speed
Fastest Ever Driving Speed
226 Stat 100 Students, Fall 1998
160
110
60
female
male
Gender
Source: Protecting Children from Harmful Television: TV Ratings and the V-chip
Amy I. Nathanson, PhD Lecturer, University of California at Santa Barbara
Joanne Cantor, PhD Professor, Communication Arts, University of Wisconsin-Madison
Source: http://www.elonka.com/kryptos/
Web page on cryptography
Ancestry of US residents
Source: UCLA International Institute
Source: Cornell University website
Source: www.cit.cornell.edu/computer/students/bandwidth/charts.html
Source: www.cit.cornell.edu/computer/students/bandwidth/charts.html
Source: Verisign
Search engine use
The percentage of online searches done by US
home and work web surfers in July 2006
NY Times
Source: Verisign
Old Faithful Geyser
Duration in seconds of 272 eruptions of the Old
Faithful geyser.
library(datasets)
> faithful[1:10,]
eruptions waiting
1 3.600
79
2 1.800
54
3 3.333
74
4 2.283
62
5 4.533
85
6 2.883
55
7 4.700
88
8 3.600
85
9 1.950
51
10 4.350
85
> summary(faithful)
eruptions
waiting
Min. : 1.600 Min. : 43.0
1st Qu. : 2.163 1st Qu.: 58.0
Median : 4.000 Median : 76.0
Mean : 3.488 Mean : 70.9
3rd Qu. : 4.454 3rd Qu.: 82.0
Max. : 5.100 Max. : 96.0
Normal distribution
Many characteristics are distributed through the
population in a ‘normal’ manner
Normal curves have well-defined statistical properties
Parametric statistics are based on the assumption that the
variables are distributed normally
Most commonly used statistics
This is the famous “Bell curve” where many
cases fall near the middle of the distribution and
few fall very high or very low
I.Q.
Statistical properties of
the normal distribution
I.Q. distribution
Measures of central tendency
Mode (Mo): the most frequent score in a
distribution
good for nominal data
Median (Md): the midpoint or midscore in
a distribution.
(50% cases above/50% cases below)
– insensitive to extreme cases
--Interval or ratio
Source : Reasoning with Statistics, by Frederick Williams & Peter Monge, fifth edition, Harcourt College Publishers.
Measures of central tendency
Mean
The ‘average’ score—total score divided by the
number of scores
has a number of useful statistical properties
however, can be sensitive to extreme scores
many statistics based on mean
Sensitive to ‘outliers’
Extreme cases that just happened to end up in your
sample by chance
Index of central tendency
Source: http://www.uwsp.edu/psych/stat/5/skewnone.gif
Source: Scianta.com
Source: www.wilderdom.com/.../L2-1UnderstandingIQ.html
Source: CSAP’s Data Pathways
Measures of dispersion
Look at how widely scattered over the scale the scores
are
Groups with identical means can be more or less
diverse
To find out how the group is distributed, we need to
know how far or close individual members are from the
mean
Like mean, only meaningful for interval or ratio-level
measures
Measures of dispersion
Range
Distance between the highest and lowest scores in a
distribution;
sensitive to extreme scores;
compensate by calculating interquartile range (distance
between the 25th and 75th percentile points) which
represents the range of scores for the middle half of a
distribution
Usually used in combination with other measures of
dispersion.
Range
Source: www.animatedsoftware.com/ statglos/sgrange.htm
Source: http://pse.cs.vt.edu/SoSci/converted/Dispersion_I/box_n_hist.gif
Average Deviation (Mean Deviation)
Merits:
1. Easy to calculate and understand.
2. This can be calculated from any average.
3. It is less affected by extreme observations.
Demerits:
1. This is mathematically incomplete because it ignores negative
signs.
2. As it can be calculated from any average, it does not have
certainty (i.e., it is not a well defined measure).
3. Its use is very limited in statistical work.
Measures of dispersion
Variance (S2)
Average of squared distances of individual points
from the mean
High variance means that most scores are far away
from the mean. Low variance indicates that most
scores cluster tightly about the mean.
Standard Deviation (SD)
A summary statistic of how much scores vary from
the mean
Square root of the Variance
expressed in the original units of measurement
Used in a number of inferential statistics
Variance vs. Standard Deviation
Variance
Population
Sample
Standard Deviation
Skewness of distributions
Measures look at how lopsided
distributions are—how far from the ideal
of the normal curve they are
When the median and the mean are
different, the distribution is skewed. The
greater the difference, the greater the skew.
Distributions that trail away to the left are
negatively skewed and those that trail away to
the right are positively skewed
If the skewness is extreme, the researcher should
either transform the data to make them better
resemble a normal curve or else use a different
set of statistics—nonparametric statistics—to
carry out the analysis
Different Shapes of Distributions
Source: http://faculty.vassar.edu/lowry/f0204.gif
Skewness of distributions
Source: http://www.polity.org.za/html/govdocs/reports/aids/images/image022.gif
Distribution of posting frequency on Usenet
Kurtosis
Measures of kurtosis look at how sharply the
distribution rises to a peak and then drops away