Reasoning with Statistics

Download Report

Transcript Reasoning with Statistics

Descriptive statistics
for one variable
描述性统计
What to describe?

What is the “location” or “center” of the
data? (“measures of location”)

How do the data vary? (“measures of
variability”).
Types of statistics

Descriptive Statistics
Gives numerical and graphic
procedures to summarize a
collection of data in a clear
and understandable way

Inferential Statistics
Provides procedures to
draw inferences about a
population from a sample
Reasons for using statistics




aid in summarization
aid in “getting at what’s going on”
aid in extracting “information” from the data
aid in communication
Frequency distribution

The frequency with which observations are
assigned to each category or point on a
measurement scale.
Most basic form of descriptive statistic
 May be expressed as a percentage of the total sample
found in each category

Source : Reasoning with Statistics, by Frederick Williams &
Peter Monge, fifth edition, Harcourt College Publishers.
Frequency distribution

The distribution is “read” differently depending
upon the measurement level
Nominal scales are read as discrete measurements at
each level (no ordering)
 Ordinal measures show tendencies, but categories
should not be compared (ordering exists, but not
distance)
 Interval (distance exists, but no ratios) and ratio
scales (ratios exist) all for comparison among
categories

Sex
N Mean Median TrMean StDev SE Mean
female 126 91.23 90.00 90.83
11.32 1.01
male 100 96.79 110.00 105.62
17.39 1.74
Minimum Maximum
female 65.00 120.00
male 75.00 162.00
Q1
Q3
85.00 98.25
95.00 118.75
Fastest Ever Driving Speed
226 Stat 100 Students, Fall '98
100
Men
126
Women
70
80
90
100 110 120 130 140 150 160
Speed
Fastest Ever Driving Speed
226 Stat 100 Students, Fall 1998
160
110
60
female
male
Gender
Source: Protecting Children from Harmful Television: TV Ratings and the V-chip
Amy I. Nathanson, PhD Lecturer, University of California at Santa Barbara
Joanne Cantor, PhD Professor, Communication Arts, University of Wisconsin-Madison
Source: http://www.elonka.com/kryptos/
Web page on cryptography
Ancestry of US residents
Source: UCLA International Institute
Source: Cornell University website
Source: www.cit.cornell.edu/computer/students/bandwidth/charts.html
Source: www.cit.cornell.edu/computer/students/bandwidth/charts.html
Source: Verisign
Search engine use
The percentage of online searches done by US
home and work web surfers in July 2006
NY Times
Source: Verisign
Old Faithful Geyser
Duration in seconds of 272 eruptions of the Old
Faithful geyser.
library(datasets)
> faithful[1:10,]

eruptions waiting
1 3.600
79
2 1.800
54
3 3.333
74
4 2.283
62
5 4.533
85
6 2.883
55
7 4.700
88
8 3.600
85
9 1.950
51
10 4.350
85
> summary(faithful)
eruptions
waiting
Min. : 1.600 Min. : 43.0
1st Qu. : 2.163 1st Qu.: 58.0
Median : 4.000 Median : 76.0
Mean : 3.488 Mean : 70.9
3rd Qu. : 4.454 3rd Qu.: 82.0
Max. : 5.100 Max. : 96.0
Normal distribution

Many characteristics are distributed through the
population in a ‘normal’ manner


Normal curves have well-defined statistical properties
Parametric statistics are based on the assumption that the
variables are distributed normally


Most commonly used statistics
This is the famous “Bell curve” where many
cases fall near the middle of the distribution and
few fall very high or very low

I.Q.
Statistical properties of
the normal distribution
I.Q. distribution
Measures of central tendency

Mode (Mo): the most frequent score in a
distribution


good for nominal data
Median (Md): the midpoint or midscore in
a distribution.
(50% cases above/50% cases below)
– insensitive to extreme cases
--Interval or ratio

Source : Reasoning with Statistics, by Frederick Williams & Peter Monge, fifth edition, Harcourt College Publishers.
Measures of central tendency

Mean
The ‘average’ score—total score divided by the
number of scores
 has a number of useful statistical properties


however, can be sensitive to extreme scores
many statistics based on mean
 Sensitive to ‘outliers’


Extreme cases that just happened to end up in your
sample by chance
Index of central tendency
Source: http://www.uwsp.edu/psych/stat/5/skewnone.gif
Source: Scianta.com
Source: www.wilderdom.com/.../L2-1UnderstandingIQ.html
Source: CSAP’s Data Pathways
Measures of dispersion




Look at how widely scattered over the scale the scores
are
Groups with identical means can be more or less
diverse
To find out how the group is distributed, we need to
know how far or close individual members are from the
mean
Like mean, only meaningful for interval or ratio-level
measures
Measures of dispersion

Range
Distance between the highest and lowest scores in a
distribution;
sensitive to extreme scores;
 compensate by calculating interquartile range (distance
between the 25th and 75th percentile points) which
represents the range of scores for the middle half of a
distribution

Usually used in combination with other measures of
dispersion.
Range
Source: www.animatedsoftware.com/ statglos/sgrange.htm
Source: http://pse.cs.vt.edu/SoSci/converted/Dispersion_I/box_n_hist.gif

Average Deviation (Mean Deviation)
Merits:
1. Easy to calculate and understand.
2. This can be calculated from any average.
3. It is less affected by extreme observations.
Demerits:
1. This is mathematically incomplete because it ignores negative
signs.
2. As it can be calculated from any average, it does not have
certainty (i.e., it is not a well defined measure).
3. Its use is very limited in statistical work.
Measures of dispersion

Variance (S2)

Average of squared distances of individual points
from the mean

High variance means that most scores are far away
from the mean. Low variance indicates that most
scores cluster tightly about the mean.
Standard Deviation (SD)
A summary statistic of how much scores vary from
the mean
Square root of the Variance
expressed in the original units of measurement
 Used in a number of inferential statistics

Variance vs. Standard Deviation
Variance
Population
Sample
Standard Deviation
Skewness of distributions

Measures look at how lopsided
distributions are—how far from the ideal
of the normal curve they are

When the median and the mean are
different, the distribution is skewed. The
greater the difference, the greater the skew.


Distributions that trail away to the left are
negatively skewed and those that trail away to
the right are positively skewed
If the skewness is extreme, the researcher should
either transform the data to make them better
resemble a normal curve or else use a different
set of statistics—nonparametric statistics—to
carry out the analysis
Different Shapes of Distributions
Source: http://faculty.vassar.edu/lowry/f0204.gif
Skewness of distributions
Source: http://www.polity.org.za/html/govdocs/reports/aids/images/image022.gif
Distribution of posting frequency on Usenet
Kurtosis

Measures of kurtosis look at how sharply the
distribution rises to a peak and then drops away