PP #4 - Personal Web Pages

Download Report

Transcript PP #4 - Personal Web Pages

Descriptive Statistics
Used to describe the basic features of
the data in any quantitative study.
Both graphical displays and descriptive
summary statistics provide the basis
of nearly any quantitative analysis of
data.
Descriptive Statistics
The purpose of descriptive statistics is
to organize and summarize data so
that the data are more readily
comprehended.
That is, descriptive statistics describe
distributions with numbers.
The Process of Becoming Familiar with the Data
• ???
The Process of Becoming Familiar with the Data
•
•
•
•
•
•
•
•
•
Screening for valid values
Missing data
Value labels
Levels of measurement
Center
Spread
Shape
Rank or relative position
Association
Background Information
• Types of variables
– Qualitative
– Quantitative
• Scales or Levels of Measurement
– Nominal – Names the category, therefore a
qualitative variable represents a nominal scale
– Ordinal – Values that can be ordered, reflect
differing degrees or amounts of a characteristic
being studied, difference between values are
not interpretable.
– Interval – Values can be ordered, however,
difference between values are interpretable.
– Ratio – A zero as a value is meaningful, ratios
make sense.
Examples of Levels of
Measurement
• Nominal - Numbers assigned to sport
figures, gender, party affiliation
• Ordinal – Numbers assigned to educational
attainment, rank in population
• Interval – Temperature, there is a zero but
it depends on how it is measured – it is not
an absolute zero, a temperature of 100 is
not twice as hot as a temperature of 50.
• Ratio – Has an absolute zero, weight,
count of the number of people, height,
distance, elapsed time.
Why is knowing the level of measurement important?
• It will help you decide how to interpret
the data from that variable.
• Helps you decide what statistical
analysis is appropriate on the values
assigned.
• http://www.socialresearchmethods.net/selstat/ssstart.htm
Central Tendency
Central Tendency refers to measuring the
center or average. Only the notation for
mean is standard.
The most common measures of central
tendency are:
n
1
– Arithmetic Mean or Mean – X  n  X i
i1
– Mode – Mo – the item that occurs with greatest
frequency
– Median – Mdn – the middle score when the
observations are arranged in order of
magnitude, so that an equal number of scores
fall below and above.
Examples of these measures
• Mean of: 2, 3, 6, 7, 3, 5, 10
(2 + 3 + 6 + 7 + 3 + 5 + 10)/ 7 = 36/ 7 = 5.14
• Mode of: 2, 3, 6, 7, 3, 5, 10 is 3
• Median of: 2, 3, 6, 7, 3, 5, 10
First data is ordered: 2, 3, 3, 5, 6, 7, 10.
Middle value is 5 therefore that is the
median.
Some Important Points About These Measures
• Mode is the only descriptive measure
used for nominal data.
• Median is unaffected by extreme
values, it is resistant to extreme
observations.
Some Important Points About These Measures
• Mean or Average is affected by
extremely small or large values. We
say that it is sensitive or nonresistant
to the influence of extreme
observations.
• The mean is the balance point of the
distribution.
• In symmetric distributions the mean
and median are close together.
More important points
• In skewed data the mean is pulled to the
tail of the distribution.
• Median is not necessarily preferred over
the mean even if it is resistant. However if
data is known to be strongly skewed then
the median is preferable.
• Finally, the average is usually the
measurement of central tendency of choice
because it is stable during sampling.
Measuring Spread or Variability
There are several measures of variability. These
measures give an added dimension to the data.
More information about the data is better than
less.
Example: A test was given in two classes and the
average in one class was 97 and the average in
the other was 94. Was the second test more
difficult? Was it easier to get an A in the first class
than the other? Not necessarily to both questions.
The spread of the test grades might help answer
the questions. Say that the spread of grades in the
first test was 85 – 100 and in the second test the
spread was 92 – 96.
Measures of Variability, Spread of Dispersion
• Range – Difference between highest and
lowest items in a distribution. This measure
is not responsive to each item in the
distribution.
• Quartiles Q and Q3 – Medians of each part
of the distribution to the left and right of the
median.
• Interquartile range – IQR is range between
Q and Q3.
IQR is used to find outliers. The rule is that if
an item is 1.5 times the IQR below or
above the Q and Q3 then it is considered
and outlier.
1
1
1
The Five-Number Summary
A convenient and quick way to graph and
give some preliminary descriptive statistics
is to determine the five-number summary.
We need two additional bits of information.
The maximum and minimum.
Example: The data set in a previous slide
was: 2, 3, 3, 5, 6, 7, 10.
The median is : 5
The Q and Q3 are 3 and 7 respectively.
The minimum is 2
The maximum is 10
1
The Boxplot
The boxplot would look like:
12
10
8
6
4
2
0
N=
7
VAR00001
Deviations from the mean
Another way to measure spread is to measure the
deviations from the mean or average. For our
example:
Avg.
5.14, so deviations are,
2 – 5.14, 3 – 5.14, 3 – 5.14, 5 – 5.14, 6 – 5.14, 7 –
5.14, 10 – 5.14. So, they are:
-3.14, -2.14, -2.14, -0.14, 0.86, 1.86, 4.86.
Notice that they add up to zero.
So as a descriptor it tells you something about the
spread but since the sum is always zero the
squares are computed and added.

Deviations from the mean continued
• Simply dividing by the number of
sample items would give us the
average of the sum of the squared
deviations from the mean or variance.
However, we will find out that it will
give us an unbiased estimator of the
variance if we divide by # items – 1.
• So formula becomes:
S2 
1 n ( X  X )2
i
n 1
i1
Standard Deviation
A more useful and popular statistic is the standard
deviation. Its units will be the same as the items
in the data set. Fortunately, it does not involve
another formula. By taking the square root of the
variance we also have the standard deviation.
Again, the standard deviation is nonresistant to
extreme values.
The formula then is:
n
1
S
( X i  X )2

n 1i1
Class Demos
Outliers
Demo Data
Teacher Stress Data
Key for Teacher Stress Data