Data Distributions:

Download Report

Transcript Data Distributions:

Important Properties of Distributions:
•
•
Focus is on summarizing the distribution as a
whole, rather than individual values
The distribution of points represents a
combination of:
–
–
•
Common patterns or group conditions
Unique features or individual conditions
How to summarize/describe these:
a) Numerically (in statistical indexes)
b) Visually (in images and pictures)
c) Verbally (in words and phrases)
(Parenthetical Note)
•
Note: focus here is on summarizing the
distribution of one variable at a time
–
–
•
This is called univariate analysis or statistics
Unique features or individual conditions
If we consider the combined (joint) distribution
of multiple variables (to see how they are
interrelated):
–
–
Analyzing two variables jointly is called bivariate
analysis
Analyzing three or more variables jointly is called
multivariate analysis
Important Properties of Distributions:
1) Central Tendency:
•
•
What is the typical or average value of the distribution?
Where is the middle of the data?
2) Variation:
•
•
How wide are the data points spread out? (range)
How concentrated are the data points within the
distribution? (variance)
3) Size:
•
How numerous are the data points in the distribution?
4) Symmetry: (also called skew)
•
How lop-sided is the distribution across its range?
5) Peakiness (also called modality):
•
•
•
Are all data points smoothly spread over the values?
Are there notable peaks or lumps are in the distribution?
How many and how sharp are the peaks?
Central Tendency: (3 common measures)
1) Mode:
•
•
The most common, popular, or “typical” value.
Applies to all levels of variables – nominal & up
2) Median:
•
•
•
The “midpoint” (50/50) of ordered distribution.
Divides distribution into upper and lower halves.
Variable must be at least ordinal level (ordered).
3) Mean:
•
•
•
The “average” (“center of gravity”) of the values
Weighted by the size or value of the data points.
Variable need to be interval level (at least).
Which one is the correct measure of
Central Tendency?
1) Depends on the type of data
•
•
•
Nominal = mode
Ordinal = median
Interval/Ratio = mean (quasi-interval?)
2) Depends on the distribution of the variable
•
•
•
Highly skewed or weirdly distributed variables
Unusual or extreme outliers (AKA the “Bill Gates
effect” or the “New York City effect”)
Variables with infinitely many “unique” values
How to compute measures of Central
Tendency?
A. By hand (& calculator)?
•
•
See the textbook and the handouts
Notice difference between formulas for: (a) data
list, (b) frequency table, © grouped distribution
B. By SPSS?
•
Use one of 3 procedures:
1) Frequencies command  compute more kinds of
statistics and accompanying chart; more detailed
output
2) Descriptives command  quickly compute most
common statistics but no median and no charts
3) Explore command  wider array of information
Shape of Distribution: properties
1) Symmetry:
•
•
“Lopsidedness”  unevenness around center
“Skew” = the technical name for asymmetry
–
–
•
•
Skew = direction of the longer tail
Left-Skew = negative; Right-Skew = positive
Some statistics assume symmetric distribution
If symmetric, mean & median = same
2) Peakiness:
•
•
Multi-modality  number of peaks
“Kurtosis”  sharpness of peaks
3) Truncation:
•
Some values are excluded or “censored”
How to tell Shape of a Distribution?
1) Look at frequency table (if # values = small):
2) Look at frequency graph:
•
•
Bar chart or line graph (if # values = small)
Histogram (if # values = large)
2) Compare values of median and mean:
•
•
•
Difference between Mean & Median = skew
If Mean > Median: skewed to the right
If Mean < Median: skewed to the left
3) Box Plots:
Bar Chart
Histogram
How to tell Shape of a Distribution?
3) Box Plots:
How to tell Shape of a Distribution?
3) Box Plots:
Variation (the spread) of the data):
1) Range:
•
The difference between the highest and lowest
values in the distribution
2) Inter-Quartile Range:
•
The difference (range) between the 25th & 75th
percentiles (lowest & highest quarters) of the
distribution. (span of the middle 50%)
3) Variance (& standard deviation):
•
•
•
The total amount of variance around the mean.
Counts the amount but not direction of deviation.
Weights large deviations more heavily.
How to compute the Variance:
1) Compute the Mean of the distribution
2) Compute the deviation of each score from
the Mean of the distribution
3) Square the deviations from the mean
4) Add all the squared deviations together
5) Divide by the total number of scores
To Compute the Standard Deviation:
1) Take the square root of the variance
2 Measures of Variance?
• Note 2 slightly different formulas:
– Population/Description formula:
2
(
x

x
)
 i
i
N
– Sample/Estimation formula:
2
(
x

x
)
 i
i
N1
How to compute the Variance:
Note two different computing strategies that
yield the same answers:
1) Definitional Formula:
•
•
•
Requires computing the mean first & then
deviations
Uses deviation scores and decimal fractions
Messier computations (with decimal fractions)
2) Computational Formula:
•
•
•
Computations occur in the same step
Does not compute deviations
Simpler computations (decimals only at the end)