Descriptive Statistics
Download
Report
Transcript Descriptive Statistics
Programming in R
Data Analysis Module: Basic
Descriptive Statistics
Data Analysis Module
Basic Descriptive Statistics and Confidence Intervals
Basic Visualizations
Histograms
Pie Charts
Bar Charts
Scatterplots
Ttests
One Sample
Paired
Independent Two Sample
ANOVA
Chi Square and Odds
Regression Basics
2
Data Analysis: Descriptive Statistics
In this session I will explain:
• Measures of central tendency and
variation
• How to use figures to summarize a
single variable (univariate data)
• How to create these in R.
Data Analysis: Descriptive Statistics
• Center, or where do we find most of the
data
• Distribution or shape, such as a bell
shaped curve
• Variation or dispersion, how far spread
out is the data, on average, how far are
observations from the center?
• Outliers…have we got Bill Gates in our
salary sample?
Measure of central tendency
The “center” of a data set can be
described using two different measures:
1. Mean – the commonly known
“average”
2. Median – the midpoint
The mean
• The sample mean is sometimes
called “x bar”
x =
x
n
• Translation, add up all the values
and divide by the number of values
• Usually, this is what people call the
average
The median
• The middle of the data is called the
median
– Sort the data from smallest to largest
– If there are an odd number of
observations, the middle number is the
median
– For even number of observations, the
median is the midpoint between the
two middle numbers
Median price=
(7521+8139)/2 or
7830
Shape and skewness
Normal variables and standard deviation
• In a symmetric, bell shaped
distribution, we are able to describe
the entire distribution using only two
numbers, the mean and the
standard deviation
• The standard deviation is roughly
the average distance that
observations are from their mean
Calculating the standard deviation
Standard deviation=
X x
2
i
n 1
Translation: Find the difference between the
mean and each value in the dataset, square
each difference, add these up, divide by the
total number of values minus 1, then take the
square root of that (or, get R to do it for you)
And we care because?
The Empirical Rule
For any normal curve, approximately
•68% of the values fall within 1 standard
deviation of the mean
•95% of the values fall within 2 standard
deviations of the mean
•99.7% of the values fall within 3 standard
deviations of the mean
Other things to describe
• How many modes?
• The range, minimum and maximum
Eruption times for of Old Faithful geyser in
Yellowstone National Park, 1997 n=107
25
# of eruptions
20
15
10
20
18
17
12
5
5
5
0
3
2
2.2
2.5
2.8
3.1
3.4
0
1.9
3.7
16
8
4
4.3
4.6
4.9
1
5.2
This histogram shows
a bimodal shape.
The data has a
minimum of 1.67
minutes and a
maximum of 4.93
minutes, for a range
of 3.26 minutes.
Time of eruption in minutes
http://wps.aw.com/wps/media/objects/15/15719/projects/ch3_faithful/index.html
The five number summary
• Minimum, maximum, median, lower
quartile and upper quartile
Minimum
Lower
Quartile
Median
Upper
Quartile
The visual representation of the five
number summary is the box or box and
whiskers plot
Maximum
Interpreting box plots
¼ of students slept between
3 and 6 hours,
¼ slept between 6 and 7,
¼ slept between 7 and 8
¼ slept between 8 and 16
Outlier: any value
more than 1.5 interquartile range(IQR)
beyond closest quartile, shown with stars.
Copyright ©2005 Brooks/Cole, a division of Thomson Learning, Inc.
Other ways to visualize data
• When developing a visual
representation of a single variable,
the most common tools are –
Histograms, Pie Charts, Bar Charts,
Box Plots and Stem and Leaf Plots.
• We’ve already seen a histogram
and a box plot
How to produce these in R
• The function summary() to get
mean, median, first quartile, third
quartile, minimum, and maximum.
• table() to get frequency counts
• prop.table() to get percentages
• Plus, pie(), barplot(), hist(), and
boxplot() to get pie, bar plots,
histograms, and box plots,
respectively.