STAT02 - Descriptive statistics (cont.)

Download Report

Transcript STAT02 - Descriptive statistics (cont.)

Applied statistics for testing and
evaluation
– MED4
STAT02
- Descriptive
statistics (cont.)
Descriptive statistics (cont.)
Lecturer:
Smilen Dimitrov
1
STAT02 - Descriptive statistics (cont.)
Introduction
•
We previously discussed arithmetic mean, as a measure of central tendency
(location) of a data sample (collection) in descriptive statistics
•
Here we continue with other important measures of central tendency –
namely mode and median
•
We will also get acquainted with frequency tables, and their graphical form –
histograms – and also get acquainted to the range as a measure of
statistical variability (dispersion or spread) in descriptive statistics
•
We will look at how we perform these operations in R, and a bit more about
plotting
2
STAT02 - Descriptive statistics (cont.)
Arithmetic mean as central tendency, range and outliers
•
The range is the length of the smallest interval which contains all the data.
– calculated by subtracting the smallest observations from the greatest
•
In R, we can use the commands min and max to find the range of a data
collection
•
We can use abline to plot straight lines
3
STAT02 - Descriptive statistics (cont.)
Arithmetic mean as central tendency, range and outliers
•
Our sample data set (raisins), with quantities plotted as bar graph (using
barplot), and with the range and arithmetic mean shown:
4
STAT02 - Descriptive statistics (cont.)
Arithmetic mean as central tendency, range and outliers
•
Our sample data set (raisins), with quantities plotted as point/line plot (using
plot), and with the range and arithmetic mean shown:
The y axis is auto scaled to show the range with plot
5
STAT02 - Descriptive statistics (cont.)
Arithmetic mean as central tendency, range and outliers
•
Our sample data set (raisins), with quantities plotted as point/line plot (using
plot), and with the range and arithmetic mean shown – with only one value
changed to lie outside the original range:
6
STAT02 - Descriptive statistics (cont.)
Arithmetic mean as central tendency, range and outliers
•
Both the range and the arithmetic mean change significantly, if only one
value is quite different than the others
– outlier - is a single observation 'far away' from the rest of the data.
•
However, one outlier does not change the fact that the other values still tend
to have values close to the original arithmetic mean and range
•
Therefore we need tools / methods for describing central tendency and
variability, which are less sensitive to outliers
•
For central tendency, we can use mode and median
7
STAT02 - Descriptive statistics (cont.)
Mode and frequency distribution
•
By definition, outliers occur rarely - they are single occurrences.
•
Useful to see which values occur the most often (most frequently) - mode
– mode means the most frequent value assumed by a random variable, or
occurring in a sampling of a random variable.
– applied both to probability distributions and to collections of
experimental data
– Can be unusable for real numbers (they are unique – occur only once),
unless we apply histogram techniques
– Can be applied to nominal data (most frequent name for instance)
8
STAT02 - Descriptive statistics (cont.)
Mode and frequency distribution
•
To see which value occurs most frequently, we must first count how many
times does each value in the data collection occur – frequency count ==
distribution
– Collection and aggregation of data result in a distribution. Distributions
are most often in the form of a histogram or a table (frequency table) –
looking to approximate to a math function, and infer conclusions
– Frequency of an event i is the number ni of times the event occurred in
the experiment or the study. These frequencies are often graphically
represented in histograms.
• absolute frequencies - when the counts ni themselves are given
• (relative) frequencies - when the counts are normalized by the total
number of events:
fi 
ni
n
 i
N  ni
i
9
STAT02 - Descriptive statistics (cont.)
Building a histogram and frequency table (ex using applets)
1. Standard collection of our data:
2. Building a point plot histogram “manually”,
from the individual counts observed
3. Building a frequency table from
a point plot histogram
4. Transition to a bar graph
histogram from a point plot
histogram
10
STAT02 - Descriptive statistics (cont.)
Mode and frequency distribution - histogram
•
Histogram – graphical display of a frequency table (distribution)
– A histogram is a graphical display of tabulated frequencies.
– A histogram is the graphical version of a table which shows what
proportion of cases fall into each of several or many specified
categories.
– The categories are usually specified as non-overlapping intervals of
some variable – bins
– In a more general mathematical sense - a histogram is simply a
mapping that counts the number of observations that fall into various
disjoint categories (known as bins), whereas the graph of a histogram is
merely one way to represent a histogram.
11
STAT02 - Descriptive statistics (cont.)
Mode and frequency distribution - histogram
•
In R – a frequency table is obtained through table command
•
A histogram is most easily drawn (for integer data) by plotting the output of
table using plot or barplot
12
STAT02 - Descriptive statistics (cont.)
Mode and frequency distribution - histogram
•
In R there is a special command hist that is used for plotting a histogram
– however, as it can accept real (in addition to integer) numeric data, it
needs some fine-tuning to graph integer data correctly
•
Plotting relative frequencies is relatively easy – by dividing with the number
of elements (length) in the data collection
13
STAT02 - Descriptive statistics (cont.)
Median
•
A median is a number dividing the higher half of a sample, a population, or a
probability distribution from the lower half.
– At most half the population have values less than the median and at
most half have values greater than the median.
– If both groups contain less than half the population, then some of the
population is exactly equal to the median.
•
In R – means one should
– Sort the data collection – in ascending order
– Find out whether the data collection has odd or even number of
elements
• If they are odd, return the mid-element in the collection
• If they are even, return the mean value of the two mid-elements in
the data sample
14
STAT02 - Descriptive statistics (cont.)
Review
•
•
•
Arithmetic mean
Median
Mode
•
Range
Measures of
Central tendency (location)
Descriptive
statistics
Measure of
Statistical variability (dispersion - spread)
15
STAT02 - Descriptive statistics (cont.)
Exercise for mini-module 2 – STAT02
Exercise
Use the Sample Data Set of Southern Oscillations, given on
http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4412.htm
• Collect the southern oscillation data per month for three consecutive years in an Excel
sheet.
–
Choose the years based on your group number g, according to the formula: yi  1955  3  g  1  i  1; i  1,2,3
–
–
•
•
•
(so group 1 would choose 1955, 1956, 1957; group 2 would choose 1958, 1959, 1960 etc.)
Multiply all oscillation data with 10 so as to work with integers.
Hint: you could use month number as row names, and years as column names in Excel and in R.
Import the data into R, and for each year, find the arithmetic mean, the median and
the mode of the oscillation.
Using R, plot as quantity the oscillation each month, for each of the assigned years.
Mark graphically the range and the median on each graph.
Using R, plot the relative frequency histogram for each of the assigned years. Mark
graphically the arithmetic mean on each graph.
Delivery:
Deliver the collected data (in tabular format), the found statistics and the requested
graphs for the assigned years in an electronic document. You are welcome to include
R code as well.
16