Introduction: Welcome to HSCI 800

Transcript Introduction: Welcome to HSCI 800

CHAPTERS 1 AND 2:
DESCRIPTIVE STATISTICS
STAT 241/251
Outline
2

Data: definitions and examples

Tabular and Graphical Summaries


Categorical
Quantitative

Data Distributions

Numerical Summaries


Measures of Location
Measures of Spread

Boxplots

Data transformations
3
Data
Definitions and Summaries
Data – A Definition
4
A Need for Organization
5
6
Variables
7

Categorical:
 Nominal:
 Ordinal:

Quantitative:
 Discrete:
 Continuous:
More on Variables and Data
8




Why should we care about the type of variable?
Caution: Categorical variables are often recorded
using numbers (e.g. yes=1, no=0). Don’t mistake
these for quantitative variables.
Univariate Data: Data on one variable. E.g. weight
or age or time or…
Multivariate Data: Data on multiple variables. E.g.
weight and age and time and…
Raw Data
9
Four algorithms have been developed for cracking coded
transmissions. Trials with each of the algorithms were run and
the following data were collected
Trial
Algorithm Time to Completion
(sec)
Success
1
3
1.34
yes
2
1
3.45
yes
3
4
0.99
no
…
…
…
…
In all, 2800 trials were done. Thus the complete table is 3 by
2800. This does not lend itself well to drawing conclusions.
Tables are Still Useful
10

Tables can be used to summarize data.

There are two basic types of summary tables
 Frequency
Table
 Relative Frequency Table

The definitions of each will change according to the
type of data (Categorical or Quantitative).
For Categorical Data
11


A Frequency Table is a table that displays the total
number of cases falling into each category of a
single categorical variable.
A Relative Frequency Table displays the
percentage/proportion of cases rather than the
number of cases.
Success
Count
Success
Percentage
Yes
2520
Yes
90%
No
280
No
10%
Steps to making a Frequency Table for
Quantitative Data
12
1.
2.
3.
4.
Identify the smallest and largest observations to
obtain the range of the values for the data.
Divide the (adjusted) range into equal sized nonoverlapping bins.
Count the number of observations (frequency) in each
bin.
Calculate the Relative Frequency for each bin.
(optional)
Time to Completion
13
Time (in seconds)
Count
0 – 0.99
780
1.00 – 1.99
1276
2.00-2.99
614
3.00-3.99
130
- The bins should be of equal length
- It should be clear where each bin starts and ends.
- Use 5-20 bins
Displaying Categorical Data
14

Bar Charts

Pie Charts

Side-by-Side Bar Charts

Side-by-Side Pie Charts
Pictures Speak Louder than Tables
15
Pie Charts
16

Pie Charts present categories as slices of a circle, where the
area of each slice is proportional to the total number of case
in each category (or proportion)
Caution with Fancy 3-D Plots
17
Displaying Quantitative Data
18

Histograms: The quantitative equivalent to the bar
chart. It is a graphical version of the frequency or
relative frequency table. One difference with the
bar chart is that there is no space between the
bars.

Stem and Leaf plots (not covered in this course)

Box-plots
Histogram Example
19
Consider the following ages:
18, 45, 23, 34, 33, 39, 50, 19, 51, 68, 36, 26, 42,
49, 25, 37, 71
20
Distributions and Numerical
Summaries
Data Distributions
21



The definition of data distribution changes for
categorical variables and quantitative variables,
but both have the same goals: to characterize the
behavior of the variable.
Categorical Variable: It’s distribution is the list of
categories of the variable, along with the frequency
of each.
Quantitative Variable: This can’t be achieved, so
we need to describe other features.
Describing Distributions
for Quantitative Data
22

Shape

Center

Spread
The disadvantage here is that we don’t have as good a
grasp of the data as we do with categorical variables,
but the advantage is that we can work with these
numerically.
Shape part 1 - Modes
23






Does the distribution (viewed using a histogram,
say) have no humps, one hump or more than one
hump?
We call the humps modes (the most popular
value(s) the variable can take on).
A distribution with no modes is called uniform.
A distribution with one mode is called unimodal.
A distribution with two modes is called bimodal.
A distribution with many modes is called
multimodal.
Shape part 2 - Symmetry
24


If we cut the distribution at the center and find an
approximately mirror image on both sides, the
distribution is sad to be symmetric.
The ends of the distribution are known as its tails.
Skewed
25


If the distribution is not symmetric, then we say that
it is skewed.
We say it is skewed in the direction of the longer
tail. Thus is can be left/negatively skewed or
right/positively skewed.
Shape part 3 - Outliers
26




An Outlier is an observation that is quite far from
the ‘body’ of the distribution.
They can cause problems with just about every
method we will discuss in this course, so they must be
identified.
In some cases, outliers are removed, but this must be
done with great caution.
If an outlier is to be removed, it should be
mentioned in any subsequent conclusion/discussion.
Numerical Summaries
27



The center and spread of the data are described
numerically using summary statistics.
These try to communicate as much as possible with
regards to the data
The shape of the distribution plays an important
role in the choice of summary statistics.
Center 1 – Median
28


The median is the middle observation of the ordered
list.
Calculating the median
Order the data (usually from lowest to highest)
 If there are an odd number of observations, select the
middle observation
 If there are an even number of observations, take the
average of the two middle observations.
E.g. if there are 7 observations, take the 4th , if there are 8
observations, take the average of observations 4 and 5.


Center 2 – Mean
29


The mean is simply the average of the observations.
We’ll use the mean extensively in this course.
Let yi be the ith observation of variable y.
Let n be the number of observations
Then we denote the using
and calculate it using:
More Mean
30



The mean summarizes the center well if the
distribution is symmetric, unimodal and there are no
outliers.
Otherwise the median is a better choice.
At this point, it may appear that the median is the
natural choice to calculate the center, but the mean
is often favoured. The reason is somewhat beyond
our scope for the moment.
An exercise
31


Consider two small data sets.
 1,
3, 7, 2, 12, 4, 8, 5, 8
 1,
3, 7, 2, 12, 4, 8, 5, 8, 120
Calculate the mean and median
Spread 1 – Interquartile Range
32





The IQR is the range of the middle 50% of the
data.
The first quartile (Q1) is the value with 25% of the
ordered data below it.
The third quartile (Q3) is the value with 75% of the
ordered values below it.
IQR= Q3- Q1
It is a number, not an interval.
Spread 2 – Variance and SD
33


The Variance is the ‘average’ of the squared
differences (or deviation) from the mean.
The Standard Deviation is the square root of the
variance
Properties of SD and Variance
34





They cannot be negative
The SD has the same units as the observations/mean
They are zero iff all observations have the same
value. (i.e. there is no spread!)
The larger they are the more spread out the data
are.
So why divide by n-1? Why not just take the
average?
What we should retain
35




The median and IQR are good measures of center and
spread even when the distribution is skewed or has
outliers.
The mean and variance are good measure of center
and spread when the distribution is symmetric without
outliers, but not multimodal.
When the data are symmetric without outliers and isn’t
multimodal, we typically only report the mean and SD.
If the distribution is multimodal, then summary statistics
are not appropriate.
36
Boxplots
Box-plot
37
Upper and
Lower
Whiskers
Q3 – 3rd Quartile
Median
Q1 – First Quartile
Box-plot
38


The Box-plot is a visual display of the 5-number
summary.
It is useful for comparing two ore more distributions.
Constructing a Boxplot
39




Step 1 – the Box: Identify the Median, 1st and 3rd
Quartiles and complete the box.
Step 2 – the Fences: Fences are only used for
construction purposes. The fences are 1.5xIQR away
from each Quartile.
Step 3 – the Whiskers: Extend a line from each quartile
to the most extreme observation within the fences. At
these points, extend the whiskers.
Step 4 – Outliers: Any point outside the fences should
be drawn in as points. These are potential outliers.
Lifetime of Pacemakers
40
Replacing a pacemaker is a big deal. Data were
collected on pacemaker lifetimes (in years). Here are
the raw data:
12.3, 11.7, 11.5, 9.2, 1.2, 13.4, 12.9, 20.4, 11.1, 15.5,
12.4, 10.4, 10.7, 10.2
Summary Statistics:
Median = 11.6
Q1 = 10.4
Q3 = 12.9
Draw an appropriate boxplot.
Boxplot Questions
41

When are box-plots inappropriate?

When do we favor box-plots over histograms?

Explain why a point identified as an outlier by a
box-plot may not be an outlier.
42
Linear Transformations!
Purpose
43

There are many situations which lead to the need to
linearly transform data.
 E.g.
Your firm sends some temperature readings to an
American firm, so it transforms the readings in degrees
Celsius to degrees Fahrenheit.

When we transform the data, what happens to
summary statistics?
What are you measuring?
44


How the numerical measures are affected is
dependent on what it is measuring: location or
spread.
We create three classes of numerical summaries
which are affected differently by transformations
 Measures
of location: Mean, Median, Midrange,
Quartiles, Percentiles, Min and Max
 Measures of Spread: Standard Deviation, IQR, Range
 Variance
Measures of Location
45


These are affected by both adding/subtracting and
multiplying/dividing
Let m be the current measure of location, then
 Adding
the constant a will result in the new measure:
m’ = m + a
 Multiplying by b will lead to: m’ = bm
 Using the linear function f(x)=a+bx will lead to:
m’ = a + bm
Measures of Spread
46


These are affected by multiplying/dividing, but not
by adding/subtracting
Let m be the current measure of spread, then
 Adding
the constant a will result in the new measure:
m’ = m
 Multiplying by b will lead to: m’ = bm
 Using the linear function f(x)=a+bx will lead to:
m’ = bm
Variance
47


Same as measures of spread, but the effect is
different.
Let v be the variance, then
 Adding
the constant a will result in the new measure:
v’ = v
 Multiplying by b will lead to: v’ = b2v
 Using the linear function f(x)=a+bx will lead to:
v’ = b2v
Questions
48


The average weight of watermelons from a farm is
4.3 kg with a SD of 1.5 kg. What are the mean
and SD weight in lbs?
The median and IQR on a final exam are 50 and
22. The instructor decides to multiply the results1.13
and add 5 to each grade. What are the summary
statistics now?

Introduction: Welcome to HSCI 800

Transcript Introduction: Welcome to HSCI 800

Directory