Session 2 - bjgumm.com!

Download Report

Transcript Session 2 - bjgumm.com!

II. Descriptive
Statistics
(Zar, Chapters 1 - 4)
Statistics and Randomization
Universe/Population
X4
X1
Sample
{c1, c2,c3,c4,K}
X2
Describe the
Population
Randomize
Group 1
{ y1 , y2 ,L, y m}
Group 2
{ z1 ,z2 ,L, zn - m }
Statistical Test
Extrapolate
Conclusion
Statistical Test
Hypothesis
H0:Group 1 = Group 2
HA:Group 1 ≠ Group 2
Or
HA1:Group 1 < Group 2
HA2:Group 1 > Group 2
Null Hypothesis
Alternative
Two-sided
One-sided
Types of Data
. Discrete
. Binary
(Examples: alive or dead
heads or tails
Drug "A" or Drug "B"
Male or Female
Normal/Disease)
Representation as data:
0 = alive
1 = dead
or
"A" for "alive "
"D" FOR "dead"
Sample then is
{x ,x , , x }
1
2
n
with each x having only two choices
Summarize by
(1)
Table
Factor number
Status
Alive
25
Dead
10
(2) Histogram
(a) Numbers
%
71%
29%
(b) Percent
.Coded
(ex. diagnosis, genus/species,
race, TNM, stage, color)
Representation as data:
By name or coded name
1 = Caucasian, Non Hispanic
2 = Black (African American)
3 = Hispanic
or just “C”, “B” or “A”, and “H”
if 4 = Oriental, then C,B(A), H, O.
Race
Summarize by
(1)
Table
Number
W
B
H
O
(2) Histogram
Numbers
10
5
12
7
%
29%
15%
34%
21%
Percent
 Ordered Scale
 Examples:
Date,
Severity Scales (Benign, Possible Ca,Probable Ca, Cancer),
Agreement/Preference (Likert:Strongly Disagree, Disagree,
Neutral, Agree, Strongly Agree)
Stage
Strength Scales (0, +, ++, +++)
Represented by an Integer Scale
1: Benign;
2: Possible;
3: Not Sure or Neutral;
4: Probable;
5: Cancer
Summarized by:
(1) Table
Stage
0
+
++
+++
(2) Histogram
Percent
Number
10
8
14
10
Percent
23.8
19.0
33.3
23.8
Cummulative Percent
Continuous
Ratio Scales
Scale differences are the same
(Ex: most data that have a zero)
0
1
2
3
4
5
6
7
8
9
10
True ratio data
(Ex: normalized data:
raw data
treated effector
background control target
.1
.2
.4
.6
.8
1
2
Continuous Log Scale
4
6
8
10
Representation of data:
real number
scientific notation
real number w/significant digits
{x1, x2, … ,xn}
Summarized by
(1) Table
Ex: 10 data points
x1 = 1.5 x 2 = 8, x3 = 4 x4 = 1 x5 = 4.5
x6 = 5 x7 = 3.5 x8 = 4.5 x9 = 7 x10 = 5.5
(a) Point Plot
X4 X1
X8
X7X3X5X6X10 X9 X2
0 1 2 3 4 5 6 7 8
(b) histogram
(1) form “bins”
Ex: 0-2, 2-4, 4-6, 6-8
(2) count number of data points in
each bin and plot # or %
(a) Count
(b) Percent
( c ) Cummulative Histogram
1) form bins as before
0-2, 2-4, 4-6, 6-8
2) Count number ≤ or ≥
0
2
4
6
8 10
 0 0-2 0-4 0-6 0-8 0-10
≥0-10 2-10 4-10 6-10 8-10 0
What else can we do to summarize, or describe, the data?
(1) define where the center of the data lies
(measures of central tendency)
(2) how the data varies from that center
(measures of dispersion)
Center
Dispersion
Two numbers instead of all n
Chapter 3
Measures of Central Tendency
Where is the middle of the data?
Random Sample: x1, x2, ---, xn
(1) The arithmetic mean (average)
x =
x1  x2 
 xn
n
X4 X1
X8
X7X3X5X6X10 X9 X2
n
=

i =1
x =
xi
n
1.5  8  4  ...  5.5
10
Center of Gravity
= 4.45
(2) The order statistics
x(1) = min (xi) ≤ x(2) ≤ x(3) ≤ … ≤ x(n) = max(xi)
x4 ≤ x1 ≤ x7 ≤ x3 ≤ x5 = x8 ≤ x6 ≤ x10 ≤ x9 ≤ x2
x(1) ≤ x(2) ≤ x(3) ≤ x(4) ≤ x(5.5) = x(5.5) ≤ x(7) ≤ x(8) ≤ x(9) ≤ x(10)
For Ties, sum up the indices and divide by the
number of ties!!
Ex., x5 and x8 are tied (4.5)
the order statistic index is (5+6)/2,
The order statistic is x5.5.
Median - middle order statistic:
If n is odd, it’s the middle statistic
xmedian = x n1 


 2 
If n is even, it’s the average of the two middles
xmedian


=  x n   x n   / 2
    1 
 2 2  
If we want a formula that has even and odd together,
we can use the greatest integer function:
xmed =
x([ n / 2])  x([( n 1) / 2])
2
Where [-] is the “greatest integer in … “
In the example above, n = 10, [n/2] = 5
xmed =
x(5)  x(6)
2
x5  x8 4.5  4.5
=
=
2
2
Plot the order statistic index (plot i on the y-axis)
against the corresponding order statistic (x(i) on the x-axis),
The plot is called a frequency polygon:
(3) The Mode
The x where the histogram is maximal.
Usually use the midpoint of the box
where the histogram is maximal.
Ex: In our continuous example:
The mode is in the box 4-6 = 5.0 = (4+6)/2
(4) The mid-range
mid-range =
x(1)  x( n )
2
1.0  8.0
=
= 4.5
2
(5) The geometric mean
n
xgeo = n x1  x2  xn = n  xi
i =1
Derivation of the geometric mean.
Let yi = log10(x i)
Then
-
n
y =  yi / n
i =1
n
=  log10 ( xi ) / n
i =1
 n 
1
= log10   xi 
n
 i =1 
10 = 10
y
n
= n  xi
i =1
= xgeo
1
log10
n
 xi
(6) The harmonic mean
harmonic mean =
1
1
1

n
x
i
=
n
1
x
i
SUMMARY: Measures of Central Tendency
(1) MEAN
Data evenly weighted
n
x =  xi / n
i =1
Average of salaries in lab:
4 hard working G.R.A. = 20,000
20,000
20,000
20,000
1 Faculty member
100,000
180,000
Mean=36,000.
(2) Median
Center of Data
50% above, 50% below
Median=20,000
(3) Mode bin sizes to be about the same
Mode=20,000
(4) Midrange - only the endpoints.
100,000 + 20,000
= 60,000
Chapter 4
Measures of Dispersion and Variability
(1)
Range
Range = x(n) - x(1)
(2)
Mean Deviation
n
xi - x
i =1
n
mean deviation = 
(3) Variance
n
n
s2 = 
i =1
( xi - x )
n -1
2
x
i
=
2
- nx 2
i =1
n -1
Sometimes called the sample variance.
Sometimes called the moment of inertia.
Variance (cont.)
Each data point selected randomly and independently
of all other points. It represents a degree of freedom.
A sample of n points is a vector in n-dimensional space.
The new statistics used by s2 are xi - x
n
n
n
 (x - x ) =  x -  x
i
i =1
i =1
i
n
=  xi - nx
i =1
n
n
i =1
i =1
=  xi - n xi / x)
=0
i =1
so that the xi - x are not independent
The estimate of for the true mean costs one
degree of freedom to make (n-1) degrees of freedom.
(4) The standard deviation
s= s =
2
 (x - x )
2
i
(n - 1)
= sd
The units of s are the same as xi
(5) The standard error of the mean
se = s / n
(6) The coefficient of variation
s
CV = 100%
x
(7) Quartiles (Divide the data in Quarters)
1st: q1 = x([ n / 4]1)
Ex: q1 = x([10 / 4]1) = x7 = 3.
(25% percentile)
2nd : median
(50%)
3rd: q3 = x([3n / 4]1)
Ex: q3 = x([103/ 4]1) = x(8) = x10 = 5.5
(75% percentile)
Interquartile range:
IQR = q3 - q1
Percentiles (Divide into %)
jth Percentile = x(nj /100)
(8) Indicies of diversity
“Shannon Index”
k
H =  pi log pi
'
Information Theory
i -1
where
pi =
hi
n
and hi = data pts in the ith bin.
The pi’s represent & estimate
the “true” probabilities in the bins (∑pi = 100%).
H =
n log n - hi log hi
n
So, How do we use a measure of the Center
and a measure of Dispersion to represent the data?
(1) Mean SD or SE
In a Table
In a Graph
More common: Histogram Bars with whiskers
Problem: perception of lower limits -- who is similar?
Choices:
Standard Deviation
Show Population
Variability
Standard Error
Show Mean
Comparisons
Confidence Interval
Shows the result
of the t-test
Box Plot
Median with quartiles
Whiskers for Min&Max
Circles/Asterisk for outliers
Extrapolation to the Universe
Sample Space
Universe
Probability Density Function Is
esti
f ( x)
mat
ed
by
Histogram
As the sample size n gets large and bin width gets small
n 
, bin width 
0
Parameter in the Universe
Statistic in the Sample Space
 =  xf ( x)dx 
 xn
n 
 =  ( x -  ) f ( x)dx 
sn2
n
2
.5 =
F ( x) =
2
Median

-
f ( x)dx, Median 
n xmed
x
Cummulative Histogram ( x )
n
- f (u)du 
F(x) is called the distribution function and is also
approximated by the frequency polygon.
n