Data Handling/Statistics - LSU Macromolecular Studies Group

Download Report

Transcript Data Handling/Statistics - LSU Macromolecular Studies Group

Data Handling/Statistics
There is no substitute for books—
— you need professional help!
My personal favorites, from which this lecture is drawn:
•The Cartoon Guide to Statistics, L. Gonick & W. Smith
•Data Reduction in the Physical Sciences, P. R. Bevington
•Workshop Statistics, A. J. Rossman & B. L. Chance
•Numerical Recipes, W.H. Press, B.P. Flannery, S.A. Teukolsky
and W.T.Vetterling
•Origin 6.1 Users Manual, MicroCal Corporation
Outline
•Our motto
•What those books look like
•Stuff you need to be able to look up
•Samples & Populations
•Mean, Standard Deviation, Standard Error
•Probability
•Random Variables
•Propagation of Errors
•Stuff you must be able to do on a daily basis
•Plot
•Fit
•Interpret
Our Motto
That which can be taught can be learned.
An opposing, non-CMC IGERT
viewpoint
The “progress” of civilization relies being able to
do more and more things while thinking less and
less about them.
What those books look like
The Cartoon
Guide to
Statistics
The Cartoon
Guide to
Statistics
In this example, the author
provides step-by-step analysis
of the statistics of a poll.
Similar logic and style tell
you how to tell two populations
apart, whether your measley
five replicate runs truly
represent the situation, etc.
The Cartoon Guide gives an
enjoyable account of statistics in
scientific and everyday life.
An Introduction
to Error Analysis
A very readable text, but with
enough math to be rigorous. The
cover says it all – the book’s
emphasis is how statistics and
error analysis are important in the
everyday.
Author John Taylor is known as “Mr.
Wizard” at Univ. of Colorado, for his popular
science lectures aimed at youngsters.
Bevington
Bevington is really good at introducing
basic concepts, along with simple code
that really, really works. Our lab uses a
lot of Bevington code, often translated from
Fortran to Visual Basic.
“Workshop Statistics”
This book has a website full of data that it
tells you how to analyze. The test cases are
often pretty interesting, too.
Many little shadow boxes provide info.
“Numerical Recipes”
A more modern and thicker version of Bevington.
Code comes in Fortran, C, Basic (others?). Includes
advanced topics like digital filtering, but harder to read
on the simpler things. With this plus Bevington and a
lot of time, you can fit, smooth, filter practically
anything.
Stuff you need to be able to look up
Samples vs. Populations
The world as we
understand it, based
on science.
The world as God
understands it, based
on omniscience.
Statistics is not art but artifice–a bridge to help us
understand phenomena, based on limited observations.
Our problem
Sitting behind the target, can
we say with some specific level of
confidence whether a circle
drawn around this single arrow
(a measurement) hits the
bullseye (the population mean)?
Measuring a molecular weight by
one Zimm plot, can we say with
any certainty that we have obtained
the same answer God would have
gotten?
Sample
View
Average
Variance
x1  x2  x3...xn 1 n
x
  xi
n
n i 1
1 n
2
s 
(
x

x
)
 i
n  1 i 1
2
Standard
deviation
Standard
error of
mean
s  s2
SEM 
s
n
Population
View
  E(x)

2
   ( x) n
Sample View: direct, experimental, tangible
The single most important thing about this is the reduction
In standard deviation or standard error of the mean according
To inverse root n.
s  s2 
~1
(for large n)
n
Three times better takes 9 times longer (or costs
9 times more, or takes 9 times more disk space).
If you remembered nothing else from this lecture, it
would be a success!
Population View: conceptual,
layered with arcana!
The purple equation in the table is an expression of the central
limit theorem. If we measure many averages, we do not
always get the same average:
x is itself a random variable!
“if one takesrandomsamplesof size n
froma populationwith mean  and
standarddeviation , then(for large n) x itself
approachesa normaldistribution
with mean and standard
deviation n " (from" Cartoon..." ).
It means…if you want to estimate , which only
God really knows, you should measure many averages, each
involving n data points, figure their standard deviation,
and multiply by n1/2. This is hard work!
Huh?
A lot of times,  is approximated by s.
If you wanted to estimate the population average ,
the best you can do is to measure many averages and
averaging those.
A lot of times  is approximated by x.
IT’S HARD TO KNOW WHAT GOD DOES.
I think the  in the purple equation should be an s, but the equation only works in the limit
of large n anyhow, so there is no difference.
You got to compromise, fool!
The t-distribution was invented by
a statistician named Gosset, who was forced
by his employer (the Guinness brewery!)
to publish under a pseudonym.
He chose “Student” and his t-distribution is
known as student’s t.
The student’s t distribution helps us assign confidence in
our imperfect experiments on small samples.
Input: desired confidence level, estimate of population
mean (or estimated probability),
estimated error of the mean (or probability).
Output: ± something
Probability
…is another arcane concept in the “population” category: something
we would like to know but cannot. As a concept, it’s
wonderful. The true mean of a distribution of mass is given as the
probability of that mass times the mass. The standard deviation
follows a similarly simple rule. In what follows, F means a
normalized frequency (think mole fraction!) and P is a probability
density. P(x)dx represents the number of things (think molecules)
with property x (think mass) between x+dx/2 and x-dx/2.

 xF ( x)
all x
2 
2
(
x


)
F ( x)

all x


 xP( x)dx


2 
2
(
x


)
P ( x)dx


Discrete system
Continuous system
Here’s a normal probability density distribution from
“Workshop…” where you use actual data to discover.
   68% of results
  2 95% of results
What it means
Although you don’t usually know the distribution,
(either  or ) about 68% of your measurements will
fall within  1 of ….if the distribution is a “normal”,
bell-shaped curve. t-tests allow you to kinda play this
backwards: given a finite sample size, with some
average, x, and standard deviation, s—inferior to
 and , respectively—how far away do we think the true
 is?
Details
No way I could do it better than “Cartoon…”
or “Workshop…”
Remember…this is the part of the lecture
entitled “things you must be able to look up.”
Propagation of errors
Suppose you give 30 people a ruler and ask them to measure
the length and width of a room. Owing to general
incompetence, otherwise known as human nature,
you will get not one answer but many. Your averages
will be L and W, and standard deviations sW and sL.
Now, you want to buy carpet, so need area A = L·W.
What is the uncertainty in A due to the measurement errors
in L and W?
Answer! There is no telling….but you have several options
to estimate it.
A = L·W example
Here are your measured data:
L  30  1 ft
W  19  2 ft
You can consider “most” and “least” cases:
Amax  L  W  31 20 ft 2  620 ft 2
Amin  L  W  29 17 ft 2  490 ft 2
620 490 2
Aaverage 
ft  557 ft 2
2
620- 490
estimat eduncertainty :
 65
2
reportedarea : (560 65) ft 2
Another way
We can use a formula for how  propagates.
Suppose some function y (think area) depends on
two measured quantities t and s (think length &
width). Then the variance in y follows this rule:
2
y
2
2
 y  2  y  2
   t     s
 t 
 s 
Aren’t you glad you took partial differential equations?
What??!! You didn’t? Well, sign up. PDE is the bare
minimum math for scientists.
Translation in our case, where A = L·W:
2
A
2
2
 A  2  A  2
   L 
 W
 L 
 W 
W
2
L
2
L
2
W
2
Problem: we don’t know W, L, L or W! These are
population numbers we could only get if we had the
entire planet measure this particular room. We therefore
assume that our measurement set is large enough (n=30)
That we can use our measured averages for W and L and
our standard deviations for L and W.
2
A
 (19 ft ) (1 ft )  (30 ft ) (2 ft )  3961ft
2
2
2
2
4
 A  63 ft 2
So the value to reportis :
(19 30) ft 2  63 ft 2
or....
A  (570 63) ft
Comparethis to our empirical,most/leastcalculation :
2
A  (560 65) ft 2
Error propagation caveats
2
2
 y 
 y 
The equation,  2y     t2     s2 , assumes
 t 
 s 
normal behavior. Large systematic errors—for example,
3 euroguys who report their values in metric units—are not
taken into consideration properly. In many cases, there
will be good knowledge a priori about the uncertainty in
one or more parameters: in photon counting, if N is
the number of photons detected, then N = (N)1/2 . Systematic
error that is not included in this estimate, so photon folk are
well advised to just repeat experiments to determine
real standard deviations that do take systematic errors into
account.
Stuff you must know how
to do on daily basis
Plot!!!
99.97% of the trend can be explained
by the fitted relation.
25000
20000
r=0.99987
r2=0.9997
Larger Particle
30.9 g/ml
/Hz
15000
Parameter Value Error
-----------------------------------------------------------A -0.00267 44.94619
B 2.25237E-7 8.46749E-10
------------------------------------------------------------
10000
Intercept = 0.003 ± 45
(i.e., zero!)
5000
R SD N P
-----------------------------------------------------------0.99987 118.8859 21 <0.0001
------------------------------------------------------------
0
0
2
4
6
q2/1010cm-2
8
10
The same data
3.0
twilight users rcueto e739
How to find
this file!
Larger Particle
30.9 g/ml
2.5
Parameter Value Error
1.5
-----------------------------------------------------------A 2.2725E-7 7.62107E-10
app
D
/ cm2s-1
2.0
B -3.09723E-20 1.43575E-20
1.0
------------------------------------------------------------
r=0.444
r2=0.20
R SD N P
------------------------------------------------------------
0.5
-0.44355 2.01583E-9 21 0.044
Only 20% of the data can be
explained by the line! While
2

on
8 depended
10
12 q , Dapp does not!
------------------------------------------------------------
0.0
0
2
4
6
q2/1010cm-2
What does the famous “ r2 ” really
tell us?
Suppose you
invented a new
polymer that you
hoped was more
stable over time
than its
predecessor…
So you check.
time
melting
point
2
4
8
12
16
24
36
48
110.2
110.9
108.8
109.1
109.0
108.5
110.0
109.2
Question:
time
melting
point
2
4
8
12
16
24
36
48
110.2
110.9
108.8
109.1
109.0
108.5
110.0
109.2
What describes the data better:
A simple average
(meaning things aren’t really
changing over time: it is stable)
OR
A trend
(meaning melting point might be
dropping over time)?
How well does the mean describe
the data?
These are called ‘residuals.’
The sum of the square of all the
residuals characterizes how well
the data fit the mean.
St   (Ti  Tmean ) 2
i
(= 4.6788)
How much better is a fit
(i.e., a regression in this case)?
The regression also has residuals.
The sum of their squares is smaller
than St.
S r   (Ti  T fit,i ) 2
i
(= 4.3079)
The r2 value simply compares
the fit to the mean, by comparing
the sums of the squares:
St  S r
r 
Sr
2
4.6788 4.3079
r 
 0.0793
4.6788
2
In our case, the fit was NOT a
dramatic improvement,
explaining only 7.9% of the
variability of the data!
Plot showing 95% confidence limits.
Excel doesn’t excel at this!
25
Rh/nm
20
Range of Rg values obsreved in MALLS
15
1/2
(3/5) Rh
[6/7/01 13:44 "/Rhapp" (2452067)]
Linear Regression for BigSilk_Ravgnm:
Y=A+B*X
10
Parameter
Value Error
-----------------------------------------------------------A
20.88925
0.19213
B
0.01762
0.01105
------------------------------------------------------------
5
R
SD
N
P
-----------------------------------------------------------0.62332
0.28434
6
0.18611
------------------------------------------------------------
0
0
10
20
-1
c/g-ml
30
Interpreting data: Life on the bleeding edge of
cutting technology. Or is that bleating edge?
Rg/nm
The noise level in individual runs is much less than
The run-to-run variation. That’s why many runs are
a good idea. More would be good here, but we are
still overcoming the shock that we can do this at all!
n = 0.324 +/- 0.04
df = 3.12 +/- 0.44
10
3E6
1E7
M
2E7
Excel does not automatically
provide  estimates!
Correlation Caveat!
Correlation  Cause. No, Correlation=Association.
Country
Life Expectancy People per TV TV's per person
44
76.5
49.5
76.5
70
60.5
78
53.5
67
79
52.5
72
64.5
56.5
69
64
71.5
51
76
75.5
65
50
200
2
177
1.7
8
15
2.6
234
18
1.8
92
6.6
21
73
3.2
11
28
191
3
1.3
29
38
0.0050
0.5000
0.0056
0.5882
0.1250
0.0667
0.3846
0.0043
0.0556
0.5556
0.0109
0.1515
0.0476
0.0137
0.3125
0.0909
0.0357
0.0052
0.3333
0.7692
0.0345
0.0263
Life Expectancy
Chart Title
Angola
Australia
Cambodia
Canada
China
Egypt
France
Haiti
Iraq
Japan
Madagascar
Mexico
Morocco
Pakistan
Russia
South Africa
SriLanka
Uganda
United Kingdom
United States
Vietnam
Yemen
90
80
70
60
50
40
30
20
10
0
0.0000
0.2000
0.4000
y = 35.441x + 57.996
R2 = 0.5782
0.6000
TV's per person
58% of life expectancy is associated with TV’s.
Would we save lives by sending TV’s to Uganda?
0.8000
1.0000
Linearize it!
Life Expectancy
Chart Title
y = -0.1156x + 70.717
R2 = 0.6461
Linearity is
improved by
plotting Life vs.
people per TV
rather than TV’s
per people.
90
80
70
60
50
40
30
20
10
0
0
50
100
150
200
250
People per TV
Observant scientists are adept at seeing curvature. Train
your eye by looking for defects in wallpaper, door trim,
lumber bought at Home Depot, etc. And try to straighten
out your data, rather than let the computer fit a nonlinear form,
which it is quite happy to do!
Plots are pictures of
science, worth
thousands of words
in boring tables.
These 4 plots all have the
Same slopes, intercepts and
r values!
From whence do those lines come?
Least squares fitting.
“Linear Fits”
the fitted coefficients
appear in linear part
expression. e.g..
y =a+bx+cx2+dx3
An analytical “best fit” exists!
“Nonlinear fits”
At least some of the fitted coefficients
appear in transcendental
arguments. e.g.,
y =a+be-cx+dcos(ex)
Best fit found by trial & error.
Beware false solutions! Try
several initial guesses!
CURVE FITTING:
Fit the trend or fit the points?
Earth’s mean
annual temp has
natural
fluctuations year
to year.
To capture a long
term trend, we
don’t want to fit
the points, so use a
low-order
polynomial
regression.
BUT,
The bumps and jiggles
in the U.S. population
data are ‘real.’
We don’t want to lose
them in a simple trend.
REGRESSION: We lost the baby boom!
SINGLE POLYNOMIAL: Does
funny things (see 1905).
SPLINE: YES: Lots of
individual polynomials give us
a smooth fit (especially good
for interpolation).
All data points are not created equal.
Since that one point has
so much error (or noise) should
we really worry about minimizing
its square? No.
We should minimize “chisquared.”
n
( yi  y fit ) 2
i 1
 i2
2  
Goodness of fit parameter that should
be unity for a “fit within error”
2
 reduced

1
n
n
( yi  y fit ) 2
i 1
 i2

n is the # of degrees of freedom
n  n-# of parameters fitted
Why is a fit based on chisquared so
special?
Based on chi: these two curves fit equally well!
Based on |chi| (absolute value): these
three curves fit equally well!
Based on max(chi): outliers exert
too strong an influence!
2 caveats
•Chi-square lower than unity is meaningless…if you
trust your 2 estimates in the first place.
•Fitting too many parameters will lower 2 but this may
be just doing a better and better job of fitting the noise!
•A fit should go smoothly THROUGH the noise, not
follow it!
•There is such a thing as enforcing a “parsimonious” fit
by minimizing a quantity a bit more complicated than 2.
This is done when you have a-priori information that the
fitted line must be “smooth”.
Achtung! Warning!
This lecture is an example of a very dangerous
phenomenon: “what you need to know.” Before you were
born, I took a statistics course somewhere in undergraduate
school. Most of this stuff I learned from experience….um…
experiments. A proper math course, or a course from LSU’s
Department of Experimental Statistics would firm up your
knowledge greatly.
AND BUY THOSE BOOKS! YOU WILL NEED THEM!
Cool Excel/Origin Demo