Don’t Be Afraid to Ask

Download Report

Transcript Don’t Be Afraid to Ask

Quantitative Methods
of Data Analysis
Bill Menke, Instructor
Natalia Zakharova, TA
Goals
Make you comfortable with the analysis of
numerical data through practice
Teach you a set of widely-applicable data
analysis techniques
Provide a setting in which you can creatively
apply what you’ve learned to a project of
your own choosing
SYLLABUS
September 03 (W)
September 08 (M)
September 10 (W)
September 15 (M)
September 17 (W)
September 22 (M)
September 24 (W)
September 29 (M)
October 01 (W)
October 06 (M)
October 08 (W)
October 13 (M)
October 15 (W)
October 20 (M)
October 22 (W)
October 27 (M)
October 29 (W)
November 03 (M)
November 05 (W)
November 12 (W)
November 17 (M)
November 19 (W)
December 01 (M)
December 03 (W)
December 08 (M)
Intro; Issues associated with working with data
Issues associated with coding; MatLab tutorial
Linear Algebra Review; Least Squares
Probability and Uncertainty
Variance and other measures of error, bootstraps
The principle of maximum likelihood
Advanced topics in Least-Squares, Part 1
Advanced topics in Least-Squares, Part 2
Interpolation and splines
Hypothesis testing
Linear systems, impulse response & convolutions
Filter Theory
Applications of Filters
Midterm Exam
Orthogonal functions; Fourier series
Basic properties of Fourier transforms
Fourier transforms and convolutions
Sampling theory
spectral analysis; power spectra
statistics of spectra; practical considerations
wavelet analysis
Empirical Orthogonal Functions
Adjoint methods
Class project presentations
Review for Final
Homework
Assigned on a weekly basis
Due Mondays at the start of class
Due in hardcopy; arrange so that the
numbered problem (typically 1, 2, and 3) can
be physically separated from one another
Advice: start early; seek assistance of
classmates, TA and me (in that order)
Project
Substantial and creative analysis of a
dataset of your choice
Chance to apply a wide suite of
techniques learned in this class in a
realistic setting
might (or might not) be part of your
research; might (or might not) lead to
a paper
Project Dates
September 17 (W) 1-page abstract due;
then schedule brief meeting with me
November 05 (W) Progress report due
December 03 (W) Brief presentation of
results in class
December 08 (M) Hardcopy of Project
Report due at start of class
Grading
Homework
20%
Midterm
15%
Final
15%
Project
50%
You should read my grading policy:
www.ldeo.columbia.edu/users/menke/gradingpolicy.html
Software
Excel
point-and-click environment
little overhead for quick analysis
hard to automate repetitive tasks
hard to document operations
columns, rows of data
cell-oriented formulas
MatLab
scripting environment
some overhead, so less quick
easy to automate repetitive tasks
easy to document operations
vectors and matrices of data
general programming environment
Survey
1. Put an x in THEORY
this triangle
that represents
your
expectation for
your career
3. Calculate this sum w/o electronic
assistance: 472+964+674+2 = ______
4. Plot the function y(x) = 1 + 2x + 3x2
on this graph:
LAB & FIELD
DATA
WORK
ANALYSIS
2. Have you had a course that included
(check all that apply)
matrices & linear algebra
probability and statistics
vector calculus
computer programming
5. Estimate the number of
minutes it would take you to
walk from Morningside to
Lamont: ______.
The Nature of Data
please read
Doug Martinson’s Part 1: Fundamentals
(available under Courseworks)
Key Ideas
How data were estimated is important!
Data are never perfect; inherently contain error
You analyze data to learn something specific;
not to show virtuosity with some analysis
method!
A scientific objective must be clearly-articulated;
the analysis method must fit the objective.
Data Lingo
Discrete vs. Continuous
data is always discrete
(a series of individual numbers, such as a
sequence of readings off of a thermometer)
even though the process being observed may
be continuous
(the temperature of the room, which varies
continuously as a function of time and space)
Sequential vs. non-sequential Data
data often has some sort of natural
organization, the most common of which is
sequential, E.g.
temperature of this room, measured
every fifteen minutes
temperature right now along the hallway,
measured every thirty centimeters
Such data is often called a time-series, even in
the case where the organization is not based on
time, but on distance (or whatever) …
Multivariate Data
while a time-series is the data analog of a
function of one variable, e.g. f(t)
a multivariate dataset is the data analog of a
function of two or more variables, e.g. f(x,y)
My photo, at left, is one such
multivariate dataset, because the
digital camera that captured the
image was measuring light
intensity as a function of two
independent spatial variables.
There are 300,000 individual
measurements in this (relatively
low resolution) image.
Precision and Dynamic Range
Any measurement is made only to a finite number
of decimal places, or precision. It can make a
big difference whether the measurement is to
one decimal place, 1.1, or to 7, 1.096752
A sequence of values will vary in size. The
dynamic range quantifies the ratio of the largest
value to the smallest (non-zero) value*. It can
make a big difference if all the data vary in the
range 1.0046-1.0062 or 1.0046-11018.
* See Doug’s notes for the exact definition, which involves a logarithm
Vectors and Matrices
A list of measurements (d1, d2, d3 .. ) can be organized
very effectively into a vector, d.
A table of data
time1
time2
time3
site1
d11
d21
d31
site2 site3
d12
d13
d22
d22
d32
d32
can be organized very effectively into a matrix, D.
As we will see during the semester, the algebra of vector
and matrix arithmetic can then be used very effectively
to implement many different kinds of data analysis
Precision* vs. Accuracy
precision –
repeatability of the measurement
what is the scatter if you make the
measurement many times?
Accuracy difference between the center of a group of
scattered measurements and the true value of what
you’re trying to measure
* Note the different sense of the word precision than 3 slides ago.
Signal-to-Noise Ratio
Error in the data
compare to
Size of the data
The size of the error is most
meaningful when compared to the
size of the data
some examples of data …
and techniques used to analyze it
Biggest peak has a period of exactly one year
… makes sense, it’s the annual cycle in river
flow
But what about
these smaller
peaks?
Daily Temperature 1947-2008 at Laguardia Airport (LGA), New York, NY
I’m suspicious of these ‘exacly
zero’ values. Missing data,
defaulted to zero, maybe?
Laguardia Airport (New York, NY) Temperature vs. Precipitation
Mean precip in a
given one-degree
temperature
range
Related to: Conditional
probability that it will rain,
given the temperature?
Ground motion here at Lamont on
August 28 2008
Here’s an earthquake
But what are these little things?
A simple filtering technique to
accentuate the longer, clear periods in
the data
Here’s an earthquake
Little similar
looking
earthquakes
and lots of
them!
A little case study:
correlation of dN-15 and dust
in the Vostok Ice Core
ftp://ftp.ncdc.noaa.gov/pub/data/paleo/icecore/antarctica/vostok/dustnat.txt
Ice age (GT4)
4509
4552
4690
5259
5674
6486
...
417095
420527
421761
Dust Conc (ppm)
0.025
0.018
0.015
0.033
0.028
0.013
A little Googling indicates that dust data readily
0.013
0.015
0.012
(522 lines of data)
available on web
Given as age vs. dust concentration, assuming the
GT4 age model (which relates depth in the ice
core to age)
About 1 sample per few hundred years at the too
of the core, declining to one sample per few
thousand years at the bottom
ftp://sidads.colorado.edu/pub/DATASETS/AGDC/bender_nsidc_0107
Vos_O2-N2_isotope_data_all.txt
Core
Depth dN15
dO18
5G
174.5
0.49
1.09
5G
174.5
0.56
1.14
5G
185.2
0.58
0.97
...
5G
3342
0.53
1.73
5G
3342
0.53
1.70
5G
3348
0.48
1.21
(572 lines of data)
dO18atm dO2/N2
0.11
-28.2
0.01
-35.2
-0.19
-26.4
0.67
0.63
0.24
-9.7
-10.6
-7.2 dN15 data also readily available on
web
Given as depth vs. dN15
Vostok_EGT20_chronology.txt
Depth EGT20 ice EGT20 *ageEGT20 gas age
150
5.721 3.319 2.402
151
5.769 3.307 2.462
152.152 5.816 3.306 2.51
...
3347
454.327 3.106 451.221
3348
455.621 3.652 451.969
3349
456.943 4.137 452.806
(3200 lines of data)
Roughly the same number of lines of
data (so presumably similar age
sampling)
EGT20 chronology given;
presumably different (but by how
much?) than GT4
Age of air in ice and be as much as
4000 years younger than age of ice
itself
Decision: Compare data at
same depth in ice (probably not so sensible)
same age (probably more sensible)
Need then to convert
dN15 depth to dN15 (gas) age
(we ‘ve found the conversion table)
(some sort of interpolation probably necessary,
how much error will that introduce? )
dust GT4 age EGT20 age
(we need to look for the conversion table)
Need to deal with the likely problem that the sampled ages of
the dust will not match the ages of the dN15
(how much error will interpolation introduce?)
Some Thoughts
on Working With Data
Start Simple !
Don’t launch into time-consuming analyses
until you’ve spent time…
… gathering background information
... learning enough about a database to have
some sense that it really contains the
information that you want
… retrieving a small subset of the data and
looking them over carefully
Look at your data!
look at the numerical values
(e.g, in spreadsheet format)
graph it in a variety of ways
you’ll pick up on all sorts of useful
- and scary –
things
Where do I get data ?
You collect it your self through some sort of
electronic instrumentation
You compile it manually from written sources
A colleague gives it to you
(e.g. emails you a file)
You retrieve it from some sort of data archive
(e.g. accessible through the web)
Don’t Be Afraid to Ask …
… technicians familiar with instrumentation
… authors of paper that you have read
… your colleagues
... data professionals at mission agencies
Learn how the data were
collected
What was really measured
Hidden assumptions and conversion factors
The number of steps between the
measurement and the data as it appears in
the database
Who are the people who
collected the data
Who performed the actual measurements?
How many different people over what period of
time?
What kind of quality control was performed?
Are you accessing the original database or
somebody’s copy?
What are the data’s limitations?
How much is nonsense?
(Typically 5% of data records in compilations
have errors)
What is the measurement accuracy?
(Are formal error estimates given?)
Perform sanity checks on both the data and your
understanding of it.
Compare similar data from different databases.
Identify and understand differences.
data analysis can be messy
Many files, some of them very large
Many steps in the analysis process
including rather tedious data re-formatting
Possibly many versions of an analysis
exploring different choices in analysis
Lots of output … tables, graphs, etc.
Organization very important
Well-designed directory (folder) structure
Disciplined use of filenames
Ongoing documentation of the analysis
process, and especially of the big picture
Quandry
How much to keep …
How much to delete …
Advice #1
Always keep a copy of the unaltered raw
data (and make sure that you identify it as
such)
Advice #2
Always keep anything that you type in
manually (notes, MatLab scripts,
references, etc) with the theory that you
couldn’t possibly type fast enough to
consume significant storage space.
Advice #3
Whenever possible, design and use a single
script that recreates a sensible part of your
work. You can use it to recreate anything
you’ve deleted, and also documents what
you’ve done.
Advice #4
If you do delete a large chunk of your work,
leave the top directory and put in a note to
yourself explaining what you’ve deleted
… and why