Transcript lecture4

EART20170 Computing,
Data Analysis &
Communication skills
Lecturer: Dr Paul Connolly (F18 – Sackville Building)
[email protected]
2. Computing (Excel statistics/modelling)
2 lectures
assessed practical work
Course notes etc:
http://cloudbase.phy.umist.ac.uk/people/connolly
Recommended reading: Cheeney. (1983)
Statistical methods in Geology. George, Allen &
Unwin
Plan
 Two more lectures plus drop-in
sessions in computer labs
 Assessment handed out today
and need to hand in by 16:00,
Tuesday December 12th.
Lecture 4
 Using Microsoft excel
 Cell referencing and naming
datasets
 Entering formulas
 Worksheet formulas
 Statistical functions and add-ins
 Analysing the Gaussian and T-
distributions in Excel.
Using Microsoft Excel
Worksheet
Cell
Cell reference
Function bar
Basic functions
Using Microsoft Excel
Entering data
Can also `import’
text files
Can name
worksheets
Cell referencing
 Several different ways of
referencing the information in a
cell.
 A1 reference style
 Difference between absolute and
relative references
The A1 reference style
To refer to
Use
The cell in column A and row 10
A10
The range of cells in column A and
rows 10 through 20
A10:A20
The range of cells in row 15 and
columns B through E
B15:E15
All cells in row 5
5:5
All cells in rows 5 through 10
5:10
All cells in column H
H:H
All cells in columns H through J
H:J
The range of cells in columns A
through E and rows 10 through 20
A10:E20
To refer to another
worksheet
 Example you can use:

=MySheet!B1:B10
 You can also use worksheet
functions in the same way

=Average(MySheet!B1:B10)
 The worksheets must be in the
same workbook
Relative, absolute and
mixed references
 A relative cell reference, such as A1, is based
on the relative position of the cell. If the
position of the cell that contains the reference
changes, the reference itself is changed.
 An absolute cell reference, such as $A$1,
always refers to a cell in a specific location. If
the position of the cell that contains the formula
changes, the absolute reference remains the
same.
 A mixed reference has either an absolute
column and relative row, or absolute row and
relative column. An absolute column reference
takes the form $A1, $B1, and so on. An
absolute row reference takes the form A$1,
B$1, and so on.
Relative references
Suppose I enter that cell b2 is
equal to a1.
If I copy cell b2 to b3, the
relative reference automatically
adjusts to be the next cell
relative to a1 – a2!
This is the default in Excel
Absolute references
If I put in an absolute cell
reference, the cell
reference does not
change when copied to
other cells
Mixed references
What happens if I copy
a relative column,
absolute row reference
to the c3 cell?
The relative column
reference adjusts, but
the absolute column
reference does not
Naming datasets
You can name a
dataset by
selecting it with
the mouse (left
clicking and
dragging) and
entering the
name in the
`reference box’
The name must
not have any
`spaces’.
The name can be
used as a
reference in that
worksheet.
Aside: An example of
plotting data in excel
There are many ways of
plotting the data that get the
same result
Usually you will do this by
trial and error (i.e. create a
plot and edit it
Say you wanted to plot two
similar things on the same
graph, one a line and one a
histogram
Start by selecting all three
columns and go to
insert->chart
Aside: An example of
plotting data in excel
In this case you want to plot
a custom line and column
plot
Select Custom types and
find the line column
graph
Aside: An example of
plotting data in excel
As you can see, Excel
wants to plot all three
columns, which we don’t
want in this case
But you can remove
the first column by
clicking on the Series
tab, highlighting the
series you want to
remove and clicking
remove
Aside: An example of
plotting data in excel
Clicking next you can
play around with
many settings and
annotate your plot
This requires some playing with to investigate all the options
Aside: An example of
plotting data in excel
As you can see the
bar chart has large
gaps which is not
what we want
You can change
this by right
clicking on the
bars and selecting
Format Data
Series…
On the next dialog,
click on the options
tab and set the Gap
widths to 0
Entering formulas
 You can use Excel just like a calculator
 This involves entering numerical
expressions
 For example: imagine I wanted to
calculate the mean of 3 numbers: 2, 3
and 7
 I could enter in cell a1:
 =(2+3+7)/3
Entering formulas
 A better way might be to enter
the values in cells, a1, a2 and a3
 This is still repetitive for large
datasets, but it is useful for
calculating things like t-values
 But Excel has many worksheet
functions…
Worksheet functions
 Functions are predefined
formulas that perform
calculations by using specific
values, called arguments, in a
particular order, or structure.
Functions can be used to
perform simple or complex
calculations. For example, the
`average’ function calculates the
mean of data in a given range.
 =average(a1:a10)
 =average(data1)
 Excel has a huge library of
functions like this. Using them
takes practice but it is best to
learn by doing examples.
Worksheet functions
 The easiest way to start using functions is to
`insert’ one into the worksheet.
 Have a cell highlighted and click on the
insert->function tab.
Worksheet functions
 The insert function dialog pops-
up.
 You can either search for a
function or find it yourself.
 Usually you will want to use a
certain category such as
`statistical’ and find the function
in that category
Worksheet functions
 Next you will get a dialog asking you to put in
cell references for the function.
 This dialog changes for different functions.
 For the above example using the `average’
function we could put in A1:A3 in the first text
box and click OK.
 The mean will appear in the cell.
Note that when you become efficient with excel you
will use shortcuts for inserting functions and wont
need to use this dialog.
Statistical worksheet
functions (hint, hint)
















Average(a1:a3)
[average(data1)]
Stdev(a1:a3) [stdev(data1)]
Var(a1:a3)
[variance]
Stdevp(a1:a3)
[stdevp(data1)]
Sqrt($a$1)
[sqrt(a1)]
^2
[to the power]
Sum(a1:a3)
[sum(data1)]
pearson(a1:a3,b1:b3)
[pearson(data1,data2)]
Rsq(a1:a3,b1:b3)
[rsq(data1,data2)]
Quartile(a1:a20,0.25)
[quartile(data1,0.5]
Mode(a1:a20) [mode(data1)]
Normdist(a1,mean,std,TRUE)
Norminv(p,mean,std)
Tdist(a1,8,2)
[students t-distribution]
Tinv(p,8)
[inverse of t-distribution]
Ttest(data1,data2,tails,type)
[comparing two means] probability of rejecting
null hypothesis (if this is less than significance level,
reject null hypothesis). Type should equal 2.
Analysing the Gaussian
(normal) distribution
A normal distribution, =0.82, =0.45
1
0.8
normdist-FALSE
normdist-TRUE
f(x)
0.6
0.4
0.2
0
-2
-1.5
-1
-0.5
0
x
0.5
1
1.5
2
 The function normdist has the following prototype:
 NORMDIST(x,mean,std,cumulative)
 It returns the value of a Gaussian distribution with




given mean and standard deviation at x.
Cumulative is either set to `FALSE’ or `TRUE’
If it is set to FALSE the function will return the actual
value of the Gaussian distribution at x.
If it is set to TRUE the function will return the
cumulative distribution at x (i.e. this is the table from
lecture 3).
Confidence level of a given interval. Eg what is the
probability of a value lying in the interval...
Analysing the Gaussian
(normal) distribution
A normal distribution, =0.82, =0.45
1
0.8
normdist-FALSE
normdist-TRUE
f(x)
0.6
0.4
0.2
0
-2
-1.5
-1
-0.5
0
x
0.5
1
1.5
 The function norminv has the following




prototype:
NORMINV(p,mean,std)
It returns the x value associated with the
cumulative probability of p
Useful for assessing levels of significance. Eg
what are the limits on x at a given confidence
level?
This method is used more frequently for the tdistribution
2
Analysing the tdistribution
A t-distribution, =5
0.5
tdist-non cumulative
tdist-cumulative
0.4
f(x)
0.3
0.2
0.1
0
0








0.5
1
1.5
2
x
2.5
3
3.5
4
You still need to remember the formulas for (1) estimating
the interval for the mean; (2) testing the significance of the
correlation coefficient; and (3) if two means are equal.
The function tdist has the following prototype:
TDIST(x,df,tails)
It returns the significance level (alpha) of a t-distribution
with given degrees of freedom.
Tails is either set to 1 or 2.
If it is set to 1 the function will return the accumulation of
probability from infinity to x.
If it is set to 2 the function will return the accumulation of
probability in both tails.
Not used too often
Analysing the tdistribution
A t-distribution, =5
0.5
tdist-non cumulative
tdist-cumulative
0.4
f(x)
0.3
0.2
0.1
0
0
0.5
1
1.5
2
x
2.5
3
3.5
4
 The function tinv has the following prototype:
 TINV(alpha,df)
 It returns the critical value for the t-distribution
corresponding to a significance level, alpha.
 By default it is a two tailed confidence level,
but for a one tailed confidence level substitute
2x(alpha) for alpha.
 Used in hypothesis testing.
Excel has a quick way of
comparing two means:
 The student t-test.
 If they have the same length we can use the TTEST
function
 But rather than giving us the critical t-value, it gives
us a critical probability for rejection.
 What if this is less than the significance value,
alpha?
A t-distribution, =5
0.5
tdist-non cumulative
tdist-cumulative
0.4
f(x)
0.3
0.2
Rejection
Acceptance
0.1
0
0
0.5
1
1.5
T-critical
We reject the null hypothesis
2
x
2.5
3
3.5
4
Random number
generation
 This is very useful in computational science.
 They are not really random numbers, they are
generated by an algorithm.
 But it is difficult to get random numbers on a
computer.
 The worksheet function rand() generates random
numbers between 0 and 1.
 Hence to generate a normally distributed random
number sequence, with a given mean and standard
deviation we can use:
 Norminv(rand(),mean,std)
 Over many generations, the variable will have the
given mean and standard deviation
A normal distribution, =0.82, =0.45
1
0.8
normdist-FALSE
normdist-TRUE
f(x)
0.6
0.4
0.2
0
-2
-1.5
-1
-0.5
0
x
0.5
1
1.5
2
Random number
generation
Over a large number of generations, the mean approaches
that of the true random variable
A normally distributed random number with =0.5, =0.3
2
true mean 0.50341
true stdev 0.30652
1.5
0.5
0
-0.5
-1
0
500
1000
1500
2000
2500
3000
Normally distributed random number
1000
800
Frequency
x
1
true mean 0.50341
true stdev 0.30652
600
400
200
0
-2
-1
0
x
1
2
3500
Next lecture
 Propagation of errors.
 Many people struggled with this on the
test.
 There is an easy way to do it with
Excel.
 Lets go back to our example
measuring bed thickness. We know
that x=12.10.3m and y=4.20.2m.
 So if we generate many values of x
and y with the above means and std,
then add all these together we could
calculate the mean and std of the
result.
Error propagation
 Using a computer we can generate
many normally distributed variables
and therefore find the distribution of
the answer.
 Hence we can directly calculate the
error (standard deviation) in the
answer.
 This is useful as it can be done for
complicated equations with ease.
 This is called the Monte Carlo method
of propagating errors.
One more thing
 Worksheet functions can also take text
arguments and return text.
 If you are doing a hypothesis test you might
want a statement that tells you to either accept
or reject the null hypothesis
 You can use the `IF’ construct
 =IF(logical test,if true,if false)
 E.g. IF(tvalue>tcrit,”Accept alternate
hypothesis”,”Accept null hypothesis”).
 IF can also return numeric values.
Homework
 Have a look at using some of the
functions in excel.
 Especially for manipulating cells and
calculating means, standard
deviations.
 Try calculating z-values and t-values
from the table from handout 3 in excel
to see if you can get them correct.
 This will help for Tuesdays practical
labs.
 REMEMBER: please check the
student notice board for your allotted
time.