2015_1_Introduction - Laboratory for Remote Sensing Hydrology

Download Report

Transcript 2015_1_Introduction - Laboratory for Remote Sensing Hydrology

STATISTICS
Introduction
Professor Ke-Sheng Cheng
Department of Bioenvironmental Systems Engineering
National Taiwan University
• Lecture notes will be posted on class website
– https://www.space.ntu.edu.tw/navigate/s/E2DA955C12764B48B9C04F3
6492F48D1QQY
– Digital reference book: A modern introduction to probability and
statistics / Dekking et al. [Electronic book]
• Grades
– Homeworks (40%)
– Midterm (30%), Final (30%)
• The R language will be used for data analysis.
• A tutorial session is arranged on Thursday (6:00 –
7:00 pm). Attendance of the tutorial session is
voluntary.
• Class attendance rule
– If you are more than 15 minutes late for the class, please do NOT enter
the classroom until the next class session.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
2
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
3
What is “statistics”?
• Statistics is a science of “reasoning” from
data.
• A body of principles and methods for
extracting useful information from data, for
assessing the reliability of that information,
for measuring and managing risk, and for
making decisions in the face of uncertainty.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
4
• The major difference between statistics and
mathematics is that statistics always needs
“observed” data, while mathematics does
not.
• An important feature of statistical methods is
the “uncertainty” involved in analysis.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
5
• Statistics is the discipline concerned with
the study of variability, with the study of
uncertainty and with the study of
decision-making in the face of
uncertainty. As these are issues that are
crucial throughout the sciences and
engineering, statistics is an inherently
interdisciplinary science.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
6
Practical Applications of Statistics
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
7
Iris recognition
– An Iris code consists of 2048 bits.
– The iris code of the same person may change at
different times and different places. Thus one has to
allow for a certain percentage of mismatching bits
when identifying a person.
– Of the 2048 bits, 266 may be considered as
uncorrelated.
Hamming distance is defined as the fraction of
mismatches between two iris codes.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
8
A modern introduction to probability and statistics : understanding why and how / Dekking et al.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
9
Killer Football
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
10
41
27.2 deaths, the average over the 5
days preceding and following the match.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
11
• Poisson process modeling
– Occurrences of rare events
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
12
Economic Warfare Analysis
During World War II
– In order to obtain more reliable estimates of German war
production, experts from the Economic Warfare Division of
the American Embassy and the British Ministry of Economic
Warfare started to analyze markings and serial numbers
obtained from captured German equipment.
– Each piece of enemy equipment was labeled with markings,
which included all or some portion of the following
information: (a) the name and location of the maker; (b) the
date of manufacture; (c) a serial number; and (d)
miscellaneous markings such as trademarks, mold numbers,
casting numbers, etc.
A modern introduction to probability and statistics : understanding why and how / Dekking et al.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
13
– The first products to be analyzed were tires taken
from German aircraft shot over Britain and from
supply dumps of aircraft and motor vehicle tires
captured in North Africa. The marking on each tire
contained the maker’s name, a serial number, and a
two-letter code for the date of manufacture.
– The first step in analyzing the tire markings involved
breaking the two-letter date code.
• It was conjectured that one letter represented the month
and the other the year of manufacture, and that there
should be 12 letter variations for the month code and 3
to 6 for the year code. This, indeed, turned out to be
true. The following table presents examples of the 12
letter variations used by four different manufacturers.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
14
– For each month, the serial numbers could be
recoded to numbers running from 1 to some
unknown largest number N.
– The observed (recoded) serial numbers could be
seen as a subset of this.
– The objective was to estimate N for each month
and each manufacturer separately by means of the
observed (recoded) serial numbers.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
15
– With a sample of about 1400 tires from five
producers, individual monthly output figures were
obtained for almost all months over a period from
1939 to mid-1943.
– The following table compares the accuracy of
estimates of the average monthly production of all
manufacturers of the first quarter of 1943 with the
statistics of the Speer Ministry that became
available after the war.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
16
A modern introduction to probability and statistics : understanding why and how / Dekking et al.
– The accuracy of the estimates can be appreciated
even more if we compare them with the figures
obtained by Allied intelligence agencies. They
estimated, using other methods, the production
between 900 000 and 1 200 000 per month!
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
17
The Monty Hall Problem
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
18
• Standard assumptions
– The host must always open a door that was not
picked by the contestant.
– The host must always open a door to reveal a goat
and never the car.
– The host must always offer the chance to switch
between the originally chosen door and the
remaining closed door.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
19
• Assuming the car is worth one million NTDs
and the goat 5,000 NTDs, the expected
amounts of award are
– 668333.33 NTDs for the choice of switching
– 336666.67 NTDs for the choice of not switching.
• Simulation of the Monty Hall Problem using R.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
20
Ebola Outbreak in West Africa
(as of Aug. 26, 2014)
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
21
2014 West Africa Ebola
• Total cases since the beginning of the 2014
outbreak
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
22
2014 West Africa Ebola
• Total death counts since the beginning of the
2014 outbreak
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
23
2014 West Africa Ebola
• Death rate since the beginning of the 2014
outbreak
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
24
Spatial & Temporal Rainfall
Analysis
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
25
臺灣防災地圖 | Google Crisis Map
http://www.google.org/crisismap/taiwan
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
26
Stochastic Modeling & Simulation
• Building probability models for real world
phenomena.
– No matter how sophisticated a model is, it only
represents our understanding of the complicated
natural systems.
• Generating a large number of possible
realizations.
• Making decisions or assessing risks based on
simulation results.
• Conducted by computers.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
27
• Simulation of a two-dimensional random walk
Possible applications?
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
28
Exploratory Data Analysis
• Features of data distributions
– Histograms
– Center: mean, median
– Spread: variance, standard deviation, range
– Shape: skewness, kurtosis
– Order statistics and sample quantiles
– Clusters
– Extreme observations: outliers
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
29
• Histogram: frequencies and relative
frequencies
– A sample data set X
104.838935
22.371870
24.762863
82.708815
82.535199
115.387515
64.158533
72.895810
85.553281
102.347372
4/9/2016
265.018615
129.538575
275.440477
149.905426
150.761192
102.460651
133.663194
107.569047
96.920012
19.277535
205.279506
37.587841
70.721022
113.442704
134.931864
16.480639
139.201204
81.266071
34.202372
134.484317
146.938446
231.608794
100.717110
131.144892
174.200632
9.961515
112.180103
101.351639
45.472935
121.101643
12.577133
60.397366
33.918756
9.539663
130.360126
53.449806
105.368124
16.652365
149.996985
10.382787
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
30
• Frequency histogram
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
31
• Relative histogram
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
32
• Measures of center
– Sample mean
– Sample median
4/9/2016
Sample mean = 98.26067
Sample median = 101.8495
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
33
– One desirable property of the sample median is
that it is resistant to extreme observations, in the
sense that its value depends only the values of the
middle observations, and is quite unaffected by the
actual values of the outer observations in the
ordered list. The same cannot be said for the
sample mean. Any significant changes in the
magnitude of an observation results in a
corresponding change in the value of the mean.
Hence, the sample mean is said to be sensitive to
extreme observations.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
34
• Measures of spread
– Sample variance and sample standard deviation
– Range
• the difference between the largest and smallest values
Sample variance = 4039.931
Sample standard deviation = 63.56045
Range = 265.9008 (275.440477 – 9.539663)
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
35
• Measures of shape
– Sample skewness
– Sample kurtosis
Sample skewness = 0.7110874
Sample kurtosis = 0.533141 (or 3.533141 in R)
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
36
• Order statistics
• Sample quantiles
Linear interpolation
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
37
• Box-and-whisker plot (or box plot)
– A box-and-whisker plot includes two major parts – the box
and the whiskers.
– A parameter range determines how far the plot whiskers
extend out from the box. If range is positive, the whiskers
extend to the most extreme data point which is no more
than range times the interquartile range (IQR) from the box.
A value of zero causes the whiskers to extend to the data
extremes.
– Outliers are marked by points which fall beyond the
whiskers.
– Hinges and the five-number summary
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
38
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
39
Not “linear interpolation”
– In R, a boxplot is essentially a graphical
representation determined by the 5NS.
The summary function in R yields a list of six numbers:
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
40
– Box-and-whisker plot of X
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
41
Seasonal variation of average
monthly rainfalls in CDZ, Myanmar
– Boxplots are based on average monthly rainfalls of
54 rainfall stations.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
42
Random Experiment and
Sample Space
• An experiment that can be repeated under
the same (or uniform) conditions, but whose
outcome cannot be predicted in advance,
even when the same experiment has been
performed many times, is called a random
experiment.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
43
• Examples of random experiments
– The tossing of a coin.
– The roll of a die.
– The selection of a numbered ball (1-50) in an urn.
(selection with replacement)
– Occurrences of earthquakes
• The time interval between the occurrences of two
consecutive higher-than-scale 6 earthquakes.
– Occurrences of typhoons
• The amount of rainfalls produced by typhoons in one
year (yearly typhoon rainfalls).
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
44
•
The following items are always associated
with a random experiment:
– Sample space. The set of all possible outcomes,
denoted by .
– Outcomes. Elements of the sample space,
denoted by . These are also referred to as
sample points or realizations.
– Events. Subsets of  for which the probability is
defined. Events are denoted by capital Latin
letters (e.g., A,B,C).
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
45
Definition of Probability
• Classical probability
• Frequency probability
• Probability model
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
46
Classical (or a priori) probability
• If a random experiment can result in n
mutually exclusive and equally likely
outcomes and if nA of these outcomes have
an attribute A, then the probability of A is
the fraction nA/n .
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
47
• Example 1.
Compute the probability of getting two
heads if a fair coin is tossed twice. (1/4)
• Example 2.
The probability that a card drawn from an
ordinary well-shuffled deck will be an ace
or a spade. (16/52)
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
48
Remarks
• The probabilities determined by the classical
definition are called “a priori” probabilities
since they can be derived purely by deductive
reasoning.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
49
• The “equally likely” assumption requires
the experiment to be carried out in such a
way that the assumption is realistic; such as,
using a balanced coin, using a die that is not
loaded, using a well-shuffled deck of cards,
using random sampling, and so forth. This
assumption also requires that the sample
space is appropriately defined.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
50
• Troublesome limitations in the classical
definition of probability:
– If the number of possible outcomes is infinite;
– If possible outcomes are not equally likely.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
51
Relative frequency
(or a posteriori) probability
• We observe outcomes of a random
experiment which is repeated many times.
We postulate a number p which is the
probability of an event, and approximate p
by the relative frequency f with which the
repeated observations satisfy the event.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
52
• Suppose a random experiment is repeated n
times under uniform conditions, and if event
A occurred nA times, then the relative
frequency for which A occurs is fn(A) = nA/n.
If the limit of fn(A) as n approaches infinity
exists then one can assign the probability of A
by:
f n ( A) .
P(A)= lim
n 
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
53
• This method requires the existence of the
limit of the relative frequencies. This
property is known as statistical regularity.
This property will be satisfied if the trials
are independent and are performed under
uniform conditions.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
54
• Example 3
A fair coin was tossed 100 times with 54
occurrences of head. The probability of
head occurrence for each toss is estimated
to be 0.54.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
55
• The chain of probability definition
Random
experiment
4/9/2016
Sample
space
Event
space
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
Probability
space
56
Probability Model
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
57
Event and event space
An event is a subset of the sample space. The
class of all events associated with a given
random experiment is defined to be the event
space.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
58
Remarks
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
59
• Probability is a mapping of sets to numbers.
• Probability is not a mapping of the sample
space to numbers.
– The expression P( ) for    is not defined.
However, for a singleton event {}, P({}) is
defined.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
60
Probability space
• A probability space is the triplet (, A, P[]), where
 is a sample space, A is an event space, and P[] is
a probability function with domain A.
• A probability space constitutes a complete
probabilistic description of a random
experiment.
– The sample space  defines all of the possible
outcomes, the event space A defines all possible things
that could be observed as a result of an experiment,
and the probability P defines the degree of belief or
evidential support associated with the experiment.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
61
Conditional probability
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
62
Bayes’ theorem
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
63
Multiplication rule
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
64
Independent events
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
65
• The property of independence of two events
A and B and the property that A and B are
mutually exclusive are distinct, though
related, properties.
• If A and B are mutually exclusive events then
AB=. Therefore, P(AB) = 0. Whereas, if A
and B are independent events then P(AB) =
P(A)P(B). Events A and B will be mutually
exclusive and independent events only if
P(AB)=P(A)P(B)=0, that is, at least one of A
or B has zero probability.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
66
• But if A and B are mutually exclusive events
and both have nonzero probabilities then it is
impossible for them to be independent
events.
• Likewise, if A and B are independent events
and both have nonzero probabilities then it is
impossible for them to be mutually exclusive.
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
67
Reading assignments
• IPSUR
– Chapter 2
– Chapter 3
• 3.1.1, 3.1.3, 3.1.4
• 3.3
• 3.4.3, 3.4.4, 3.4.5, 3.4.6, 3.4.7
• AMIPS
– Chapter 2
– Chapter 3
4/9/2016
Lab for Remote Sensing Hydrology and Spatial Modeling
Department of Bioenvironmental Systems Engineering, National Taiwan University
68