Statistics and Error

Transcript Statistics and Error

How many Jelly Beans fill a
0.5-Liter Bottle?
Take 60 seconds to calculate on
your own.
Fermi Problems
•Fermi Problems challenge us to ask more questions, not just
provide “an answer.”
•Enrico Fermi (1901-1954) – Italian physicist best known for
contributions to nuclear physics and the development of quantum
theory.
•Fermi used a process of “zeroing in” on problems by saying that
the value in question was certainly larger than one number and
less than some other amount – yields a quantified answer within
identified limits.
•The goal is to get an answer to an order of magnitude by making
reasonable assumption about the situation, not necessarily relying
upon definite knowledge for an “exact” answer.
Fermi Problems
Solutions have algorithmic approaches as well as intuitive approaches.
A very rough answer is better than no answer
What was your model for a jelly bean? For a 0.5-liter bottle?
Model is a partial rather than complete representation
The design of a model depends as much on circumstances and
constraints ($, time, materials, data, personnel, etc.) as it does on
the problem being solved
A symbolic representation is clean and powerful. It
communicates, simply and clearly, what the modeler thinks is
important, what information is needed, and how that
information is used. (model jelly bean as cylinder or box? Rounded
ends?)
What other simplifications or assumptions did you make?
Fermi Problems
Get with a partner and come up with a final
estimate.
Describe 3 entirely different (but practical) ways for determining the
area (in cm2) of the darkened region below (design is on a piece of
paper) to within 0.1%.
1. Superimpose a finely-spaced grid over
the figure and count squares.
2. Cut out figure and weigh it. Compare
that weight to that of piece of paper. If
too light, transfer image to another
uniformly-dense material.
3. Divide figure into local regions that can
be integrated numerically.
4. Computer scan image and count pixels.
5. Build a container whose cross-section
is that of the darkened figure. Fill with
1000cc water and measure level.
Describe 3 entirely different (but practical) ways for determining the
area (in cm2) of the darkened region below (design is on a piece of
paper) to within 0.1%.
6. Use a “polar planimeter” – gadget that
mechanically integrates the area
defined by a close curve.
7. “Throw darts.” Draw rectangle (of
calculable area) that encloses image.
Pick random points within the
rectangle and count which ones fall
within the darkened figure. The ratio
can be used to estimate area.
Why Study Statistics?
Statistics: A mathematical science concerned with data
collection, presentation, analysis, and interpretation.
Statistics can tell us about…
Sports
Population
Economy
New York City, NY
4.6%
Unemployed
Employed
Other
32.3%
63.1%
Estimated Population
Employment Rates: 2006
8,250,000
8,200,000
8,150,000
8,100,000
8,050,000
8,000,000
7,950,000
7,900,000
2000 2001 2002 2003 2004 2005 2006
Year
Why Study Statistics?
Statistical analysis is also an integral part of scientific research!
Are your experimental results believable?
Example: Tensile Strength of Spaghetti
Data suggests a relationship between Type (size) and breaking strength
Not perfect – have random error.
Why Study Statistics?
Responses and measurements are variable!
Due to…
Systematic Error – same error value by using an
instrument the same way
Random Error – may vary from observation to
observation
Perhaps due to inability to perform measurements in
exactly the same way every time.
Goal of statistics is to find the model that best describes a
target population by taking sample data.
Represent randomness using probability.
Probability
Experiment of chance: a phenomena whose outcome is uncertain.
Probabilities
Probability Model
Chances
Sample Space
Events
Probability of Events
Sample Space: Set of all possible outcomes
Event: A set of outcomes (a subset of the sample space). An event E
occurs if any of its outcomes occurs. Rolling dice, measuring, performing an experiment, etc.
Probability: The likelihood that an event will produce a certain
outcome.
Independence: Events are independent if the occurrence of one does
not affect the probability of the occurrence of another. Why important?
Probability
Consider a deck of playing cards…
Sample Space?
Event?
Set of 52 cards
R: The card is red.
F: The card is a face card.
H: The card is a heart.
Probability?
3: The card is a 3.
P(R) = 26/52
P(F) = 12/52
P(H) = 13/52
P(3) = 4/52
Events and variables
Can be described as random or deterministic:
The outcome of a random event cannot be predicted:
The sum of two numbers on two rolled dice.
The time of emission of the ith particle from radioactive material.
The outcome of a deterministic event can be predicted:
The measured length of a table to the nearest cm.
Motion of macroscopic objects (projectiles, planets, space
craft) as predicted by classical mechanics.
Extent of randomness
A variable can be more random or more deterministic depending on
the degree to which you account for relevant parameters:
Mostly deterministic:
Only a small fraction of the outcome cannot be accounted for.
Length of a table:
• Temperature/humidity variation
• Measurement resolution
• Instrument/observer error
• Quantum-level intrinsic uncertainty
Mostly Random:
Most of the outcome cannot be accounted for.
• Trajectory of a given molecule in a solution
Random variables
Can be described as discrete or continuous:
• A discrete variable has a countable number of values.
Number of customers who enter a store before one
purchases a product.
• The values of a continuous variable can not be listed:
Distance between two oxygen molecules in a room.
Consider data collected for undergraduate students:
Random Variable
Possible Values
Gender
Male, Female
Class
Fresh, Soph, Jr, Sr
Height (inches)
Integer in interval {30,90}
College
Arts, Education, Engineering, etc.
Shoe Size
3, 3.5 … 18
Is the height a discrete or continuous variable?
How could you measure height and shoe size to make them continuous
variables?
Probability Distributions
If a random event is repeated many times, it will produce a
distribution of outcomes (statistical regularity).
(Think about scores on an exam)
The distribution can be represented in two ways:
• Frequency distribution function: represents the
distribution as the number of occurrences of each outcome
• Probability distribution function: represents the
distribution as the percentage of occurrences of each
outcome
Discrete Probability Distributions
Consider a discrete random variable, X:
f(xi) is the probability distribution function
What is the range of values of f(xi)?
Therefore, Pr(X=xi) = f(xi)
Discrete Probability Distributions
Properties of discrete probabilities:
Pr( X  xi )  f ( xi )  0
k
for all i
k
 Pr( X  x )   f ( x )  1
for k possible discrete outcomes
Pr(a  X  b)  F (b)  F (a) 
 f (x )
i
i 1
i 1
i
a  xi b
Where:
F ( x)  Pr( X  x)
i
Discrete Probability Distributions
Example: Waiting for a success
Consider an experiment in which we toss a coin until heads turns up.
Outcomes, w = {H, TH, TTH, TTTH, TTTTH…}
Let X(w) be the number of tails before a heads turns up.
f ( x) 
1
2
For x = 0, 1, 2….
x 1
0.5
0.45
0.4
0.35
0.3
f(x) 0.25
0.2
0.15
0.1
0.05
0
Pr(a  X  b)  F (b)  F (a) 
 f (x )
a  xi b
k
k
 Pr( X  x )   f ( x )  1
i 1
0
1
2
3
Waiting time
4
5
6
i
i 1
i
Board Example
i
Cumulative Discrete Probability Distributions
j
Pr( X  x' )  F ( x' )   f ( xi )
i 1

Where xj is the largest discrete
value of X less than or equal to x’
Pr( X  xk )  1
Discrete Probability Distributions
Example: Distribution Function for Die/Dice
Distribution function for throwing a die:
Outcomes, w = {1, 2, 3, 4, 5, 6}  f(xi) = 1/6 for I = 1,6
0.180
0.160
0.140
0.120
0.100
0.080
0.060
0.040
0.020
0.000
1
2
3
4
5
6
Discrete Probability Distributions
Example: Distribution Function for Die/Dice
Distribution function for the sum of two thrown dice:
f(xi) = 1/36 for x1 = 2
2/36 for x2 = 3
…
0.180
0.160
0.140
0.120
0.100
0.080
0.060
0.040
0.020
0.000
2
3
4
5
6
7
8
9
10
11
12
Continuous Probability Density Function
Cumulative Distribution Function (cdf): Gives the fraction of the total
probability that lies at or to the left of each x
Probability Density (Distribution) Function (pdf): Gives the density of
concentration of probability at each point x
F ( x)  Pr( X  x)
Continuous Probability Distributions
Properties of the cumulative distribution function:
F ()  0
0  F ( x)  1
F ( x)  Pr( X  x)
F ( )  1
Properties of the probability density function:
b
Pr( a  X  b)  F (b)  F (a)   f ( x)dx
a
Continuous Probability Distributions
For continuous variables, the events of interest are intervals rather than
isolated values.
Consider waiting time for a bus which is equally likely to be
anywhere in the next ten minutes:
Not interested in probability that the bus will arrive in
3.451233 minutes, but rather the probability that the bus
will arrive in the subinterval (a,b) minutes:
ba
P(a  T  b)  F (b)  F (a ) 
10
F(t)
1
t
10
Continuous Probability Distributions
Z
Example: Gaussian (normal) distribution:
Z
X 

X 

1
f ( x) 
exp
2 
X    Z
 ( x   )2 
1
f ( x) 
exp 

2
2

2 


X    Z
Each member of the normal distribution family is described by
the mean (μ) and variance (σ2).
Standard normal curve: μ = 0, σ = 1.
Normal / Gaussian Distribution
Normal (Gaussian) Distribution:
Can be used to approximately describe any variable that
tends to cluster around the mean.
Central Limit Theorem:
The sum of a (sufficiently) large number of independent
random variables will be approximately normally distributed.
Importance:
Used as a simple model for complex phenomena – statistics,
natural science, social science
Examples
of experiments/measurements
that will
e.g., Observational
error assumed to follow normal
distribution
produce
Gaussian distribution?
Standard Error
( X   )
Deviation: d i  xi 
 
N
( X   )
Standard Deviation:  
N
2
2
Variance:
2

(
X


)
2 
N
Varianceis the average squared distance of the data from
the mean. Therefore, the standard deviation measures the
spread of data about the mean.

Standard Error:
N
Standard Error
How do we reduce the size of our standard error?
1) Repeated Measurements
2) Different Measurement Strategy
Jacob Bernoulli (1731):
“For even the most stupid of men, by some instinct of nature, by
himself and without any instruction (which is a remarkable thing), is
convinced the more observations have been made, the less danger
there is of wandering from one’s goal" (Stigler, 1986).
Moments
Other values in terms of the moments:
Skewness:
3
 
2 3/ 2
‘lopsidedness’ of the distribution
 a symmetric distribution will have a skewness = 0
 negative skewness, distribution shifted to the left
 positive skewness, distribution shifted to the right
Kurtosis:
 Describes the shape of the distribution with respect to the
height and width of the curve (‘peakedness’)
Central Limit Theorem
As the sample size goes to infinity, the distribution function of the
standardized variable leads to the normal distribution function!
http://www.jhu.edu/virtlab/prob-distributions/
Moments
In physics, the moment refers to the force applied to a system at a distance
from the axis of rotation (as in a lever).
In mathematics, the moment is a measure of how far a function is from the origin.
The 1st moment about the origin:
(mean)
 Average value of x
The 2nd moment about the mean:
 A measure of the ‘spread’ of the data
(variance)
Two teams measure the height of a pole.
Height in cm
Team A
183
182
185
181
183
184
avg =
183
std dev =
1.41
Team B
183.0
183.5
182.7
182.5
183.1
183.3
avg =
183.0
std dev =
0.37
• Which team did the better job?
• Why do you think so?
If you measure something many times, you get
random error.
Height in cm
Team A
183
182
185
181
183
184
avg =
183
std dev =
1.41
Team B
183.0
183.5
182.7
182.5
183.1
183.3
avg =
183.0
std dev =
0.37
• The positive and negative errors
should balance out.
• The average should be closer to the
true value than any one
measurement might be.
• The deviations from the average for
individual measurements give an
indication of the reliability of that
average value.
• Standard deviation measures the
reliability of the average.
How do we make lines?
6
e6
e5
e4
e3
e2
e1
4
0
0.25
yi  (mxi  b)  yi
0.5
0.75
yi  yi  (mxi  b)
1
Plot ei vs xi
6
e6
e5
e4
e3
e2
e1
4
0
0.25
0.5
0.75
Good lines have random, uncorrelated errors
1
Error, Precision, Accuracy
• Error – difference between an observed/measured value
and a true value.
– We usually don’t know the true value
– We usually do have an estimate
• Systematic Errors
– Faulty calibration
– User bias
– Change in conditions – e.g., temperature rise
• Random Errors
– Statistical variation
– Precision of measurement
Error, Precision, Accuracy
• Accuracy – measure of how close result is to the true value
– Measure of correctness
• Precision – measure of how well the result is determined
– Measure of variation in the data, within itself, not relative
to true value
Error, Precision, Accuracy
Low Precision
Low Precision
High Precision
Low Accuracy
High Accuracy
Low Accuracy
Calculators and significant digits:
Let the uncertain digit determine the precision to
which you quote a result
Calculator:
12.6892
Estimated Error:
+/- 0.07
Quote:
12.69 +/- 0.07
What is an error?
• In data analysis, engineers use
– error = uncertainty
– error ≠ mistake.
• Mistakes in calculation and measurements should always be
corrected before calculating experimental error.
• Measured value of x = xbest  x
– xbest = best estimate or measurement of x
– x = uncertainty or error in the measurements
All measurements have errors.
• What are some sources of measurement errors?
– Instrument uncertainty (caliper vs. ruler)
• Use half the smallest division.
L = 9 ± 0.5 cm
L = 8.5 ± 0.3 cm
L = 11.8 ± 0.1 cm
– Measurement error (using an instrument incorrectly)
• Measure your height - not hold ruler level.
– Variations in the size of the object (spaghetti is bumpy)
• Statistical uncertainty
If no error is given, assume half the last significant figure.
• That's why you don't write 25.367941 mm.
How do you account for errors in calculations?
• The way you combine errors depends on the math function
– added or subtracted
– multiplied or divide
– other functions
• The sum of two lengths is Leq = L1 + L2. What is error in Leq?
• The area is of a room is A = L x W. What is error in A?
• A simple error calculation gives the largest probable error.
Sum or difference
• What is the error if you add or subtract numbers?
x  x
y  y
z  z
w x yz
• The absolute error is the sum of the absolute errors.
w  x  y  z
upper bound
What is the error in length of molding to put
around a room?
• L1 = 5.0cm  0.5cm and L2 = 6.0cm  0.3cm.
• The perimeter is
L  L1  L2  L1  L2
 5.0 cm  6.0 cm  5.0 cm  6.0 cm
 22 cm
• The error (upper bound) is:
 L   L1   L2   L1   L2
 0.5cm  0.3cm  0.5cm  0.3cm
 1.6cm
Errors can be large when you subtract similar values.
• Weight of container = 30 ± 5 g
• Weight of container plus nuts = 35 ± 5 g
• Weight of nuts?
Weight  35  30g  5 g
Error  5  5g  10 g
Result  5 g  10 g  200%
What is the error in the area of a room?
• L = 5.0cm  0.5cm and W = 6.0cm  0.3cm.
A  L  W  5.0cm  6.0cm  30.0cm 2
• What is the relative error?
A L W


A
L
W
Board Derivation
0.5cm 0.3cm


 .15 or 15%
5.0cm 6.0cm
• What is the absolute error?
A  A  0.15  30.0cm 2  0.15  4.5cm 2
Product or quotient
• What is error if you multiply or divide?
x  x
y  y
z  z
x y
w
z
( x  x)  ( y  y )
w 
z  z
• The relative error is the sum of the relative errors.
w x  y z



w
x
y
z
upper bound
Multiply by constant
• What if you multiply a variable x by a constant B?
w  Bx
• The error is the constant times the absolute error.
w  B x
What is the error in the circumference of a circle?
• C=2πR
– For R = 2.15 ± 0.08 cm
• C = 2 π (0.08 cm)
= 0.50 cm
Powers and exponents
• What if you square or cube a number?
w  xn
• The relative error is the exponent times the relative error.
w
x
n
w
x
What is the error in the volume of a sphere?
• V = 4/3 π R3
– For R = 2.15 ± 0.08 cm
– V = 41.6 cm3
• V/V = 3 * (0.08 cm/2.15 cm)
= 0.11
• V = 0.11 * 41.6 cm3
= 4.6 cm3
What is the error in the volume of a sphere?
Lab “Calculus of Errors” Explanation
How much error did you have in your remote
measurement result?

Statistics and Error

Transcript Statistics and Error

Directory