Transcript Lecture 3

Fundamentals of Data
Analysis
Lecture 3
Basics of statistics
Program for today
Basic
terms and definitions
Discrete distributions
Continuous distributions
Normal distribution
Topics for discussion


What are the applications of
statistics in modern physics?
How important is the drawing of
conclusions based on statistical
analysis ?
What is the statistics ?
Definition of Statistics:
1.
A collection of quantitative data
pertaining to a subject or group.
Examples are blood pressure
statistics etc.
2.
The science that deals with the
collection, tabulation, analysis,
interpretation, and presentation of
quantitative data
What is the statistics ?
Two phases of statistics:
 Descriptive Statistics:
o Describes the characteristics of a product
or process using information collected on
it.
 Inferential Statistics (Inductive):
o Draws conclusions on unknown process
parameters based on information
contained in a sample.
o Uses probability
Probability

When we cannot rely on the assumption that all
sample points are equally likely, we have to
determine the probability of an event
experimentally. We perform a large number of
experiments N and count how often each of the
sample points is obtained. The ratio of the
number of occurrences of a certain sample point
to the total number of experiments is called the
relative frequency.
Probability

The probability is then assigned the relative
frequency of the occurrence of a sample point in
this long series of repetitions of the experiment.
This is based on the axiom, called the "law of
large numbers", which says that the relative
frequency approaches the true (theoretical)
probability of the outcome if the experiment is
repeated over and over again. How important is
the drawing of conclusions based on statistical
analysis.
Probability
where n(E) is the number of times, the event E took
place out of a total of N experiments. From this
definition we can see that the probability is a
number between 0 and 1. When the probability is 1,
then we know that a particular outcome is certain.
Probability
For a discrete random variable definition of probability is
intuitive:
n x 
P
N
where n(x) is the number of occurences of the desired value of
the random variable x (successes) in N samples (N  ).
Probability

For a continuous random variable, this definition
requires the identification of a small range of
variation Δx (Δx  0), for which the probability is
determined :
n x0  x  x0  x 
P x0  x  x0  x  
N

For a continuous random variable it is preferable
to use the probability density function:
P x0  x  x0  x 
f  x0  
x
Histogram
The histogram is the most important graphical tool for
exploring the shape of data distributions. And a good way
to visualize trends in population data. The more a
particular value occurs, the larger the corresponding bar
on the histogram.
Histogram
Constructing a histogram
Step 1: Find range of distribution, largest smallest values
Step 2: Choose number of classes, 5 to 20
Step 3: Determine width of classes, one
decimal place more than the data, class width =
range/number of classes
Step 4: Determine class boundaries
Step 5: Draw frequency histogram
Histogram
Number of groups or cells
 If number of observations < 100 – 5 to
9 cells
 Between 100-500 – 8 to 17 cells
 Greater than 500 – 15 to 20 cells
Analysis of histogram
Analysis of histogram
Calculating the average for ungrouped data
n
Xi
X 
i 1 n
and for grouped data:
h
fi X i
X 
i 1 n
f1 X 1  f 2 X 2 ...  f h X h .

f1  f 2 ...  f h
Analysis of histogram
Boundaries
Midpoint
Frequency
Computation
23.6-26.5
25.0
4
100
26.6-29.5
28.0
36
1008
29.6-32.5
31.0
51
1581
32.6-35.5
34.0
63
2142
35.6-38.5
37.0
58
2146
38.6-41.5
40.0
52
2080
41.6-44.5
43.0
34
1462
44.6-47.5
46.0
16
736
47.6-50.5
49.0
6
294
320
11549
Total
Measures of dispersion

Range

Standard deviation

Variance
Measures of dispersion
The range is the simplest and easiest to
calculate of the measures of dispersion.
R = Xmax - Xmin
Measures of dispersion
Standard deviation inside the probe:
S

n
(
Xi

X
)
i 1
n 1
2
Measures of dispersion
For a discrete random variable definition of
variation is as follows:
V  x    xi  E  x  P xi 
2
when for continous is:
b
2
V x    x  E x  f  x dx
a
Parameters of a distribution


Parameter is a characteristic of a population,
i.o.w. it describes a population
Statistic is a characteristic of a sample, used
to make inferences on the population
parameters that are typically unknown, called
an estimator
Parameters of a distribution

Population - Set of all items that
possess a characteristic of interest

Sample - Subset of a population
Parameters of a distribution
Expected value (EV) discrete random variable:
E x  
k
xi Pxi 
Z
i 1
and for continuous random variable:
b
E x    x f x dx
a
Random numbers
1
2
3
4
5
6
7
8
9
10
1534
7106
2836
7873
5574
7545
7590
5574
1202
7712
6128
8993
4102
2551
0330
2358
6427
7067
9325
2454
6047
8566
8644
9343
9297
6751
3500
8754
2913
1258
0806
5201
5705
7355
1448
9562
7514
9205
0402
2427
9915
8274
4525
5695
5752
9630
7172
6988
0227
4264
2882
7158
4341
3463
1178
5789
1173
0670
0820
5067
9213
1223
4388
9760
6691
6861
8214
8813
0611
3131
8410
9836
3899
3883
1253
1683
6988
9978
8026
6751
9974
2362
2103
4326
3825
9079
6187
2721
1489
4216
3402
8162
8226
0782
3364
7871
4500
5598
9424
3816
8188
6569
1492
2139
8823
6878
0613
7161
0241
3834
3825
7020
1124
7483
9155
4919
3209
5959
2364
2555
9801
8788
6338
5899
3309
0807
0968
0539
4205
8257
Normal distribution
Characteristics of the normal curve:
 It is symmetrical -- Half the cases are to one side of the
center; the other half is on the other side.
 The distribution is single peaked, not bimodal or multimodal
 Also known as the Gaussian distribution
Normal distribution
Characteristics of the normal curve:
 It is symmetrical -- Half the cases are to one side of the
center; the other half is on the other side.
 The distribution is single peaked, not bimodal or multimodal
 Also known as the Gaussian distribution
Normal distribution
 Probability density function:
 N(μ,σ)
 N(0,1) - standard normal distribution is a normal
distribution with a mean of 0 and a standard deviation
of 1
Normal distribution
Exponential distribution

Probability density function
for

Cumulative distribution function
Cumulative distribution function is given by:
F(x) = P(-oo, x)
Thanks for attention !