Probability&statistics

Download Report

Transcript Probability&statistics

Mathematical Statistics
Instructor:
Dr. Deshi Ye
Course homepage: http://www.cs.zju.edu.cn/people/yedeshi/
Course information
• What is for?
– This course provides an elementary
introduction to mathematical statistics with
applications.
– Topics include: statistical estimation,
hypothesis testing; confidence intervals;
calculation of a P-value; nonparametric testing;
curve fitting; analysis of variance and factorial
experimental design.
Grading
• Grades for the course will be based on the
following weighting
1) Class attendance: 10%
2) Homework assignment: 26%
3) Unit quiz: 24% (12%, 12%)
4) Final exam: 40%
Introduction
• Probability theory is devoted to the study
of uncertainty and variability
• Statistics can be described as the study of
how to make inference and decisions in
the face of uncertainty and variability
Brief History
• Blaise Pascal and Pierre de Fermat: the
origins of probability are found.
– concerning a popular dice game
– fundamental principles of probability theory
• Pierre de Laplace:
– Before him, concern on the analysis of games
of chance
– Laplace applied probabilistic ideas to many
scientific and practical problems
A case study
• Visually inspecting data to improve
product quality
Population and Sample
• Investigating: a physical phenomenon,
production process, or manufactured unit,
share some common characteristics.
• Relevant data must be collected.
• Unit: the source of each measurement.
– A single entity, usually an object or person
• Population: entire collection of units.
Examples
Population
Unit
variables
All students
currently
enrolled in
school
student
GPA
Number of
credits
All books in
library
book
Replacement
cost
Sample
• Statistical population: the set of all
measurement corresponding to each unit
in the entire population of units about
which information is sought.
• Sample: A sample from a statistical
population is the subset of measurements
that are actually collected in the course of
investigation.
Ch2: Treatment of data
• Outline
– Pareto diagrams, dot diagrams
– Histograms (Frequency distributions)
– Stem-and-leaf display
– Box-plot (Quartiles and Percentiles)
– The calculation of x and standard deviation s
Pareto Diagram
• For a computer-controlled lathe whose
performance was below par, workers
recorded the following causes and their
frequencies:
power fluctuations
6
controller not stable
22
operator error
13
worn tool not replaced 2
other
5
Minitab14
• 1. Stat->Quality tools->Pareto chart
• 2. Choose chart defects table as follows
Output
Pareto diagram
• Pareto diagram: depicts Pareto’s empirical
law that any assortment of events consists
of a few major and many minor elements.
• Typically, two or three elements will
account for more than half of the total
frequency.
Dot diagram
• Observation on the deviations of cutting
speed from the target value set by the
controller.
• EX. Cutting speed – target speed
• 3 6 –2 4 7 4
• In minitab: stat->dotplots->simple
Dot diagram
• This diagram visually summarize the
information that the lathe is generally
running fast.
Data001.
80 data of emission (in ton)of sulfur
oxides from an industry plant
• 15.8 26.4 17.3 11.2 23.9 24.8 18.7 13.9 9.0 13.2 22.7
9.8 6.2 14.7 17.5 26.1 12.8 28.6 17.6 23.7 26.8
• 22.7 18.0 20.5 11.0 20.9 15.5 19.4 16.7 10.7 19.1 15.2
22.9 26.6 20.4 21.4 19.2 21.6 16.9 19.0 18.5 23.0
• 24.6 20.1 16.2 18.0 7.7 13.5 23.5 14.5 14.4 29.6 19.4
17.0 20.8 24.3 22.5 24.6 18.4 18.1 8.3 21.9 12.3
• 22.3 13.3 11.8 19.3 20.0 25.7 31.8 25.9 10.5 15.9 27.5
18.1 17.9 9.4 24.1 20.1 28.5
Frequency distributions
• A frequency distribution is a tabular
arrangement of data whereby the data is
grouped into different intervals, and then
the number of observations that belong to
each interval is determined.
• Data that is presented in this manner are
known as grouped data.
Class limits & frequnecy
Class limits
5.0 -- 8.9
9.0 – 12.9
13.0 – 16.9
17.0 – 20.9
21.0 – 24.9
25.0 – 28.9
29.0 – 32.9
Total
Frequency
3
10
14
25
17
9
2
80
Class limit and width
• lower class limit: The smallest value that can belong to
a given interval
• upper class limit: The largest value that can belong to
the interval.
• Class width: The difference between the upper class
limit and the lower class limit is defined to be the.
• When designing the intervals to be used in a frequency
distribution, it is preferable that the class widths of all
intervals be the same.
Class limits & frequnecy
Class limits
[5.0, 9.0)
[9.0, 13.0)
[13.0, 17.0)
[17.0, 21.0)
[21.0, 25.0)
[25.0, 29.0)
[29.0, 33.0)
Total
Frequency
3
10
14
25
17
9
2
80
Variants of frequency distribution
• The cumulative frequency distribution is
obtained by computing the cumulative frequency,
defined as the total frequency of all values less
than the upper class limit of a particular interval,
for all intervals.
• Relative frequency: the ratio of the number of
observations in the interval to the total number of
observations
• The percentage frequency distribution is arrived
at by multiplying the relative frequencies of each
interval by 100%.
cumulative frequnecy
Class limits
Less than 5
Less than 9
Less than 13
Less than 17
Less than 21
Less than 25
Less than 29
Less than 33
Frequency
0
3
13
27
52
69
78
80
Percentage distribution
Class limits
Perc. Dist.
Frequency
[5.0, 9.0)
[9.0, 13.0)
[13.0, 17.0)
[17.0, 21.0)
[21.0, 25.0)
[25.0, 29.0)
[29.0, 33.0)
Total
3.75%
12.5%
17.5%
31.25%
21.25%
11.25%
2.5%
100%
3
10
14
25
17
9
2
80
Histogram
• The most common form of graphical
presentation of a frequency distribution is
the histogram.
• Histogram: is constructed of adjacent
rectangles; the height of the rectangles is
the class frequencies and the bases of the
rectangles extend between successive
class boundaries.
Histogram in Minitab
1. Graph->histogram->simple
2. Graph variables: c4
3. Edit bars: Click the bars in the output figures, in
Binning, Interval type select midpoint and interval
definition select midpoint/cutpoint, and then input 7
11 15 19 23 27 31 as illustrated in the following
Density histogram
• When a histogram is constructed from a
frequency table having classes of unequal
lengths, the height of each rectangle must be
changed to
• Height = relative frequency / width.
• The area of the rectangle then represents the
relative frequency for the class and the total area
of the histogram is 1.
Density histogram
Cumulative histogram
• 1) Graph>histogram->simple
• 2) Dataview->
Datadisplay: check
“symbos” only
Smoother: check
“lowess” and “0” in
degree of
smoothing and “1”
in number of steps.
Stem-and-leaf Display
• Class limits and frequency, contain data in
each class, but the original data points
have been lost.
• Stem-and-leaf: function the same as
histogram but save the original data points.
• Example: 10 numbers:
• 12, 13, 21, 27, 33, 34, 35, 37, 40, 40
• Frequency table
Class limits Frequency
10 – 19
2
20 – 29
2
30 – 39
4
40 – 49
3
Stem-and-leaf
Stem-and-leaf: each row has a stem and
each digit on a stem to the right of the vertical
line is a life.
The "stem" is the left-hand column which
contains the tens digits.
The "leaves" are the lists in the right-hand
column, showing all the ones digits for each
of the tens, twenties, thirties, and forties.
Key: “4|0” means 40
Stem-and-leaf in Minitab
• The display has three columns:
– The leaves (right) - Each value in the leaf column
represents a digit from one observation.
– The stem (middle) - The stem value represents the
digit immediately to the left of the leaf digit.
– Counts (left) - If the median value for the sample is
included in a row, the count for that row is enclosed in
parentheses. The values for rows above and below
the median are cumulative.
Stem-and-leaf for DATA001
•
•
Stem-and-leaf of frequencies N = 80
Leaf Unit = 1.0
•
•
•
•
•
•
•
•
•
•
•
•
•
2 0 67
6 0 8999
11 1 00111
17 1 223333
24 1 4445555
32 1 66677777
(13) 1 8888888999999
35 2 0000000111
25 2 222223333
16 2 4444455
9 2 66667
4 2 889
1 3 1
Ch2.5: Descriptive measures
• Mean: the sum of the observation divided by the
sample size.
n
x
x
i 1
i
n
• Median: the center, or location, of a set of data.
If the observations are arranged in an ascending
or descending order:
– If the number of observations is odd, the median is
the middle value.
– If the number of observations is even, the median is
the average of the two middle values.
Example
• 15 14 2 27 13
• Mean: 15  14  2  27  13
x
5
 14.2
• Ordering the data from smallest to largest
• 2 13 14 15 27
• The median is the third largest value 14
Sample variance
• Deviations from the mean:
n
s2 
2
(
x

x
)
 i
i 1
n 1
• Standard deviation s:
n
s
2
(
x

x
)
 i
i 1
n 1
n
s2 
n
n   x  ( xi ) 2
i 1
2
i
i 1
n(n  1)
Quartiles and Percentiles
• Quartiles: are values in a given set of
observations that divide the data in 4 equal parts.
• The first quartile,Q1 , is a value that has one fourth,
or 25%, of the observation below its value.
• The sample 100 p-th percentile is a value such
that at least 100p% of the observation are at or
below this value, and at least 100(1-p)% are at or
above this value.
Example
• Example in P34:
14.7  15.2
Q1 
 14.95
2
19.0  19.1
Q2 
 19.05
2
22.9  23
Q3 
 22.95
2
Boxplots
• A boxplot is a way of summarizing
information contained in the quartiles (or
on a interval)
• Box length= interquartile range= Q3  Q1
Modified boxplot
• Outlier: too far from third
quartile.
• 1.5(interquartile range)
of third quartile.
• Modified boxplot:
identify outliers and
reduce the effect on the
shape of the boxplot.