Statistics and Probability

Download Report

Transcript Statistics and Probability

Virtual University of Pakistan
Lecture No. 1
Statistics and Probability
Miss Saleha Naghmi Habibullah
• To inculcate in you an attitude of Statistical
and Probabilistic thinking.
• To give you some very basic techniques in
order to apply Statistical analysis to realworld situations/problems.
That science which enables us to draw conclusions about
various phenomena on the basis of real data collected on
A tool for data-based research
Also known as Quantitative Analysis
Any scientific enquiry in which you would like to base your
conclusions and decisions on real-life data, you need to
employ statistical techniques!
Now a days, in the developed countries of the world, there is
an active movement for of Statistical Literacy.
Application Areas
A lot of application in a wide variety of
disciplines …
Agriculture, Anthropology, Astronomy,
B i o l o g y, E c o n o m i c s , E n g i n e e r i n g ,
Environment, Geology, Genetics, Medicine,
Physics, Psychology, Sociology, Zoology ….
Vi r t u a l l y e v e r y s i n g l e s u b j e c t f r o m
Anthropology to Zoology …. A to Z!
Text and Reference Material
The primary text-book for the course is Introduction to Statistical
Theory (Sixth Edition) by Sher Muhammad Chaudhry and Shahid Kamal
published by Ilmi Kitab Khana, Lahore. Reference books for the course
1. “ “ by Afzal Beg & Miraj Din Mirza.
2. “ “ by Mohammad Rauf Chaudhry (Polymer Publications, Urdu
Bazar, Lahore).
3. “Statistics” by James T. McClave & Frank H. Dietrich, II (Dellen
Publishing Company, California, U.S.A).
4. “Introducing Statistics” by K.A. Yeomans (Penguin Books Ltd.,
5. “Applied Statistics” by K.A. Yeomans (Penguin Books Ltd., England).
6. “Business Statistics for Management & Economics” by Wayne W.
Daniel and James C. Terrell (Houghton Mifflin Company, U.S.A.).
7. “Basic Business Statistics” by Berenson & Levine ( )
1 TO 5
1 TO 15
1 TO 5
6 TO 10
16 TO 30
6 TO 10
11 TO 15
31 TO 45
11 TO 15
Upon completion of the first
segment, you will be able to:
•Appreciate the nature of statistical data.
•Understand various methods of collecting
statistical data.
•Appreciate the importance of a proper sampling
•Utilize various methods of summarizing and
describing collected data.
•Employ statistical techniques to understand the
nature of relationship between two quantitative
Upon completion of the second
segment, you will be able to:
•Understand the basic concepts of probability theory (which is
the foundation of statistical inference). Understand the
concept of discrete probability distributions and their
mathematical properties.
•Understand the concept of continuous probability
distributions and their mathematical properties.
•Get acquainted with some of the most commonly
encountered and important discrete and continuous
probability distributions such as the binomial and the normal
Upon completion of the third
segment, you will be able to:
Understand and employ various techniques of
estimation and hypothesis-testing in order to draw
reliable conclusions necessary for decision-making
in various fields of human activity.
Through this segment, you will be able to
appreciate the purpose and the goal of the subject
of Statistics.
There will be two term exams and one final
exam. In addition, there will be 15 homework
assignments. The final examination will be
comprehensive in nature. (Approximately 25-30% of the
final exam paper will be on the course covered upto the
Mid-Term-II Exam.)
These will contribute the following percentages to the
final grade:
Final Exam:
Homework Assignments: 30%
Meaning of Statistics
Information useful for the State
The meaning of Data
The word “data” appears in many contexts
and frequently is used in ordinary conversation.
Although the word carries something of an aura of
scientific mystique, its meaning is quite simple and
It is Latin for “those that are given” (the
singular form is “datum”). Data may therefore be
thought of as the results of observation.
Data are collected in many aspects of everyday life.
• Statements given to a police officer or physician or
psychologist during an interview are data.
• The correct and incorrect answers given by a student on
a final examination.
• Almost any athletic event produces data.
• The time required by a runner to complete a marathon,
• The number of errors committed by a baseball team in
nine innings of play.
• And, of course, data are obtained in the course of
scientific inquiry:
• The positions of artifacts and fossils in an archaeological
• The number of interactions between two members of an
animal colony during a period of observation,
• The spectral composition of light emitted by a star.
Types of Data
(Non - Numeric)
A quantity that, varies from an individual to
(Non - Numeric)
In statistics, an observation often means any sort
of numerical recording of information, whether it is a
physical measurement such as height or weight; a
classification such as heads or tails, or an answer to a
question such as yes or no.
A characteristic that varies with an individual or an
object, is called a variable.
For example, age is a variable as it varies from person to
person. A variable can assume a number of values. The
given set of all possible values from which the variable
takes on a value is called its Domain. If for a given
problem, the domain of a variable contains only one
value, then the variable is referred to as a constant.
Variables may be classified into quantitative and
qualitative according to the form of the characteristic of
A variable is called a quantitative variable when a
characteristic can be expressed numerically such as age,
weight, income or number of children.
On the other hand, if the characteristic is nonnumerical such as education, sex, eye-colour, quality,
intelligence, poverty, satisfaction, etc. the variable is referred
to as a qualitative variable. A qualitative characteristic is also
called an attribute.
An individual or an object with such a characteristic
can be counted or enumerated after having been assigned to
one of the several mutually exclusive classes or categories.
(Non - Numeric)
Continuous Variable
Height, Weight etc
Continuous Variable
Discrete Variable
e.g. No. of sisters
Discrete Variable
Gaps, Jumps
A quantitative variable may be classified as discrete or
continuous. A discrete variable is one that can take only a discrete
set of integers or whole numbers, that is, the values are taken by
jumps or breaks. A discrete variable represents count data such as
the number of persons in a family, the number of rooms in a house,
the number of deaths in an accident, the income of an individual, etc.
A variable is called a continuous variable if it can take on any
value-fractional or integral––within a given interval, i.e. its domain is
an interval with all possible values without gaps. A continuous
variable represents measurement data such as the age of a person,
the height of a plant, the weight of a commodity, the temperature at a
place, etc.
A variable whether countable or measurable, is generally
denoted by some symbol such as X or Y and Xi or Xj represents the
ith or jth value of the variable. The subscript i or j is replaced by a
number such as 1,2,3, … when referred to a particular value.
Measurement Scales
Nominal Scale
Ordinal Scale
Measurement Scales
Interval Scale
Ratio Scale
By measurement, we usually mean the assigning of number to
observations or objects and scaling is a process of measuring. The four
scales of measurements are briefly mentioned below:
The classification or grouping of the observations into mutually
exclusive qualitative categories or classes is said to constitute a nominal
scale. For example, students are classified as male and female. Number 1
and 2 may also be used to identify these two categories. Similarly, rainfall
may be classified as heavy moderate and light. We may use number 1, 2
and 3 to denote the three classes of rainfall. The numbers when they are
used only to identify the categories of the given scale, carry no numerical
significance and there is no particular order for the grouping.
It includes the characteristic of a nominal scale
and in addition has the property of ordering or
ranking of measurements. For example, the
performance of students (or players) is rated as
excellent, good fair or poor, etc. Number 1, 2, 3,
4 etc. are also used to indicate ranks. The only
relation that holds between any pair of
categories is that of “greater than” (or more
A measurement scale possessing a constant interval size
(distance) but not a true zero point, is called an interval scale.
Temperature measured on either the Celcius or the Fahrenheit
scale is an outstanding example of interval scale because the
same difference exists between 20o C (68o F) and 30o C (86o F)
as between 5o C (41o F) and 15o C (59o F). It cannot be said
that a temperature of 40 degrees is twice as hot as a
temperature of 20 degree, i.e. the ratio 40/20 has no meaning.
The arithmetic operation of addition, subtraction, etc. are
It is a special kind of an interval scale where the sale of
measurement has a true zero point as its origin. The ratio scale
is used to measure weight, volume, distance, money, etc. The,
key to differentiating interval and ratio scale is that the zero point
is meaningful for ratio scale.
Chemical and manufacturing plants
sometimes discharge toxic-waste materials
such as DDT into nearby rivers and streams
These toxins can adversely affect the plants
and animals inhabiting the river and the river
A study of fish was conducted in the Tennessee
River in Alabama and its three tributary creeks:
Flint creek, Limestone creek and Spring creek.
A total of 144 fish were captured, and the
following variable measured for each one:
1. River/Creek from where fish was captured
2. Species of fish (Channel fish, Largemouth
bass or smallmouth buffalo fish)
3. Length of fish (Centimeters)
4. Weight of fish (grams)
5. DDT concentration in the bodily system of the
fish (parts per million)
Classify each of the five variables measured
as quantitative or qualitative.
Also, identify the types of measurement
scales for each of the five variables.
The variables Length, weight and DDT
concentration are quantitative variables
because each is measured on a nominal
scale (Length is centimeters, Weight is
grams and DDT in parts per million).
All three of these variables are being
measured on the Ratio Scale.
Whenever we speak about the weight of an
object, obviously, if our measuring instrument
reads ‘zero’, this means that the object being
measured has zero weight --- and, in this sense,
the ‘zero’ would be a true zero.
An exactly similar argument holds for the length of
an object.
As far as DDT concentration in the bodily
system of the fish is concerned, obviously, if
there is absolutely no DDT in the fish, then
the DDT concentration reads zero --- and,
this particular ‘zero’ reading will be true zero.
As, explained above, the three variables
length of fish, weight of fish and DDT
concentration in the bodily system of the
fish are quantitative variables measures
on the ratio scale.
In contrast:
Data on River/Creek from which the fish
were captured, and the species of fish are
qualitative data.
Both of these variables are measured on
Nominal Scale.
The river/creek from which the fish
were captured, and the species of fish are
qualitative data because these can not be
measured quantitatively, they can only be
classified into categories.
(i.e. Channel fish, Largemouth bass or
smallmouth buffalo fish for the species and Tennessee
River, Flint creek, Limestone creek and Spring
The Statistical methods for describing,
reporting and analyzing data depend on
the type of data measured (i.e. whether
data are quantitative or qualitative).
Experience has shown that a continuous variable can never be
measured with perfect fineness because of certain habits and practices,
methods of measurements, instruments used, etc. the measurements are
thus always recorded correct to the nearest units and hence are of limited
accuracy. The actual or true values are, however, assumed to exist. For
example, if a student’s weight is recorded as 60 kg (correct to the nearest
kilogram), his true weight in fact lies between 59.5 kg and 60.5 kg, whereas
a weight recorded as 60.00 kg means the true weight is known to lie
between 59.995 and 60.005 kg. Thus there is a difference, however small it
may be between the measured value and the true value. This sort of
departure from the true value is technically known as the error of
measurement. In other words, if the observed value and the true value of a
variable are denoted by x and x +  respectively, then the difference (x + ) –
x, i.e.  is the error. This error involves the unit of measurement of x and is
therefore called an absolute error. An absolute error divided by the true value
is called the relative error. Thus the relative error, which when multiplied by
100, is percentage error. These errors are independent of the units of
measurement of x. It ought to be noted that an error has both magnitude
and direction and that the word error in statistics does not mean mistake
which is a chance inaccuracy.
Errors of Measurements
Errors of Measurements
Biased Errors
Random Errors
Cumulative Errors
Systematic Errors
Compensating Errors
Accidental Errors
An error is said to be biased when the observed value is
consistently and constantly higher or lower than the true value.
Biased errors arise from the personal limitations of the observer,
the imperfection in the instruments used or some other conditions
which control the measurements. These errors are not revealed by
repeating the measurements. They are cumulative in nature, that
is, the greater the number of measurements, the greater would be
the magnitude of error. They are thus more troublesome. These
errors are also called cumulative or systematic errors.
An error, on the other hand, is said to be unbiased when the
deviations, i.e. the excesses and defects, from the true value tend
to occur equally often. Unbiased errors and revealed when
measurements are repeated and they tend to cancel out in the long
run. These errors are therefore compensating and are also known
as random errors or accidental errors.
Statistical Inference
A Statistical Inference in an estimate or
prediction or some other generalization
about a population based on information
contained in sample.
That is, we use information contained in
sample to learn about the larger population.
Population and Sample
The collection of all individuals, items or
data under consideration in a statistical
That part of the population from which
information is collected.
Population and Sample
Five Elements of an Inferencial
Statistical Problem:
A population
One or more variables of interest
A sample
An Inference
A measure of Reliability
In order of understand the concept of
Reliability, a very important point to be
understood is that making an inference
about population from the sample is only
part of the story.
We also need to know its reliability --- that is,
how good our inference is.
Measure of Reliability
A measure of reliability is a statement
(usually quantified) about the degree of
uncertainty associated with a statistical
The point to be noted is that the only way we
can be certain that an inference about
population is correct is to include the entire
population in our sample.
However, because of resource constraints,
(i.e. Insufficient time and/ or money). We
usually can not work with whole
population, so we base our inference on
just a portion of population (i.e. Sample)
Consequently, whenever possible, it is
important to determine and report the
reliability of each inference made.
As such, reliability is the fifth element of
statistical inferencial problems.
A large paint retailer has had numerous
complaints from customers about underfilled paint cans.
As, a result retailer has begun inspecting
incoming shipments of paint from
Shipments with under-filled problems will be
sent back to supplier.
A recent shipment contained 2,440 gallonsize cans.
The retailer sampled 50 cans and weighted
each on a scale capable of measuring
weight to four decimal places.
Properly filled cans weigh 10 pounds.
Describe a population
Describe a variable of interest
Describe a sample
Describe the Inference
Describe a measure of uncertainty of our
a) The population is the set of units of
interests to the retailer, which is the
shipment of 2,440 cans of paint.
b) The weight of paint cans is the variable,
the retailer wishes to evaluate.
c) The sample is the subset of population.
In this case, it is the 50 cans of paint
selected by the retailer.
d) The inference of interest involves the
generalization of the information contained in
the sample of paint cans to the population of
paint cans.
In particular, Retailer wants to learn about
the content of under-filled problem (if any)
In the population.
This might be accomplished by finding the
average weight of the cans in the sample,
and using it to estimate the average weight
of the cans of population.
e) As far as the measure of reliability of our
inference is concerned, the point to be
noted is that, using statistical methods,
we can determine a bound on the
estimation error.
Bound on the Estimation Error
This bound is simply a number that our
estimation error (i.e. the difference between
the average weight of sample and average
weight of population of cans) is not likely to
This bound is a measure of the uncertainty
o f o u r inference, or, in other wo rd s , th e
reliability of statistical inference.
The crux of the matter is that an inference is
incomplete without a measure of its reliability
When the weights of 50 paint cans are used
to estimate the average weight of all the
cans, the estimate will not exactly mirror the
entire population.
For Example:
If the sample of 50 cans yields a mean
weight of 9 pounds, it does not follow (nor is
it likely) that the mean weight of population
of can is also exactly 9 pounds.
Nevertheless, we can use sound statistical
reasoning to ensure that our sampling
procedure will generate estimate that is
almost certainly within a specified limit of the
true mean weight of all the cans.
For example such reasoning might assure us that
the estimate of the population from the sample is
almost certainly within 1 pound of the actual
population mean.
The implication is that the actual mean weight of
the entire population of the cans is between
9 – 1=8 pounds and 9 +1=10 pounds --- that is,
(9 ± 1) pounds.
This interval represents the a measure of reliability
for the inference.
• The nature of the science of Statistics
• The importance of Statistics in various
• Some technical concepts such as
– The meaning of “data”
– Various types of variables
– Various types of measurement scales
– The concept of errors of measurement
• Concept of sampling
– Random verses non-random sampling
– Simple random sampling
– A brief introduction to other types of random sampling
• Methods of data collection
In other words, you will begin your journey in a
subject with reference to which it has been said
that “statistical thinking will one day be as
necessary for efficient citizenship as the ability to
read and write”.