chapter 2 - UniMAP Portal

Download Report

Transcript chapter 2 - UniMAP Portal

CHAPTER 2 :
DESCRIPTIVE STATISTICS:
TABULAR & GRAPHICAL
PRESENTATION
DESCRIPTVE STATISTICS : TABULAR &
GRAPHICAL PRESENTATION
2.1 Summarizing Qualitative Data
• A graphic display can reveal at a glance the main
characteristics of a data set.
• Their presentation are depend on the nature of data, whether
the data is in quantitative (ex. income and CGPA) or
qualitative (ex. Gender and ethnic group).
• Three types of graphs used to display qualitative data:- bar graph / column chart
- pie chart
- line chart
Types of Graph
Qualitative Data
Example 2.1 :Table 2.1 shows that the data of 25 UNIMAP students with
their data and background.
Code used :
• For gender: 1 is male and 2 is
female
• For ethnic group: 1 is Malay,
2 is Chinese, 3 is Indian and 4
is others
• Not much information can be
obtained from the data 1 in the
raw form. It has to be
summarized so that we can get
more informations.
Table 2.1 : Data of 25 UNIMAP students
• If data from table 2.1 summarized into gender and ethnic
group, then the frequency tables can get as below :
Observation
Frequency
Male
28
Femate
22
Total
50
Table 2.2: Frequency Table for the Gender
Observation
Malay
Chinese
Indian
Others
Total
Frequency
33
9
6
2
50
Table 2.3: Frequency Table for the Ethnic Group
2.1.1 Bar Chart
- Bar chart is used to display the frequency distribution in the
graphical form. It consists of two orthogonal axes and one of
the axes represent the observations while the other one
represents the frequency of the observations. The frequency of
the observations is represented by a bar.
*Bar chart is for data from Table 2.3.
Figure 1: Bar Chart of the Ethnic Group
2.1.2 Pie Chart
• Pie Chart is used to display the frequency distribution. It
displays the ratio of the observations. It is a circle consists of a
few sectors. The sectors represent the observations while the
area of the sectors represent the proportion of the frequencies
of that observations.
• *Pie chart is for data from Table 2.2.
Figure 2: The Pie Chart for the Gender
2.1.3 Line Chart
- Line chart is used to display the trend of observations. It
consists of two orthoganal axes and one of the axes represent
the observations while the other one represents the frequency
of the observations. The frequency of the observations are
joint by lines.
- Example :
Table 2.4 below shows the number of sandpipers recorded
between January 1989 till December 1989.
Jan
Feb
Mar
Apr
10
7
5
10
May June
39
7
July
Aug
Sept
Oct
Nov
Dec
260
316
142
11
4
9
Table 2.4 : The number of sandpipers
Figure 3: The line Chart for the numbers of common Sandpipers
2.2 Summarizing Quantitative Data
2.2.1 Frequency Distribution
• When summarizing large quantities of raw data, it is often
useful to distribute the data into classes. In determining the
classes, there is no spesific rules but statistician suggest the
number of classes are between 5 to 20. Table 1.3 shows that
the number of classes for Students` CGPA.
CGPA (Class)
2.50 - 2.75
2.75 - 3.00
3.00 - 3.25
3.25 - 3.50
3.50 - 3.75
3.75 - 4.00
Total
Frequency
2
10
15
13
7
3
50
Table 2.5: The Fequency Distribution of the
Students’ CGPA
• A frequency distribution for quantitative data lists all the
classes and the number of values that belong to each class.
Data presented in the form of a frequency distribution are
called grouped data.
• For quantitative data, an interval that includes all the values that fall
within two numbers; the lower and upper class which is called class.
Class is in first column for frequency distribution table.
*Classes always represent a variable, non-overlapping; each value is
belong to one and only one class.
• The numbers listed in second column are called frequencies, which
gives the number of values that belong to different classes.
Frequencies denoted by f.
Table 2.6 : Weekly Earnings of 100 Employees of a Company
Variable
Third class
(Interval Class)
Lower Limit
of the sixth class
Weekly Earnings (dollars)
Number of Employees, f
801-1000
9
1001-1200
22
1201-1400
39
1401-1600
15
1601-1800
9
1801-2000
6
Upper limit of the sixth class
Frequency
column
Frequency
of the third
class.
• The class boundary is given by the midpoint of the upper limit
of one class and the lower limit of the next class.
• The difference between the two boundaries of a class gives the
class width; also called class size.
Class Midpoint or Mark
Class midpoint or mark = (Lower Limit + Upper Limit)/2
Finding The Number of Classes
Number of classes = n
or Number of classes = 1  3.3log n
Finding Class Width Between Two Boundaries
c= Upper boundary – Lower Boundary
Finding Class Width For Interval Class
Approximate class width = (Largest value – Smallest value)/Number of classes
* Any convenient number that is equal to or less than the smallest values in the
data set can be used as the lower limit of the first class.
2.2.2 Cumulative Frequency Distributions
• A cumulative frequency distribution gives the total number of values that
fall below the upper boundary of each class.
• In cumulative frequency distribution table, each class has the same lower
limit but a different upper limit.
Table 2.7 : Class Limit, Class Boundaries, Class Width , Cumulative Frequency
Weekly
Earnings
(dollars)
(Class Limit)
Number of
Employees, f
Class
Boundaries
Class
Width
Cumulative
Frequency
801-1000
9
800.5 – 1000.5
200
9
1001-1200
22
1000.5 – 1200.5
200
9 + 22 = 31
1201-1400
39
1200.5 – 1400.5
200
31 + 39 = 70
1401-1600
15
1400.5 – 1600.5
200
70 + 15 = 85
1601-1800
9
1600.5 – 1800.5
200
85 + 9 = 94
1801-2000
6
1800.5 – 2000.5
200
94 + 6 = 100
• Tabular presentation for quantitative data is usually in the form
of frequency distribution that is a table represent the frequency
of the observation that fall inside some specific classes
(intervals) .
• There are few graphs available for the graphical presentation
of the quantitative data.
Most popular
graphs
Frequency Polygon
Histogram
Ogive
• Histogram
The histogram looks like the bar chart except that the
horizontal axis represent the data which is quantitative in
nature. There is no gap between the bars.
• Frequency Polygon
The frequency polygon looks like the line chart except that the
horizontal axis represent the class mark of the data which is
quantitative in nature.
• Ogive
Ogive is a line graph with the horizontal axis represent the
upper limit of the class interval while the vertical axis
represent the cummulative frequencies.
2.3 Exploratory Data Analysis
• Exploratory data analysis (EDA) is an approach to analyze
data for the purpose of formulating hypotheses worth testing,
complementing the tools of conventional statistics for testing
hypotheses.
• The goal of EDA is to discover the patterns in data.
• EDA is an approach for data analysis that employs a variety of
techniques to (mostly graphical) :
i. Maximize insight into a data set.
ii. Uncover underlying structure.
iii. Extract important variable.
iv. Detect outliers and anomalies.
v. Test underlying assumption.
vi. Develop parsimonious models.
vii. Determine optimal factor settings.
STEM-AND-LEAF DISPLAYS
• Another technique that is used to present quantitative data is
the stem-and-leaf display.
• An advantage of a stem-and-leaf-display over a frequency
distribution is that by preparing stem-and-leaf display, we do
not lose information on individual observations.
• A stem-and-leaf only for quantitative data.
• In a stem-and-leaf display of quantitative data, each value is
divided into two portions; a stem and leaf. The leaves for each
stem are shown separately in a display.
• Steps to construct a stem-and-leaf plot.
i. Split each data into two parts; first part contains the leading
digit (stem), second part contains the twilling digit (leaf).
ii. Draw a vertical line and write the stems on the left side,
arranged in increasing or decreasing order.
iii. Read the leaves for all data and record them next to the
corresponding stems on the right side of the vertical line.
iv. Rank the leaves for each stem in increasing order.
•
Example :The following are the scores of 30 college students on a
statistics test.
75 52 80 96 65 79 71 87 93 95
69 72 81 61 76 86 79 68 50 92
83 84 77 64 71 87 72 92 57 98
Stems
5
2
Leaf for 52
6
Stem-and-leaf display
Leaf for 75
7
5
8
9
For the score of the first student, which is 75, 7 is the stem and 5 is the leaf.
For the score of the second student, which 52, 5 is the stem and 2 is the
leaf. Observed from data, the stems for all scores are 5,6,7,8 and 9 because
all scores lie in the range 50 to 98. After we have listed the stems, we read
the leaves for all scores and record them next to the corresponding stems at
the right side of the vertical line.
5
6
7
8
9
2
5
5
0
6
0
9
9
7
3
7
1
1
1
5
8
2
6
2
4
6 9 7 1 2
3 4 7
2 8
Stem-and-leaf display of test scores.
Now we read all the scores and write the leaves on the right
side of the vertical line in the rows of corresponding stems. By
looking at the stem-and-leaf display of test scores, we can
observed how the data values are distributed. For example, the
stem 7 has the highest frequency, followed by stems 8,9,6 and
5. The leaf for each stem of the stem-and-leaf display of test
scores are rank in increasing order and presented as below :
5 0 2 7
6 1 4 5 8 9
7 1 1 2 2 5 6 7 9 9
Ranked stem-and-leaf display of
8 0 1 3 4 6 7 7
test scores.
9 2 2 3 5 6 8
* Analyze – There are 9 out of 30 college students score
between 71 and 79.
Example :- Distribution of porosity (portion in percent of
pores in rock) in sand stone. Use a leaf unit 0.1.
22.1
22.0
22.3
25.2
24.9
23.5 25.3 26.6 23.9 26.0 22.8 22.3 23.1 23.0 21.0
22.2 22.3 22.4 22.4 22.4 22.3 21.6 22.1 22.6 22.1
23.9 23.2 22.5 23.7 23.3 24.4 22.6 23.9 24.2 27.6
21.7 20.0 19.8 21.5 25.6 25.3 24.1 28.6 23.7 24.0
24.2 25.0 23.7 27.3 23.0 23.8 21.2 21.1
19 8
20 0
21 7 5 6 2 1
22 1 0 3 2 3
23 5 9 2 7 9
24 9 2 4 1 2
25 2 3 0 6 3
26 6 0
27 3 6 9
28 6
0 8 9 8
4 5 4 4 8 3 3 6 1 6 1
7 3 0 8 1 9 0 7
0
21.8
21.9
27.9
21.8