data presentation

Download Report

Transcript data presentation

DATA DESCRIPTION
1
Units


Unit: entity we are studying, subject if human
being
Each unit/subject has certain parameters,
e.g., a student (subject) has his age, weight,
height, home address, number of units taken,
and so on.
2
Variables


These parameters are called variables.
In statistics variables are stored in columns,
each variable occupying a column.
3
Cross-sectional and time-series
analyses


In a cross-sectional analysis a unit/subject will
be the entity you are studying. For example,
if you study the housing market in San Diego,
a unit will be a house, and variables will be
price, size, age, etc., of a house.
In a time-series analysis the unit is a time
unit, say, hour, day, month, etc.
4
Data Types




Nominal data: male/female, colors,
Ordinal data: excellent/good/bad,
Interval data: temperature, GMAT
scores,
Ratio data: distance to school, price,
5
Two forms


GRAPHICAL form
NUMERICAL SUMMARY form
6
Graphical forms



Sequence plots
Histograms (frequency distributions)
Scatter plots
7
Sequence plots



To describe a time series
The horizontal axis is always related to the
sequence in which data were collected
The vertical axis is the value of the variable
8
Example: sequence plot
470
S&P-500
460
450
440
430
Index
10
20
30
40
9
Histograms I





A histogram (frequency distribution) shows how
many values are in a certain range.
It is used for cross-sectional analysis.
the potential observation values are divided into
groups (called classes).
The number of observations falling into each class is
called frequency.
When we say an observation falls into a class, we
mean its value is greater than or equal to the lower
bound but less than the upper bound of the class. 10
Example: histogram
A commercial bank is studying the time a
customer spends in line. They recorded
waiting times (in minutes) of 28 customers:
5.9
4.0
1.1
1.1
7.6
1.6
8.6
6.7
5.3
7.3
4.3
5.0
9.7
8.2
1.2
4.5
1.6
8.4
3.3
9.4
3.5
6.5
2.1
6.3
7.4
8.9
8.4
6.4
11
Example: histogram
12
Histogram II


The relative frequency distribution depicts the
ratio of the frequency and the total number of
observations.
The cumulative distribution depicts the
percentage of observations that are less than
a specific value.
13
Example: relative frequency
distribution

A “relative frequency” distribution plots the
fraction (or percentage) of observations in
each class instead of the actual number. For
this problem, the relative frequency of the first
class is 6/28=0.214. The remaining relative
frequencies are 0.179, 0.250, 0.286 and
0.071. A graph similar to the above one can
then be plotted.
14
Example: cumulative distribution

In the previous example, the percentage of
observations that are less than 3 minutes is
0.214, the percentage of observations that
are less than 5 is 0.214+0.179=0.393, less
than 7 is 0.214+0.179+0.25=0.643, less than
9 is 0.214+0.179+0.25+0.286=0.929, and that
less than 11 is 1.0.
15
Example: cumulative distribution
16
Histogram III




The summation of all the relative frequencies
is always 1.
The cumulative distribution is nondecreasing.
The last value of the cumulative distribution is
always 1.
A cumulative distribution can be derived from
the corresponding relative distribution, and
17
vice versa.
Probability




A random variable is a variable whose values cannot
predetermined but governed by some random
mechanism.
Although we cannot predict precisely the value of a
random variable, we might be able to tell the
possibility of a random variable being in a certain
interval.
The relative frequency is also the probability of a
random variable falling in the corresponding class.
The relative frequency distribution is also the
18
probability distribution.
Scatter plots

A scatter plot shows the relationship between
two variables.
19
Example: scatter plot
. The following are the height and foot size
measurements of 8 men arbitrarily selected from
students in the cafeteria. Heights and foot sizes are
in centimeters.
man 1
2
3
4
5
6
7
8
Height 155 160 149 175 182 145 177 164
foot 23.3 21.8 22.1 26.3 28.0 20.7 25.3 24.9
20
Example: scatter plot
Height, cm
190
180
170
160
150
140
130
20
22
24
26
28
Foot size, cm
21
Numerical Summary Forms



Central locations: mean, median, and mode.
Dispersion: standard deviation and variance.
Correlation.
22
Mean

Mean/average is the summation of the
observations divided by the number of
observations
27 22 26 24 27 20 23 24 18 32


Sum = (27 + 22 + 26 + 24 + 27 + 20 + 23 +
24 + 18 + 32) = 243
Mean = 243/10 = 24.3
23
Median



Median is the value of the central observation
(the one in the middle), when the
observations are listed in ascending or
descending order.
When there is an even number of values, the
median is given by the average of the middle
two values.
When there is an odd number of values, the
24
median is given by the middle number.
Example: median
18 20 22 23 24 24 26 27 27 32
25
Compare mean and median

The median is less sensitive to outliers than
the mean. Check the mean and median for
the following two data sets:
18 20 22 23 24 24
18 20 22 23 24 24
26
26
27
27
27 32
27 320
26
Mode

Mode is the most frequently occurring
value(s).
27
Symmetry and skew



A frequency distribution in which the area to the left of
the mean is a mirror image of the area to the right is
called a symmetrical distribution.
A distribution that has a longer tail on the right hand
side than on the left is called positively skewed or
skewed to the right. A distribution that has a longer
tail on the left is called negatively skewed.
If a distribution is positively skewed, the mean
exceeds the median. For a negatively skewed
distribution, the mean is less than the median.
28
Range

The range is the difference in the maximum
and minimum values of the observations.
29
Standard deviation and variance


The standard deviation is used to describe
the dispersion of the data.
The variance is the squared standard
deviation.
30
Calculation of S.D.




Calculate the mean;
calculate the deviations;
calculate the squares of the deviations and
sum them up;
Divide the sum by n-1 and take the square
root.
31
Example: S.D.
Sample
27 22
26 24 27 20 23 24 18 32
Deviation 2.7 -2.3 1.7 -0.3 2.7 -4.3 -1.3 -.3 -6.3 7.7
Sq of Dev 7.29 5.29 2.89 .09 7.29 19.5 1.69 .09 39.7 59.3
Sum of = 7.29 + 5.29 + ..... + 59.3 = 142.1
Std. Dev. =
142.1
 15. 79  3. 97
9
32
std . dev. 
( x1  x ) 2  ( x 2  x ) 2  ( x n  x ) 2
n 1
33
Empirical rules




If the distribution is symmetrical and bellshaped,
Approximately 68% of the observations will
be within plus and minus one standard
deviation from he mean.
Approximately 95% observations will be
within two standard deviation of the mean.
Approximately 99.7% observations will be
34
within three standard deviations of the mean.
Percentiles


The 75th percentile is the value such that
75% of the numbers are less than or equal to
this value and the remaining 25% are larger
than this value.
The k-th percentile is the value such that k%
of the numbers are less than or equal to this
value and the remaining 1-k% are larger than
this value.
35
Correlation coefficient



The Correlation coefficient measures how
closely two variables are (linearly) related to
each other. It has a value between -1 to +1.
Positive and negative linear relationships.
If two variables are not linearly related, the
correlation coefficient will be zero; if they are
closely related, the correlation coefficient will
be close to 1 or -1.
36