Chapter 15 Exploratory data analysis: graphical summaries

Download Report

Transcript Chapter 15 Exploratory data analysis: graphical summaries

CIS 2033
Based on text book:
F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A
Modern Introduction to Probability and Statistics
Understanding Why and How
Instructor: Dr. Longin Jan Latecki
1
Chapter 15 Exploratory data analysis:
graphical summaries
The set of observations is called a dataset.
By exploring the dataset we can gain insight into what probability model suits the
phenomenon.
To graphically represent univariate datasets, consisting of repeated measurements
of one particular quantity, we discuss the classical histogram, the more recently
introduced kernel density estimates and the empirical distribution function.
To represent a bivariate dataset, which consists of repeated measurements of two
quantities, we use the scatterplot.
2
15.2 Histograms: The term histogram appears to have
been used first by Karl Pearson.
3
Histogram construction and pdf
Denote a generic (univariate) dataset of size n by
First we divide the range of the data into intervals. These intervals are called bins
and denoted by
The length of an interval Bi is denoted by ǀBiǀ and is called the bin width.
We want the area under the histogram on each bin Bi to reflect the number of elements
in Bi. Since the total area 1 under the histogram then corresponds to the total number
of elements n in the dataset, the area under the histogram on a bin Bi is equal to the
proportion of elements in Bi:
The height of the histogram on bin Bi must be equal to
As we know from Ch. 13.4, the histogram approximates the pdf f, in particular,
for a bin centered at point a, Ba=(a-h, a+h], we have
f (a ) 
4
# x j B a
n |B a |

# x j Ba
n  2h
 Ha
In Matlab:
binwidth=0.5;
bincenters=[0.5:binwidth:9.5];
hx=hist(x,bincenters)/(200*binwidth);
5
The function g in blue is a mixture of two Gaussians. We draw 200 samples from it,
which are shown as blue dots.
We use the samples to generate the histogram (yellow)
and its kernel density estimate f (red).
The Matlab script is twoGaussKernelDensity1.m
Choice of the bin width
Consider a histogram with bins of equal width. In that case the bins are of the
from
where r is some reference point smaller than the minimum of the dataset and b
denotes the bin width. Mathematical research, however, has provided some guideline for a data-based choice for b or m, where s is the sample std:
6
15.3 Kernel density estimates
7
A kernel K is a function K:RR and a kernel K typically
satisfies the following conditions.
8
Examples of Kernel Construction
9
Scaling the kernel K
Scale the
kernel K into
the function
10
Then put a scaled
kernel around each
element xi in the
dataset
The bandwidth is
too big
The
bandwid
th is too
small
11
12
The function g in blue is a mixture of two Gaussians. We draw 200 samples from it,
which are shown as blue dots.
We use the samples to generate the histogram (yellow)
and its kernel density estimate f (red).
The Matlab script is twoGaussKernelDensity1.m
15.4 The empirical distribution function
Another way to graphically represent a dataset is
to plot the data in a cumulative manner.
This can be done by using the
empirical cumulative distribution function .
13
Empirical distribution function Continued
14
Example
15.6. Given is the following information about a
histogram, compute the value of the empirical
distribution function at point t = 7:
Because (2 - 0) * 0.245 + (4 - 2) * 0.130 + (7 - 4) * 0.050 + (11 - 7) * 0.020
+ (15 - 11) * 0.005 = 1, there are no data points outside the listed bins.
Hence
15
By: Wanwisa Smith
Relation between histogram and empirical cdf
15.11. Given is a histogram and the empirical distribution function Fn
of the same dataset. Show that the height of the histogram on
a bin (a, b] is equal to
The height of the histogram on a bin Bi = (a, b] is
Hence
16
By: Wanwisa Smith
15.5 Scatterplot
In some situation we might wants to investigate the relationship between two or
more variable. In the case of two variables x and y, the dataset consists of
pairs of observations:
We call such a dataset a bivariate dataset in contrast to the univariate.
The plot the points (Xi, Yi) for i = 1, 2, …,n is called a scatterplot.
17