Transcript Data
Chapter 15: Exploratory data
analysis: graphical summaries
CIS 3033
15.1 Example: the Old Faithful data
Statistics: the collection, analysis, and
interpretation of data.
The set of observations is called a dataset.
Assumption: the randomness in a dataset roughly
follows a probability model.
From Data to Model (the reverse of simulation)
It is often necessary to condense the data for easy
visual comprehension of general characteristics.
15.1 Example: the Old Faithful data
15.1 Example: the Old Faithful data
The durations (in seconds) of 272 eruptions of the
Old Faithful geyser is collected.
The variety in the lengths of the eruptions indicates
that randomness is involved, but what can be said
about the distribution?
The mean of the data is 209.3. Putting the elements
in order shows that they are all in [96, 306], with 240
as median.Such numerical summaries are covered in
detail in the next chapter.
15.2 Histograms
Graphical summary: group similar data and show their
distribution visually.
15.2 Histograms
A version of histogram: the total area under the curve
is equal to 1, so the histogram can be seen as an
approximation of the density function.
Steps:
1.Divide the range of the data into bins (intervals),
which usually (though not necessarily) have the
same width.
2.The height of the histogram on a bin is
(the number of elements in the bin) /
[(the number of all elements)*(the width of the bin)]
15.2 Histograms
Let r be a reference point smaller than the minimum
of the dataset, and b the bin width, then
Bi = (r + (i − 1)b, r + ib] for i = 1, 2, . . .,m
We may let m = 1 + 3.3 log10(n) or b = 3.49sn−1/3
where s is the sample standard deviation.
15.3 Kernel density estimates
Idea: “put a pile of sand” around each data element,
so as to contribute to its neighborhood continuously.
15.3 Kernel density estimates
The plot is constructed by choosing a kernel K and
a bandwidth h. The kernel reflects the shape of the
"piles of sand", whereas the bandwidth determines
how wide the piles of sand will be.
A kernel K typically satisfies the following
conditions:
(K1) K is a probability density function;
(K2) K is symmetric around zero, i.e., K(u) = K(−u);
(K3) K(u) = 0 for |u| > 1.
Roughly, histograms can be seen as formed with
uniform kernels on bins.
15.3 Kernel density estimates
15.3 Kernel density estimates
Three steps to construct a kernel density estimate:
1:
2:
3:
15.3 Kernel density estimates
Choice of the bandwidth: too small and too large are
both bad. A good choice: h = 1.06 sn−1/5, where s is
the sample standard deviation.
15.3 Kernel density estimates
Choice of the kernel is less important, since
different kernels may produce similar results.
When symmetric kernel is improper, boundary
kernel can be used.
15.4 The empirical distribution function
The empirical cumulative distribution function of the
data:
For example, if the data is 4 3 9 1 7, then
15.4 The empirical distribution function
15.5 Scatterplot
In the case of two variables x and y, the dataset
consists of pairs of observations: (x1, y1), (x2, y2), . . . ,
(xn, yn). In a scatterplot, each pair is shown as a point.
15.5 Scatterplot