Transcript Click!

5. Graphical data presentation
Sihua Peng, PhD
Shanghai Ocean University
2016.9
1
Contents
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Introduction to R
Data sets
Introductory Statistical Principles
Sampling and experimental design with R
Graphical data presentation
Simple hypothesis testing
Introduction to Linear models
Correlation and simple linear regression
Single factor classification (ANOVA)
Nested ANOVA
Factorial ANOVA
Simple Frequency Analysis
2
5.1 The plot() function
 The plot() function is a generic function, the output of
which depends on the class of objects passed to it as
arguments.
 The type parameter:
The type parameter takes a single character argument
and controls how the points should be presented.
3
The plot() function
4
Plot() function:
The log parameter
 The log parameter indicates whether or
which axes should be plotted on a
logarithmic scale.
5
Install car package and load data christ
 >install.packages("car",dependencies=TRUE)
 >christ <- read.table("christ.csv", header=T, sep=",")
6
Help(plot) Example
 Examples
 require(stats) # for lowess, rpois, rnorm
 plot(cars)
 lines(lowess(cars))
plot(cars)
lines(cars)
7
5.2 Graphical Parameters
 The graphical parameters provide consistent
control over most of the plotting features across a
wide range of high and low plotting functions. Any
of these parameters can be set by passing them as
arguments to the par() function.
 Once set via the par() function, they become
global graphical parameters that apply to all
subsequent functions that act on the current
graphics device.
 For more information, please see the textbook
from pages 89-99.
8
5.3 Enhancing and customizing plots
with low-level plotting functions
In addition to their specific parameters, each of the following
functions accept many of the graphical parameters.
In the function definitions, these capabilities are represented
by three consecutive dots (...).
Technically, ... indicates that any supplied arguments that are
not explicitly part of the definition of a function are passed
on to the relevant underlying functions (in this case, par).
 We will discuss THREE useful concepts in this section:
(1) abline() ;(2) Smoothers; and (3) Confidence ellipses matlines().
For more information, please see the textbook in page 99-113.
9
Adding lines and shapes within a plot
 Straight lines - abline()
The low-level plotting abline() function is used to fit
straight lines with a given intercept (a) and gradient
(b) or single values for horizontal (h) or vertical (v)
lines.
Y=a+bx.
10
Adding lines and shapes within a plot
 > plot(CWD.DENS ~ RIP.DENS, data=christ)
 > # use abline to add a
 > # regression trendline
 > abline(lm(CWD.DENS ~
 RIP.DENS, data=christ))
 > # use abline to represent the
 > # mean y-value
 > abline(h=mean(christ$CWD.DENS), lty=2)
11
Deference between lines () and abline ()
lines () function is to do the general connection
diagram, and the input is x, y point vector.
abline () function is to do the regression line, and the
input is the regression model object.
12
Smoothers
Smoothing functions can be useful additions to
scatterplots, particularly for assessing (non)linearity
and the nature of underlying trends. There are many
different types of smoothers see section 8.3 and
Table 8.2.
Smoothers are added to a plot by first fitting the
smoothing function (loess(), ksmooth()) to the data
before plotting the values predicted by this function
across the span of the data.
13
Smoothers
> plot(CWD.DENS ~ RIP.DENS, data=christ)
> # fit the loess smoother
> christ.loess<-loess(CWD.DENS ~RIP.DENS,
data=christ)
> # created a vector of the sorted
> # X values
> xs<-sort(christ$RIP.DENS)
> lines(xs, predict(christ.loess, data.frame
(RIP.DENS=xs)))
> # fit and plot a kernel smoother
> christ.kern <- ksmooth(christ$RIP.DENS,
christ$CWD.DENS, "norm", bandwidth=200)
> lines(christ.kern, lty=2)
14
Confidence ellipses - matlines()
The matlines() function, along with the similar
matplot() and matpoints() functions plot multiple
columns of matrices against one another, thereby
providing a convenient means to plot predicted
trends and confidence intervals in a single
statement.
Confidence bands are added by using the value(s)
returned by a predict() function as the second
argument to the matlines() function.
15
Confidence ellipses - matlines()
> plot(CWD.DENS ~ RIP.DENS, data=christ)
> christ.lm<-lm(CWD.DENS ~RIP.DENS, data=christ)
> xs<-with(christ, seq(min(RIP.DENS), max(RIP.DENS), l=1000))
> matlines(xs, predict(christ.lm, data.frame(RIP.DENS=xs),
interval="confidence"), lty=c(1,2,2), col=1)
The seq() function is a basic function in the R,
and its function is to generate a vector.
16
5.4 Interactive graphics
 The majority of plotting functions on the majority of
graphical devices operate by sending all of the
required information to the device at the time of the
call – no additional information is required or
accepted from the user.
 The display devices (X11(), windows() and quartz())
however, also support a couple of functions designed
to allow interactivity between the user and the current
plotting region.
 For more information, please see the text book in page
113-114.
17
5.5 Exporting graphics
 Graphics can also be written to several graphical file
formats via specific graphics devices which oversee the
conversion of graphical commands into actual graphical
elements.
 To write graphics to a file, an appropriate graphics device
must first be ‘opened’. A graphics device is opened by
issuing one of the device functions listed below and
essentially establishes the devices global parameters and
readies the device stream for input.
 Opening such a device also creates (or overwrites) the
nominated file. As graphical commands are issued, the
input stream is evaluated and accumulated.
 The file is only written to disk when the device is closed via
the dev.off() function.
18
Postscript - poscript() and pdf()
Postscript is device independent and scalable to any
size and is therefore the preferred format of most
publishers.
Whilst there are many other arguments that can be
passed to the postscript() function, common use is as
follows:
> postscript(file, family, fonts = NULL, width, height,
horizontal, paper)
19
Bitmaps - jpeg() and png()
 R also supports a range of bitmap file formats, the
range of which depends on the underlying operating
system and the availability of external applications.
> jpeg(filename, width = 480, height = 480, units = "px",
pointsize = 12, quality = 75, bg = "white", res = NA, ...)
20
5.6 Working with multiple graphical devices
It is possible to have multiple graphical devices open
simultaneously. However, only one device can be active
(receptive to plotting commands) at a time.
Once a device has been opened (see section 5.5), the
device object is given an automatically iterated
reference number in the range of 1 to 63. Device 1 will
always be a null device that cannot accept plotting
commands and is essentially just a placeholder for the
device counter.
For more information, please see the text book in page
115-116.
21
5.7 High-level plotting functions
for univariate (single variable) data
5.7.1 Histogram
Histograms are useful at representing the
distribution of observations for large (> 30) sample
sizes.
> set.seed(1)
> VAR <- rnorm(100,10,2)
hist(VAR)
> hist(VAR, breaks=18, probability=T)
#OR equivalently in this case
> hist(VAR, breaks=seq(5.5,15, by=.5), probability=T)
22
5.7.2 Density functions
 Probability density functions are also useful additions
or alternatives to histograms as they further assist in
describing the patterns of the underlying distribution.
 A density function can be plotted using the density()
function as an argument to the high-level overloaded
plot() function.
 > plot(density(VAR))
23
5.7.2 Density functions
 The type of smoothing kernel (normal or gaussian by
default) can be defined by the kernel= argument and
the degree of smoothing is controlled by the bw=
 (bandwidth) argument. The higher the smoothing
bandwidth, the greater the degree of smoothing.
 > plot(density(VAR, bw=1))
24
5.7.2 Density functions
 The density function can also be added to a histogram




using the density() function as an argument to a the
low-level lines() function.
> set.seed(1)
> VAR1 <- rlnorm(100,2,.5)
> hist(VAR1, prob=T)
> lines(density(VAR1))
25
5.7.3 Q-Q plots
 Q-Q normal plots can also be useful at diagnosing
departures from normality by comparing the data
quantilesg to those of a standard normal distribution.
Substantial deviations from linearity, indicate departures
from normality.
 > qqnorm(VAR1)
 > qqline(VAR1)
 >A<-log(VAR1)
 > qqnorm(A)
 > qqline(A)
26
5.7.4 Boxplots
 For smaller sample sizes, histograms and density functions
can be difficult to interpret.
 Boxplots provide an alternative means of depicting the
location (average), variability and shape of the distribution
of data.
> set.seed(6)
> VAR2<-rlnorm(15,2,.5)
> boxplot(VAR2)
27
5.7.4 Boxplots
The dimensions of a boxplot are defined by the five-
number summaries (minimum value, lower quartile (Q1),
median (Q2), upper quartile (Q3) and maximum value each representing 25%) of the data (see Figure 5.5).
28
5.7.4 Boxplots
 The horizontal=T argument is used to produce
 horizontally aligned boxplots
 > boxplot(VAR2, horizontal=T)
29
5.7.5 Rug charts
 Another representation of the data that can be added to




existing plots is a rug chart that displays the values as a
series of ticks on the axis. Rug charts can be particularly
useful at revealing artifacts in the data that are ‘‘smoothed’’
over by histograms, boxplots and density functions.
> set.seed(1)
> VAR <- rnorm(100,10,2)
> plot(density(VAR))
> rug(VAR,side=1)
30
5.8 Presenting relationships
When two or more continuous variables are collected,
we often intend to explore the nature of the
relationships between the variables. Such trends can
be depicted graphically in scatterplots.
Scatterplots display a cloud of points, the coordinates
of which correspond to the values of the variables that
define the horizontal and vertical axes.
31
5.8.1 Scatterplots
 >install.packages("car",dependencies=TRUE)
 >christ <- read.table("christ.csv", header=T, sep=",")
 > library(car)
 > scatterplot(CWD.DENS ~
 RIP.DENS, data=christ)
32
The scatterplotMatrix() function (car package) is
an extension of the regular scatterplot() function.
 > library(car)
 > scatterplotMatrix(~CWD.DENS +
RIP.DENS + CABIN + AREA,
data=christ, diag="boxplot")
33
3D scatterplots
 Three dimensional scatterplots can be useful for
exploring multivariate patterns between combinations
of three or more variables.
 To illustrate 3D scatterplots in R, we will make use of a
dataset by Allison and Cicchetti that compiles sleep,
morphology and life history characteristics 62 species
of mammal along with predation indices.
34
The scatterplot3d function (scatterplot3d package)
 > allison <- read.table("allison.csv", header=T, sep=",")
 > library(scatterplot3d)
 > with(allison, scatterplot3d(log(Gestation),
log(BodyWt),log(LifeSpan), type="h", pch=16))
 The type="h" parameter specifies that points should be connected to
the base by a line and the pch=16 parameter specifies solid points.
 All variables were expressed as their natural logarithms using the log()
function.
35
The scatter3d function (Rcmdr package) displays
rotating three dimensional plots.
 > library(Rcmdr)
 > with(allison, scatter3d(log(Gestation), log(LifeSpan),
log(BodyWt), fit="additive", rev=1))
 The fit= parameter specifies the form of surface to fit through
the data. The option selected ("additive") fits an additive nonparametric surface through the data cloud and is useful for
identifying departures from multivariate linearity.
 The rev= parameter specifies the number of full revolutions the
plot should make. Axes rotations can also be manipulated
manually by dragging the mouse over the plot.
36
The scatter3d
37
5.9 Presenting grouped data:
5.9.1 Boxplots
> ward<-read.table("ward.csv",
header=T, sep=",")
> boxplot(EGGS~ZONE, data=ward,
ylab="Number of eggs per capsule",
xlab="Zone")
> furness<-read.table("furness.csv",
header=T, sep=",")
> boxplot(METRATE~SEX, data=furness,
ylab="metabolic rate", xlab="Sex")
38
5.9.2 Boxplots for grouped means
 Technically, the normality and homogeneity of variance
assumptions pertain to the residuals (difference between values
observed and those predicted by the proposed model) and
thus the model replicates.
 For multi-factor analysis of variance designs, the appropriate
replicates for a hypothesis test are usually the individual
observations from each combination of factors. Hence,
boxplots should also reflect this level of replication.
 To illustrate, a data set introduced in Box 11.2 of Sokal and Rohlf
on the oxygen consumption of two species of limpets under
three seawater concentrations will be used.
39
5.9.2 Boxplots for grouped means
> limpets <-read.table("limpets.csv", header=T, sep=",")
> boxplot(O2~SEAWATER*SPECIES, limpets)
40
5.9.4 Bargraphs
 Bargraphs are plots where group means are
represented by the tops of bars or columns.
 Biologist often find bargraph useful
graphical summaries and they do provide a
greater area for displaying colors and shading
to distinguish different treatment
combinations.
41
5.9.4 Bargraphs
> means<-with(ward, tapply(EGGS, ZONE, mean))
> sds <-with(ward, tapply(EGGS, ZONE, sd))
> ns<-with(ward, tapply(EGGS, ZONE, length))
> ses <- sds/sqrt(ns)
> b<-barplot(means, ylim=c(min(pretty ( means - ses)),
max(pretty(means+ses))), xpd=F, ylab="Number of eggs per
capsule")
> arrows(b, means+ses, b, means-ses, angle=90, code=3)
> box(bty="l")
With tapply(), the data in the
vector is processed according
to the grouping principle, not
the whole data.
42
Multifactor bargraphs
> means<-with(limpets, tapply(O2,
 list(SPECIES,SEAWATER), mean))
> sds <-with(limpets, tapply(O2,
 list(SPECIES,SEAWATER), sd))
> ns<-with(limpets, tapply(O2,
 list(SPECIES,SEAWATER), length))
> ses <- sds/sqrt(ns)
> b<-barplot(means, ylim=c(min(pretty
 ( means-ses)), max(pretty
 (means+ses))), beside=T, xpd=F,
 ylab="Oxygen consumption",
 legend.text=rownames(means))
> arrows(b,means+ses,b,means-ses,
 angle=90, code=3,length=0.05)
> box(bty="l")
43
5.10 Presenting categorical data
Associations between two or more categorical
variables (such as those data modelled by contingency
tables and log-linear modelling) can be summarized
graphically by mosaic and association plots.
For more information, please see the text book in page
128-129.
44
5.11 Trellis graphics
Trellis graphics provide the means of plotting the
trends amongst a set of variables separately according
to the levels of other variables and can therefore be
more appropriate for exploring trends within grouped
data.
For more information, please see page 129-133.
45
46