Transcript Chapter 9
CHAPTER 9
Data Transformations
Tables, Figures, and Equations
From: McCune, B. & J. B. Grace. 2002. Analysis of
Ecological Communities. MjM Software Design,
Gleneden Beach, Oregon http://www.pcord.com
Table 9.1. Domain of input and range of output from transformations.
Reasonable and acceptable
domain of x
Range of f(x)
MONOTONIC TRANSFORMATIONS
x0 (power)
all
0 or 1 only
x½ (power)
log(x)
nonnegative
positive
nonnegative
all
(2/)arcsin(x)
0<x<1
0 to 1 inclusive
(2/)arcsin (x½)
0<x<1
0 to 1 inclusive
0 or 1 only
0 to 1 inclusive
SMOOTHING
Beals smoothing
ROW/COLUMN RELATIVIZATIONS
general
nonnegative
0 to 1 inclusive
by maximum
by mean
nonnegative
all
0 to 1 inclusive
all
by standard deviates
binary by mean
all
all
generally between -10 and 10
0 or 1 only
rank
all
positive integers
binary by median
all
0 or 1 only
ubiquity
information function of ubiquity
nonnegative
nonnegative
nonnegative
nonnegative
Monotonic transformations
Power transformation
bij x
p
ij
10
9
8
7
6
b 5
4
3
2
1
0
power = 1/2
power = 1/3
power = 1/4
power = 1/10
0
25
50
75
100
x
Figure 9.1. Effect of square root and higher root transformations, b = f(x).
Note that roots higher than three are essentially presence-absence
transformations, yielding values close to 1 for all nonzero values.
Logarithmic transformation
bij log( xij )
If the lowest nonzero value in the data is one (as in count
data), then it is best to add one before applying the
transformations:
bij log( xij 1)
If the lowest nonzero value of x differs from one by more than an order of
magnitude, then:
The following transformation is a generalized procedure that (a) tends to preserve
the original order of magnitudes in the data and (b) results in values of zero when
the initial value was zero. Given:
Min(x) is the smallest nonzero value in the data
Int(x) is a function that truncates x to an integer by dropping digits after
the decimal point
c = order of magnitude constant = Int(log(Min(x))
d = decimal constant = log-1 (c)
then the transformation is
bij = log(xij + d) - c
Arcsine transformation
bij = 2/ * arcsin(xij)
Arcsine squareroot transformation
bij = 2/ * arcsin
x
ij
1.0
0.9
0.8
sqrt(x)
0.7
f(x)
0.6
arcsin(sqrt(x))
0.5
0.4
arcsin(x)
0.3
0.2
0.1
x2
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
x
Figure 9.2. Effect of several transformations on proportion data.
Beals smoothing
The index evaluates the favorability of a given sample for species i,
based on the whole data set, using the proportions of joint
occurrences between the species that do occur in the sample and
species i.
1
bij =
Si
k
M jk
Nk
where
Si is the number of species in sample unit i,
Mjk is the number of sample units with both
species j and k, and
Nk is the number of sample units with species k.
Box 9.1. Example of Beals smoothing
Data matrix X before transformation (3 sample units 5 species):
SU1
SU2
SU3
Nj
sp1
sp2
sp3
sp4
sp5
Si
1
0
1
2
0
0
1
1
1
0
0
1
1
1
0
2
1
0
0
1
4
1
2
Si = number of species in sample unit i.
Nj = number of sample units with species j.
Construct matrix M, where Mjk = number of sample units with both species j and k.
(Note that where j = k, then Mjk = Nj).
Species k
Species j
1
2
3
4
5
1
2
1
1
1
1
2
3
4
5
1
0
0
0
1
1
1
2
1
1
Box 9.1. (cont.) Example of Beals smoothing
Construct new matrix B containing values transformed with
Beals smoothing function:
1
bij =
Si
k
M jk
for all k with xik 0
Nk
Data after transformation (B):
SU1
SU2
SU3
sp1
sp2
sp3
sp4
sp5
0.88
0.50
1.00
0.13
0.00
0.75
0.75
0.50
0.25
0.88
1.00
0.25
0.75
0.50
0.25
Box 9.1. (cont.) Example of Beals smoothing
Example for sample unit 1 and species 2:
b1,2 = 1/4 (1/2 + 0/1 + 0/2 + 0/1)
b1,2 = 0.25 (0.5)
b1,2 = 0.125 (rounded to 0.13 in matrix above)
Example for sample unit 3 and species 2:
b3,2 = 1/2 (1/2 + 1/1)
b3,2 = 0.5 (1.5)
b3,2 = 0.75
Relativizations
"To relativize or not to relativize, that focuses the question."
(Shakespeare, ????)
Table 9.2. Evaluation of degree of variability in row or
column totals as measured with the coefficient of variation
of row or column totals.
CV, %
Variability among rows (or columns)
< 50
Small. Relativization usually has
small effect on qualitative outcome of
the analysis.
Moderate (with a correspondingly
moderate effect on the outcome of
further analysis).
50-100
100-300
> 300
Large. Large effect on results.
Very large.
Figure 9.3. Effect of various transformations on relative weighting of species.
Species abundance was measured on a continuous, quantitative scale. “Rank” is the
order of species ranked by their abundance.
Figure 9.3. (cont.) Effect of various transformations on relative weighting of
species. Species abundance was measured on a continuous, quantitative scale.
“Rank” is the order of species ranked by their abundance.
General relativization
By rows:
bij =
By columns:
x ij
p
x ij
j 1
q
1/ p
bij =
for a matrix of n rows and q columns.
x ij
p
x ij
i1
n
1/ p
Relativization by maximum
bij = xij /xmaxj
where
rows (i) are samples and
columns (j) are species,
xmaxj is the largest value in the matrix for
species j.
Adjustment to standard deviate
bij ( xij x j ) / s j
Binary with respect to mean
bij = 1 if xij > x , bij = 0 if xij
x
Rank adjustment
Matrix elements are assigned ranks within rows or
columns such that the row or column totals are
constant. Ties are assigned the average rank of the tied
elements. For example, the values 1, 3, 3, 9, 10 would
receive ranks 1, 2.5, 2.5, 4, 5.
Binary with respect to median
bij = 1 if xij > median, bij = 0 if xij median
Weighting by ubiquity
bij U j xij
where
Uj Nj / N
If rows are samples, columns are species, and relativization is by
columns, more ubiquitous species are given more weight.
Under these conditions:
Nj is the number of samples in which species j occurs and
N is the total number of samples.
Information function of ubiquity
bij I j x ij
where
I j p j log( p j ) (1 p j ) log(1 p j )
and pj = Nj /N with Nj and N as defined above.
Double relativizations
Bray and Curtis (1957):
First relativized by species maximum, equalizing
the rare and abundant species.
Then they relativized by SU total
"contingency deviate" relativization
Austin and Greig-Smith (1968)
p
n
x x
ij
bij = x ij -
j=1
p
i=1
n
x
j=1 i=1
ij
ij
Deleting rare species
1.0
Depth to
water table
r-squared
0.8
Distance
from stream
0.6
0.4
Elevation
above stream
0.2
0.0
0
5 10 15 20 25 30 35 40 45
Criterion for Species Removal
(occurrence in % of SUs)
Figure 9.4. Correlation between ordination axis scores and environmental variables can
often be improved by removal of rare species. In this case, the strength of relationship
between hydrologic variables and vegetation, as measured by r2, is maximized with
removal of species occurring in fewer than 5-15% of the sample units, depending on the
hydrologic variable. The original data set contained 88 species; 59, 35, 16, and 9 species
remained after removal of species occurring in fewer than 5, 15, 40, and 45% of the sample
units, respectively. Data are courtesy of Nick Otting (1996, unpublished).
0.20
0.15
A 0.10
0.05
0.00
0
1
2
3
4
5
6
7
Number of species removed
Figure 9.5. Response of A statistic (blocked MRPP) to removal of rare species from
small mammal trapping data. A measures the effect size of the treatments, in this case
different stand structures.
Difference between two dates
If aij1 and aij2 are the abundances of species j in sample
unit i at times 1 and 2, then the difference between dates
is:
bij = aij2 - aij1
First difference of time series
bij = aij,t+1 - aij,t
for a community sampled at times t and t+1.
First difference of time series
bij = aij,t+1 - aij,t
for a community sampled at times t and t+1.
Absolute differences, creating a matrix of species’
contributions to community change, without regard to
the direction of the change:
bij = | aij,t+1 - aij1,t |
A general procedure for data adjustments
Species data
Table 9.3. Suggested procedure for data adjustments of species data matrices.
Action to be considered
1. Calculate descriptive statistics. Repeat this after
each step below. (In PC-ORD run Row & column
summary)
Beta diversity (community data sets)
Average skewness of columns
Coefficient of variation (CV, %)
CV of row totals
CV of column totals
Criteria
Always
Example data set profile from PC-ORD 5.
****************************** Data Set Profile **************************
Main matrix:
StreamRestoration.wk1
Second matrix: StreamRestoration2.wk1
-------------------------------------------------------------------------Main matrix
Second Matrix
-------------------------------------------------------------------------% zeros
78.2
11.1
Average distance Sorensen 55.97955
Rela.Eucl.
0.41136
-------------------------------------------------------------------------Beta diversity,Whittaker`s
3.6
--Beta diversity,ave.1/2 changes 1.2
--Range(orders magnitude base10) 1.3
7.0
Lowest nonzero value
0.0436
0.1000E-03
Highest value
0.8165
0.1000E+04
-------------------------------------------------------------------------Rows
Columns
Rows
Columns
Contents:
54 Sites
67 Attribut
54 sites
14 attribut
Skewness
Average
3.2
3.8
2.0
1.6
Maximum
5.1
7.3
3.6
6.9
Minimum
1.5
-0.5
1.3
-0.1
CV of totals, %
26.10
179.03
42.24
160.31
-------------------------------------------------------------------------Potential Outliers
Distance measure:
Sorensen
Rela.Eucl.
SD-Item
SD-Item
SD-Item
SD-Item
4.4-Pott Crk
0.03.9-Brushy F
2.4-Vol LWD/
2.5-Little W
0.03.2-Lindley
0.02.0-yates mi
0.02.4-Pott Crk
0.0--------------------------------------------------------------------------
A general procedure for data adjustments
Species data
Table 9.3. Suggested procedure for data adjustments of species data matrices.
Action to be considered
1. Calculate descriptive statistics. Repeat this after
each step below. (In PC-ORD run Row & column
summary)
Beta diversity (community data sets)
Average skewness of columns
Coefficient of variation (CV, %)
CV of row totals
CV of column totals
s
CV 100
x
Criteria
Always
< 50 = S
50-100 = M
100-300 = L
> 300 = XL
A general procedure for data adjustments
Species data
Table 9.3. Suggested procedure for data adjustments of species data matrices.
Action to be considered
1. Calculate descriptive statistics. Repeat this after
each step below. (In PC-ORD run Row & column
summary)
Beta diversity (community data sets)
Average skewness of columns
Coefficient of variation (CV, %)
CV of row totals
CV of column totals
2. Delete rare species (< 5% of sample units)
Criteria
Always
Usually applied to community data sets, unless
contrary to study goals
Species data, cont.
3. Monotonic transformation (if applied to species,
then usually applied uniformly to all of them, so that
all are scaled the same)
A. Average skewness of columns (species)
B. Data range over how many orders of magnitude?
(Count and biomass data often are extreme.)
C. Beta diversity. (Consider presence/absence
transformation for community data when is high.)
Species data, cont.
3. Monotonic transformation (if applied to species,
then usually applied uniformly to all of them, so that
all are scaled the same)
A. Average skewness of columns (species)
B. Data range over how many orders of magnitude?
(Count and biomass data often are extreme.)
C. Beta diversity. (Consider presence/absence
transformation for community data when is high.)
4. Row or column relativizations
What is the question?
Are units for all variables the same?
Is relativization built into the subsequent analysis?
CV of row totals
CV of column totals
What distance measure do you intend to use?
Note: regardless of your decision to relativize or not,
you should state your decision and justify it briefly on
biological grounds.
Species data, cont.
5. Check for outliers based on the average distance of
each point from all other points. Calculate standard
deviation of these average distances. Describe
outliers and take steps to reduce influence, if
necessary
standard
deviation
----------<2
2 - 2.3
2.3 - 3
>3
degree of
problem
----------------------no problem
weak outlier
moderate outlier
strong outlier
Example data set profile from PC-ORD 5.
****************************** Data Set Profile **************************
Main matrix:
StreamRestoration.wk1
Second matrix: StreamRestoration2.wk1
-------------------------------------------------------------------------Main matrix
Second Matrix
-------------------------------------------------------------------------% zeros
78.2
11.1
Average distance Sorensen 55.97955
Rela.Eucl.
0.41136
-------------------------------------------------------------------------Beta diversity,Whittaker`s
3.6
--Beta diversity,ave.1/2 changes 1.2
--Range(orders magnitude base10) 1.3
7.0
Lowest nonzero value
0.0436
0.1000E-03
Highest value
0.8165
0.1000E+04
-------------------------------------------------------------------------Rows
Columns
Rows
Columns
Contents:
54 Sites
67 Attribut
54 sites
14 attribut
Skewness
Average
3.2
3.8
2.0
1.6
Maximum
5.1
7.3
3.6
6.9
Minimum
1.5
-0.5
1.3
-0.1
CV of totals, %
26.10
179.03
42.24
160.31
-------------------------------------------------------------------------Potential Outliers
Distance measure:
Sorensen
Rela.Eucl.
SD-Item
SD-Item
SD-Item
SD-Item
4.4-Pott Crk
0.03.9-Brushy F
2.4-Vol LWD/
2.5-Little W
0.03.2-Lindley
0.02.0-yates mi
0.02.4-Pott Crk
0.0--------------------------------------------------------------------------
Environmental data
Table 9.4. Suggested procedure for data adjustments of quantitative variables in environmental data matrices.
Action to be considered
Criteria
1. Calculate descriptive statistics for
quantitative variables. Repeat this
after each step below. (In PC-ORD
run Row & column summary)
Skewness and range for each
variable (column)
Always
2. Monotonic transformation (applied
to individual variables, depending on
need)
Consider log or square root transformation for variables with
skewness > 1 or ranging over several orders of magnitude.
Consider arcsine squareroot transformation for proportion data.
Environmental data
Table 9.4. Suggested procedure for data adjustments of quantitative variables in environmental data matrices.
Action to be considered
Criteria
1. Calculate descriptive statistics for
quantitative variables. Repeat this
after each step below. (In PC-ORD
run Row & column summary)
Skewness and range for each
variable (column)
Always
2. Monotonic transformation (applied
to individual variables, depending on
need)
Consider log or square root transformation for variables with
skewness > 1 or ranging over several orders of magnitude.
Consider arcsine squareroot transformation for proportion data.
3. Column relativizations
Consider column relativization (by norm or standard deviates) if
environmental variables are to be used in a distance-based
analysis that does not automatically relativize the variables (for
example, using MRPP to answer the question: do groups of
sample units defined by species differ in environmental space?).
Column relativization is not necessary for analyses that use the
variables one at a time (e.g., ordination overlays) or for analyses
with built-in standardization (e.g., PCA of a correlation matrix).
Environmental data
Table 9.4. Suggested procedure for data adjustments of quantitative variables in environmental data matrices.
Action to be considered
Criteria
1. Calculate descriptive statistics for
quantitative variables. Repeat this
after each step below. (In PC-ORD
run Row & column summary)
Skewness and range for each
variable (column)
Always
2. Monotonic transformation (applied
to individual variables, depending on
need)
Consider log or square root transformation for variables with
skewness > 1 or ranging over several orders of magnitude.
Consider arcsine squareroot transformation for proportion data.
3. Column relativizations
Consider column relativization (by norm or standard deviates) if
environmental variables are to be used in a distance-based
analysis that does not automatically relativize the variables (for
example, using MRPP to answer the question: do groups of
sample units defined by species differ in environmental space?).
Column relativization is not necessary for analyses that use the
variables one at a time (e.g., ordination overlays) or for analyses
with built-in standardization (e.g., PCA of a correlation matrix).
4. Check for univariate outliers and
take corrective steps if necessary.
Examine scatterplots or frequency distributions or relativize by
standard deviates (“z scores”) and check for high absolute
values.