Italy_Siena_May2005 C - Sites at Penn State

Download Report

Transcript Italy_Siena_May2005 C - Sites at Penn State

Gini and Lorenz in
Statistical Ecology and
Environmental Statistics
G. P. Patil
Center for Statistical Ecology and
Environmental Statistics
Penn State University
University Park, PA 16802 USA



Diversity Measurement and Comparison
Gini, Lorenz
Equitability Measurement and
Comparison
Lorenz, Gini
Impurity Measurement and Comparison
Gini
Diversity Measurement and Comparison
Am I a Specialist or a Generalist?



My Wife: I am a specialist… because I
do ‘something’; not cooking, not
washing, not shopping, etc.
My Grandson: I am a generalist…
because I read, play, swim, drive,
draw, etc.
Degree of Specialization/Diversification
is relative to categorization
Resource Apportionment
(Time, Energy, Biomass,
Abundance,…)


Math Music
John: 2/3
1/3 ~π=(π1, π2)=(2/3,1/3)
Jane: 1/3
2/3 ~ν=(ν1, ν2)=(2/3,1/3)





Does John have a different kind of
specialization/diversification than
Jane?
Answer: Yes…subject identity matters
Does John have a different degree of
specialization/diversification than
Jane?
Answer: No…subject identity does not
matter
Degree of specialization/diversification
is permutation-invariant
Alfred Russel Wallace
-1875To an outside observer, variety is a most
striking feature of a diverse community.
Wallace’s description of a tropical forest is a
vivid illustration:
“If the traveler notices a particular species and
wishes to find more like it, he may turn his
eyes in vain in any direction.
Trees of varied forms, dimensions, and colors
are around him, but he rarely sees any one
of them repeated.
Alfred Russel Wallace
-1875Time after time he goes towards a tree which
looks like the one he seeks, but a closer
examination proves it to be distinct.
He may at length, perhaps, meet with a
second specimen half a mile off, or may fail
altogether, till on another occasion he
stumbles on one by accident.”
Alfred Russel Wallace
-1875In a diverse community, such as that
described by Wallace, the typical
species is relatively rare. Consequently,
we propose that diversity be defined as
the average rarity within a community.
Diversity as Average
Species Rarity
C = (s, ~π) = (π)
=
(π
,
π
,…,
π
)
1
2
s
~
R(i, π)
~
s
∆(π)
=
∑ πi R(i ; π)
~
~
i=1
Species Rarity
Conceptualization
Approach 1: Dichotomy
R(i ; π)
= R(πi)
~
R(π), 0 < π ≤ 1
s
∆(π)
= ∑ πi R(πi)
~
i =1
Species Rarity
Conceptualization
Approach 2: Ranking
# = (π #, π #, …, π #)
π
=
π
1
2
s
~
~
R(i, π#) = R(i)
~
s
#) = ∑ π # R(i)
∆(π)
=
∆(π
i
~
~
i=1
Dichotomous Approach
R(π) = 1- π
∆(π) = ∑ πi R(πi)
~
= ∑ πi (1- πi)
Gini Index, Simpson Index
1
R(π)
0
π
1
Dichotomous Approach
R(π) = (1/π) – 1
∆(π)
= s – 1, Species Count
~
R(π)
R(π) = log(1/π) = -log(π)
∆(π)
= -∑ πi log(πi), Shannon Index
~
π
1
Ranking Approach
Average Rank
# = π #, π #, …, π #)
π
:
(π
1
2
s
~
~
R(i) = i – 1
#) = Average Rank – 1
∆(π)
=
∆(π
~
~
Ranking Approach
Patil-Taillie Tail-sum Diversity Profile:
Intrinsic Diversity Profile
C = π ; Ranked π = π# = (π1# ≥ π2# ≥ … ≥
#)
π
~
~
~
j-th sranked
species:
standard
π1# ≥ π2# ≥ … ≥ πj# ≥ πj+1# ≥ … ≥ πs#
R(i ; π): 0
0
0
1
1
1
~
# + π
# + …+ π # = T (π)
∆(π)
=
π
j+1
j+2
s
j~
~
π = (2/6, 3/6, 1/6)
1
~
π# = (3/6, 2/6, 1/6)
~
Lorenz Profile
Tj 3/6
1/6
0
1
2
3
j
Patil – Taillie ∆β Diversity Profile
Small β: sensitivity to rare species
Large β: sensitivity to abundant species
Rβ(π) = (1-πβ)/(β)
-1 ≤ β ≤ 1…
∆β(π)
∆β
-1
0
1
2
β, Rβ(π)
β
β
∆β Profile
Diversity
5
π
= (.5, .2, .2, .05, .05)
~
4
3
2
1
-1
0
1
Beta
2
3
Tj Profile
Diversity
1.0
π
= (.5, .2, .2, .05, .05)
~
.8
.6
.4
.2
0
1
2
3
Jay
4
5
HURLBERT-SMITH GENERALIZATION TO
SPECIES AREA CURVE
H-S
∆
(π) = ∑(1-πi)[1-(1ω]
π
)
ω i ~
Hurlbert (1971)
Smith and Grassle (1977)
The Hurlbert-Smith index of order ω is the expected number of species
obtained when ω+1 individuals are randomly selected from the
community, minus one so that a single-species community has diversity
zero.
Gini index becomes available when ω=1 and species count when ω=∞.
Interestingly, the Hurlbert-Smith Family can be seen as Patil-Taillie
Family with rarity measure
R(π) = (1-π) [1 - (1-π)ω] / π
arising within the context of Intraspecific Encounters Theory.
Tj Profile
1.0
Diversity
.8
.6
.4
.2
0
1
2
3
4
5
Jay
π
= (.5, .2, .2, .05, .05)
~
HSG
∆ω Profile
Diversity
4
3
2
1
0
10
20
30
40
Omega
π
= (.5, .2, .2, .05, .05)
~
∆β Profile
5
Diversity
4
3
2
1
-1
0
1
2
3
Beta
π
=
(.5,
.2,
.2,
.05,
.05)
~
Sβ Profile
5
Diversity
4
3
2
1
-1
0
1
2
3
Beta
π
= (.5, .2, .2, .05, .05)
~
EQUIVALENT NUMBER OF
SPECIES PROFILE
Sβ(π)
=
1/
∑ πiβ+1
~
-1≤β<∞
It is the number of species that a completely even
community would need to have for its ∆β diversity
to be ∆β(π).
~
Gini index ∆1 for β=1 gives
S1(π) = 1 / ∑ πi ● πi = 1 / [1-Gini
index]
Encounter Theory and Type I and
Type II Rarity Measures
Consider again such a traveler who initially
encounters a member of species i and
subsequently encounters X additional individuals
where X is a positive integer-valued random
variable. Define the type I rarity measure to be
the probability that a new species is encountered,
i.e., the probability that at least one of the X
additional individuals belongs to species different
from i. A type II rarity measure, on the other
hand, is the probability that each of the additional
individuals belongs to species different from i.
Clearly these probabilities are large when the
species i is rare.
INTRASPECIFIC ENCOUNTER
THEORY AND DIVERSITY INDICES
Wallace story. Let Y+1 be the number of encounters
required to experience the first intraspecific encounter.
P(Y = y│πi ) = πi (1-πi )y, y = 0, 1, 2, …
Clearly, E[Y│πi ] = (1-πi ) / πi ,
E[Y+1│πi ] = 1 / πi ,
and, E[1/(Y+1) │πi ] = -πi log(πi ) / (1-πi ).
Since large values of Y are indicative of the
rarity of the species i, the following quantities
should be reasonable measures of its rarity.
(i)
measure
E[Y│πi ] = (1-πi ) / πi , the rarity
for the Species Count.
(ii)
E[Y│πi ] / E[Y+1│πi ] = 1-πi , the rarity
measure for the Simpson index.
(iii)
E[Y│πi ]● E[1/(Y+1)│πi ] = -log πi , the
rarity measure of the Shannon index.
The three classical diversity indices used
frequently in the ecological literature thus have a
meaningful interpretation within the average
community rarity formulation and also the
intraspecific encounter theory. Particularly, we
note that the Shannon index has an encounter
theoretic interpretation and thus should not be
singled out or criticized simply because of its
continuing use in information theory.
RAO GENERALIZATION TO
QUARDRATIC ENTROPY
Q=∑∑dijπiπj , Rao(1982)
i
j
the average distance between two randomly
selected individuals if dij=1, i≠j, and dij=0,
Q=∑πi(1-πj), Gini Index.
i
Q incorporates both species relative
abundances and a measure of taxonomic
or functional pairwise species distance.
Diversity Ordering
Let C = (s, π) and C’ = (s’, υ) be two communities.
~
~
The following statements are equivalent:
a) C’ is intrinsically more diverse than C.
b) ∆(C’) ≥ ∆(C) whenever ∆ satisfies Criterion C2.
c) ∆(C’) ≥ ∆(C) whenever ∆ satisfies Criterion C3.
d) π# majorizes υ#, i.e.
~
~
∑ πi# > ∑ υi#, k=1, 2, 3, … .
i≤k
i≤k
e) υ# is stochastically greater than π#, i.e.
~
~
∑ υi# > ∑ πi#, k=1, 2, 3, … .
i>k
i>k
f) υ is a convex linear combination of permutations of π.
~
~
Starting with members of the most abundant species, we
gradually accumulate individuals and plot, as abscissa X, the
cumulative proportion of individuals and, as ordinate Y, the cumulative
number of species. Formally, then, the intrinsic diversity profile is the
polygonal path joining the successive points
P0 = (1-T0, 0) = (0, 0)
P1 = (1-T1, 1)
P2 = (1-T2, 2)
…
Ps = (1-Ts, s) = (1, s).
With this definition, it is still the case that one community is intrinsically
more diverse than another if and only if the first community has its
intrinsic profile everywhere above that of the second community.
Cumulative Number of Species (Y)
Intrinsic diversity profile for a hypothetical five-species
community with relative abundances 0.5, 0.2, 0.2, 0.05, 0.05.
P5
5
P4
4
P3
3
P2
2
P1
1
0
0
0.50
1.0
Cumulative Proportion of Individuals (X)
Index-Free Definition of Species Equitability
Here the population units are species while the
individual organisms comprise the “commodity.” Now let
us gradually accumulate individuals starting with
members of the most abundant species. The Lorenz
curve is obtained by plotting cumulative proportions of
individuals as abscissae (X) against corresponding
cumulative proportions of species as ordinates (Y).
Formally, letting
π1# > π2# > … > πs#
be the ordered relative abundances, the Lorenz curve is
the polygonal path joining the successive points
P0 = (0,0),
P1 = (π1#, 1/s),
P2 = (π1# + π2#, 2/s),
P3 = (π1# + π2# + π3#, 3/s),
…
Ps = (π1# + π2# + … + πs#, s/s) ≡ (1,1).
Cumulative Number of Species (Y)
Lorenz curve for the hypothetical five-species community with
relative abundances 0.5, 0.2, 0.2, 0.05, 0.05.
5
P5
4
P4
3
P3
2
P2
1
P1
0
0
0.50
1.0
Cumulative Proportion of Individuals (X)
How then does the Lorenz curve effect equitability
comparisons when species richness varies? Consider
two communities which are replicates of one another in
the sense that they have the same relative abundance
vectors, but no species in common. It seems plausible
that combining these communities should give a
community with the same evenness but twice the
richness as either of the original communities. In other
words,
C: π1, π2, …, πs
and
C’: π1/2, π1/2, π2/2, π2/2, …, πs/2, πs/2
are expected to have the same evenness. The Lorenz
curves of C and C’ are indeed “extra” vertices such as
P2.
Now suppose we wanted to compare the
evenness of two communities with, say, 3 and 5
species respectively. We could replicate the 3species community 5 times and the 5-species
community 3 times and then carry out a diversity
comparison between the pair of resulting 15species communities. The Lorenz curves do all
this automatically.
This interesting replication property was first
pointed out by Hill (1973) in his discussion of the
Ea,b measures. Hill failed to notice the
connection with the Lorenz curve however.
Cumulative Number of Species (Y)
Lorenz Curves for the Gamma and Lognormal Models
1
1
0.5
Gamma Model
Lognormal Model
Labeled values: k
Labeled values: σ2
∞
0
0.5
.25
5
1
1
2
0.3
0.1
4
0
0
0
0.5
1
0
Cumulative Proportion of Individuals (X)
0.5
1
Describing Inequality in Plant Size or Fecundity
Damgaard and Weiner (2000)
Lorenz curves are used to describe inequality in
plant size and fecundity, where the inequality is
summarized by the Gini coefficient propose a
second and complementary statistic, the Lorenz
asymmetry coefficient, which characterizes an
important aspect of the shape of the Lorenz
curve. The statistic tells which size classes
contribute most to the population’s total
inequality. Helpful in interpreting the ecological
significance.
Crossings and No Crossings for
Lognormal Communities
Patil and Taillie (1979)
•
•
•
If the species density functions f’ and f have the starshaped property, then their intrinsic diversity profiles
have at most on crossing point
If C’ and C are lognormal communities with repsective
parameters (s’, σ’) and (s, σ). The intrinsic diversity
profiles have at most one crossing point. Further, c’ is
intrinsically more diverse than C if and only if s’ > s and
σ’ < σ.
The parameter 1/ σ2 completely characterizes the
evenness of the lognormal model, (Taillie, 1979).
Therefore, in view of the above, within the lognormal
family, increasing diversity is equivalent to
simultaneously increasing richness and
evenness/equitability.
Crossings and No Crossings for
Gamma Communities
Patil and Taillie (1979)
•
•
•
Let c’ and c be two gamma communities with
respective parameters (α’, k’) and (α, k). The intrinsic
diversity profiles have at most one crossing point.
Further, if k>0, then c’ is intrinsically more diverse than
c if and only if either (i) k’ > k and α’/k’ > α/k or (ii) k’ < k
and α’ > α.
The exponent k characterizes equitability within the
family of gamma models with positive exponent (Taillie,
1979).
For lognormal and gamma communities with positive
k, the Lorenz curves never cross one another.
Random Forests for Scientific Discovery
Leo Breiman, UC Berkeley
Adele Cutler, Utah State University
The Data Avalanche
We can gather and store larger amounts of data than ever
before:
 Satellite data
 Web data
 EPOS
 Microarrays etc
 Text mining and image recognition.
Who is trying to extract meaningful information form these data?
 Academic statisticians
 Machine learning specialists

People in the application areas!
CART (Breiman, Friedman,
Olshen, Stone 1984)
1.
2.
3.
4.
5.
Arguably one of the most successful tools of the
last 20 years. Why?
Universally applicable to both classification and
regression problems with no assumptions on the
data structure.
Can be applied to large datasets. Computational
requirements are of order MNlogN, where N is
the number of cases and M is the number of
variables.
Handles missing data effectively.
Deals with categorical variables efficiently.
Example: UCSD Heart Disease Study*
Goal: to predict who is at risk of a 2nd heart
attack and early death within 30 days and to
determine who should be sent to intensive
care treatment
# of subjects = 215
Outcome variable = High/Low Risk determined
by PI after 30 days follow up
# of variables available = 100
19 noninvasive clinical and lab variables were
used as the predictors
*:Gilpin, Olshen, Henning and Ross (1983)
Drawbacks of CART


Accuracy– current methods, such as support
vector machines and ensemble classifiers
often have 30% lower error rates than
CART.
Instability—if we change the data a little,
the tree picture can change a lot. So the
interpretation is not as straightforward as it
appears.
Today, we can do better!
What do we want in a tool for
the sciences?










Universally applicable for classification
Unexcelled accuracy
Capable of handling large datasets
Effective handling of missing values
}
minimum
Variable importance
Interactions
What is the shape of the data?
Are there clusters?
Are there novel cases or outliers?
How does the multivariate action of the variables
separate the classes?
Random Forests





General-purpose tool for classification and
regression
Unexcelled accuracy – about as accurate as
support vector machines (see later)
Capable of handling large datasets
Effectively handles missing values
Gives a wealth of scientifically important
insights
References








Damgaard, C., and Wiener, J. (2002). Describing inequality in plant size or
fecundity. Ecology, 81, 11939—11942.
Fattorini, L., and Marcheselli, M. (1999). Inference on intrinsic diversity
profiles of biological populations. Environmetrics, 10, 589—599.
Gini, C. (1936). On the measure of concentration with especial reference to
income and wealth. Cowles Commission.
Gini, C. W. (1912). Variabilita e mutabilita. Studi Economico-Giuridici della R.
Universita di Cagliaria, 3, 3—159.
Hill, M. O. (1973). Diversity and evenness: a unifying notation and its
consequences. Ecology, 54, 427—432.
Hurlbert, H. (1971). The nonconcept of species diversity: A critique and
alternative parameters. Ecology, 52, 577—586.
Lorenz, M. C. (1905). Methods of measuring the concentration of wealth.
Journal of the American Statistical Association, 9, 209—219.
Patil, G. P., and Rosenzweig, M. L. (eds). (1979). Contemporary Quantitative
Ecology and Related Ecometrics. International Co-operative Publishing
House, Fairland, MD.








Patil, G. P., and Taillie, C. (1979a). An overview of diversity. In Ecological
Diversity in Theory and Practice, J. F. Grassle, G. P. Patil, W. K. Smith, and C.
Taillie, eds. International Co-operative Publishing House, Fairland, MD. pp. 3—
27.
Patil, G. P., and Taillie, C. (1979b). A study of diversity profiles and orderings for
a bird community in the vicinity of Colstrip, Montana. In Contemporary
Quantitative Ecology and Related Ecometrics, International Co-operative
Publishing House, Fairland, MD. pp. 23—47.
Patil, G. P. and Taillie, C. (1982). Diversity as a concept and its measurement.
Journal of the American Statistical Association, 77, 548—567.
Rao, C. R. (1982). Diversity and dissimilarity coefficients: a unified approach.
Theoretical Population Biology, 21, 24—43.
Rao, C. R. (1982). Gini-Simpson index of diversity: A characterization,
generalization and applications. Utilitus Mathematica, 21, 273—282.
Ricotta, C. (2006). Through the jungle of biological diversity. Acta Biotheoretica.
(To appear).
Rousseau, R., VanHecke, P., Nijssen, D., and Bogaert, J. (1999). The relationship
between diversity profiles, evenness and species richness based on partial
ordering. Environmental and Ecological Statistics, 6, 211—223.
Smith, W. K., and Grassle, J. F. (1977). Sampling properties of a family of
diversity indices. Biometrics, 33, 283—292.
If a free society cannot help
the many who are poor,
it cannot save the few
who are rich.
John F. Kennedy, Inaugural Speech, 1961
Environmental and Ecological World:
Rare Species, Abundant Species:
Mindset
If the society cannot help with
the species that are rare (poor),
it cannot help save the few
that are abundant (rich).