The next generation of identification tools: interactive

Download Report

Transcript The next generation of identification tools: interactive

Pavel B. Klimov
Barry M. OConnor
University of Michigan, Museum of
Zoology, 1109 Geddes Ave., Ann Arbor, MI
Context:
The vast majority of interactive
identification programs use a sequential approach to assign an
unknown specimen to a known group. This algorithm works
when the distinguishing characters do not have overlapping
values. If the boundaries between taxa are overlapping,
simultaneous
(=probabilistic,
matching)
methods
of
identifications are more likely to lead to the correct assignment,
but these methods usually require time-consuming
measurements or experiments. We discuss how the sequential
approach can be enhanced by multivariate statistics incorporated
into this method.
1. INTRODUCTION
Computer assisted interactive identification allows quick
assignment of an unknown specimen to a known taxon with
minimal costs in obtaining data and learning about the unknown.
The number of characters used in the identification is
substantially reduced compared to traditional taxonomic keys.
For example, any of 128 taxa can be identified using only eight
binary characters, or even fewer numeric or multistate
characters. There are two major approaches to identification,
sequential (=elimination, diagnostic) and simultaneous
(=probabilistic, matching). In the sequential approach, only one
character is used at each step of identification until the unknown
specimen is assigned to a particular group. In the simultaneous
approach, some or all characters are entered simultaneously, and
the probability of group membership of the unknown specimen
is calculated.
The advantage of the sequential algorithm, particularly its
multi-entry variant (=freedom to choose any character), is
obvious when a taxon set is large and the taxa have distinct
boundaries. At each step, taxa matching the unknown are
retained and diagnostic characters for this subset are ordered
according to their separating power. This algorithm has been
implemented in a variety of interactive identification programs
such as DELTA and Lucid that are widely used at present. In
contrast, simultaneous methods usually require data obtained by
time consuming measurements or experiments and are not that
flexible in terms of the freedom of choosing characters, but are
more likely to lead to the correct assignment if the boundaries
between some or all taxa are overlapping.
The situation when a data set is large and contains taxa that
cannot be completely separated using qualitative or uni- or
bivarite characters requires a combination of both methods of
identification where each approach will handle the appropriate
data.
The next generation of identification tools:
interactive programs incorporating
multivariate models
2. MULTIVARIATE MODELS
Multivariate statistics summarizes variation in many
variables in many specimens in the form of a concise model that
contains essential and comprehensive information about the
groups and that has predictive power. We consider two
multivariate techniques that are usually used to analyze
intergroup differences: canonical variates analysis (CVA), and
binomial logistic regression (LR). Both analyses handle metric
and non-metric independent variables.
A canonical variates function is a latent variable that is
created as a linear combination of independent variables,
CV = b1*x1 + b2*x2 + ... + bn*xn + c (1),
where the b's are coefficients, the x's are independent variables, and c is a constant.
If there are n groups, n-1 CV's are calculated. For assignment
purposes, the estimated posterior probability of group
membership is calculated, or, when multivariate normality of the
independent variables is assumed, the value of CV can be
equivalently used.
Logistic regression models can be expressed as the
following equation,
P(0) = exp(b1*x1 + b2*x2 + ... + bn*xn + c)/(1+exp(b1*x1 +
b2*x2 + ... + bn*xn + c)) (2),
Implementation of the new data type will require some
adjustment in the internal logic of an identification program. In
the general case, there are some characters in the identification
matrix that can separate a subset of taxa without using methods
of multivariate models. These characters, whether they are
binary, multistate, or variable, should be given more weight
compared to the complex character generated by a multivariate
model. The latter also should be coded only for the subset of
taxa included in the model, and this character for the other taxa
should be coded as "missing". Because a multivariate model
may contain characters that are used elsewhere in the
identification matrix, these matching characters should be crossreferenced.
Results
• The most optimal way of identification when a data matrix
contain both both discrete and overlapping groups is to use
combined sequential and probabilistic strategies for appropriate
data.
• Canonical variates and logistic regression models can be used
in the context of the sequential approach to calculate posterior
probabilities and to classify the unknown specimen.
where P(0) the probability of an unknown specimen being taxon 0, other notations are
the same as for CVA above.
If P(0) exceeds 0.5, then the unknown belongs to taxon 0,
otherwise to taxon 1.
A great advantage of LR over CVA is that it is a direct
posterior probabilities estimator, it calculates the class posterior
probabilities without ever estimating the classes' individual
density functions, which requires additional data (group means,
prior probabilities, and the value of mean square within groups).
3. INCORPORATING
THE MODELS IN THE
SEQUENTIAL ALGORITHM
Both (1) and (2) can be used in any sequential
identification program, as a single character “Model classifies
the unknown specimen to” with the character states “group 1,
group2,…group n”. The user, however, should be asked simply
to enter measurements or observations, x1, x2, …, xn, then the
Bayesian probabilities associated with being in either group are
calculated, and the greater of these probabilities is used to
classify the specimen.
http://insects.ummz.lsa.umich.edu/beemites/Morphometrics.html
Research supported by NSF DEB-0118766 (PEET)
and the USDA (CSREES #2002-35302-12654).