Transcript Document

Database Representation
of Phenotype: Issues and
Challenges
Prakash Nadkarni
Human Phenotyping Studies




“Phenotype” means different things to clinical
researchers and to classical human or animal
geneticists.
To the latter, it has traditionally been a “syndrome”,
consisting of one or more detectable or visible
traits.
These days, it is often likely to be defined in terms
of variation from the norm (for better or for worse).
The single most useful catalog of human variation
is Online Mendelian Inheritance in Man (OMIM).
Being a text database, OMIM has limited
computability.
Why standardize electronic
phenotype representation?


In the post-sequencing era of genomics, electronic
publication of primary data may eventually be
mandated, like sequence data in Genbank.
Requirements:
As in publication of research papers, the description
must be detailed and unambiguous enough to allow
others to reproduce the experimental design.
 Must allow the possibility of data mining by analytical
tools. While complete “understanding” of the data by a
computer is rarely possible, the objective is to facilitate
understanding by computer-aided human interaction.

Challenges and Caveats




The data itself is highly diverse - sub-cellular , cellular,
organ system, clinical. For structured data, the format
of the data depends on its nature.
A single batch of data could include combinations of
types of data. (e.g., both alterations in enzyme
function as well as clinical descriptors).
The potential number of data elements could range in
the hundreds of thousands across the entire life
sciences field.
Phenotype-genotype “correlation” becomes difficult in
multigenic disorders where thousands of genetic loci
have been screened. A given trait may be the result of
several interacting haplotypes.
Challenges and Caveats- II


Data does not age well with time. If more than a
few years old, unlikely to be of much use, because
newly discovered parameters now considered
essential to phenotype characterization may be
missing from old data (example of diabetes- insulin
deficiency vs. resistance at various levels).
Mining of someone else’s data only suggests
hypotheses. To confirm them, one usually needs to
go back to the original subjects, which is not
always possible. (Reconsenting, HIPAA).
How has Phenotype Data been
represented so far?



Unstructured Data: Narrative Text is used to
describe qualitative findings (e.g., OMIM)
Structured Data: Multiple values of quantitative data
are typically represented in tables/spreadsheets that
are described and annotated by accompanying text
(descriptors or “meta-data”).
For a mix of structured and unstructured data, one
must fall back on narrative (free) text. (Even after the
standardization approaches described later, narrative
text will still be necessary, because it captures
nuances that codification cannot.)
Phenotype Database Structure

The problem of representing phenotypic data is very
similar to the problem of representing clinical patient
data in clinical patient record systems.



A vast number of clinical parameters can potentially
apply to a human subject, but for a given clinical
study, only a modest number of parameters actually
apply.
The same modeling approach – (Entity-AttributeValue) can be used. First used in the TMR system
(Stead and Hammond) in the 1970s.
For phenotyping data, the major challenge is one of
imposing an organization of the universe of
attributes – I.e., standardizing the metadata.
Standardizing Phenotype MetaData:


While the potential number of data elements is vast,
the number of types of data elements is much fewer,
and will possibly be tractable.
E.g., there are thousands of different clinical lab tests.
However, all of them belong to one type of data (“lab
test”). The description of a lab test is standardized by
LOINC..
Data type, e.g., number, string,
Source of sample, how sample is collected (random vs.
post-prandial, single vs. cumulative), how performed
(bibliographic ref),Units in which recorded, precision of
method
Maximum and minimum legal values, normal range of
values, if applicable.
Objectives of Metadata
Standardization



By structuring the metadata to a sufficient degree of
richness, software can compare the scientific
metadata accompanying the data items from
disparate data sets and determine whether it is safe
to combine them in a meta-analysis.
An example of data that cannot be directly
combined is blood glucose done by tests detecting
reducing substances (normal 80-120 mg/dl) vs.
tests based on glucose oxidase methods (60-100).
A more mundane example – weight in kilos vs. lbs.
Conversion factors can then be employed.
Mechanisms of Standardization:
Controlled Vocabularies





Controlled Vocabularies organize the concepts (attributes) that
describe a particular domain into a taxonomy.
Address the issues of different synonyms or lexical forms for
the same concept. (e.g., “glucose, blood” vs. “blood glucose”).
Concepts are assigned stable Identifiers which do not change
between versions of the vocabulary.
Using a ID as part of the annotation of a dataset minimizes
ambiguity, and reduces the need to supply numerous inferable
details.
Limitations: for questionnaires, e.g., those used in psychometry,
each questionnaire is essentially its own vocabulary- mapping
to “standard” vocabularies is rarely possible. (LOINC is trying to
incorporate these questionnaires themselves, but newly
devised questionnaires will be left out until a standards
committee decides to incorporate it.)
Examples of Controlled Vocabularies
/ IDs in the Life Sciences





Bibliographic databases: PubMED ID, OMIM ID etc.
Sequence and Other Biology Databases: Gene Ontology ID,
Geninfo ID, EC Number, etc.
Clinical: SNOMED for clinical diagnosis, LOINC for Lab tests
and Clinical Observations.
Different controlled vocabularies are specialized for managing
different types of data. Where there is overlap, some are
known to do a better job than others.
Liberal use of vocabulary identifiers for annotation and crossreferencing in narrative text, as in OMIM (structured text).
Some degree of automated mapping of text to identifiers is
possible, but the results require curation.
Requirements of Submission Tools



Must manage a balancing act- ease and
convenience of use vs. rigorous validation.
Must facilitate the (re-)use of standard definitions,
both within a submission as well as across
submissions by the same investigator, or different
investigators.
The need by the repository’s curators to define the
minimum set of mandatory descriptors- cf. MAGE
for microarray experiments. The value of data that
is not accompanied by this minimum set may be
markedly diminished for investigators other than
those who originated it. (Nature of minimum set
varies with category of data.)
Requirements (2)


The submission tools must provide intuitive
searching of previously created definitions, or reuse
is unlikely to occur.
The submission tools must provide access to set of
controlled vocabularies. These must be searched
by curators, often in response to researchers who
declare the intention to submit data. (Based on
limited experience, we believe that requiring
submitters to explore vocabularies during the
process of submission itself may be too onerousthe addition of vocabulary mappings may therefore
be done after submission.)
Structure of a Data Submission



A submission consists of two parts:
 Descriptors (metadata) for each data item in the
submission
 the data itself.
Based on the category of phenotypic data for that
item, the minimum set of descriptors for each item
must be specified.
Some examples of descriptors that are possibly universal
 Primary data vs. aggregated (mean, S.D.)
 Qualitative vs. Quantitative: (nominal, ordinal, interval,
ratio). Choice Sets.
 Raw vs. derived (e.g., composite score- in psychometry).
 ?Test of significance, p value
Conclusions & Summary



It is essential to define minimum descriptors for a variety of
phenotypic parameters. While these have been defined for
clinical parameters, defining them for pre-clinical parameters
is a major challenge.
Collaboration within and across research consortia, coupled
with experience, will determine the minimum set; several
iterations may be required.
Some forbearance of informatics personnel (who are
building tools for electronic submission) may be required:
some investigators who submit data may not want to be
bothered to provide the extensive metadata to make that
data understandable to others. Making the tools intuitive
enough, and comprehensive enough, will also take several
iterations.
Acknowledgments