A text-mining analysis of the human phenome
Download
Report
Transcript A text-mining analysis of the human phenome
A text-mining analysis of the
human phenome
European Journal of Human Genetics (2006) 14, 535-542
Marc A van Driel1, Jorn Bruggeman2, Gert Vriend1, Han G
Brunner*,3 and Jack AM Leunissen2
1Centre
for Molecular and Biomolecular Informatics, Radboud University
Nijmegenthe Netherlands; 2Department of Bioinformatics, Wageningen University
and Research Centre; 3Department of Human Genetics, University Medical
Centre Nijmegen
Speaker: Yu-Ching Fang
Advisors: Hsueh-Fen Juan and Hsin-His Chen
1
Outline
•
•
•
•
Introduction
Methods
Results
Discussion
2
Introduction
• Functional annotation of genes is an
important challenge once the sequence of
a genome has been completed.
• Previous studies have correlated various
attributes of human genes with the chance
of causing a disease.
3
Introduction (cont.)
• But, few attempts have been made to
systematically classify relationships
between genes and proteins at the
phenotype level.
4
Introduction (cont.)
• The Online Mendelian Inheritance in Man
(OMIM) database contains human disease
phenotype data and record-based textual
information, one gene or one genetic
disorder per record.
• Goal: Systematic grouping of genes by
their associated phenotypes from the
OMIM database.
5
Methods – The OMIM database
• Full text (TX) field: 5132 (disease)/16357
6
Methods – The OMIM database (cont.)
• Clinical synopsis (CS) field
7
Creation of ‘feature vectors’
• MeSH terms and their components are
concepts.
• MeSH concepts serve as phenotype
features characterizing OMIM records.
Ex: OMIM_1->[MeSH_1,MeSH_2,…]
8
Refinement of the feature vectors
• MeSH concepts can be very broad like
‘Eye’ or more specific like ‘Retina’.
• A concepts hierarchy that describes
relationships such as ‘Eye’-’Retina’’Photoreceptors’.
• Retina is a hyponym of Eye.
9
Refinement of the feature vectors (cont.)
• To ensure that the concepts eye and retina
are recognized as similar, the MeSH
hierarchy was used to encode this
similarity in the feature vectors by
increasing the value of all hypernyms.
rc: relevance of concept c
rc,counted: count of the concept c in a document
rhypo’s: relevance of the concept c’s hyponym
nhypo,c: the number of the concept c’s hyponyms
10
Refinement of the feature vectors (cont.)
• Example of concept expansion using the
MeSH hierarchical structure.
11
Refinement of the feature vectors (cont.)
• Not all concepts in the OMIM records are
equally informative.
• Ex: ‘retina pigment epithelium’ occurs
rarely, and thus provides more specific
information than very frequently terms
such as ‘Brain’.
• Inverse document frequency measure
gwc: inverse document frequency or global weight
of concept c
N: 5080
12
nc: the number of records that contain concept c
Refinement of the feature vectors (cont.)
• Not all OMIM records contain equally
extensive descriptions (record size
differences).
• These differences will make a comparison
between records difficult because the
diversity and the frequency of concepts in
the larger records will exceed those in the
smaller records.
rc: relevance of concept c
rmf: the frequency of the most occurring MeSH
13
concept in that record
Comparing OMIM records
• The similarity between OMIM records can
be quantified by comparing the feature
vectors that are expanded and corrected.
• Similarities between feature vectors were
determined by the cosines of their angles.
s(X,Y): the similarity between the
feature vectors X and Y
xi, yi: concept frequencies
14
Results – Comparing OMIM records
• 5080/5132 OMIM records could match one or more
MeSH terms.
• The 5080x5080 pair-wise feature vector similarities form
phenomap (All to all similarities).
Most phenotypephenotype pairs
have a low
similarity score.15
Comparing OMIM records - The best scores for all
phenotypes in the disease phenotype data set
• For each OMIM record, the most similar of the
other 5079 records was identified.
• Moderately similar phenotype pairs might still
yield reasonable hypotheses.
Ex: ‘Fibromuscular
Dysplasia of
Arteries’ and
‘Cardiomyopathy,
Familial
Hypertrophic’ have
0.31 similarity 16
score
Comparing OMIM records (cont.)
• Conclusion: The more phenotypes
resemble each other, the more likely they
are to share an interaction.
17
Discussion
• Developed a text-mining approach to map
relationships between more than 5000
human genetic disease phenotypes from
the OMIM database.
• Phenotype clustering reflects the modular
nature of human disease genetics. Thus,
the phenomap may be used to predict
candidate genes for diseases.
18