Relevance Detection Approach to Gene Annotation

Download Report

Transcript Relevance Detection Approach to Gene Annotation

Relevance Detection Approach to
Gene Annotation
• Aid to automatic annotation of databases
• Annotation flow
– Extraction of molecular function of a gene from
literature
– That annotation of this function with a term in a
controlled vocabulary
• Premise
– If the document sets retrieved by a GeneRIF and a GO
concept are similar then a link can be made between
them
Data
• GeneRIF/GO term pairs
– Paired if reference same MEDLINE article
– Manually filtered for obvious errors
– 550 pairs from 335 distinct genes
• GO concept = GO term + definition
• GeneRIFs and GO concepts too short for simple
keyword matching
• Treated as an IR problem
– Similar to TREC novelty track
– Compute relevance and similarity of 2 sentences
• Document set - TREC Genomics 2003 docs
• Each sentence within GeneRIF/GO concept
pair treated as IR query
• Similarity between the 2 computed based on
top 200 docs retrieved by each query
• Best Recall = 78.2%(prec = 22.1%)
• Best Precision = 66.2% (rec = 46.9%)
GO Dependence Relations
• Previous work (PSB)
– Using substring matching between GO codes
– Derived from annotation databases, using vector space
models, co-occurrence, association rule-mining.
• ChEBI: www.ebi.ac.uk/chebi/
– Chemical Entities of Biological Interest
– Preferred names + synonyms
– IS_A (poly)hierarchy
methods
• String matching
• If the same ChEBI entity is used within 2 GO
codes, they are in a dependence relationship
– First order relationship
– ChEBI term must be whole word or surrounded by
punctuation, e.g. carbonic anhydrase activity is not
related to carbon-oxygen lyase activity
• Also, in a dependence relationship with the
ancestors
– Second order relationship
Results
• 55% of GO terms contain a ChEBI entity
• 56% of dependent pairs with a ChEBI term found
in PSB study were identified in this study
• Less than 1% of GO term pairs found in this
study were identified by the PSB study
• Issues
– How to validate potential relationships?
– Usual naming/synonym ambiguity!
– Substrings not used: imidazolonepropionase
Disease Text Classification
• Task: Classification of text into one of 26
disease classes
• Used full text and weighted sections
according to information distribution
published by other groups
Data Preparation
• HTML full text documents, semi automatic
section division
• Tokenisation, Stemming, Stop word
filtering, Part of speech tagging
• Dataset: 21*25 positive full text articles, 33
negative full text articles
• 10 fold cross validation
• Nearest centroid classifier
Results
• Baseline: 56% F-score
• Additional preprocessing: 67%
– 10,000 stopword filter
– Only nouns
• Section weighting: 74%
– Abstract and Introduction weighted highest
From Nonsense to Sense in
Healthcare Questions
• Diagnosis, Prognosis, Therapy, Prevention
• medicine finds disease mechanisms by first
finding cures
– Currently by trial and error
• Try drug then test
– Future - test then try drug
• Biomarkers
– Normality -> dysfunction -> disease
– There are prognostic markers before any diagnostic
markers
Integrative Genomics
• Looking for hidden connections over wide
field, e.g.
– Immune system works too hard = rheumatoid
arthritis
– Immune system doesn’t work hard enough =
infectious diseases
Term Disambiguation
•
•
•
•
40% of genes have homonym problem
For 300 genes = 1mil MEDLINE articles
After disambiguation = 60,000 articles
93% accuracy in asigning correct ID to ambiguous
genes
• Use contectual fingerprints:
– Experts choose 5 abstracts about a concept
– Fingerprint then created for that concept