Introduction to molecular biology…

Download Report

Transcript Introduction to molecular biology…

BioText Conference
Birkbeck College, London
Stephen Edwards BSc.
University of Edinburgh
BioNLP meeting 14th November 2005
Speakers






EDIMed
SBSS
Astra-Zeneca
Rob Gaizauskas (Sheffield)
BioRAT
Andrew Clegg (UoL)
EBIMed

Co-occurrence based IE (looking at parse methods)
Created on the fly
40% sentences retrieved useful for PPI
Includes navigation to databases (important)
Assessible data
max 10,000 (moving to full papers)

Whatizit modules





– High speed tagging modules
– Can be hooked up to any dictionary



RegExp and ML combination should be combined
Recall: “The whole truth”
Precision: “Nothing but the truth”
SBSS

Business uses:
–
–
–
–
–






Patent recognisers
IP protection
Drug design
Author networks, competition and funding
Marketing
NLP system, statistical linking between concepts
External/Internal databases
Includes web-sites, forums (negate false rumours!)
Microarray -> Text-Mining
User interface to add new synonyms
IBM Unstructured Information Management
Architecture
Astra-Zeneca

Drug discovery process => masses of data
(chem/bio assays, clinical trials, reports etc)


Track competition, groups
GCLit
- gene summaries
- MeSH/gene co-occurrences


Produce similarity matrices for two genes
Back dating to trap now know associations
Rob Gaizauskas


GO tagging (19,022 terms), GOSlims
Many tools:
– GOPubmed (weighted GO->doc assignment)
– GO-KDS (comm. Assigns GO terms to PubMed, rubbish!)
– BLAST -> Lit, cluster and structure by GO code

AMBIT
– combines IR/IE
– Termino module – GO, UniProt, UMLS

DiscoveryNet
– data management software, includes Termino

Created complete GO corpus
– fuzzy match+manual to get GO complete corpus

High results achieved assigning GO to abstracts
– F-measure 0.8
– – dubious, difficult to replicate evaluation as GO codes incomplete


User view applet: GO | Abstracts
Glass ceiling, too much tinkering, more fundamental ideas
Bio Research Assistant
(BioRAT)



100+ words/sec
PhD grads in India – pay them!
Tagging, PPI extraction, based on GATE
(further funding 5 yrs)
– User defines concepts of interest
– program defines templates
– select or reject, most are poor, time costly

Or,
– ML sequence aligns sentences produces
templates
– requires less effort but less reliable
NER


(Andrew Clegg)
Trees – discard parts of tree don’t
need
NER
– achieve max recall then filter through
ABNER (high precision)
– Create every possible variant, strip
punctuation, substitute greek, remove
stop words, long/short names
MMTx – Mapping the
UMLS to text
Stephen Edwards BSc.
University of Edinburgh
BioNLP meeting 14th November 2005
Overview

UMLS
MMTx
Hypothesis generation

milkER

Future use


UMLS


Unified Medical Language System
Multi-source vocabulary (>60 families)
– ~2.5 million terms

Concepts in semantic network
– ~12,000,000 relations between concepts


Lexicon
Many IDs
–
–
–
–

AUI
SUI
CUI
TUI
Customisable (MetaMorphosys)
MMTx

Preparatory filtering
– Relaxed
– Moderate
– Strict


(manual, lexical: 87%)
(relaxed+type-based:75%)
(moderate+syntactic)
Highly computationally expensive
Options
– restrict to sources
– Restrict to semantic types
– Show CUIs, semantic types, treecodes
MMTx parsing

Parsed into noun phrases
– SPECIALIST minimal commitment parser/MedPost SKR

Variant generation
– Largely preprocessed


Candidate retrieval
Candidate evaluation
–
–
–
–

Centrality
Variation
Coverage
Cohesiveness
Mapping
– Combines candidates
– Mapping evaluation (as with candidates)
Sentence: 0|0|183|Progress is described on the advanced stages in design of an instrument for the study of
red blood cell aggregation and blood viscosity under near-zero gravity conditions.|11540609:1
Phrase: "Progress"
Meta Mapping (1000)
1000 C1280477:Progress [Functional Concept] {}
Phrase: "is"
Meta Candidates (0): <none>
Meta Mappings: <none>
Phrase: "described"
Meta Candidates (0): <none>
Meta Mappings: <none>
Phrase: "on the advanced stages"
Meta Mapping (888)
694 C0205179:Advanced [Qualitative Concept] {}
861 C1306673:Stages [Functional Concept] {}
Phrase: "in design"
Meta Candidates (0): <none>
Meta Mappings: <none>
Phrase: "of an instrument"
Meta Mapping (1000)
1000 C0348000:Instrument, NOS [Manufactured Object] {}
Phrase: "for the study"
Meta Mapping (1000)
1000 C0008972:Study (Clinical Research) [Research Activity] {}
Meta Mapping (1000)
1000 C0557651:Study [Manufactured Object] {}
Phrase: "of red blood cell aggregation"
Meta Mapping (916)
756 C0014792:Blood Cell, Red (Erythrocytes) [Cell] {}
MMTx customisation







Advised to customise
English only sources used
Removed inappropriate sources
~2secs/sentence (~12 X improved
performance)
Can limit to sources, semantic types
Running on Windows, FC2 Linux
Lots of fudging required!
Hypothesis generation





Aim to extract interactions and
diseases
Swanson (Fish oil – Blood viscosity - Raynaud’s disease)
Srinivasan (Turmeric - NFB - Chron’s Disease)
Weeber
(Thalidamide – IL-4 – Pancretitis)
Confirmed experimentally
Hypothesis generation


Open/Closed
Co-occurrence relationship extraction
A (Raynaud’s Disease)
–
B1
–
B2
–
B3 (Blood Viscosity)
–
B4
Hypothesis generation
B3 (Blood viscosity)
–
C1
–
C2
–
C3 (Fish Oil)
–
C4
Hypothesis generation

Closed
A
C
–
–
–
–

B1
B2
B3
B4
B5
B2
B6
B1
–
–
–
–
Need to remove known A – C relationships
Other systems

ManJal – MeSH only, basic
LitLinker – shows associations by frequency
TransMiner – can be linked to MicroArray
DAD
– Drug Adverse Drug Reactions

i-HOP



– slick informative sentences (e.g.
experimental evidence, synonyms,
hyperlinked BUT 5 species only)
(Refs cited at end)
Other systems

ManJal
– MeSH only, basic
LitLinker – shows associations by frequency
TransMiner – can be linked to MicroArray
DAD
– Drug Adverse Drug Reactions

i-HOP




EBIMed
– slick informative sentences
(e.g. experimental evidence, synonyms, hyperlinked BUT
five species only)
– linkouts to external databases
(Refs cited at end)
milkER program
Manjal
DAD
milkER program
Input MEDLINE A/B/C term
Extract Titles, Abstracts, MeSH,
Substance Terms
Sort and count MeSH and Substance
terms
Tag milk proteins/peptides or term in
title and abstracts
Extract sentences containing the
protein/peptide or term
Partial standardisation, remove
overmatching
UMLS tagging
(customised MMTx)
Variable parameters
Filter and sort terms
1.
Physiological function
2.
Entity
3.
Location
4.
Combined
5.
No filter
(filter by MMTx weight?)
Group terms by concept
(removes plurals and variants)
Remove terms that are too general on
second or third level of the UMLS heiracrchy
Remove parent or child terms of search term
Cluster concepts by using the head of the
noun(? E.g. common and right migraine)
Remove over-abundant terms e.g. >15,000
documents
Calculate weighting of term
-TF*IDF
-Level of support of relationship (e.g. Must occur in >5 titles with
A term or is spurious )
Select B terms for subsequent analysis
Features


User defined gazetteer
Removes overmatches
– <prot>casein</prot> kinase
– Currently hard-coded

Some standardisation
– E.g. alpha-CN => alpha casein
– prevents loss of data from MMTx




Each sent/title given unique ID
Main MeSH terms, MeSH terms, Substance terms
Can use any PubMed query, PMID etc
Did you mean?
Comparative filtration

Compare filter combinations
– Calibrate with known link (RD – Fish Oil)
– Highest rank of blood viscosity
– Dependence on topic?

Combine type ranks
–
–
–
–
MeSH terms
Substances terms
Title
Abstract
Targets…

Milk proteins
– Largely digested
– Maternal regulation

Milk peptides
– Can reach blood stream, stable
– Receptor binding
– Protein binding
– Immunoresponse
Targets…

Plasmin remodelling
–
–
Plasmin levels increase during parturition and
involution
Hypothesis: peptides involved in restructure


Extension: Are peptides involved in apoptosis,
hyperplasia?
Role of the abundant proteins
–
–
–
–
MFGL
Xanthine Oxidase
CD36
-lactoglobulin
Information kept

Defined area (milk) therefore can store
detailed info., unlike generic system
– Known assoc with strength
– Unknown assoc with strength
– LinkOuts




Main MeSH terms
MeSH terms
Substance terms
MMTx concepts
Problems



No directionality on relationships
Incorrect MMTx tagging
Peptide literature
– Small(ish) amount of named peptide data
– Need to TM peptides, however, also
strength as more disparate data

Species/age differentiation (by MeSH?)
Conclusions



Co-occurrence relationships derived for
milk protein/peptides and other terms
Hypothesis generation to identify new
knowledge
Information stored for user access
Future work



Debug!
Species/age specificity by MeSH term?
Check incorrect MMTx tagging
– add bioactive peptides to source data



Link proteins to milkER sequence
database
Finish user interface
Learn Java 
Acknowledgements




Prof. Lindsay Sawyer
Dr. Carl Holt (Hannah Research Institute, Ayr)
Prof. Bonnie Webber (Informatics)
Dr. Alistair Kerr and Gail Sinclair
technical support
Miscellaneous…





ArrayPaths, Stratagene
Huang et al., 2005 PPI extractor program
Metis (Mitchell et al) – flags interesting sentences
to user from a UniProt sequence search,
crap but nice to have BLAST
MELISA (Abasolo et al) – ontology based IE
Genomes to Systems Conference
Manchester, 22 - 24th March 2006
References
Abasolo JM, Gomez M: MELISA. An ontology-based agent for information
retrieval in medicine. ECDL Workshop on the Semantic Web 2000.
Aronson AR: Effective mapping of biomedical text to the UMLS
Metathesaurus: the MetaMap program. Proc AMIA Symp 2001:17-21.
Aronson AR: Filtering the UMLS Metathesaurus for MetaMap. 2001.
Bodenreider O: The Unified Medical Language System (UMLS):
integrating biomedical terminology. Nucleic Acids Res 2004,
32(Database issue):D267-270.
iHOP (Information Hyperlinked over Proteins)
[http://www.pdg.cnb.uam.es/UniPub/iHOP/]
Hoffmann R, Valencia A: Implementing the iHOP concept for navigation of
biomedical literature. Bioinformatics 2005, 21 Suppl 2:ii252-ii258.
Mitchell AL, Divoli A, Kim JH, Hilario M, Selimas I, Attwood TK: METIS: multiple
extraction techniques for informative sentences. Bioinformatics 2005,
21(22):4196-4197.
Narayanasamy V, Mukhopadhyay S, Palakal M, Potter DA: TransMiner: mining
transitive associations among biological objects from text. J Biomed
Sci 2004, 11(6):864-873.
Pratt W, Yetisgen-Yildiz M: LitLinker: Capturing Connections Across the
Biomedical Literature. K-CAP 2003 2003.
References (2)
Pratt W, Yetisgen-Yildiz M: A study of biomedical concept identification: MetaMap vs.
people. AMIA Annu Symp Proc 2003:529-533.
EBIMed [http://www.ebi.ac.uk/Rebholz-srv/ebimed/index.jsp]
Whatizit [http://www.ebi.ac.uk/Rebholz-srv/whatizit]
Shatkay H: Hairpins in bookstacks: information retrieval from biomedical text. Brief
Bioinform 2005, 6(3):222-238.
Srinivasan P: Text mining: Generating hypotheses from MEDLINE. J Am Soc Inf Sci
Technol 2004, 55(5):396-413.
Srinivasan P, Libbus B: Mining MEDLINE for implicit links between dietary
substances and diseases. Bioinformatics 2004, 20 Suppl 1:I290-I296.
Weeber M, Klein H, Aronson AR, Mork JG, de Jong-van den Berg LT, Vos R: Text-based
discovery in biomedicine: the architecture of the DAD-system. Proc AMIA Symp
2000:903-907.
Weeber M, Klein H, de Jong-van den Berg LTW, Vos R: Using concepts in literaturebased discovery: Simulating Swanson's Raynaud-fish oil and migrainemagnesium discoveries. J Am Soc Inf Sci Technol 2001, 52(7):548-557.
Weeber M, Vos R, Klein H, de Jong-van den Berg LTW, Aronson AR, Molema G: Generating
hypotheses by discovering implicit associations in the literature: A case report
of a search for new potential therapeutic uses for thalidomide. J Am Med Inf
Assoc 2003, 10(3):252-259.
Wren JD: Extending the mutual information measure to rank inferred literature
relationships. BMC Bioinformatics 2004, 5:145.