Molecular Biology Databases

Download Report

Transcript Molecular Biology Databases

Research in the Verspoor Lab
Karin Verspoor, Ph.D.
Faculty, Computational Bioscience Program
University of Colorado School of Medicine
[email protected]
http://compbio.ucdenver.edu/Hunter_lab/Verspoor
Linguistics, Lexicons, and
Biomedical Verbs
• I could go on and on and on and on
• But I probably won’t…
Biological Knowledge Discovery
GENE NORMALIZATION
Gene Normalization
• Mapping a gene or protein name to an
•
•
identifier (e.g. in GenBank)
Very important task for using extracted
information (more useful than just a name)
Ambiguity
– with English words (“to” “dunce” “wingless”)
– in naming (1168 genes in Entrez named “p60”)
– in species (949 species have a gene named “p53”)
Normalization methods
• Heuristic approach can be effective
– Edit distance is too coarse (some characters matter more
than others)
• Some heuristics that appear to help
– Ignore hyphens, commas, some other interrupting
punctuation (but not, e.g., ' )
– Ignore parenthetical elements
– Consider translations among arabic/roman numerals, and
latin/greek letters
– Special words for compound noun phrases: receptor,
precursor, mRNA, gene, protein, greek letter names, etc.
Gene Normalization:
a species-based approach
• Based on species detection (NCBI Taxonomy terms)
– Global cues:
• First (first species mention)
• Abstract (most frequent species in abstract)
• Majority (most frequent species in doc)
– Local cues, close to gene reference:
• Recency
• Window (most frequent in window)
– “mixed” strategy setting confidence
• “First” >> “Recency” >> “Window” >> “Majority
Putting it all together:
BioCreative II.5 System Architecture
Document
Concept
Recognition
Tokenization and
Sentence splitting
Dictionary-based protein
and species recognition
protein candidate
sets and species
annotations
Gene
Normalization
Gene normalization
INT
Coordination analysis
OpenDMAP
relation extraction
Interaction pair
construction
normalized
interaction
pairs
filtered
protein sets
IPT
OpenDMAP
UniProt Dictionary Match
• Trie-based data structure
• Protein names and synonyms normalized upon
insertion
– reduces number of variants
– same form we search for in the text
Gene candidate selection
• Normalized string match against SwissProt
names and synonyms
– lowercase
– eliminating punctuation (apostrophes, hyphens,
and parentheses)
– converting Greek letters and Roman numerals to
a standard form
– removing spaces
• Left and right token boundary constraints
(right constraint relaxed for plurals)
Protein Match example
• Sentence:
Affixin/β-parvin is an integrin-linked kinase (ILK)-binding
focal adhesion protein highly expressed in skeletal muscle
and heart.
• Normalized Sentence:
affixinbparvinisanintegrinlinkedkinaseilkbindingfocaladhesio
n
proteinhighlyexpressedinskeletalmuscleandheart
• Match Affixin to affixin (ID: Q9HBI1)
• Match β-parvin to bparvin (ID: Q9HBI1)
Species detection
• Dictionary lookup using UIMA Concept
•
Mapper loaded with NCBI Taxonomy
Match species and sub-species; traverse is-a
hierarchy for sub-species
BC II.5 results
RAW
TP
FP
FN
105
1592
147
P
R
F
AUC
micro
0.06187
0.41667
0.10775
0.05316
macro
0.06817
0.44374
0.11296
0.17806
Homonym/Ortholog
TP
FP
FN
127
454
125
P
R
F
AUC
micro
0.21859
0.50397
0.30492
0.21285
macro
0.28334
0.55928
0.32453
0.39295
KNoGM and KaBOB
• KNoGM: Knowledge-based
Normalization of Gene Mentions
• Strategy based on WSD methods
from Agirre and Soroa, based on
knowledge graphs
• Taking advantage of biological knowledge
resources
• KaBOB: Knowledge Base Of Biology
– Integrated resource across biological databases
Knowledge-based methods in
Word Sense Disambiguation
• Disambiguate words based on relations
•
•
represented in a semantic graph
Take advantage of connections among word
senses and prefer word senses that are
semantically connected
Intuition: Spreading Activation
– Can perform static analysis of the graph to
determine most likely disambiguations based only on
the state of connections in the graph
– More effective: dynamic, consider words in context
UKB: Agirre & Soroa
knowledge-based WSD
• Knowledge-based word sense disambiguation
method
– knowledge = WordNet graph
– algorithm = (personalized) page rank
PageRank: ranks vertices in a graph according to
their relative structural importance
Personalized PageRank: bias certain vertices;
“activation” from a vertex increases
Knowledge-based methods in
Gene Normalization
• Knowledge typically brought to bear based on
textual matching of concepts known to be
associated with genes
– Gene ontology concepts
– Chromosome locations
– Species names
• KNoGM takes advantage of such knowledge in
a broader relational context
KaBOB: Knowledge Base of
Biology
• Goal: construction of an integrated, broad-coverage
semantic resource of biological knowledge
– information artifacts
– abstracted biological knowledge
– RDF representation using ontological relations
• KaBOB v.0
– iRefWeb protein interaction data
– GO annotations
– Homologene
– NCBI Taxonomy
From knowledge-based WSD
to KNoGM
• knowledge: KaBOB
• dictionary: gene name → gene identifiers
• context: mentions of gene names, GO terms,
NCBI Taxonomy terms
KNoGM
Training Set 1, BCIII
True
False
False
Precision
Positives Positives Negatives
Recall
F-score
Default
human
(baseline)
73
465
534
0.1357
0.1203
0.1275
UKB-5
(iRefWeb)
73
322
534
0.1848
0.1203
0.1457
UKB-50
(iRefWeb)
64
310
543
0.1711
0.1054
0.1305
UKB-5
(KaBOB v.0)
104
468
503
0.1818
0.1713
0.1764
UKB-25
(KaBOB v.0)
115
504
492
0.1858
0.1895
0.1876
UKB-100
(KaBOB v.0)
151
580
456
0.2066
0.2488
0.2257
Biological Knowledge Discovery
PROTEIN ACTIVE SITES
Automated validation of
high-throughput predictions
• Collaboration with Mike Wall @ LANL
• Combine structure-based predictions of active
sites on proteins with literature-based
validation
– Given a PDB protein structure, and a prediction
for residues in that structure that are active
(ligand binding sites, catalytic sites, etc.)
– Search the literature for evidence supporting the
prediction
Protein Fold vs. Function
• Many amino acids in a protein are responsible
•
•
for defining the overall fold
However, only a small fraction of the residues
in a protein are directly responsible for its
behavior
The evolutionary pressures on these residues
are different from other residues, and can
cause mutations to be correlated with
function (Lichtarge)
Functional Residues Are Often
Remote in Sequence
• Difficult to identify as motifs
>1AQM:A|PDBID|CHAIN|SEQUENCE
TPTTFVHLFEWNWQDVAQECEQYLGPKGYAAVQVSPPNEHITGSQWWTRYQPVSYELQSRGGNRAQFIDMVNRCSAAGVD
IYVDTLINHMAAGSGTGTAGNSFGNKSFPIYSPQDFHESCTINNSDYGNDRYRVQNCELVGLADLDTASNYVQNTIAAYI
NDLQAIGVKGFRFDASKHVAASDIQSLMAKVNGSPVVFQEVIDQGGEAVGASEYLSTGLVTEFKYSTELGNTFRNGSLAW
LSNFGEGWGFMPSSSAVVFVDNHDNQRGHGGAGNVITFEDGRLYDLANVFMLAYPYGYPKVMSSYDFHGDTDAGGPNVPV
HNNGNLECFASNWKCEHRWSYIAGGVDFRNNTADNWAVTNWWDNTNNQISFGRGSSGHMAINKEDSTLTATVQTDMASGQ
YCNVLKGELSADAKSCSGEVITVNSDGTINLNIGAWDAMAIHKNAKLNTSSAS
-amylase from Alteromonas haloplanctis
Asp174, Glu200, Asp264
The Same Residues are Often
Nearby in 3D Structure
Glu200
Asp264
Asp174
1AQM
Functional Sites
• Types of Functional Sites
– Catalytic sites
– Allosteric Sites
– Ligand-binding sites
– Protein-protein interaction sites
• Used to define motifs
– Geometric hashing and other methods (TESS,
Thornton lab)
• Targets for Drug Design
DPA Prediction of
Functional Sites
Glu200
Asp264
Asp174
Catalytic Triad
Predicted Residues
NLP Validation of Protein
Active Site Predictions
• Combine structure-based predictions of active
sites on proteins with literature-based
validation
– Given a PDB protein structure, and a prediction
for residues in that structure that are active
(ligand binding sites, catalytic sites, etc.)
– Search the literature for evidence supporting the
prediction
NLP validation: approach
Protein
Data
Bank
Protein ID
protein name(s)
protein structure
Pubmed query
Dynamic Perturbation
Analysis
predicted active
residues
residue validation
or re-ranking
validated active
residues
relevant
documents
Analysis Pipeline
extracted residues
NLP validation: NLP analysis
Catalytic
Site Atlas
Amino Acid residue
pattern development
Corpus of
documents
Binding
MOAD
database
Analysis
pipeline
Tokenization and
Sentence splitting
Amino Acid residue
recognition
list of residues
Compare to known active
residues for the document
P/R/F score
Analyze FP/FNs
•
•
•
•
•
Residue mention detection,
examples
This missense mutation converts a highly conserved glycine (Gly17
of neurophysin) to a valine residue.
Killer of prune (Kpn) is a mutation in the awd gene which
substitutes Ser for Pro at position 97 and causes dominant lethality
in individuals that do not have a functional prune gene.
Residues in both the N-terminal (Arg-66 and Glu-70) and Cterminal (Arg-200, Asp-254, Asp-255, and Asp-276) thirds of the
protein are implicated in binding to cells.
… where cysteines at positions 6, 42, 48, 90 and 393 were replaced
by serine.
Other outliers of possible functional relevance include D18, R23,
R59, R390 and A391.
Patterns must handle 3-letter and 1-letter abbrevations; various connectors, mutations,
linguistic constructs such as coordination, and other variations in surface forms.
Some regular expressions for
AA mentions
AA_long=
"(alanine|asparagine|aspartic|cysteine|glutamic|glutamic acid|glutamine|glycine|histidine|allo\
|leucine|lysine|methionine|penylalanine|proline|serine|threonine|tryptophane|\
tyrosine|valinealanine|arginine|alanyl|arginyl|asparaginyl|aspartyl|cysteinyl|glutaminyl\
|glycyl|glutamyl|histidyl|isoleucyl|leucyl|lysyl|methionyl\
|phenylalanyl|prolyl|seryl|threonyl|tryptophanyl|tyrosyl|isoleucine|valyl)"
AA_short =
"(arg|asn|asp|cys|gln|gly|glu|his|ile|leu|lys|met|phe|pro|ser|thr|trp|tyr|val|asx|glx|xle|xa
a|ala|ctt)"
AA_initial = "(A|C|D|E|F|G|H|I|K|L|M|N|P|Q|R|S|T|V|W|Y)”
AA_unbounded = AA_long + "|" AA_short
AA_bounded = "\b" + AA_unbounded + "\b"
AA_position_variant1 = "(\d+)([ \-]+)" + AA_bounded
#AA plus the position tyr85 with optional parenthesis around the position tyr(85)
AA_position_variant2 = AA_unbounded + "[ \-]*\(?\d+\)*?"
# (tyr85 to ser85, Tyr 85 Ser 85, trp27-gly360)
connection = "[ \-]?(\-|to|\s|\\)[ \-]?"
grammatical_expressions = "([ \-]?(to|substitution of|at position|acid)[ \-]?)”
pattern3 = AA_unbounded + ".?\d+" + connection + AA_unbounded + ".?\d+"
Current pattern performance
residues
Corpus 1
Corpus 2
Corpus 3
Average
3723
767
303
Prec
0.725933
0.741873
0.735436
0.734415
Recall
0.993
1.0
1.0
0.998
F1
0.84
0.85
0.85
0.85
Corpus 1: 61 full-text journal publications derived from Protein Data Bank
(PDB) records that have known functional sites
Corpus 2: 7 full-text journal publications; 5 abstracts. Derived from PDB
records that are known drug targets.
Corpus 3: 100 journal abstracts; obtained from Nagel et al (2009).
NLP analysis, refined
Amino Acid residue
pattern development
Catalytic
Site Atlas
Corpus of
documents
protein-residue association
pattern development
Binding
MOAD
database
Analysis
pipeline
Protein recognition
Amino Acid residue
recognition
Protein-Residue
association
(OpenDMAP patterns)
list of pairs
(protein, residue)
Compare to known active
residues for a protein
linked to the document
P/R/F score
Analyze FP/FNs
Some initial results of
integration
• For 32,195 PDB entries:
– 26,829 entries map to a PubMed ID
– 14,851 unique PubMed abstracts processed
– 23,477 residues identified
• 69% match surface residues on the relevant protein
– 50% of these match predicted active sites
• 79% of PDB entries have at least one residue identified
Complicating factors
• AA numbering in sequences may not be
consistent
– Different “reference” sequences for the protein
– Mutant or other variant sequences
• Explicit mentions of mutations
• Namespace ambiguity, possibly
BioNLP
TECHNICAL AND
REPRESENTATIONAL ISSUES
NLP validation: infrastructure
• Requires scaling our architecture to process
full text publications on a large scale
– UIMA-AS (Asynchronous Scaleout)
– Cloud/cluster computing
• Take software engineering seriously
– Robust, scaleable, modular architectures
– Consider the kinds of knowledge structures we
need to be able to represent and manipulate
• hierarchical controlled vocabularies
• patterns of expression
Annotation Representation
“biological
regulation”
“transcription”
rdfs:label
rdfs:label
GO:0006350
GO:0065007
kiao:denotesResource
kiao:denotesResource
kiao:denotesResource
a1
a2
a3
has_location
has_location
has_location
t1
t2
t3
EG:23939
rdfs:label
“M. musculus Mapk7”
…regulation of transcription of mouse Mapk7…
t4
“regulation of transcription”
a4
rdfs:Resource
rdfs:label
GO:0045449
p
kiao:ResourceAnnotatio
n
kiao:StatementSetAnnotatio
n
rdf:Property (s p o)
In a nutshell
• Ontologies and Semantic graph analysis
• Vocabularies and Linguistic knowledge for the
•
•
biomedical domain
Text Mining
Information Extraction
• Addressing the needs of the biological user
• Biological data analysis integrating multiple data
sources
Acknowledgements
•
•
•
•
•
•
•
Larry Hunter (Lab director)
Eneko Agirre and Aitor Soroa
at EHU (UKB)
Kevin Livingston (KaBOB)
Kevin Cohen (NLP)
Helen Johnson (Linguist)
(Software engineers)
•
Other Lab members:
•
Mike Wall and Judith Cohn at LANL
– Bill Baumgartner
– Chris Roeder
– Tom Christiansen
•
NIH grants
– R01 LM 010120-01
– R01 LM 009254
– R01 LM 008111
– R01 GM 083649
– G08 LM 009639
– T15 LM 009451
Guillaume Achaz for
the gnome image
– Mike Bada, Hannah Tipney, Yuriy Malenkiy, Lynne Fox