Sophia Loren - Computational Bioscience Program
Download
Report
Transcript Sophia Loren - Computational Bioscience Program
Research Opportunities in
Biomedical Text Mining
Kevin Bretonnel Cohen
Biomedical Text Mining Group
Lead
[email protected]
http://compbio.ucdenver.edu/Hunter_lab/Cohen
More projects than people
•
•
Ongoing:
–
–
–
–
–
–
Coreference resolution
Software engineering perspectives on natural language processing
Odd problems of full text
Tuberculosis and translational medicine
Discourse analysis annotation
OpenDMAP
In need of fresh blood:
–
–
–
–
–
–
–
–
Metagenomics/Microbiome studies
Temporality in clinical documents
Translational medicine from the clinical side
Summarization
Negation
Question-answering: Why?
Nominalizations
Metamorphic testing for natural language processing
Tuberculosis and
translational medicine
• Pathogen/host interactions
– Pathogens
– Hosts
– Genes
–?
• Information retrieval alone is difficult if
•
strain must be considered
Current status: evaluating dictionarybased approaches to gene name
recognition; 0.69 and climbing
Tuberculosis and
translational medicine
• How different are the eukaryotic and
prokaryotic (literature) domains, really?
– Kullback-Leibler divergence to measure
difference
– Log likelihood to see what makes them
different
• Effector protein prediction
– Retrieve lists of effector proteins and other
proteins from one or more pathogens
– Build language models for each—what would
Tuberculosis and
translational medicine
• Feature types (non-clever):
– Words/stems
– N-grams (bigrams, trigrams)
• Feature types (clever): conceptual
– Gene Ontology terms (especially related to
effector proteins?)
– Eukaryotic domains
– Signals
– Hosts/biomes
– You tell me
Translational medicine from
the clinical side
• Factors affecting inclusion/exclusion from
•
•
•
clinical trials
Sharpening phenotypes (7% of patients in
Schwarz’s PIF study)
ICD9-CM prioritization
Gazillions of named entity recognition
problems (drugs, assays, signs,
symptoms, vital signs, …)
Translational medicine from
the clinical side
• History: foundational
• Practice: difficult—access issues
• Technical problems related to data
availability (e.g. will you have enough for
machine learning?)
– TREC EMR track: yes
– i2b2 obesity data: probably
• Time for a renaissance
• Strategy: break in via TREC and i2b2;
deadline: summer/fall
Translational medicine from
the clinical side
• Inclusion/exclusion from effectiveness
studies—Text REtrieval Conference
(TREC) 2011/2012
– Given:
• 110,000 clinical notes
• List of topics
– Return: relevant records
• Successful methods: pay attention to
•
document structure; model as questions;
build more training data
Hard to beat Lucene! (We did)
Translational medicine from
the clinical side
•
•
•
•
•
•
•
•
•
•
•
Patients with hearing loss
Patients with complicated GERD who receive endoscopy
Hospitalized patients treated for methicillin-resistant Staphylococcus aureus
(MRSA) endocarditis
Patients diagnosed with localized prostate cancer and treated with robotic
surgery
Patients with dementia
Patients who had positron emission tomography (PET), magnetic
resonance imaging (MRI), or computed tomography (CT) for staging or
monitoring of cancer
Patients with ductal carcinoma in situ (DCIS)
Patients treated for vascular claudication surgically
Women with osteopenia
Patients being discharged from the hospital on hemodialysis
Patients with chronic back pain who receive an intraspinal pain-medicine
pump
Temporality in clinical
documents
• Did this patient have a headache within
the past ten days?
She had a migraine 2 weeks ago, lasting a
few days with headache, dizziness, visual
changes and vomiting.
•Subtasks:
–Recognize that she had a headache
–Figure out the temporal relation to the
document creation date
–Average accuracy 0.76-0.78
Temporality in clinical
documents
All varieties of time expressions in a single
clinic visit note
Conventional time
Logical time
Anchored
July 2009
in August
in September
On the day of the visit
present
Unanchored
a few days
in three months
2 weeks ago
Since then
Temporality in clinical
documents
All varieties of time expressions relevant to
current approaches to NLP in a single clinic
visit note Events
TIMEX3
TLINKs
ALINKs
July 2009
in August
in September
a few days
in 3 months
on the day of the
visit
present
Since then
2 weeks ago
Seen
migraine
headache
dizziness
visual changes
vomiting
period
(menses)
soreness
activity
malar rash
Soreness BEFORE
office visit
malar rash OVERLAP
office visit
return AFTER office
visit
Continue
CONTINUE
medications
Examination
INITIATE BP
Temporality in clinical
documents
• Approach to event recognition: use
•
•
•
structure of ontologies and definitions in
ontologies to recognize that event
occurred
(Need robust handling of negation and
context)
Regular expressions for temporal
expressions
Logic for translating from temporal
expressions to within-10-days-or-not
Summarization
• Task: Given one or more documents,
•
•
produce a shorter version that preserves
information
Difficulties (multi-document): Duplication,
aggregation, presentation
Holy grail: abstraction
Extraction
abstraction
An abstract
is avs.
summary
least some of whose
•at
Extraction:
– “Extract” strings
from present
the input text in
material
is not
– Assemble them to produce the summary
the input.” (Mani)
Abstraction:
“•An
extract is a summary
– Find meaning
consisting
entirely
of
– Produce text that communicates the meaning
material copied from the
input.” (Mani)
“
Extract, or abstract?
Abstract
Relationship between
summarization and generation
• Natural language generation: producing
•
textual output
Coherence (good)
– Redundancy (bad)
– Unresolvable anaphora (bad)
– Gaps in reasoning (bad)
– Lack of organization (bad)
Summarization and
generation
• GENE: BRCA1
• SPECIES:
– Hs.
BRCA1 is found in
humans. BRCA1 plays
a role in breast cancer.
• DISEASE_ASSOC.:
Breast cancer
BRCA1 is found in
humans. It plays a role
in breast cancer.
A multi-document summary
Caenorhabditis elegans p53: role in
apoptosis, meiosis, and stress resistance.
Bcl-2 and p53: role in dopamine-induced
apoptosis and differentiation. P53 role in
DNA repair and tumorigenesis.
Another multi-document
summary
P53: role in apoptosis, meiosis, and stress
resistance, dopamine-induced apoptosis
and differentiation, DNA repair and
tumorigenesis.
Another multi-document
summary
P53 has a role in apoptosis, meiosis, and
stress resistance. It also has a role in
dopamine-induced apoptosis and
differentiation, DNA repair, and
tumorigenesis.
Summarization and
generation
• Examples of non-coherent summaries that
wouldn’t be bad…
– A table
– A table of contents?
– An index?
– A diagram?
Unique problem in
summarization for
tuberculosis and
host/pathogen interactions
• How do you build a single summary that
•
•
covers data about two different species?
Start with relations—bridge sentences,
even if extractive
Ordering: temporal? We know the course
of disease…
Negation
• Classic problem
• Reasonably well-studied in clinical domain
•
•
(NegEx), but heavily restricted by
semantic class
Biological domain: 0.20-0.43 F-measure
Pattern-learning for OpenDMAP, machine
learning, semantic role labelling…
Semantic role labelling
Arg1: experiencer
Arg2: origin
Arg3: distance
Arg4: destination
Figure adapted from
Haghighi et al. (2005)
Question-answering: Why?
• Why did David Koresh ask for a
•
•
•
typewriter?
Why did I have a Clif bar for breakfast?
versus Why did I have a Clif bar for
breakfast instead of cereal?
Need for data set collection
Need novel methods—pattern-matching
doesn’t work well
Question-answering: Why?
• Overall performance is poor
– 0.00 MRR versus 0.69 on birthyear
(Ravichandran and Hovy 2002)
– 0.33 MRR versus 0.75 on location
(Ravichandran and Hovy 2002)
– 45% at least partially correct (Higashinaka
and Isozaki (2007)
– 0.35 mean reciprocal rank (2010, Verberne et
al.)
• Pattern-based approaches outperformed
Question-answering: Why?
• …why-questions are one of the most
•
complex types. This is mainly because
the answers to why-questions are not
named entities (which are in general
clearly identifiable), but text passages
giving a (possibly implicit) explanation
(Maybury 2002 in Verberne 2007)
Answers to why-questions cannot be
stated in a single phrase but they are
passages of text that contain some form of
Question-answering: Why?
• How can we improve on machine learning
methods?
– Don’t try—improve pattern learning, instead
– Apply what we’re learning about inference
and knowledge representation from
Hanalyzer-related work
– Improved recognition of semantic classes in
text (more on this later)
Nominalization
• Nominalization: noun derived from a verb
– Verbal nominalization: activation, inhibition,
induction
– Argument nominalization: activator,
inhibitor, inducer, mutant
Nominalizations are
dominant in biomedical texts
Predicate
Nominalization
All verb forms
Express
2,909
1,233
Develop
1,408
597
Analyze
1,565
364
Observe
185
809
Differentiate
737
166
Describe
10
621
Compare
185
668
Lose
556
74
Perform
86
599
Form
533
511
Data from CRAFT
corpus
Relevant points for text
mining
• Nominalizations are an obvious route for
scaling up recall
• Nominalizations are more difficult to
handle than verbs…
• …but can yield higher precision (Cohen et
al. 2008)
Alternations of nominalizations:
positions of arguments
• Any combination of the set of positions
for each argument of a nominalization
– Pre-nominal: phenobarbital induction,
trkA expression
– Post-nominal: increases of oxygen
– No argument present: Induction followed
a slower kinetic…
– Noun-phrase-external: this enzyme can
undergo activation
Result 1: attested alternations
are extraordinarily diverse
• Inhibition, a 3-argument predicate—
Arguments 0 and 1 only shown
Implications for systembuilding
• Distinction between absent and noun-phrase-
•
•
external arguments is crucial and difficult, and
finite state approaches will not suffice; merging
data from different clauses and sentences may be
useful
Pre-nominal arguments are undergoer by ratio
of 2.5:1
For predicates with agent and patient, post/post
and pre/post patterns predominate, but others
are common as well
What can be done?
• External arguments:
– semantic role labelling approach
• …but, very important to recognize the
absent/external distinction, especially with
machine learning
– pattern-based approach
• …but, approaches to external arguments (RLIMSP) are so far very predicate-specific
What can be done?
• Pre-nominal arguments:
– apply heuristic that we have identified based
on distributional characteristics
– for most frequent nominalizations, manual
encoding may be tractable
Metagenomics/microbiome
studies
• Experiments not interpretable/comparable
•
without large amounts of metadata
Metadata in various places
– (fielded)
– GenBank isolation_source field
– GOLD description fields
– Journal articles (full text)
Metagenomics/microbiome
studies
• Various standards:
– MIMS
– MIMARKS (Nat Biotech, forthcoming)
• Ontology terms
• Continuous variables
• ??
Metagenomics/microbiome studies
•
•
•
•
•
•
•
•
•
“Metagenomic sequence data that lack an environmental
context have no value.”
– Crucial to replication, analysis
Do microbial gene richness and evenness patterns (at some specific
sampling density) correlate with other environmental characteristics?
Which microbial phylotypes or functional guilds co-occur with high statistical
probability in different environments?
Do specific phylotypes track particular geographic or physico-chemical
clines (latitudes, isotherms, isopycnals, etc.)?
Do specific microbial community ORFs (functionally identified or not) track
specific bioenergetic gradients (solar, geothermal, digestive tracts, etc.)?
What is the percentage of genes with a given role, as a function of some
physical feature, e.g. the average temperature of the sample sites?
Do microbial community protein families, amino acid content, or sequence
motifs vary systemically as a function of habitat of origin? Are specific
protein sequence motifs characteristic of specific habitats?
What is the “resistome” in soil? (Phenotype)
Habitat change over time, host-to-host variation, within-host variation—
biodefense and forensics applications
Metagenomics/microbiome
studies
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Investigation type: eukaryote, bacteria, virus, plasmid, organelle, metagenome
Experimental factor: Experimental Factor Ontology, Ontology for Biomedical
Investigations
Latitude, longitude, depth, elevation, humidity, CO2, CO, salinity, temperature, …
Geographic location (country, region, sea) from Gaz Ontology
Collection date/time
Environment, biome and features, material: Environment Ontology
Trophic level; aerobe/anaerobe
Sample collection device or method
Sample material processing: Ontology for Biomedical Investigations
Amount or size of sample
Targeted gene or locus name
PCR primer, conditions
Sequencing method
Chemicals administered: ChEBI
Diseases: Disease Ontology
Body site
Phenotype: PATO
Metagenomics/microbiome
studies
• Where do you find this stuff?
– Text fields in databases
• Isolation_source in GenBank
• Description in GOLD
• TBD in microbiome studies, but hopefully coming
• Full text of journal articles
– Marine secondary products corpus coming
(pharmacogenomics connection)
– Problem of tables
– Multiple sentences, coreference
Metamorphic testing for NLP
• Metamorphic testing motivation: situations
•
where input/output space is intractably
large and it’s not clear what would
constitute right answers
Use domain knowledge to specify broad
categories of changes to output that
should occur with broad categories of
changes to input
Metamorphic testing for NLP
• Gene regulatory networks:
– Add an unconnected node—G should be
subsumed by G’
• SeqMap:
– Given a reference string p and a set of
sequence reads T = {t1, t2, ..., tn}, and a
genome p, we form a new genome p' by
deleting an arbitrary portion of either the
beginning or ending of p. After mapping T to
both p and p' independently, all reads in T that
are unmappable to p should also be
Metamorphic testing for NLP
• Non-linguistic
– Add non-informative feature, see if feature
selection screens it out
– Subtract informative features, see if
performance goes down
• Linguistic
–?
Coreference defined
• Sophia Loren says she will always be
grateful to Bono. The actress revealed
that the U2 singer helped her calm down
when she became scared by a
thunderstorm while travelling on a plane.
Coreference resolution
• Sophia Loren says she will always be
grateful to Bono. The actress revealed
that the U2 singer helped her calm down
when she became scared by a
thunderstorm while travelling on a plane.
Coreference defined
• Sophia Loren says she will always be
grateful to Bono. The actress revealed
that the U2 singer helped her calm down
when she became scared by a
thunderstorm while travelling on a plane.
Sophia Loren, she, The actress, her, she
Coreference defined
• Sophia Loren says she will always be
grateful to Bono. The actress revealed
that the U2 singer helped her calm down
when she became scared by a
thunderstorm while travelling on a plane.
Bono, the U2 singer
How do humans do this?
•
Linguistic factors:
•
Knowledge about the world:
•
– Kevin saw Larry. He liked him.
– Sophia Loren will always be grateful to Bono. The
actress…
– Sophia Loren will always be grateful to Bono. The
singer…
– Sophia Loren will always be grateful to Bono. The
storm…
A combination of world knowledge and linguistic
factors:
– Sophia Loren says she will always be grateful to
Bono…
– Sophia Loren says he will always be grateful to
Bono…
Computers are bad at this
• Linguistic features don’t always help.
– Each child ate a biscuit. They were delicious.
– Each child ate a biscuit. They were delighted.
• Programming enough knowledge about
the world into a computer has proven to be
very difficult.
Our approach
• Matching semantic categories helps
– BRCA1, the gene
– Cell proliferation, leukocyte proliferation
• Minimal work on using ontologies
– WordNet (General English, mostly)
– Replacing ontology with web search
• We’re going to use ontologies, and more
•
than anyone
First step: broad semantic class
assignment
Our approach
• Broad semantic class assignment
– Coreference resolution benefits from knowing
whether semantic classes match
– Semantic class ≈ what ontology you should
belong to
– Looking at headwords, frequent words,
informativeness measures
Why assign broad
semantic classes?
• Coreference resolution
• Information extraction
• Document classification
To be clear about what I
mean by “broad semantic
class…”
If you were going to be part of an ontology,
which ontology would you be part of?
Target semantic classes
• Chosen for relevance
to mouse genomics
Gene Ontology
Sequence Ontology
Foundational Model of Anatomy
NCBI Taxonomy
Chemical Entities of Biological
Interest
Phenotypic Quality
BRENDA Tissue/Enzyme Source
Cell Type Ontology
Gene Regulation Ontology
Homology Ontology
Human Disease Ontology
Mammalian Phenotype Ontology
Molecule Role Ontology
Mouse Adult Gross Anatomy
Ontology
Mouse Pathology Ontology
Protein Modification Ontology
Protein-Protein Interaction Ontology
Suggested Ontology for
Pharmacogenomics
Sample Processing and Separation
Techniques Ontology
Method for class assignment
• Exact match
• “Stripping”
• Head noun
• Stemmed head noun
“Stripping”
• Delete all non-alphanumeric characters
• Cadmium-binding, cadmium binding
• cadmiumbinding
• X of…
Head nouns: two simple
heuristics
– Where of represents any preposition
• Rightmost word
• Positive regulation of growth
• Positive regulation
• Regulation
Evaluation
• Annotated corpus
• Ontology-against-itself
• Structured test suite
Two potential baselines
• NCBO Annotator
– Exact match and substring only—not strong
enough
• MetaMap
– Future work
CRAFT corpus
• Colorado Richly Annotated Full Text
• 97 full-text journal articles
• 597,000 words
• Evidence for MGI Gene Ontology
•
annotations
119,783 annotations across five
ontologies:
– Gene Ontology
– Sequence Ontology
– Cell Type Ontology
Ontology against itself
• Use the terms from the ontologies
•
•
•
themselves. Seems obvious, but…
…every term should return its own
ontology as its semantic class.
Used the head noun technique only….
…since exact match and stripping are
guaranteed to give the right answer.
Structured test suite
• 300 canonical and non-canonical forms
• Categorized according to features of terms
and features of changes to terms
Structured test suite
Non-canonical forms
• Ordering and other
syntactic variants
• Inserted text
• Coordination
• Singular/plural variants
• Verbal versus nominal
• Adjectival versus nominal
• Unofficial synonyms
Features of terms
• Length
• Punctuation
• Presence of stopwords
• Ungrammatical
• Presence of numerals
• Official synonyms
• Ambiguous terms
Structured test suite
• Syntax
– induction of apoptosis apoptosis induction
• Part of speech
– cell migration cell migrated
• Inserted text
– ensheathment of neurons ensheathment of some
neurons
Results on the CRAFT corpus when
only CRAFT ontologies are used as
input
Ontology
Annotations
Precision
Recall
F-measure
Gene
Ontology
39,626
66.31
73.06
69.52
Sequence
Ontology
40,692
63.00
72.21
67.29
Cell Type
Ontology
8,383
53.58
87.27
66.40
NCBI
Taxonomy
11,775
96.24
92.51
94.34
ChEBI
19,307
70.07
90.53
79.00
119,783
67.06
78.49
72.32
69.84
83.12
75.31
Total (microaveraged)
Total (macroaveraged)
Ontology
Accuracy on CRAFT corpus
when all 20 ontologies are
Exact
Strippedused
Head noun
Stemmed
head
Gene
Ontology
24.26
24.68
59.18
77.12
Sequence
Ontology
44.28
47.63
56.63
73.33
Cell Type
Ontology
25.26
25.80
70.09
88.38
NCBI
Taxonomy
84.67
84.71
90.97
95.73
ChEBI
86.93
87.44
92.43
95.49
Results on
ontology-against-itself
• 97-100% for 18/20 ontologies (no
•
•
surprise), but…
…found much lower performance on two
ontologies (Sequence Ontology and
Molecule Role Ontology) due to
preprocessing errors and omissions,
indicating that…
…this evaluation method is robust!
Results on structured test
suite
• Nota bene: this analysis is on the level of
individual terms, but don’t lose track of the
fact that we’re trying to recognize broad
semantic classes, not individual terms
Results on structured test
suite
• Headword technique works very well in
presence of syntactic variation
– induction of apoptosis/apoptosis induction
• Headword technique works in the
presence of inserted text
– ensheathment of neurons/ensheathment of
some neurons
Results on structured test
suite
• Headword stemming allows catching verb
phrases
– cell migration/cells migrate
• Headword stemming fails when verb/noun
relationship is irregular
– X growth/grows
• Stemming is always necessary for
•
recognizing plurals, regardless of term
length
Porter stemmer fails on irregular plurals
Results on structured test
suite
• Approach handles “ungrammatical” terms
like transposition, DNA-mediated
– Important because exact match will always
fail on these
Software engineering
perspectives on natural
language processing
Two paradigms of evaluation
• Traditional approach: use a corpus
•
•
•
•
•
Expensive
Time-consuming to produce
Redundancy for some things…
…underrepresentation of others (Oepen et al. 1998)
Slow run-time (Cohen et al. 2008)
• Non-traditional approach: structured test
suite
•
•
•
•
Controls redundancy
Ensures representation of all phenomena
Easy to evaluate results and do error analysis
Used successfully in grammar engineering
Structured test suite
Canonical
•
•
•
•
•
•
•
•
•
GO:0000133
GO:0000108
GO:0000786
GO:0001660
GO:0001726
GO:0005623
GO:0005694
GO:0005814
GO:0005874
Non-canonical
Polarisome
Repairosome
Nucleosome
Fever
Ruffle
Cell
Chromosome
Centriole
Microtubule
•
•
•
•
•
•
•
•
•
GO:0000133
GO:0000108
GO:0000786
GO:0001660
GO:0001726
GO:0005623
GO:0005694
GO:0005814
GO:0005874
Polarisomes
Repairosomes
Nucleosomes
Fevers
Ruffles
Cells
Chromosomes
Centrioles
Microtubules
Structured test suite
Features of terms
• Length
• Punctuation
• Presence of stopwords
• Ungrammatical terms
• Presence of numerals
• Official synonyms
• Ambiguous terms
Types of changes
• Singular/plural variants
• Ordering and other
syntactic variants
• Inserted text
• Coordination
• Verbal versus nominal
constructions
• Adjectival versus nominal
constructions
• Unofficial synonyms
Structured test suite
• Syntax
– induction of apoptosis apoptosis induction
• Part of speech
– cell migration cell migrated
• Inserted text
– ensheathment of neurons ensheathment of some
neurons
Results
• No non-canonical terms were recognized
• 97.9% of canonical terms were recognized
– All exceptions contain the word in
• What would it take to recognize the error
pattern with canonical terms with a corpusbased approach??
Cohen (2010)
Other uses to date
• Broad characterization of
•
•
successes/failures (JULIE lab)
Parameter tuning (Hinterberg)
Semantic class assignment (see
coreference resolution)
•
•
Weird stuff that comes up
Background with full text: parentheses
–
–
–
–
–
–
–
–
–
Distinguishing feature of full text (Cohen et al.)
Confusing to patients/laypeople, useful to us (Elhadad)
Ignorable in gene names (Cohen et al.)
Problems for parsers (Jang et al. 2006)
Problems with hedge scope assignment (Morante and Daelemans)
Abbreviation definition (Schwartz and Hearst)
Gene symbol grounding (Lu)
“Citances” (Nakov et al.)
17,063 in 97-document corpus
Use cases
–
–
–
–
–
–
Use P-value to set weighting in networks
Target for information extraction applications
Coreference resolution within text
Gene normalization
Meta-analyses
Table and figure mentions often indicators of assertions with experimental
validation
– Mapping text to sub-figures
– Citations useful for establishing rhetorical relations between papers, synonym
identification, and curation data
Weird stuff that comes up
with full text: parentheses
Category
Use case
Gene symbol or abbreviation
Gene normalization, coreference resolution
Citation
Summaries, high-value sentences, bibliometrics
Data value
Information extraction
P-value
Link weighting, meta-analysis
Figure/table pointer
Strong indicator of good evidence
List element
Mapping sub-figures to text
Singular/plural
Distinguish from other categories
Part of gene name
Gene normalization
Parenthetical statement
Potentially ignorable, or IE target
Discourse annotation
• Want to be able to follow and perform
•
•
abductive reasoning
Methods under development for labelling
aspects of the structure of an argument
Currently building large data set from the
CRAFT corpus—97 articles
More projects than people
•
•
Ongoing:
–
–
–
–
–
Coreference resolution
Software engineering perspectives on natural language processing
Tuberculosis and translational medicine
Discourse analysis annotation
OpenDMAP
In need of fresh blood:
–
–
–
–
–
–
–
–
Metagenomics/Microbiome studies
Temporality in clinical documents
Translational medicine from the clinical side
Summarization
Negation
Question-answering: Why?
Nominalizations
Metamorphic testing for natural language processing