Transcript Slide 1
Research Opportunities in Biomedical Text
Mining
Kevin Bretonnel Cohen
Biomedical Text Mining Group
Lead
[email protected]
http://compbio.ucdenver.edu/Hunter_lab/Cohen
More projects than people
•
•
Ongoing:
–
–
–
–
–
Coreference resolution
Software engineering perspectives on natural language processing
Odd problems of full text
Tuberculosis and translational medicine
Discourse analysis annotation
In need of fresh blood:
–
–
–
–
–
–
–
Metagenomics/Microbiome studies
Translational medicine from the clinical side
Summarization
Negation
Question-answering: Why?
Nominalizations
Metamorphic testing for natural language processing
Metagenomics/microbiome
studies
• Experiments not interpretable/comparable
•
without large amounts of metadata
Metadata in various places
– (fielded)
– GenBank isolation_source field
– GOLD description fields
– Journal articles (full text)
Metagenomics/microbiome
studies
• Various standards:
– MIMS
– MIMARKS (Nat Biotech, forthcoming)
• Ontology terms
• Continuous variables
• ??
Metagenomics/microbiome studies
•
•
•
•
•
•
•
•
•
“Metagenomic sequence data that lack an environmental
context have no value.”
– Crucial to replication, analysis
Do microbial gene richness and evenness patterns (at some specific
sampling density) correlate with other environmental characteristics?
Which microbial phylotypes or functional guilds co-occur with high statistical
probability in different environments?
Do specific phylotypes track particular geographic or physico-chemical
clines (latitudes, isotherms, isopycnals, etc.)?
Do specific microbial community ORFs (functionally identified or not) track
specific bioenergetic gradients (solar, geothermal, digestive tracts, etc.)?
What is the percentage of genes with a given role, as a function of some
physical feature, e.g. the average temperature of the sample sites?
Do microbial community protein families, amino acid content, or sequence
motifs vary systemically as a function of habitat of origin? Are specific
protein sequence motifs characteristic of specific habitats?
What is the “resistome” in soil? (Phenotype)
Habitat change over time, host-to-host variation, within-host variation—
biodefense and forensics applications
Metagenomics/microbiome
studies
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Investigation type: eukaryote, bacteria, virus, plasmid, organelle, metagenome
Experimental factor: Experimental Factor Ontology, Ontology for Biomedical
Investigations
Latitude, longitude, depth, elevation, humidity, CO2, CO, salinity, temperature, …
Geographic location (country, region, sea) from Gaz Ontology
Collection date/time
Environment, biome and features, material: Environment Ontology
Trophic level; aerobe/anaerobe
Sample collection device or method
Sample material processing: Ontology for Biomedical Investigations
Amount or size of sample
Targeted gene or locus name
PCR primer, conditions
Sequencing method
Chemicals administered: ChEBI
Diseases: Disease Ontology
Body site
Phenotype: PATO
Metagenomics/microbiome
studies
• Where do you find this stuff?
– Text fields in databases
Timeline: July 2011
Timeline: July 2011
• Isolation_source in GenBank
• Description in GOLD
• TBD in microbiome studies, but hopefully coming
• Full text of journal articles
– Marine secondary products corpus coming
(pharmacogenomics connection)
– Problem of tables
– Multiple sentences, coreference
Translational medicine from
the clinical side
• Factors affecting inclusion/exclusion from
•
•
•
clinical trials
Sharpening phenotypes (7% of patients in
Schwarz’s PIF study)
ICD9-CM prioritization
Gazillions of named entity recognition
problems (drugs, assays, signs,
symptoms, vital signs, …)
Translational medicine from
the clinical side
• History: foundational
• Practice: difficult—access issues
• Technical problems related to data
availability (e.g. will you have enough for
machine learning?)
– TREC EMR track: yes
– i2b2 obesity data: probably
• Time for a renaissance
• Strategy: break in via TREC; deadline:
summer/fall
Summarization
• Task: Given one or more documents,
•
•
produce a shorter version that preserves
information
Difficulties (multi-document): Duplication,
aggregation, presentation
Holy grail: abstraction
Extraction
abstraction
An abstract
is avs.
summary
least some of whose
•at
Extraction:
– “Extract” strings
from present
the input text in
material
is not
– Assemble them to produce the summary
the input.” (Mani)
Abstraction:
“•An
extract is a summary
– Find meaning
consisting
entirely
of
– Produce text that communicates the meaning
material copied from the
input.” (Mani)
“
Extract, or abstract?
Abstract
Relationship between
summarization and generation
• Natural language generation: producing
•
textual output
Coherence (good)
– Redundancy (bad)
– Unresolvable anaphora (bad)
– Gaps in reasoning (bad)
– Lack of organization (bad)
Summarization and
generation
• GENE: BRCA1
• SPECIES:
– Hs.
BRCA1 is found in
humans. BRCA1 plays
a role in breast cancer.
• DISEASE_ASSOC.:
Breast cancer
BRCA1 is found in
humans. It plays a role
in breast cancer.
A multi-document summary
Caenorhabditis elegans p53: role in
apoptosis, meiosis, and stress resistance.
Bcl-2 and p53: role in dopamine-induced
apoptosis and differentiation. P53 role in
DNA repair and tumorigenesis.
Another multi-document
summary
P53: role in apoptosis, meiosis, and stress
resistance, dopamine-induced apoptosis
and differentiation, DNA repair and
tumorigenesis.
Another multi-document
summary
P53 has a role in apoptosis, meiosis, and
stress resistance. It also has a role in
dopamine-induced apoptosis and
differentiation, DNA repair, and
tumorigenesis.
Summarization and
generation
• Examples of non-coherent summaries that
wouldn’t be bad…
– A table
– A table of contents?
– An index?
– A diagram?
Timeline: no pressure
Negation
• Classic problem
• Reasonably well-studied in clinical domain
•
•
(NegEx), but heavily restricted by
semantic class
Biological domain: 0.20-0.43 F-measure
Pattern-learning for OpenDMAP, machine
learning, semantic role labelling…
Semantic role labelling
Arg1: experiencer
Arg2: origin
Arg3: distance
Arg4: destination
Figure adapted from
Haghighi et al. (2005)
Timeline: no pressure
Question-answering: Why?
• Why did David Koresh ask for a
•
•
•
typewriter?
Why did I have a Clif bar for breakfast?
versus Why did I have a Clif bar for
breakfast instead of cereal?
Need for data set collection
Need novel methods—pattern-matching
doesn’t work well
Question-answering: Why?
• Overall performance is poor
– 0.00 MRR versus 0.69 on birthyear
(Ravichandran and Hovy 2002)
– 0.33 MRR versus 0.75 on location
(Ravichandran and Hovy 2002)
– 45% at least partially correct (Higashinaka
and Isozaki (2007)
– 0.35 mean reciprocal rank (2010, Verberne et
al.)
• Pattern-based approaches outperformed
Question-answering: Why?
• …why-questions are one of the most
•
complex types. This is mainly because
the answers to why-questions are not
named entities (which are in general
clearly identifiable), but text passages
giving a (possibly implicit) explanation
(Maybury 2002 in Verberne 2007)
Answers to why-questions cannot be
stated in a single phrase but they are
passages of text that contain some form of
Question-answering: Why?
• How can we improve on machine learning
methods?
– Don’t try—improve pattern learning, instead
– Apply what we’re learning about inference
and knowledge representation from
Hanalyzer-related work
– Improved recognition of semantic classes in
text (more on this later)
Nominalization
• Nominalization: noun derived from a verb
– Verbal nominalization: activation, inhibition,
induction
– Argument nominalization: activator,
inhibitor, inducer, mutant
Nominalizations are
dominant in biomedical texts
Predicate
Nominalization
All verb forms
Express
2,909
1,233
Develop
1,408
597
Analyze
1,565
364
Observe
185
809
Differentiate
737
166
Describe
10
621
Compare
185
668
Lose
556
74
Perform
86
599
Form
533
511
Data from CRAFT
corpus
Relevant points for text
mining
• Nominalizations are an obvious route for
scaling up recall
• Nominalizations are more difficult to
handle than verbs…
• …but can yield higher precision (Cohen et
al. 2008)
Alternations of nominalizations:
positions of arguments
• Any combination of the set of positions
for each argument of a nominalization
– Pre-nominal: phenobarbital induction,
trkA expression
– Post-nominal: increases of oxygen
– No argument present: Induction followed
a slower kinetic…
– Noun-phrase-external: this enzyme can
undergo activation
Result 1: attested alternations
are extraordinarily diverse
• Inhibition, a 3-argument predicate—
Arguments 0 and 1 only shown
Implications for systembuilding
• Distinction between absent and noun-phrase-
•
•
external arguments is crucial and difficult, and
finite state approaches will not suffice; merging
data from different clauses and sentences may be
useful
Pre-nominal arguments are undergoer by ratio
of 2.5:1
For predicates with agent and patient, post/post
and pre/post patterns predominate, but others
are common as well
What can be done?
• External arguments:
– semantic role labelling approach
• …but, very important to recognize the
absent/external distinction, especially with
machine learning
– pattern-based approach
• …but, approaches to external arguments (RLIMSP) are so far very predicate-specific
What can be done?
• Pre-nominal arguments:
– apply heuristic that we have identified based
on distributional characteristics
– for most frequent nominalizations, manual
encoding may be tractable
Timeline: no pressure
Metamorphic testing for NLP
• Metamorphic testing motivation: situations
•
where input/output space is intractably
large and it’s not clear what would
constitute right answers
Use domain knowledge to specify broad
categories of changes to output that
should occur with broad categories of
changes to input
Metamorphic testing for NLP
• Gene regulatory networks:
– Add an unconnected node—G should be
subsumed by G’
• SeqMap:
– Given a reference string p and a set of
sequence reads T = {t1, t2, ..., tn}, and a
genome p, we form a new genome p' by
deleting an arbitrary portion of either the
beginning or ending of p. After mapping T to
both p and p' independently, all reads in T that
are unmappable to p should also be
Metamorphic testing for NLP
• Non-linguistic
– Add non-informative feature, see if feature
selection screens it out
– Subtract informative features, see if
performance goes down
• Linguistic
–?
Timeline: no pressure
Wide range of projects over the
past few years
•
•
•
•
•
•
•
Named entity recognition:
–
Information extraction:
–
William A. Baumgartner, Jr., ...K. Bretonnel Cohen, and Lawrence Hunter (submitted) Leveraging concept recognition to extract protein interaction relations from
biomedical text. Genome Biology.
Summarization:
–
Zhiyong Lu, K. Bretonnel Cohen, and Lawrence Hunter (2006) Finding GeneRIFs via Gene Ontology annotations. Pacific Symposium on Biocomputing 11:52-63.
Word sense disambiguation:
–
William A. Baumgartner, Jr., ...K. Bretonnel Cohen, and Lawrence Hunter (2007) An integrated approach to concept recognition in biomedical text. Proceedings
of BioCreative II..
Question-answering/IR:
–
J. Gregory Caporaso, William A. Baumgartner Jr., Hyunmin Kim, Zhiyong Lu, Helen L. Johnson, Olga Medvedeva, Anna Lindemann, Lynne Fox, Elizabeth White,
K. Bretonnel Cohen, and Lawrence Hunter (2006) Concept recognition, information retrieval, and machine learning in genomics question-answering (2006)
Proceedings of the Fifteenth Text Retrieval Conference.
Document classification/IR:
–
J. Gregory Caporaso, William A. Baumgartner Jr., K. Bretonnel Cohen, Helen L. Johnson, Jesse Paquette, and Lawrence Hunter (2005) Concept recognition and
the TREC Genomics tasks. Proceedings of the Fourteenth Text Retrieval Conference, National Institute of Standards and Technology.
Computational lexical semantics:
–
–
–
•
Shuhei Kinoshita, K. Bretonnel Cohen, Philip V. Ogren, and Lawrence Hunter (2005). BioCreative Task 1A: entity identification with a stochastic tagger. BMC
Bioinformatics 6(Suppl. 1):S4.
Philip V. Ogren, K. Bretonnel Cohen, George K. Acquaah-Mensah, Jens Eberlein, and Lawrence Hunter (2004). The compositional structure of Gene Ontology
terms. Pacific Symposium on Biocomputing 2004, pp. 214-225.
Philip V. Ogren, K. Bretonnel Cohen, and Lawrence Hunter (2005). Implications of compositionality in the Gene Ontology for its curation and usage. Pacific
Symposium on Biocomputing 2005, pp. 174-185.
Helen L. Johnson, K. Bretonnel Cohen, William A. Baumgartner Jr., Zhiyong Lu, Michael Bada, Todd Kester, Hyunmin Kim, and Lawrence Hunter (2006)
Evaluation of lexical methods for detecting relationships between concepts from multiple ontologies. Pacific Symposium on Biocomputing 11:28-39.
Corpus linguistics:
–
–
Cohen, K. Bretonnel; Lynne Fox; Philip Ogren; and Lawrence Hunter (2005). Empirical data on corpus design and usage in biomedical natural language
processing. AMIA 2005 symposium proceedings, pp. 156-160.
K. Bretonnel Cohen, Lynne Fox, Philip V. Ogren, and Lawrence Hunter (2005). Corpus design for biomedical natural language processing. Proceedings of the
ACL-ISMB workshop on linking biological literature, ontologies and databases, pp. 38-45. Association for Computational Linguistics.
Other recent projects
•
•
•
•
•
•
•
Characterizing biomedical language
– Open Access versus traditional journals
– Full text versus abstracts
– Nominalization and alternations
Biological event extraction
Ontology quality assurance
Evaluation from many angles—shared task organization
and participation; many angles on testing
SciKnowMine and BASILISK evaluation (with Ellen
Riloff)
GO term recognition (with Michael and Karin)
Grant-writing
Coreference defined
• Sophia Loren says she will always be
grateful to Bono. The actress revealed
that the U2 singer helped her calm down
when she became scared by a
thunderstorm while travelling on a plane.
Coreference resolution
• Sophia Loren says she will always be
grateful to Bono. The actress revealed
that the U2 singer helped her calm down
when she became scared by a
thunderstorm while travelling on a plane.
Coreference defined
• Sophia Loren says she will always be
grateful to Bono. The actress revealed
that the U2 singer helped her calm down
when she became scared by a
thunderstorm while travelling on a plane.
Sophia Loren, she, The actress, her, she
Coreference defined
• Sophia Loren says she will always be
grateful to Bono. The actress revealed
that the U2 singer helped her calm down
when she became scared by a
thunderstorm while travelling on a plane.
Bono, the U2 singer
How do humans do this?
•
Linguistic factors:
•
Knowledge about the world:
•
– Kevin saw Larry. He liked him.
– Sophia Loren will always be grateful to Bono. The
actress…
– Sophia Loren will always be grateful to Bono. The
singer…
– Sophia Loren will always be grateful to Bono. The
storm…
A combination of world knowledge and linguistic
factors:
– Sophia Loren says she will always be grateful to
Bono…
– Sophia Loren says he will always be grateful to
Bono…
Computers are bad at this
• Linguistic features don’t always help.
– Each child ate a biscuit. They were delicious.
– Each child ate a biscuit. They were delighted.
• Programming enough knowledge about
the world into a computer has proven to be
very difficult.
Our approach
• Matching semantic categories helps
– BRCA1, the gene
– Cell proliferation, leukocyte proliferation
• Minimal work on using ontologies
– WordNet (General English, mostly)
– Replacing ontology with web search
• We’re going to use ontologies, and more
•
than anyone
First step: broad semantic class
assignment
Our approach
• Broad semantic class assignment
– Coreference resolution benefits from knowing
whether semantic classes match
– Semantic class ≈ what ontology you should
belong to
– Looking at headwords, frequent words,
informativeness measures
Timeline (coref, not semantic class assignment): this spring
Software engineering
perspectives on natural
language processing
Two paradigms of evaluation
• Traditional approach: use a corpus
•
•
•
•
•
Expensive
Time-consuming to produce
Redundancy for some things…
…underrepresentation of others (Oepen et al. 1998)
Slow run-time (Cohen et al. 2008)
• Non-traditional approach: structured test
suite
•
•
•
•
Controls redundancy
Ensures representation of all phenomena
Easy to evaluate results and do error analysis
Used successfully in grammar engineering
Structured test suite
Canonical
•
•
•
•
•
•
•
•
•
GO:0000133
GO:0000108
GO:0000786
GO:0001660
GO:0001726
GO:0005623
GO:0005694
GO:0005814
GO:0005874
Non-canonical
Polarisome
Repairosome
Nucleosome
Fever
Ruffle
Cell
Chromosome
Centriole
Microtubule
•
•
•
•
•
•
•
•
•
GO:0000133
GO:0000108
GO:0000786
GO:0001660
GO:0001726
GO:0005623
GO:0005694
GO:0005814
GO:0005874
Polarisomes
Repairosomes
Nucleosomes
Fevers
Ruffles
Cells
Chromosomes
Centrioles
Microtubules
Structured test suite
Features of terms
• Length
• Punctuation
• Presence of stopwords
• Ungrammatical terms
• Presence of numerals
• Official synonyms
• Ambiguous terms
Types of changes
• Singular/plural variants
• Ordering and other
syntactic variants
• Inserted text
• Coordination
• Verbal versus nominal
constructions
• Adjectival versus nominal
constructions
• Unofficial synonyms
Structured test suite
• Syntax
– induction of apoptosis apoptosis induction
• Part of speech
– cell migration cell migrated
• Inserted text
– ensheathment of neurons ensheathment of some
neurons
Results
• No non-canonical terms were recognized
• 97.9% of canonical terms were recognized
– All exceptions contain the word in
• What would it take to recognize the error
pattern with canonical terms with a corpusbased approach??
Cohen (2010)
Other uses to date
• Broad characterization of
•
•
successes/failures (JULIE lab)
Parameter tuning (Hinterberg)
Semantic class assignment (see
coreference resolution)
Timeline: ongoing, lots of work already done
•
•
Weird stuff that comes up
Background with full text: parentheses
–
–
–
–
–
–
–
–
–
Distinguishing feature of full text (Cohen et al.)
Confusing to patients/laypeople, useful to us (Elhadad)
Ignorable in gene names (Cohen et al.)
Problems for parsers (Jang et al. 2006)
Problems with hedge scope assignment (Morante and Daelemans)
Abbreviation definition (Schwartz and Hearst)
Gene symbol grounding (Lu)
“Citances” (Nakov et al.)
17,063 in 97-document corpus
Use cases
–
–
–
–
–
–
Use P-value to set weighting in networks
Target for information extraction applications
Coreference resolution within text
Gene normalization
Meta-analyses
Table and figure mentions often indicators of assertions with experimental
validation
– Mapping text to sub-figures
– Citations useful for establishing rhetorical relations between papers, synonym
identification, and curation data
Weird stuff that comes up
with full text: parentheses
Category
Use case
Gene symbol or abbreviation
Gene normalization, coreference resolution
Citation
Summaries, high-value sentences, bibliometrics
Data value
Information extraction
P-value
Link weighting, meta-analysis
Figure/table pointer
Strong indicator of good evidence
List element
Mapping sub-figures to text
Singular/plural
Distinguish from other categories
Part of gene name
Gene normalization
Parenthetical statement
Potentially ignorable, or IE target
Timeline: Mid-March (AMIA)
Tuberculosis and
translational medicine
• Pathogen/host interactions
– Pathogens
– Hosts
– Genes
–?
Timeline: Not pressing
• Information retrieval alone is difficult if
•
strain must be considered
Current status: evaluating dictionarybased approaches to gene name
recognition; 0.69 and climbing
Discourse annotation
• Want to be able to follow and perform
•
•
abductive reasoning
Methods under development for labelling
aspects of the structure of an argument
Currently building large data set from the
CRAFT corpus—97 articles
Timeline: Soon for annotation, September for doing something with it
More projects than people
•
•
Ongoing:
–
–
–
–
–
–
–
Coreference resolution (spring)
Software engineering perspectives on natural language processing
Odd problems of full text (mid-March)
Semantic classification (March)
Tuberculosis and translational medicine
Discourse analysis annotation (months)
SciKnowMine and BASILISK (with Ellen Riloff)
In need of fresh blood:
–
–
–
–
–
–
–
Metagenomics/Microbiome studies (July)
Translational medicine from the clinical side (summer work, due fall)
Summarization
Negation
Question-answering: Why?
Nominalizations
Metamorphic testing for natural language processing