Transcript Slide 1
Text Mining for Bioscience
Applications:
The State of the Art
Marti Hearst
University of California, Berkeley
Outline
• Search vs. Discovery
• Why is text analysis difficult?
• Some current approaches
• Future directions
My Background
• Computer Scientist by training
– NOT a biologist
• Professor in an interdisciplinary program
– School of Information Management & Systems (SIMS)
– Affiliated with the UCSF Bioinformatics Grad Group
• Research fields are
– Computational Linguistics
– Search (Information Retrieval)
– User Interfaces and Information Visualization
• Have focused for a while on bioscience text
• Have received research support from Genentech
Search vs. Discovery
Search:
Finding hay in a haystack
Monet, Haystack with Snow, Morning
Discovery:
Creating a new
kind of hay
Search Goals
• More accurate results
• More comprehensive results
– Thesaurus expansion
• Intelligent summaries of results
• Organize results along biologically
relevant lines
• Better user interfaces
Knowledge Discovery from Text
• How to discover new information …
• … As opposed to looking up what’s
already known.
• Method:
– Create hypotheses
– Use large text collections to gather
evidence to refute or support
hypotheses
– Do lab tests to verify promising results
Discovery Goals
• Genomics
– Automatically build gene networks
– Discover gene functions
• Pharmacology
– Help determine which drugs can help cure a
disease
– Help determine which genetic traits will lead
to a reaction to a drug
• Etiology
– Discover underlying causes of disease
Why is Automated Text Analysis
Difficult?
Why is automated text analysis difficult?
“Avastin, developed by South San Francisco-based
Genentech (DNA), was approved for advanced
colorectal cancer and for patients who haven't
received other chemotherapy, according to the Food
and Drug Administration.”
– What is approved doing in this sentence?
• John was approved for advancement -> gets a promotion.
• Avastin was approved for cancer -> to fight cancer.
• Avastin was approved for patients -> to consume to fight
cancer.
– What kind of patients approved for?
• Ambiguous. Could be for anyone who hasn’t received
chemotherapy, or only those patients with advanced
colorectal cancer who haven’t received chemotherapy.
USA Today, 2/26/04, Sbazo & Appleby
10
Why is automated text analysis difficult?
“This could easily be a multibillion-dollar
drug," McCamant says.
Refers to concepts mentioned in earlier
sentences.
USA Today, 2/26/04, Sbazo & Appleby
11
Why is automated text analysis difficult?
"Avastin opens up this new gateway for
cancer care," says William Li, president
of the Angiogenesis Foundation in
Massachusetts. "It's the first in a fleet of
other drugs.”
– Is Avastin a vehicle? It opens gateways and
travels in a fleet!
USA Today, 2/26/04, Sbazo & Appleby
12
Why is automated text analysis difficult?
• There are many indirect ways to say things:
– A two-dose combined hepatitis A and B vaccine
would facilitate immunization programs.
• The vaccine helps prevent hep B.
– These results suggest that con A-induced hepatitis
was ameliorated by pretreatment with TJ-135.
• The treatment TJ-135 helps cure hep.
– Effect of interferon on hepatitis B.
• There is an unspecified effect of interferon on hep B.
13
What do we do?
• Solve sub-problems
– Extract certain types of entities
• Gene/protein names
• Abbreviation definitions
– Classify the noun phrases using ontologies
• MeSH, LocusLink, GO, etc.
– Define relationship types; try to recognize them.
– Many other subproblems are actively being worked on
• Word sense disambiguation
• Co-reference resolution
Two Main Approaches
Hand-built Rules
Machine Learning
Two Main Approaches
• Hand-built rules
– Can be very accurate
– Are also very “brittle”
– Don’t scale
• Machine learning
– Usually requires labeled training data
• Unsupervised methods under development
– Can be made to scale
– Is the way of the future
Abbreviation Definition Recognition
A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text, Ariel
Schwartz and Marti Hearst, PSB 2003 Kauai, Jan 2003
• Fast, simple algorithm for recognizing abbreviation
definitions.
– Simpler and faster than the rest
• Other approaches are cubic or quadratic in time
– Higher precision and recall
– Idea: Work backwards from the end
• Examples:
– In eukaryotes, the key to transcriptional regulation of the Heat
Shock Response is the Heat Shock Transcription Factor (HSF).
– Gcn5-related N-acetyltransferase (GNAT)
• In future:
– Use redundancy across abstracts to figure out abbreviation
meaning even when definition is not present.
Gene name co-occurence
A literature network of human genes for high-throughput analysis of gene
expression. Jenssen TK, Laegreid A, Komorowski J, Hovig E. Nat Genet.
2001 May;28(1):21-8.
PubGene Assumption:
If two genes are co-mentioned in a MEDLINE record,
there is an underlying biological relationship.
Example: Genes highly
upregulated at time point 6 h (6H)
in the fibroblast serum response.
Green: upregulation
Red:
downregulation
Gene name co-occurence
A literature network of human genes for high-throughput analysis of gene
expression. Jenssen TK, Laegreid A, Komorowski J, Hovig E. Nat Genet.
2001 May;28(1):21-8.
Evaluation:
29-40% of the pairs were incorrect
45% of OMIM pairs found
51% of DIP pairs found (DB of Interacting Proteins)
How to find functions of
genes?
• Have the genetic sequence
• Don’t know what it does
• But …
– Know which genes it coexpresses with
– Some of these have known function
• So …infer function based on function of
co-expressed genes
– This is problem suggested by Michael Walker
and others at Incyte Pharmaceuticals
Gene Co-expression:
Role in the genetic pathway
Kall.
g?
Kall.
h?
PSA
PSA
PAP
PAP
g?
Other possibilities as well
Make use of the literature
• Look up what is known about the other
genes.
• Different articles in different collections
• Look for commonalities
– Similar topics indicated by Subject
Descriptors
– Similar words in titles and abstracts
adenocarcinoma, neoplasm, prostate, prostatic
neoplasms, tumor markers, antibodies ...
Formulate a Hypothesis
• Hypothesis: mystery gene has to do with
regulation of expression of genes leading to
prostate cancer
• New tack: do some lab tests
– See if mystery gene is similar in molecular
structure to the others
– If so, it might do some of the same things
they do
Etiology Example
Complementary structures in disjoint science literatures. Don
R. Swanson. In Proceedings of SIGIR ‘91
• Goal: find cause of disease
– Magnesium-migraine connection
• Given
– medical titles and abstracts
– a problem (incurable rare disease)
– some medical expertise
• Find causal links among titles
– symptoms
– drugs
– results
Gathering Evidence
stress
magnesium
CCB
migraine
magnesium
SCD
magnesium
PA
magnesium
Gathering Evidence
CCB
migraine
PA
SCD
stress
magnesium
Swanson’s Linking Approach
• Two of his hypotheses have received
some experimental verification.
• His technique
– Only partially automated
– Required medical expertise
• Recently others have made progress
automating it.
Automating Swanson-style Discovery
Text Mining: Generating Hypotheses from MEDLINE, Padmini
Srinivasan. To appear in JASIST.
•
•
UMLS defines Semantic Types
Every MeSH term is assigned one or more Semantic Types
–
Interferon type II falls within both:
•
•
•
•
Immunologic Factor and
Pharmacologic Substance
Each PubMed article is assigned a set of MeSH terms
The idea is to characterize a set of articles according to which
semantic types their MeSH terms fall into.
Automating Swanson-style Discovery
Text Mining: Generating Hypotheses from MEDLINE, Padmini
Srinivasan. To appear in JASIST.
Approach:
–
–
User inputs topic T of interest
User selects 2 sets from a small number of sets of UMLS
semantic types
System
–
•
•
•
•
•
•
•
Searches PubMed for articles about T
Selects out the important MeSH terms as determined by the userchosen semantic type categories
Searches PubMed for articles that contain these MeSH terms
Combines the MeSH terms that result from these retrieved
documents;
Call this result C
If a PubMed search on words from T and c from C are empty, place c
as a candidate in a final result set R
Report those terms in R that fall into the second user-selected
semantic type set.
Automating Swanson-style Discovery
•
•
•
Text Mining: Generating Hypotheses from MEDLINE, Padmini
Srinivasan. To appear in JASIST.
Results: have successfully reproduced the 7 examples they
tried, with very little manual intervention
Example: input topic is Raynaud’s disease
Main Ideas for NLP Approach
• Assign Semantics using
– Statistics
– Hierarchical Lexical Ontologies to
generalize
– Redundancy in the data
• Build up Layers of Representation
– Syntactic and Semantic
– Use these in a feedback loop
Automated Relation Assignment
• Recall the problem:
– A two-dose combined hepatitis A and B vaccine
would facilitate immunization programs.
• The vaccine helps prevent hep B.
• Identified 7 relations that can hold between
Treatments and Diseases
• Used Machine Learning to address this
– Graphical models
– Neural nets
• Marked up the text with syntactic and
semantic information
– MeSH labels turn out to be very important
32
Automated Relation Assignment
• Use Machine Learning to address this
– Graphical models
– Neural nets
• Mark up the text with syntactic and
semantic information
– MeSH labels turn out to be very important
33
Automated Relation Assignment
• Results
34
Future Directions
• In text analysis:
– Move away from hand-built rules
– More focus on labeling with semantics
• In problems tackled
– There are so many possibilities!
– Help with automated curation
Thank you!
Visit our site:
biotext.berkeley.edu