agyinc - UC Berkeley School of Information
Download
Report
Transcript agyinc - UC Berkeley School of Information
Automating Discovery from
Biomedical Texts
Marti Hearst & Barbara Rosario
UC Berkeley
Agyinc Visit
August 16, 2000
The LINDI Project
Linking Information for New Discoveries
Two Main Thrusts:
UIs for building and
reusing hypothesis
seeking strategies.
Statistical language
analysis techniques
for extracting
propositions
Scenario:
Explore Functions of a Gene
Objective
– Determine the functions of a newly sequenced
Gene X.
Known facts
– Gene X co-expresses (activated in the same
cell) with Gene A, B, C
– The relationship of Gene A, B, C with certain
types of diseases (from medical literature)
Question
– What types of diseases are Gene X related to?
Gene Co-expression:
Role in the genetic pathway
Kall.
g?
Kall.
h?
PSA
PSA
PAP
PAP
g?
Other possibilities as well
Make use of the literature
Look up what is known about the other
genes.
Different articles in different collections
Look for commonalities
– Similar topics indicated by Subject Descriptors
– Similar words in titles and abstracts
adenocarcinoma, neoplasm, prostate, prostatic
neoplasms, tumor markers, antibodies ...
Developing Strategies
Different strategies seem needed for
different situations
– First: see what is known about Kallikrein.
– 7341 documents. Too many
– AND the result with “disease” category
» If result is non-empty, this might be an
interesting gene
– Now get 803 documents
Explore Functions of New Gene X
Medical Literature
Query
Gene-A
Keywords
Projection
Mapping
Slide adapted from K. Patel
Developing Strategies
Different strategies seem needed for
different situations
– First: see what is known about Kallikrein.
– 7341 documents. Too many
– AND the result with “disease” category
» If result is non-empty, this might be an interesting
gene
– Now get 803 documents
– AND the result with PSA
» Get 11 documents. Better!
Explore Functions of New Gene X
Medical Literature
Query
Gene-A
Gene-B
Gene-C
Intersection
Keywords
Keywords
Keywords
Keywords
Projection
Developing Strategies
Look for commalities among these
documents
– Manual scan through ~100 category
labels
– Would have been better if
» Automatically organized
» Intersections of “important” categories
scanned for first
Explore Functions of New Gene X
Medical Literature
Query
Gene-A
Gene-B
Gene-C
Keywords
Keywords
Keywords
Projection
Keywords
Intersection
Keywords
Slicing
Mapping
Keywords
Slide adapted from K. Patel
Try a new tack
Researcher uses knowledge of field to
realize these are related to prostate
cancer and diagnostic tests
New tack: intersect search on all three
known genes
– Hope they all talk about diagnostics and
prostate cancer
– Fortunately, 7 documents returned
– Bingo! A relation to regulation of this
cancer
Explore Functions of New Gene X
Medical Literature
Possible Function
For Gene-X
Query
Query
Gene-A
Gene-B
Gene-C
Keywords
Keywords
Keywords
Projection
Keywords
Intersection
Keywords
Slicing
Mapping
Keywords
Slide adapted from K. Patel
Formulate a Hypothesis
Hypothesis: mystery gene has to do with
regulation of expression of genes leading
to prostate cancer
New tack: do some lab tests
– See if mystery gene is similar in
molecular structure to the others
– If so, it might do some of the same
things they do
Strategies again
In hindsight, combining all three
genes was a good strategy.
– Store this for later
Might not have worked
– Need a suite of strategies
– Build them up via experience and a good
UI
The System
Doing the same query with slightly
different values each time is timeconsuming and tedious
Same goes for cutting and pasting results
– IR systems don’t support varying queries
like this very well.
– Each situation is a bit different
Some automatic processing is needed in the
background to eliminate/suggest
hypotheses
The User Interface
A general search interface should support
–
–
–
–
–
–
History
Context
Comparison
Operators: Intersection, Union, Slicing
Operator Reuse
Visualization (where appropriate)
We have an initial implementation
It needs lots of work
Architecture of LINDI UI
Data Layer
Annotation Layer
User Interface Layer
Data Layer
Purpose
– Hide different formats of text collections
Components
– Data: Abstractions representing records of a
text collection
– Operations: performed on the data
Data
– A set of records
– Each record is a set of tuples with types
Operations
– union, intersection, projection, mapping
Annotation Layer
Purpose
– Associate data set with operations that
produced them (history)
– History is a first class object
Advantage
– Streamline a sequence of operations
– Reuse operations
– Parameterize operations
User Interface
Direct manipulation of information
objects and access operations
–
–
–
–
–
Query
Intersection
Union
Mapping
Slicing
Record and reuse of past operations
Parameterization of operations
Streamlining of operations
Initial Palette
Query Structure Determined
by Collection Type
Query Operation Results
Projection Operation and
Subsequent Results
GA
GB
GC
Parameterized Query:
Repeat operations with different values
Intersection over Projected Attribute
Intersection over Projected Attribute
Example Interaction with UI Prototype
1 Query on Gene names
2 Project out only mesh headings
3 Intersect the results
4 Map to create a ranking
5 Slice out the top-ranked.
Future Work on UI
As currently designed
– Better labeling
– Better layout
» Intuitive
» Scalable
– Connection to real backend
– User Testing
» Does direct manipulation work?
» What operator sequences help?
» How to improve parameterization?
More advanced
– Support for strategies
– Incorporation of NLP
Language Analysis
Component
Goals:
– Extract Propositions from Text
– Make Inferences
Language Analysis
Component
Why Extract Propositions from Text?
– Text is how knowledge at the
propositional level is communicated
– Text is continually being created and
updated by the outside world
Example:
Statistical Semantic Grammar
To detect causal relationships
between medical concepts
– Title:
Magnesium deficiency implicated in increased stress
levels.
– Interpretation:
<nutrient><reduction> related-to
<increase><symptom>
– Inference:
» Increase(stress, decrease(mg))
Statistical Semantic Grammars
Empirical NLP has made great strides
– But mainly applied to syntactic structure
Semantic grammars are powerful, but
– Brittle
– Time-consuming to construct
Idea:
– Use what we now know about statistical
NLP to build up a probabilistic grammar
LINDI: Target Components
1.
2.
3.
4.
Special UI for retrieving appropriate
docs
Language analysis on docs to detect
causal relationships between concepts
Probabilistic representation of concepts
and relationships
UI + User: Hypothesis creation