agyinc - UC Berkeley School of Information

Download Report

Transcript agyinc - UC Berkeley School of Information

Automating Discovery from
Biomedical Texts
Marti Hearst & Barbara Rosario
UC Berkeley
Agyinc Visit
August 16, 2000
The LINDI Project
Linking Information for New Discoveries
Two Main Thrusts:
UIs for building and
reusing hypothesis
seeking strategies.
Statistical language
analysis techniques
for extracting
propositions
Scenario:
Explore Functions of a Gene

Objective
– Determine the functions of a newly sequenced
Gene X.

Known facts
– Gene X co-expresses (activated in the same
cell) with Gene A, B, C
– The relationship of Gene A, B, C with certain
types of diseases (from medical literature)

Question
– What types of diseases are Gene X related to?
Gene Co-expression:
Role in the genetic pathway
Kall.
g?
Kall.
h?
PSA
PSA
PAP
PAP
g?
Other possibilities as well
Make use of the literature



Look up what is known about the other
genes.
Different articles in different collections
Look for commonalities
– Similar topics indicated by Subject Descriptors
– Similar words in titles and abstracts
adenocarcinoma, neoplasm, prostate, prostatic
neoplasms, tumor markers, antibodies ...
Developing Strategies

Different strategies seem needed for
different situations
– First: see what is known about Kallikrein.
– 7341 documents. Too many
– AND the result with “disease” category
» If result is non-empty, this might be an
interesting gene
– Now get 803 documents
Explore Functions of New Gene X
Medical Literature
Query
Gene-A
Keywords
Projection
Mapping
Slide adapted from K. Patel
Developing Strategies

Different strategies seem needed for
different situations
– First: see what is known about Kallikrein.
– 7341 documents. Too many
– AND the result with “disease” category
» If result is non-empty, this might be an interesting
gene
– Now get 803 documents
– AND the result with PSA
» Get 11 documents. Better!
Explore Functions of New Gene X
Medical Literature
Query
Gene-A
Gene-B
Gene-C
Intersection
Keywords
Keywords
Keywords
Keywords
Projection
Developing Strategies

Look for commalities among these
documents
– Manual scan through ~100 category
labels
– Would have been better if
» Automatically organized
» Intersections of “important” categories
scanned for first
Explore Functions of New Gene X
Medical Literature
Query
Gene-A
Gene-B
Gene-C
Keywords
Keywords
Keywords
Projection
Keywords
Intersection
Keywords
Slicing
Mapping
Keywords
Slide adapted from K. Patel
Try a new tack


Researcher uses knowledge of field to
realize these are related to prostate
cancer and diagnostic tests
New tack: intersect search on all three
known genes
– Hope they all talk about diagnostics and
prostate cancer
– Fortunately, 7 documents returned
– Bingo! A relation to regulation of this
cancer
Explore Functions of New Gene X
Medical Literature
Possible Function
For Gene-X
Query
Query
Gene-A
Gene-B
Gene-C
Keywords
Keywords
Keywords
Projection
Keywords
Intersection
Keywords
Slicing
Mapping
Keywords
Slide adapted from K. Patel
Formulate a Hypothesis


Hypothesis: mystery gene has to do with
regulation of expression of genes leading
to prostate cancer
New tack: do some lab tests
– See if mystery gene is similar in
molecular structure to the others
– If so, it might do some of the same
things they do
Strategies again

In hindsight, combining all three
genes was a good strategy.
– Store this for later

Might not have worked
– Need a suite of strategies
– Build them up via experience and a good
UI
The System



Doing the same query with slightly
different values each time is timeconsuming and tedious
Same goes for cutting and pasting results
– IR systems don’t support varying queries
like this very well.
– Each situation is a bit different
Some automatic processing is needed in the
background to eliminate/suggest
hypotheses
The User Interface

A general search interface should support
–
–
–
–
–
–


History
Context
Comparison
Operators: Intersection, Union, Slicing
Operator Reuse
Visualization (where appropriate)
We have an initial implementation
It needs lots of work
Architecture of LINDI UI
Data Layer
 Annotation Layer
 User Interface Layer

Data Layer

Purpose
– Hide different formats of text collections

Components
– Data: Abstractions representing records of a
text collection
– Operations: performed on the data

Data
– A set of records
– Each record is a set of tuples with types

Operations
– union, intersection, projection, mapping
Annotation Layer

Purpose
– Associate data set with operations that
produced them (history)
– History is a first class object

Advantage
– Streamline a sequence of operations
– Reuse operations
– Parameterize operations
User Interface

Direct manipulation of information
objects and access operations
–
–
–
–
–
Query
Intersection
Union
Mapping
Slicing
Record and reuse of past operations
 Parameterization of operations
 Streamlining of operations

Initial Palette
Query Structure Determined
by Collection Type
Query Operation Results
Projection Operation and
Subsequent Results
GA
GB
GC
Parameterized Query:
Repeat operations with different values
Intersection over Projected Attribute
Intersection over Projected Attribute
Example Interaction with UI Prototype
1 Query on Gene names
2 Project out only mesh headings
3 Intersect the results
4 Map to create a ranking
5 Slice out the top-ranked.
Future Work on UI

As currently designed
– Better labeling
– Better layout
» Intuitive
» Scalable
– Connection to real backend
– User Testing
» Does direct manipulation work?
» What operator sequences help?
» How to improve parameterization?

More advanced
– Support for strategies
– Incorporation of NLP
Language Analysis
Component
Goals:
– Extract Propositions from Text
– Make Inferences
Language Analysis
Component
Why Extract Propositions from Text?
– Text is how knowledge at the
propositional level is communicated
– Text is continually being created and
updated by the outside world
Example:
Statistical Semantic Grammar
To detect causal relationships
between medical concepts
– Title:
Magnesium deficiency implicated in increased stress
levels.
– Interpretation:
<nutrient><reduction> related-to
<increase><symptom>
– Inference:
» Increase(stress, decrease(mg))
Statistical Semantic Grammars

Empirical NLP has made great strides
– But mainly applied to syntactic structure

Semantic grammars are powerful, but
– Brittle
– Time-consuming to construct

Idea:
– Use what we now know about statistical
NLP to build up a probabilistic grammar
LINDI: Target Components
1.
2.
3.
4.
Special UI for retrieving appropriate
docs
Language analysis on docs to detect
causal relationships between concepts
Probabilistic representation of concepts
and relationships
UI + User: Hypothesis creation