Finding biologically relevant information using ADIOS

Download Report

Transcript Finding biologically relevant information using ADIOS

ThaiBinh’s final project for CBB545:
Finding biologically relevant
information using ADIOS
April 19, 2007
The current state of affairs in
natural language processing
• NLP: Converting human language into
representations that are easier for
computers to understand
• Most natural language processing
requires a tagged training set
• Tagging = time consuming/costly
http://en.wikipedia.org/wiki/Natural_language_processing
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42
ADIOS
• “Unsupervised learning of natural languages”
• ADIOS: Automatic distillation of structure
• Input: A corpus of characters (most likely,
untagged sentences)
• Output: A grammar
“Unsupervised learning of natural languages”, Solan, et al., PNAS vol. 102, August 2005.
A very quick primer on grammars
• A set of “rules” for making a “sentence”
• Ex.
The grammar:
A possible derivation:
SS+S
S
S+S
S1
S+S+S
Sa
1+S+S
1+1+S
1+1+a
A very quick primer on grammars
• We can visualize the expansion as a tree,
and read the leaves
The grammar:
SS+S
S1
Sa
A possible derivation:
S
S+S
S+S+S
1+S+S
1+1+S
1+1+a
S
S
1
+
S
S
1
+
S
a
A very quick primer on grammars
• We can visualize the expansion as a tree,
and read the leaves
The grammar:
SS+S
S1
Sa
A possible derivation:
S
S+S
S+S+S
1+S+S
1+1+S
1+1+a
S
S
1
+
S
S
1
+
S
a
ADIOS
• The system builds a graph using the first
sentence
• With each successive sentence, it tries to
find overlapping “subpaths” (patterns)
ADIOS
• Also try to generalize the path by looking
for equivalence classes
• Search for patterns and equivalence
classes until no new ones are found
ADIOS: A quick example
• Input a corpus of sentences
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
Chong had a presentation in CBB545 #
on Tuesday Chong had a presentation #
next Thursday Laura has a presentation #
ThaiBinh has a presentation in CBB545 #
ThaiBinh has a presentation today #
today ThaiBinh has a presentation #
Chong had a presentation #
Hugo has a presentation in CBB545 today #
ThaiBinh has a presentation in CBB545 today #
Laura has a presentation in CBB545 next Thursday #
Hugo has a presentation today #
Chong had a presentation on Tuesday #
Chong had a presentation in CBB545 on Tuesday #
Laura has a presentation next Thursday #
in CBB545 ThaiBinh has a presentation #
today ThaiBinh has a presentation in CBB545 #
ADIOS: A quick example
• Output is a grammar
P18
P19
E20
P21
P22
P23
P24
(a,presentation)
(E20,has,P18)
{Hugo,Laura,ThaiBinh}
(Chong,had)
(in,CBB545)
(P19,P22)
(P21,P18)
P18
P19
E20
P21
P22
P23
P24
(a,presentation)
(E20,has,P18)
{Hugo,Laura,ThaiBinh}
(Chong,had)
(in,CBB545)
(P19,P22)
(P21,P18)
P18
P19
E20
P21
P22
P23
P24
(a,presentation)
(E20,has,P18)
{Hugo,Laura,ThaiBinh}
(Chong,had)
(in,CBB545)
(P19,P22)
(P21,P18)
(P19,P22)
(E20,has,P18)
{Hugo,Laura,
ThaiBinh}
(a,presentation)
(in,CBB545)
P18
P19
E20
P21
P22
P23
P24
(a,presentation)
(E20,has,P18)
{Hugo,Laura,ThaiBinh}
(Chong,had)
(in,CBB545)
(P19,P22)
(P21,P18)
(P21,P18)
(Chong,had)
(a,presentation)
P18
P19
P20
E21
P22
P23
P24
(Chong,had,a)
(has,a)
(E21,P19,presentation)
{Hugo,Laura,ThaiBinh}
(in,CBB545)
(P20,P22)
(P18,presentation)
Two different grammars:
Same end result
P18
P19
P20
E21
P22
P23
P24
(Chong,had,a)
(has,a)
(E21,P19,presentation)
{Hugo,Laura,ThaiBinh}
(in,CBB545)
(P20,P22)
(P18,presentation)
P18
P19
E20
P21
P22
P23
P24
(a,presentation)
(E20,has,P18)
{Hugo,Laura,ThaiBinh}
(Chong,had)
(in,CBB545)
(P19,P22)
(P21,P18)
ADIOS
• Able to generate sentences using the
grammar it created
• Can test if new sentence fits one of the
grammar rules
• Can be applied to wide variety of domains
– Bible in various languages
– Classify protein function based on amino acid
sequence
The Project
• Use ADIOS to create grammar rules from
biomedical sentences
• Look for gene-gene associations
• Look for gene-disease associations
• Infer information about a pair of genes in
an unseen sentence based on its
sentence structure (pattern)
Abner
Find mentions of genes
Metamap
Find mentions of diseases
“The clinical effects of cortisone and ACTH (adrenocorticotropic
hormone) in the collagen diseases: acute disseminated lupus
erythematosus, periarteritis nodosa, dermatomyositis and
scleroderma; interim report.”
Phrase: "in the collagen diseases"
Meta Candidates (6)
1000 C0009326:Collagen Diseases [Disease or Syndrome]
Phrase: "periarteritis nodosa,"
Meta Candidates (4)
1000 C0031036:Periarteritis Nodosa (Polyarteritis Nodosa) [Disease or Syndrome]
Phrase: "dermatomyositis"
Meta Candidates (2)
1000 C0011633:Dermatomyositis [Disease or Syndrome]
1000 C0221056:Dermatomyositis (Dermatomyositis, Adult Type) [Disease or Syndrome]
Phrase: "scleroderma"
Meta Candidates (4)
1000 C0011644:Scleroderma [Disease or Syndrome]
1000 C0036421:Scleroderma (Systemic Scleroderma) [Disease or Syndrome]
The Project: Input
• Replace any mention of a gene with a
generic term
• Ex.
Smad7 antagonizes TGF-{beta} signaling in the nucleus
GeneOne antagonizes GeneTwo signaling in the nucleus
PTEN negatively regulates expression of cyclin D1
GeneOne negatively regulates expression of GeneTwo
The Project: Input
• Replace any mention of a gene/disease
with a generic term
• Ex.
p16 is consistently expressed in endometrial tubal metaplasia
GeneOne is consistently expressed in DiseaseOne
The expression of cyclin D1 is more often correlated with
prognosis in cancers of ampulla of vater
The expression of GeneOne is more often correlated with
prognosis in DiseaseOne
Let ADIOS work its “magic”…
Let ADIOS work its “magic”…
Out pops patterns to describe
the sentences (the grammar)
“Tagging” the patterns
GeneOne
antagonizes
GeneTwo
GeneOne
negatively regulates
GeneTwo
GeneOnez
increases transcription
GeneTwo
GeneOnez
positively regulates
GeneTwo
“Tagging” the patterns
GeneOne
antagonizes
GeneTwo
GeneOne
negatively regulates
GeneTwo
GeneOnez
increases transcription
GeneTwo
GeneOnez
positively regulates
GeneTwo
“Tagging” the patterns
inhibits
antagonizes
GeneOne
GeneTwo
negatively regulates
increases transcription
GeneOne
GeneTwo
positively regulates
activates
Seeing a new sentence
Ras/Erk pathway positively regulates Jak1/STAT6 activity
Seeing a new sentence
increases transcription
GeneOne
GeneTwo
Ras/Erk
pathway positively
regulates
Jak1/STAT6
activity
positively regulates
activates
Seeing a new sentence
increases transcription
Ras/Erk
Jak1/STAT6
positively regulates
activates
The big picture…
Automatic extraction of regulation
Smad7 antagonizes TGF{beta} signaling in the nucleus
GeneOne
action
GeneTwo
PTEN negatively regulates
expression of cyclin D1
Smad7
inhibit
TGF-Beta
PTEN
inhibit
Cyclin D1
Ras/Erk pathway positively
regulates Jak1/STAT6 activity
Ras/ERK
activate
Jak1/Stat6
Loss of p53 Expression
Correlates with… Neck Cancer
p16 is consistently expressed
in endometrial tubal metaplasia
GeneOne expression
DiseaseOne
p53
downregulated
Neck cancer
p16
upregulated
Endo. tubal
metaplasia
Potential (inevitable) problems
• The data/sentences
– Amount
• ADIOS’S data usually had 1000’s of sentences
– Quality
• ABNER/MetaMap (used for finding gene/diseasementions) are not always accurate
• Is it even feasible?
– Biologists/Scientists are very creative in
coming of with various ways of saying the
same thing
Potential (inevitable) problems
• The data/sentences
– Amount
• ADIOS’S data usually had 1000’s of sentences
– Quality
• ABNER/MetaMap (used for finding gene/diseasementions) are not always accurate
• Is it even feasible?
– Stay tuned…