Towards an Automated Analysis of Biomedical Abstracts

Download Report

Transcript Towards an Automated Analysis of Biomedical Abstracts

Towards an Automated
Analysis of Biomedical
Abstracts
Barbara Gawronska, Björn Erlendsson,
Björn Olsson
School of Humanities and Informatics, University of Skövde,
Sweden
The goal of the project: text analysis for candidate
path extraction
Organism
annotation
database
GO graph
Scored & ranked
path alignments
GO term
probability
calculation
Path extraction from
model database
Model
pathway
database
Path alignment
Parameter
settings
Path extraction from text
PubMed
Lexical
databases,
grammar
The characteristics of the language of biomedical texts
A typical PubMed abstract (PMID: 16301995):
The tumor suppressor gene hypermethylated in cancer 1 (HIC1), located on
human chromosome 17p13.3, is frequently silenced in cancer by epigenetic
mechanisms. Hypermethylated in cancer 1 belongs to the bric a brac/poxviruses
and zinc-finger family of transcription factors and acts by repressing target gene
expression. It has been shown that enforced p53 expression leads to increased
HIC1 mRNA, and recent data suggest that p53 and Hic1 cooperate in
tumorigenesis. In order to elucidate the regulation of HIC1 expression, we have
analysed the HIC1 promoter region for p53-dependent induction of gene
expression. (…)
Other members of the p53 family, notably TAp73beta and DeltaNp63alpha, can
also act through this HIC1.PRE to induce transcription of HIC1, and finally,
hypermethylation of the HIC1 promoter attenuates inducibility by p53.
Results of POS-tagging of two large corpora (30 million words each):
1) texts on stem cell research, and 2) general English prose
light - Stem Cell, dark - Prose
40%
35%
30%
25%
20%
15%
10%
5%
ad
je
ct
iv
e
pr
n
ep ou
os n
iti
on
de
v
te er b
r
co min
nj
e
un r
ct
pr ion
on
ou
au n
xi
lia
r
m y
od
a
ad l
ve
rb
0%
Results of POS-tagging of a smaller sample corpus
of biomedical abstracts
Proper nouns
Nouns
Verbs
Closed Class Words
Adjectives+Adverbs
3%
12%
30%
19%
36%
The general architecture of the
Information
Biomedical
abstracts Extraction system
Normalization
Biomedical abstracts
Named Entity Recognition
Normalization
WordNet
Closed Class Word List
Linking acronyms to full
names of biological objects
Identification of proper
nouns, acronyms,
Matching
input words
semantic
and syntactic
against
previously
stored
tagging
tags
Identification
of relevant
Identification
and
text
parts
classification of remaining
words and symbols
Syntactic parsing
Identification of relevant
text parts
Extraction of biological
relations from parse trees
Syntactic parsing
Extraction of biological
relations from parse trees
Domain-specific acronym
and name patterns
Tag Memory Database
Specialized verb lexicons
Patterns for domain-specific Named Entity Recognition
 Pattern 1: n lower case chars (n>=1) + m integers (m
>=2) + optionally: any character (p53, cdc25C, bcl2)
 Pattern 2: n lower case chars (n>=1) + m upper case
chars (m>=1) + k integers (k>=0) (mRNA)
 Pattern 3: integer + lower case + n integers (n>=0)
(1alpha)
 Pattern 4: n integers (n>=1) + m upper case (m >=1)
(7BL)
Linking acronyms to full names of biological objects
From previous
procedure
Place pointer at
the first word in
the sentence
Find next
acronym A
No
Is A
followed by
’(’ and L1* ?
Yes
Mark the words
inside the (…),
link to A
No
Found?
No
Yes
Within
(…) ?
Yes
L1:= First Letter
of A
N := Number
of letters in A
Find the N:th word
beginning in L1 to
the left of the ‘(‘ ,
link that word and
its right context to A
To next procedure
(Other parts of the
NER-module)
There are also tumor-related genes like NF2 (neurofibromatose of type 2) . p16INK4a
belongs to a cell cycle regulator group called cyclin dependent kinase inhibitors (CDKI ).
Sample semantico-syntactic tags
Our finding implicates that TNF-alpha released from the mesangium after IgA
deposition activates renal tubular cells.
[semcat('Our',our,[[],poss([])]),semcat(finding,find,[wnn,[]]),
semcat(implicates,implicate,[[],[speech_act_verb([1])]),
semcat(that,that,[[],rel([])]),semcat('TNF',[propername]),
semcat(alpha,alpha,[wnn,[]]),semcat(released,release,[[],bioverb([[],production])]),
semcat(from,from,[[],prep([])]),semcat(the,the,[[],det([])]),
semcat(mesangium,mesangium,[[],[]]),
semcat(after,after,[[],prep([])]),semcat('IgA',[propername]),
semcat(deposition,deposition,[wnn,[]]),
semcat(activates,activate,[[],bioverb([[],activation])]),
semcat(renal,renal,[adj,[]]),semcat(tubular,tubular,[adj,[]]),
semcat(cells,cell,[[],cell([])]),semcat('.',[[],[]])]
Tags (occurrences) in the test set in relation to
knowledge sources
8905
9000
8000
7000
6000
5000
Tags learned
from the training
set
Tags from NER,
lexicons and
morphology
4000
2989
3000
1964
2000
1000
1450
580
420
277
59 70
61
742
287
444
257
0
Proper
Nouns
N
CCV
"Bioverbs"
V
Adj+Adv
Closed
Class
The next step: finding background and foreground in
abstracts
Biomedical abstracts
Normalization
Identification of proper
nouns, acronyms,
semantic and syntactic
tagging
Identification of relevant
text parts
Syntactic parsing
Extraction of biological
relations from parse trees
Textual delimitators
Background/foreground in abstracts
ID: 16284406.
The transcription factors dehydration-responsive element-binding protein 1s (DREB1s)/Crepeat-binding factors (CBFs) specifically interact with the DRE/CRT cis-acting element and
control the expression of many stress-inducible genes in Arabidopsis. The genes for DREB1
orthologs, OsDREB1A and OsDREB1B from rice, are induced by cold stress, and
overexpression of DREB1 or OsDREB1 induced strong expression of stress-responsive genes
in transgenic Arabidopsis plants, resulting in increased tolerance to high-salt and freezing
stresses. In this study, we generated transgenic rice plants overexpressing the OsDREB1 or
DREB1 genes. These transgenic rice plants showed not only growth retardation under normal
growth conditions but also improved tolerance to drought, high-salt and low-temperature
stresses like the transgenic Arabidopsis plants overexpressing OsDREB1 or DREB1. We also
detected elevated contents of osmoprotectants such as free proline and various soluble
sugars in the transgenic rice as in the transgenic Arabidopsis plants. (…)
Retrieval of Relevant Text Parts
 Presence of the string this study/current study/present stud/our study
or synonyms of study in the same context (work, research,
investigation)
 Presence of the pronoun we preceded by or followed by a verb
denoting an event in the world of the researcher (i.e., a cognition,
communication, or manipulation verb) and not combined with a time
adverb referring to past time, such as previously, earlier
 Presence of the string our goal/our aim
 Presence of a cognition/communication verb combined with the
adverb now, presently or here.
 Tense shift from present to past.
success rate: 92,5%
Retrieval of Relevant Text Parts (2)
if Foreground < 6 and word is in [study, work, research, investigation]
and word-1 is in [this, current, present,our] -> Foreground = 6 else
if Foreground < 5 and word is a CCVerb and foundWe=1 and set
found{ "previously", "earlier" } = 0 -> Foreground = 5 else
if Foreground < 4 and foundCCverb=1 and (foundWe=1 and word is not in [previously, earlier])) ->
Foreground = 4 else
if Foreground < 3 and word is in [goal, aim] and word-1 is [our] -> Foreground = 3 else
if Foreground < 2 and word is a CCverb ->
if set found{ "now", "presently", "here" } = 1  Foreground= 2
else
foundCCverb=1
if Foreground < 2 and word is in [now, presently, here] ->
if foundCCverb=1 -> Foreground= 2
else
set_found{ "now", "presently", "here" } = 1
if word indicates tense shift from present to past -> Foreground = 1
Extracting relations from syntactic trees
subj
Sdsent
we
pred
obj
hypothesize
subj
mediators
Sdsent
Relcl
NUX i
subj : Ref i
(mediators )
Hy po th ese :
K E G G re la tio n : a ctiv at io n
v: act ivat e
We hypothesise that mediators released from
human mesangial cells (HMC)
triggerred by IgA
mediators
deposition may lead to activation of proximal
tubular epithelial cells (PTEC)
S
pred
Sdsent
pred
release ,
pass
obj
may lead to
activation
PTEC
advl
P
NP
v : release
relation type : production
HMC
v : trigger
KEGG relation: activation
IgA deposition
PTEC
from
NUX j
HMC
Relcl
subj : Ref j
(HMC )
Sdsent
pred
advl : agent
trigger , pass
IgA deposition
Allelic loss at TP53 seems to arise independently of LOH at the RB1
gene in carcinomas of the uterine corpus in humans
The syntactic tree after application of the tree search algorithm
A possible graphical representation of the compressed tree
Allelic loss
place: TP53
No relation
appearance
Hypothese
LOH
Results
 Test corpus: about 15 000 words selected from
PubMed using p53 as keyword
 Tagging: 95.2% recall
 Retrieval of relevant text parts: success rate
92.5%
 Syntactic parsing: 79% recall, 86% precision
 Relation retrieval: tested only manually, success
rate about 94%
Current and Future Work
 A revised tagging procedure; tagging using
a smaller lexicon and domain-specific prefix
list
 parsing improvements
 implementation of the tree search algorithm
 the question of the final output format