Processing Complex Sentences for Information Extraction

Download Report

Transcript Processing Complex Sentences for Information Extraction

Processing Complex Sentences for
Information Extraction
Deepthi Chidambaram
December 22, 2004
BY 510
Committee
Dr. Hasan Davulcu
Dr. Chitta Baral
Dr. Yoganand Balagurunathan
Outline
•
•
•
•
Information extraction – motivation.
Current Trends in Information Extraction.
Sentence structures.
Complex Sentence Processing
– A Rule Engineering Approach.
– An Automated Approach
• Experimental Results.
• Conclusion.
Information Extraction - Motivation
• Enormous amount of knowledge stored in unstructured
text.
• Lack of easy access to information for processing.
• Biomedical domain - Piping hot results that curated
databases don’t offer.
What can we get from text
• Literature
– Protein functional information.
– Interactions between genes, proteins, chemicals and
drugs.
– Signal transduction pathways.
• Clinical records
– Diagnosis patterns.
– Drugs to disease association.
The information extraction process
Text Retrieval
• Keyword retrieval (PubMed)
• Vector-Space Model
• Probabilistic models
• Latent Semantic Indexing
Text Classification
• Knowledge Engineering
– Rules specified by a domain expert.
– ‘knowledge acquisition bottleneck’.
• Machine Learning
– ‘Hard’ and ‘soft’ classification.
– Traditional classifiers with automatic feature selection
algorithms.
– Frequent word co-occurrences and/or Natural Language
patterns.
– Ontologies to identify word synonyms.
Named Entity Recognition
• Dictionary look up.
– LocusLink, Gene Ontology, HUGO, SwissProt and
UMLS provide extensive lists of biomedical terms.
– Constrained to identify names present in the
vocabularies alone.
• Regular Expressions.
– Patterns defined on words and phrases.
– E.g.. Alphanumeric words, hyphenated mixed cases.
Named Entity Recognition
• Boosted Wrapper Induction
– Name boundaries to identify entities.
• Supervised Classifiers
– Hidden Markov Models.
– Support Vector Machines.
• Supervised Rule Learners to tag entity names – the RAPIER system
– Rules based on word patterns.
• Probabilistic Rule Learner – ABGene
– Rules based on co-occurrence of POS tags and neighboring words.
Named Entity Recognition
Results from Shared task in JNLPBA, 2004
(Trained and tested on GENIA corpus)
Technique used
Precision / Recall / F-measure
Combination of Hidden Markov Models and Support Vector
Machines
76.0 / 69.4 / 72.6
Classifier trained on Local and syntactic features of entity names
71.6 / 68.6 / 70.1
Conditional Random Fields (CRF) and Novel Feature Sets
70.3 / 69.3 / 69.8
SVM + CRF
67.8 / 64.8 / 66.3
HMM model
69.1 / 61.0 / 64.8
Training SVM classifiers on linguistic knowledge
67.4 / 61.0 / 64.0
Extracting Relationships
• Word co-occurrences
– Presence of a functional word and entity
names in same sentence.
• Regular Expressions
– Regular expressions defined on part of speech
tags.
– Processing complex sentence structures using
regular expressions.
Extracting Relationships
• Grammar rules based on sentence structures –
shallow parses.
– Context Free grammars.
– Probabilistic Context Free Grammars.
– Automatic learning of grammar rules using supervised rule
learners – FOIL, PROGOL.
– Combination of Naïve Bayes Classifier and FOIL.
Extracting Relationships
• Hidden Markov Models
– Learning templates based on words and sentence structures.
• Preposition based templates – the Arizona Relation
parser
– Slot filler approach.
– Shallow parsing of sentence structure.
• Decision trees - GIS
– Wording and term distributions represented as a decision tree to
identify protein interactions.
Extracting relationships
• Manually built templates
– SUISEKI - Frame based system.
– BioRAT - Templates based on sentence structure and semantics.
– GENIES, GeneWays – Extension of the MedLEE frame based
system.
• Full sentence parses
–
–
–
–
The structure of the entire sentence is considered.
Potentially more accurate.
Rule learners based on Link grammar analysis of sentences.
Dependency sub-graph from the link grammar word-dependency
output.
Evaluation of Extraction Systems
• Precision
– Correctness of system
True Positives
True Positives + False Positives
• Recall
– Sensitivity of system
True Positives
True Positive + False Negative
• F-measure
– Combination of Precision and Recall (harmonic mean)
2*P*R
(P + R)
Common problems
• Complex sentence structures.
c-Abl tyrosine kinase activity is blocked by pRb, which
binds to the c-Abl kinase domain.
• Contextual information.
c-Abl phosphorylates tyrosines in the C-terminal domain
(CTD) of RNA polymerase II (RPase II; Km 5 0.5 mM).
Common problems
• Negation.
DNA-PK does not bind detectably to Ku in the absence of
DNA.
• Modality and hypothesis.
Cdk4 may be inhibited by tyrosine phosphorylation.
Gadd45 binds to Cdk1 and Gadd45 inhibits Cdk1 activity,
probably by displacing Cyclin B1.
Complex Sentence Processing Motivation
•
Sentences structures in text are seldom simple.
•
Simple, Complex, compound and Complex-compound
structures are addressed.
•
Dependent and independent clauses.
•
Multiple clauses used to specify interactions.
•
Multiple interactions specified in clauses.
Sentence Structures - Examples
•
Simple
Phosphorylation by ATM activates c-Abl.
•
Complex
c-Abl tyrosine kinase activity is blocked by pRb, which binds to the c-Abl kinase domain.
•
Compound
c-Abl phosphorylates tyrosines in the C-terminal domain (CTD) of RNA polymerase II
(RPase II; Km 5 0.5 mM) ; the c-Abl SH2 domain is a specificity determinant for this
reaction.
•
Complex-compound
Because the same signals that induce somitic expression of Pax-3 also induce myogenesis and
because expression of Pax-3 precedes that of MyoD and Myf-5 in vivo and that of MyoD in vitro,
we investigated whether Pax-3 might play a role in activating the myogenic program in somitic
cell.
Complex Sentence Processing
• Manual Rule Engineering approach
– Based on Part of Speech tags.
– Domain expert engineers rules for complex sentence
structures.
• Automated approach
– Based on syntactic relationships between the words in
a sentence.
– Link grammar parser.
Evaluation of Our Systems
• Precision
No. of Correct interactions
Total no. of interactions extracted
• Recall
No of correct interactions
Total no of interactions identified in text
• F-measure
(2*P*R)/P+R
Rule Engineering Approach
• Patterns learnt by interaction with domain expert.
• Sentences tagged by a natural Language
processor.
• Extraction rules written based on Part of speech
tags.
• False positives reduced by use of Ontologies.
Rule Engineering Approach
Toufeeq
Rule Engineering Approach
• System developed in XSB prolog.
• A Java GUI integrated to the back-end processing
system.
• Dictionaries coded as knowledge bases in prolog.
• Input – Sentences in a text file.
• Ouput – An interaction triplet.
– interact (upstream-gene, interaction word,
downstream-gene).
Pos patterns for simple sentences
Extract an interaction
from a (Noun, Verb,
Adjective) pattern.
extract([word([tag = NNP],_h13160),word([tag = VBZ],_h13161),
word([tag=JJ],_h13162)], interact(_h13160,_h13161,_h13162),true]).
extract([ng(_h99513),vg(_h99514),ng(_h99515)],
interact(_h99513,_h99514,_h99515),true).
extract([ng(_h108321),vg(_h108322),word([tag=NNP],_h108323)],
interact(_h108321,_h108322,_h108323),true).
Knowledge Bases Used
•
LocusLink (NCBI)
– compilation of the genes and related information such as alias names, protein
products, PubMed IDs of articles referring to the gene etc.
– Gene, protein names and their aliases.
•
UMLS (NLM)
– Electronic "Knowledge Sources" and associated lexical programs for biological
domain.
– Specialist Lexicon used to extract interaction words.
•
WordNet
– English nouns, verbs, adjectives and adverbs are organized into synonym sets,
each representing one underlying lexical concept.
– Used in conjunction with UMLS to extract interaction words.
– Interaction words are stemmed using the Porter Stemming algorithm to match
across numbers and tense.
Representation of Knowledge Bases
%gene and protein names
isa('3''-nucleotidase',gene).
isa('3'' repair exonuclease 1',protein).
isa('E2F4',gene).
isa('5''-nucleotidase, cytosolic IA',protein).
isa('M1',gene).
isa('PP2A, subunit B''',protein)
% interaction words
interaction('accumulat').
interaction('activat').
interaction('elevat').
interaction('hasten').
interaction('incite').
interaction('increas').
interaction('induc').
Complex Sentence Processing
• Extraction Methodology
– Identify the Complex Sentence patterns.
– Split the complex and compound sentences to simple
sentences based on extraction rules for the pattern.
– Process the sub-sentences using pattern matching
against simple sentence patterns.
• An iterative process was used to build the rules
and add simple sentence patterns manually.
• Training and Testing – curated text from a 1999
Molecular Biology of the Cell article by Kurtis Kohn.
Rules Written to Address Complex
Sentence Structures
Case - List of arguments - a comma separated list or a list separated by conjunctions.
Rule - Sentence is rewritten with each of the arguments. Each rewrite is processed as a
single sentence.
Sample InputDyhydrofolate reductase (DHFR) is activated via the E2F transactivation domain,
whereas B-myb, Cyclin E, E2F-1, E2F-2, and Cdc2 are regulated via the
repression domain of pRb family proteins.
Sample output interact([Dyhydrofolate,reductase],[is,activated],
[the,E2F,transactivation,domain],1.0).
interact(B-myb,[are,regulated],[the,repression,domain],0.6667).
interact([Cyclin,E],[are,regulated],[the,repression,domain],0.6667).
interact(E2F-1,[are,regulated],[the,repression,domain],0.6667).
interact(E2F-2,[are,regulated],[the,repression,domain],0.6667).
interact(Cdc2,[are,regulated],[the,repression,domain],0.6667).
Rules – Continued
extractListStart([],_).
extractListStart([X|Y],Part) :conc([X],Part,Whole),
extractComplex(Whole),
extractListStart(Y,Part).
extractComplex(S):contains([W1,word([_],','),_,wor
d([_],','),_],S),
splitList([W1],S,S1,S2),
S1=[],
getList(S2,List),
length(List,Len),
ith(Len,List,Ele),
sublist([Ele],S,Part1,Part2),
extractListStart(List,Part2),!.
Snapshot of the system GUI
Sample Run of the system
Results
Training data Testing data
Precision
(%)
Recall
(%)
Parts A, B
Parts A, B
98
88
Parts A, B
Part C
57
43
Parts A, B, C
Part E
81.25
71
Discussion
• Pronoun resolution done by hand.
• Dependent on skills of domain expert.
• Extension to other sentence structures will
involve re-engineering rules.
• Recall tends to be higher as rules are
manually created.
Automated Extraction System
• Automatic Integrated system for
–
–
–
–
Pronoun resolution.
Entity Tagging.
Complex sentence processing.
Interaction Extraction.
• Uses dependencies generated by the Link
grammar parser
• Scalable
• Less error prone
Automated Approach to CSP
Anaphora Resolution
• Anaphora Resolution
– Identification of pairs of noun phrases that refer to
each other.
• Need for Anaphora resolution
– Entities are often referred to by pronouns and referent
noun phrases.
– Identifying the noun phrases referred to improves
recall of the system.
Types of Anaphora in Literature
Pronoun Resolution Module
• Identifies and resolves pronominal anaphora.
• System written in XSB prolog.
• Interprolog Java API is used to integrate the module to
the extraction system.
• Addresses third person pronouns such as it, they.
• Handles both singular and plural pronouns as well as
reflexives (itself, themselves) and possessives
(its, their).
Pronoun Resolution - Algorithm
1. Process sentences to get POS tags for the
words.
2. Perform chunking on the words to form noun
phrases.
3. Identify pronouns in the sentence
4. Remove redundant reflexives
(Gene A, by itself remains activated.)
5. Replace identified pronouns with noun
phrases matching number and voice.
Pronoun Resolution: Walkthrough
Ku loads onto dsDNA ends and it can diffuse along the DNA in an
energy-independent manner.
Ku loads onto dsDNA ends and Ku can diffuse along the DNA in an
energy-independent manner.
When breast cancers were examined for NGAL mRNA and protein levels,
they were found to exhibit heterogeneous expression.
When breast cancers were examined for NGAL mRNA and protein levels ,
breast cancers were found to exhibit heterogeneous expression .
Pronoun Resolution: Example 2
Entity Tagging
• Combination of dictionary approach and heuristic rules.
• Dictionaries built from LocusLink, UMLS, GO and WordNet.
• Entities addressed by dictionaries - genes, proteins,
chemicals, location and interaction words.
• Regular expression developed manually using the information
from the dictionaries.
• Heuristic rules combine regular expression and POS tags to
mark entities.
Gene and Protein Name Dictionary
• Source - NCBI’s LocusLink.
• Fields in LocusLink contain information about gene
loci.
– Official nomenclature, aliases, sequence accessions, phenotypes, EC numbers,
MIM numbers, UniGene clusters, homology, map locations, related web sites,
etc.
• Perl parser written to identify gene and protein
names and their aliases.
• Names and aliases maintained in a list loaded by
the system during processing.
Gene and Protein Name Dictionary
(a) LocusLink fields considered for extraction (b) Partial LocusLink record
(c) Regular expression used for gene name matching
Gene and Protein Name Tagging
• Exact matches with dictionary
• Process of relaxation
– Mark the entire noun group as a gene if any of the word has been
identified as gene / protein name.
‘The c-abl kinase activity’ is a gene type
• Words that match a manually built regular expression.
– POS tags are used to exclude false positives such as adverbs,
modals, prepositions, etc.
• 508,477 entries in dictionary
Chemical and Location Names
• Used to tag location of interaction and agents in interaction for
modifier analysis.
• Source - UMLS Metathesaurus and Semantic Network, GO.
• Location names dictionary contains the GO cellular components
ontology.
• Semantic types identified from Semantic Network using the ‘isa’
relationship.
Chemical and Location Tagging
• Concepts categorized under the identified semantic
types are added to the corresponding dictionaries.
• A perfect match with an entry in the dictionary is tagged
as chemical name / cellular location.
• 790,000 chemical names and 118, 018 location names
were identified.
Interaction Words
• Source – Verbs from UMLS Specialist Lexicon and
WordNet.
• Interaction words are identified as
– (Specialist verbs ∩ WordNet verbs)
• Missing interaction words are added manually.
• 1400 verbs in their stemmed form are stored in the list.
• Porter Stemming Algorithm allows matches across tense
and number.
Complex Sentence Processor (CSP)
• Tool used – Link Grammar Parser (LGP) from CMU.
• Verb based extraction.
– John played the pipes.
• Linkages returned by the link grammar used to identify
syntactic constructs.
• Syntactic constructs associated with a verb form simple
sentences.
• Internal clause format representation.
• Java system with a GUI.
Link Grammar Parser
• Based on the Link Grammar for English Syntax.
• Dependency based grammar system written in C.
• Every word in a sentence is connected to every other
word directly or indirectly through links.
• Dictionary specifies linking requirements for each word.
• Linkages in a sentence should satisfy
– Connectivity.
– Planarity.
Linking Requirement
• Link Grammar Parser (LGP) trained on sentences from
conversational English (70 telephone conversations).
• Linking requirements specified using ‘&’ and ‘or’
operators.
the: ({AL-} & {@L+} & (D+ or DD+)) or DG+ or (TR- & U+);
• Each word should follow its linking requirement.
Linkage
• Words in sentence matched against dictionary
– Linkage formed using the satisfied linking requirements.
• Null linkages – ambiguous / unknown words
– LG ignores some words if an acceptable guess of the requirement is
not found.
• Linkages returned in increasing order of cost.
Using Link Grammar Parser in CSP
• LGP considers each sentence as a having one
verb.
• Provides sub-linkages for each clause in complex
sentence.
• Sub linkages are combined to a single linkage
using !union option.
• Combined linkage analyzed and split into
clauses.
Combined sub-linkages - !union
operator
Pre-processing Text for Link
Grammar
• Gene names are not in the vocabulary of LG.
– Force LGP to identify entities as nouns.
– Capitalized and hyphenated phrases with entity tags.
– E.g. ‘The C-abl protein kinase activity’ is converted to
‘Gene-The-C-abl-protein-kinase-activity’.
• Phrases / words in parentheses cause incorrect
linkages.
– Remove words within parentheses.
Preprocessing – removing
parentheses
Rapid degradation of Cyclin D1 requires phosphorylation at threonine-286 (kinase unknown,
but not Cdk2 or Cdk4); degradation is by way of the ubiquitin-proteasome pathway
Sentence crashes link grammar if processed as is, with parentheses.
Pre-processing Text for Link
Grammar
• Windows version of Link Grammar Parser
– Does not handle sentences of more than 70 words
• Mark the sentences of more than 70 words and discard
them for further processing
– Words of more than 50 characters also crash the LGP
• Mark sentences and discard them from further processing
• Sentences without two participants and one
interaction word do not contain interactions
– Discard sentences to improve processing time
– Discarding is done after pronoun resolution so
referrals are not disturbed.
Complex Sentence Processing
• Input
– Tagged and pre-processed text.
• Output
– Clauses from the sentence representing the simple
sentences.
• Internal clause format representation
Subject + Verb + Object (s) + Verb Modifying Phrases (s)
• Simple Sentences processed by Interaction
Extractor module to get interactions from text.
Complex Sentence Processor- Goal
Kohn-partC3|2|Another DMP1-regulated gene is CD13/aminopeptidase N,
which is activated cooperatively by DMP1 and c-Myb; CD13/aminopeptidase
N activation by DMP1 is inhibited by cyclin D, independent of Cdk4/6.
Complex Sentence Processing
Abstract ID | Sentence ID | Subject | Verb | Objects(s) | Verb Modifier(s)
Kohn-partC3|2|Gene-Another-DMP1-regulated-gene|is|Gene-CD13/aminopeptidaseN , which#||
Kohn-partC3|2|Gene-CD13/aminopeptidase-N |is activated||cooperatively#by GeneDMP1 and Gene-c-Myb#|
Kohn-partC3|2|Gene-N-activation by Gene-DMP1|is inhibited||by Gene-cyclin-D ,
independent of Gene-Cdk4/6#|
Our Contribution: CSP Algorithm
1. Processing complex sentences - Set ranges in
sentence
1. Identify the S links in the sentence. Each S link forms the beginning
of a simple sentence.
2. Identify the Components from each S link and expand the range to
include prepositional, adverbial and adjectival phrases.
2. Extract the components of the simple sentence
from the original sentence using the ranges as the
substring markers.
CSP - Illustration
Upon growth factor stimulation of quiescent cells, p130 declines late in
G1 and p130 is replaced by p107, which is absent in quiescent cells.
Entity Tagging and Pre-processing
Kohn-partE2|8|Upon growth factor stimulation of quiescent cells, Genep130 declines late in Gene-G1 and Gene-p130 is replaced by Genep107, which is absent in quiescent cells. |
Illustration of the CSP algorithm
Kohn-partE2|8| upon growth factor stimulation of quiescent cells, Gene-p130 |declines || late # in Gene-G1 #|
Kohn-partE2|8| Gene-p130 | is replaced || by Gene-p107 #|
Kohn-partE2|8| Gene-p107 | is absent || in quiescent cells#|
CSP Example – Pronoun resolution,
preprocessing
Experimental results
• System compared with two other systems
– BioRAT [Corney et al, 2004]
• Uses manually generated sentence structure templates.
• Partial parses of sentences.
– GeneWays [Rzhetsky et al, 2004]
• An improvement on GENIES system.
• Extraction rules are semantic grammars defined on the
sentence structures
• Partial parses of the sentences.
Evaluation - BioRAT
• DIP entries with both participants present in
SwissProt.
• 229 abstracts, 660 entries in DIP.
• Gene name dictionary restricted to SwissProt
names to emulate the BioRAT system.
• 452 interactions obtained.
• Precision of 54.39%
• Recall of 19.98%.
• F-measure of 29.18%
Comparison with BioRAT
Comparison with BioRAT and DIP was undertaken as a combined
work with Toufeeq
Comparison with DIP
Difference in Recall - the interaction is not mentioned in the abstract.
(Manual and semi-automatic curation of entries in DIP)
Comparison with GeneWays
• 2500 interactions obtained from the authors
corresponding to more than 10 citations in PubMed.
• 30 interactions with most citations were selected and
PubMed abstracts (495) obtained through NCBI’s
eutilities facility corresponding to the entities.
• Extraction system obtained 1485 interactions.
• 156 interactions occurred more than twice in the
abstracts with Precision of 76.13 %.
• Recall performed on a single full text article quoted in
GENIES.
• 42 interactions extracted from the article with a recall of
63.64%.
Comparison with GeneWays
Analysis of errors in comparison with
GeneWays
Conclusion
• Extraction system is able to extract multiple
interactions mentioned through complex sentence
structure.
• Limitations and bottlenecks
– Protein name identification.
– Link Grammar Parsing problems.
– Identification of relationships between clauses.
Thank you
• My advisors – Dr. Davulcu, Dr. Baral for their guidance
and encouragement
• Yoga, for the periodic inputs and pointers
• Dr. David Corney , for sharing the data from BioRAT
• Dr. Andrew Rzhetsky for providing data from the
GeneWays system
• Toufeeq, for sharing his code and for his contribution in
evaluation.
Questions