Classifying Biological Full-Text Articles for Multi
Download
Report
Transcript Classifying Biological Full-Text Articles for Multi
Extracting Semantic Predication from
Medline Citations for Pharmacogenomics
C.B. Ahlers1, M. Fiszman2, D.D. Fushman1,
F.M. Lang1 and T.C. Rindflesch1
1National
Center for Biomedical Communications,
National Library of Medicine
2University of Tennessee, USA
(PSB 2007 12:209-220)
Abstract
This paper describes a NLP system
(Enhanced SemRep) to identify core
assertions on pharmacogenomics (基因藥
理學) in Medline.
The development of the system is based
on the adaptation of an existing system
and depends on UMLS.
Preliminary evaluation: 55% recall and
73% precision.
2/26
1. Introduction (1/3)
Core research in pharmacogenomics
investigates the interaction of genes/proteins
with therapeutic substances. E.g. treatment of
oncology(腫瘤學).
Current NLP for pharmacogenomics
concentrates on co-occurrence information
without specifying exact relations.
Enhanced SemRep complements that approach
by representing assertions in text as semantic
predications.
3/26
1. Introduction (2/3)
Example
These findings therefore demonstrate that
dexamethasone (皮質類固醇) is a potent inducer of
multidrug resistance-associated protein (多抗藥性
蛋白質) expression in rat hepatocytes (肝細胞)
through a mechanism that seems not to involve the
classical glucocorticoid receptor (糖皮質激素受體)
pathway.
1. Dexamethasone STIMULATES Multidrug ResistanceAssociated Proteins
2. Dexamethasone NEG_INTERACTS_WITH Glucocorticoid
receptor
3. Multidrug Resistance-Associated Proteins PART_OF Rats
4. Hepatocytes PART_OF Rats
4/26
1. Introduction (3/3)
Based on two existing systems
SemRep: extract semantic predications from clinical
text.
SemGen: developed from SemRep to identify
etiologic (病因的) relations between genetic
phenomena and diseases.
Relations
Genes, drugs, diseases, and population groups.
At the gene level, no more specific genetic
phenomena ( e.g. mutations, single nucleotide
polymorphisms, and haplotype information).
5/26
2. Background
NLP for Biomedicine
The Unified Medical Language System
SemRep and SemGen
6/26
2.1 NLP for Biomedicine (1/2)
Co-occurrence of entities in text (genedisease relations, Yen et al., 2006; druggene, Rindflesch et al., 2000).
Machine learning techniques (genedisease relations, Chun et al., 2006;
drug-gene, Chang et al., 2004).
Syntactic templates and shallow parsing
(protein interactions, Blaschke et al.,
1999)
7/26
2.1 NLP for Biomedicine (2/2)
Enhanced SemRep addresses a wide
range of syntactic structures and
specific semantic relations pertinent to
pharmacogenomics.
Example
STIMULATES
DISRUPTS
CAUSES
8/26
2.2 UMLS
Metathesaurus (more than 106 concepts)
Concept: fever;
Synonyms: pyrexia, febrile, hyperthermia;
Semantic Type: ‘Finding’
Semantic types represent allowable
relationships between concepts
‘Gene or Genome’ PART_OF ‘Cell’
‘Pharmacologic Substance’ INTERACTS_WITH
‘Enzyme’
‘Disease or Syndrome’ CO-OCCURS_WITH
‘Neoplastic Process’ (腫瘤突起)
9/26
2.3 SemRep and SemGen (1/2)
SemRep: a rule-based symbolic NLP system.
Example
Phenytoin (二苯妥因) induced gingival hyperplasia
(齒齦增生)
Pharmacological Substance
[[head(noun(phenytoin)),
metaconc(‘Phenytoin’:[orch,phsu]))],
[verb(induced)], [head(noun([‘gingival
hyperplasia’)), metaconc(‘Gingival
Hyperplasia’:[dsyn]))]] Disease or Syndrome
‘Pharmacological Substance’ CAUSES ‘Disease or
Syndrome’ Semantic Network relation/ argument identification
Phenytoin CAUSES Gingival Hyperplasia
10/26
2.3 SemRep and SemGen (2/2)
SemGen: identify semantic predications on the
genetic etiology of disease.
Gene and protein name: ABGene.
Since UMLS Semantic Network does not cover
molecular genetics, semantic relations are
created:
Gene-disease interactions: (ASSOCIATE_WITH,
PREDISPOSE(易感染的), and CAUSE)
Gene-gene interactions: (INHIBIT, STIMULATE, and
INTERACTS_WITH)
11/26
3. Methods (1/2)
Scrutiny of the pharmacogenomics literature to
identify relevant predications not identified by
either SemRep or SemGen.
1000 Medline were retrieved containing drug and
gene names.
400 sentences were selected, including genetic
(gene-disease), genomic (gene-gene), and
pharmacogenomic (drug-gene, drug-genome)
relations; in addition relations between genes and
population groups; disease and population groups;
and pharmacological relations (drug-disease, drugpharmacological effect, drug-drug) were scrutinized.
12/26
3. Methods (2/2)
After processing these 400 sentences with
SemRep, errors were analyzed and categorized
for etiology.
The majority of errors
The Semantic Network
Errors in argument identification due to “empty” heads
Gene name identification
Extensive modifications for Enhanced SemRep.
Gene name identification was addressed by adding
ABGene to the machinery.
13/26
3.1 Modification of Semantic Network
for Enhanced SemRep (1/4)
Grouping semantic types: Five broader semantic
groups (Substance, Anatomy, Living Being,
Process, and Pathology) were defined to permit
predications relevant to pharmacogenomics.
Substance: ‘Amino Acid, Peptide, or Protein’,
‘Antibiotic’(抗生素), ‘Carbohydrate’(碳水化合物), ...
Anatomy: ‘Anatomical Structure’(解剖學構造), ‘Body
Part, Organ, or Organ Component’, ‘Cell’, ‘Gene or
Genome’, ‘Neoplastic Process’, ‘Tissue’ …
14/26
3.1 Modification of Semantic Network
for Enhanced SemRep (2/4)
Living Being: ‘Animal’, ‘Archaeon’(第三類有機體),
‘Bacterium’, ‘Fungus’(真菌), ‘Human’,
‘Invertebrate’(無脊椎動物), ‘Mammal’, ‘Organism’,
‘Vertebrate’, ‘Virus’
Process: ‘Acquired Abnormality’(後天異常),
‘Anatomical Abnormality’, ‘Cell Function’, ‘Cell or
Molecular Dysfunction’(機能障礙), ‘Congenital
Abnormality’(先天性異常), ‘Laboratory Test Result’…
Pathology: ‘Acquired Abnormality’, ‘Anatomical
Abnormality’, ‘Cell or Molecular Dysfunction’,
‘Congenital Abnormality’, ‘Disease or Syndrome’,
‘Injury or Poisoning’, Mental or Behavioral
Disorder’(心理及行為障礙), …
15/26
3.1 Modification of Semantic Network
for Enhanced SemRep (3/4)
Define predications: categories 1-6
1: Genetic Etiology (基因病理學)
2: Substance Relations
{Substance} ASSOCIATED_WITH OR
PREDISPOSES OR CAUSES {Pathology}
{Substance} INTERACTS_WITH OR INHIBITS OR
STIMULATES {Substance}
3: Pharmacological Effects
{Substance} AFFECTS OR DISRUPTS OR
AUGMENTS {Anatomy OR Process}
16/26
3.1 Modification of Semantic Network
for Enhanced SemRep (4/4)
4: Clinical Actions
5: Organism Characteristics
{Substance} ADMINISTERED_TO {Living Being}
{Process} MANIFESTATION_OF {Process}
{Substance} TREATS {Living Being OR Pathology}
{Anatomy OR Living Being} LOCATION_OF {Substance}
{Anatomy} PART_OF {Anatomy OR Living Being}
{Process} PROCESS_OF {Living Being}
6: Co-existence
{Substance} CO-EXISTS_WITH {Substance}
{Process} CO-EXISTS_WITH {Process}
17/26
3.2 Empty Heads
Example:
We saw differential activation of CYP2C9 variants by
dapsone(藥:氨苯).
“Variant” is ‘Qualitative Concept’.
We want CYP2C9 variant be a member of the
Substance group.
Enumerate several categories of terms as
semantically empty heads, e.g. allele (等位基因),
mutation, variant, levels, expression…
Words from these lists that have been labeled as
heads are hidden and the word to their left is
relabeled as heads.
18/26
3.3 Evaluation
Test 300 sentences which are randomly
generated from the set of 36,577 sentences
containing drug and gene co-occurrences found
on the Web-site. (bionlp.stanford.edu/genedrug)
These sentences were annotated by three
physicians (CBA, DD-F, MF).
They did not mark up all assertions in the
sentences, only those representing a
predication defined in Enhanced SemRep.
A total of 850 predications were assigned by the
annotators.
19/26
4. Results
Category
Recall
74%
Precision
74%
Substance Relations (interact_with,
inhibit, stimulate)
50%
73%
Pharmacological Effects (affect,
disrupt, augment)
41%
68%
Clinical Actions (administered_to,
manifestataion_of, treat)
54%
84%
Organism Characteristics (location_of,
part_of, process_of)
63%
71%
Total
55%
73%
Genetic Etilogy (associated_with,
cause, predispose)
20/26
5.1 Discussion: Error Analysis
(1/2)
Word sense ambiguity (28%)
Ticlopidine (血小板抑制劑) inhibition of
phenytoin (二苯妥因) metabolism mediated
by potent inhibition of CYP2C19 (基因).
Inhibition wrongly mapped to ‘Psychological
Inhibition’.
CYP2C19 AFFECTS Psychological Inhibition.
21/26
5.1 Discussion: Error Analysis
(2/2)
Process coordinate structures (35%)
The cytotoxic (細胞毒素) activities of
mercaptopurine (藥:胇基嘌呤) and
fluorouracil (抗腫瘤代謝藥物 ) are regulated
by thiopurine methyltransferase (TPMT) and
dihydropyrimidine dehydrogenase (DPD),
respectively.
Fluorouracil INTERACTS_WITH DPD gene. (○)
mercaptopurine INTERACTS_WITH thiopurine
methyltransferase. (X)
22/26
5.2 Process Medline Citations on
CYP2D6 (1/3)
2849 Medline citations contain variant forms of
CYP2D6.
5219 predications containing CYP2D6 as an
argument were analyzed according to two
predication categories (Genetic Etiology and
Substance Relations).
Compare with relations listed for this gene on
the PharmGKB Web site (PharmacoGenetic
Knowledge Base).
23/26
5.2 Process Medline Citations on
CYP2D6 (2/3)
Genetic Etiology
267 total predications represented CYP2D6 as an
etiologic agent for a disease.
Parkinson’s disease (帕金森氏症) (35), carcinoma of
the lung (肺癌) (21), tardive dyskinesia (遲發性不自
主運動) (15), Alzheimer’s disease (阿茲海默症) (9),
bladder carcinoma (膀胱癌) (8).
169 TP, and 4 FP, two were found not to contain the
disease name in the referenced citation.
Only carcinoma of the lung occurs in PharmGKB.
24/26
5.2 Process Medline Citations on
CYP2D6 (3/3)
Substance Relations
1128 total predications involve CYP2D6 and a drug.
69 drugs occurred 3 or more times in those predications where
41 drugs were in PharmGKB and 28 were not.
68 were true positives.
Inhibit CYP2D6: quinidine (45), paroxetine (34), fluoxetine (27),
fluvoxamine (8), sertraline (8).
Interact_with CYP2D6: bufuralol (27), antipsychotic agents (25)
dextromethorphan (21), venlafaxine (19), debrisoquin (18).
Quinidine and sertraline are not in PharmGKB.
Bufuralol is not in PharmGKB.
SemRep failed to capture: cocaine, levomepromazine,
maprotiline, trazodone, and yohimbine.
25/26
6. Conclusion
This paper applies an existing NLP system in the
pharmacogenomics domain.
The major changes for developing Enhanced SemRep
from SemRep involved modifying the semantic space
stipulated by the UMLS Semantic Network.
The outputs are semantic predications that represent
assertions from Medline citations expressing a range of
specific relations in pharmacogenomics.
The information can support advanced information
management applications for pharmacogenomics
research and clinical care.
In the future, authors intend to adapt the summarization
and visualization techniques developed for clinical text.
26/26