Customization of Gene Taggers for BeeSpace

Download Report

Transcript Customization of Gene Taggers for BeeSpace

Customizing Gene Taggers
for BeeSpace
Jing Jiang
[email protected]
March 9, 2005
Entity Recognition in BeeSpace
• Types of entities we are interested in:
–
–
–
–
–
–
Genes
Sequences
Proteins
Organisms
Behaviors
…
• Currently, we focus on genes
Mar 9, 05
BeeSpace
2
Input and Output
• Input: free text (w/ simple XML tags)
– <?xml version=“1.0” encoding=“UTF-8”><Document
id=“1”>…We have cloned and sequenced a cDNA encoding
Apis mellifera ultraspiracle (AMUSP) and examined its
responses to JH. …</Document>
• Output: tagged text (XML format)
– <?xml version=“1.0” encoding=“UTF-8”> <Document id=“1”>
…<Sent><NP>We</NP> have <VP>cloned</VP> and
<VP>sequenced</VP> <NP>a cDNA encoding <Gene>Apis
mellifera ultraspiracle</Gene><NP>
(<Gene>AMUSP</Gene>) and <VP>examined</VP> <NP>its
responses to JH</NP>.</Sent>…</Document>
Mar 9, 05
BeeSpace
3
Challenges
• No complete gene dictionary
• Many variations:
– Acronyms: hyperpolarization-activated ion
channel (Amih)
– Synonyms: octopamine receptor (oa1, oar,
amoa1)
– Common English words: at (arctops), by (3R-B)
• Different genes or gene and protein may
share the same name/symbol
Mar 9, 05
BeeSpace
4
Automatic Gene Recognition:
Characteristics of Gene Names
•
•
•
•
Capitalization (especially acronyms)
Numbers (gene families)
Punctuation: -, /, :, etc.
Context:
– Local: surrounding words such as “gene”,
“encoding”, “regulation”, “expressed”, etc.
– Global: same noun phrase occurs several times
in the same article
Mar 9, 05
BeeSpace
5
Existing Tools
• KeX (Fukuda)
– Based on hand-crafted rules
– Recognizes proteins and other entities
– Human efforts, not easy to modify
• ABNER & YAGI (Settles)
– Based on conditional random fields (CRFs) to
learn the “rules”
– ABNER identifies and classifies different
entities including proteins, DNAs, RNAs, cells
– YAGI recognizes genes and gene products
– No training
Mar 9, 05
BeeSpace
6
Existing Tools (cont.)
• LingPipe (Alias-i, Inc.)
– Uses a generative statistical model based on
word trigrams and tag bigrams
– Can be trained
– Has two trained models
• Others
– NLProt (SVM)
– AbGene (rule-based)
– GeneTaggerCRF (CRFs)
Mar 9, 05
BeeSpace
7
Comparison of Existing Tools
• Performance on a few manually annotated,
public data sets (protein names):
– GENIA (2000 abstracts on “human & blood cell
& transcription factor”)
– Yapex (99 abstracts on “protein binding &
interaction & molecular”)
– UTexas (750 abstracts on “human”)
• Performance on a honeybee sample data
set:
– Biosis search “apis mellifera gene”
Mar 9, 05
BeeSpace
8
Comparison of Existing Tools
(cont.)
GENIA
Yapex
UTexas
KeX
P: 0.3644
R: 0.4191
F1: 0.3898
P: 0.3451
R: 0.3931
F1: 0.3675
P: 0.1775
R: 0.3445
F1: 0.2343
ABNER
P: 0.7876
R: 0.7485
F1: 0.7675
P: 0.4351
R: 0.4441
F1: 0.4396
P: 0.3916
R: 0.4314
F1: 0.4105
LingPipe
P: 0.9298
R: 0.7388
F1: 0.8234
P: 0.4168
R: 0.4619
F1: 0.4382
P: 0.3633
R: 0.3918
F1: 0.3770
Mar 9, 05
BeeSpace
9
Comparison of Existing Tools
(cont.)
• KeX on honeybee data
– False positives: company name, country name, etc.
– Does not differentiate between genes, proteins, and
other chemicals
• YAGI on honeybee data
– False negatives: occurrences of the same gene name are
not all tagged
– Entity types and boundary detection
• LingPipe on honeybee data
– Similar to YAGI
Mar 9, 05
BeeSpace
10
Lessons Learned
• Machine learning methods outperform handcrafted rule-based system
• Machine learning methods have over-fitting
problem
• Existing tools need to be customized for
BeeSpace
– LingPipe is a good choice
• There is still room for better feature selection
– E.g., global context
Mar 9, 05
BeeSpace
11
Customization
• Train LingPipe on a better training data set
– Use fly (Drosophila) genes
– F1 increased from 0.2207 to 0.7226 on heldout fly data
– Tested on honeybee data: results
• Some gene names are learned (Record 13)
• Some false positives are removed (proteins, RNAs)
• Some false positives are introduced
– The noisy training data can be further cleaned
• E.g., exclude common English words
Mar 9, 05
BeeSpace
12
Customization (cont.)
• Exploit more features such as global
context
– Occurrences of the same word/phrase should
be tagged all positive or all negative
• Differentiate between domain-independent
features and domain-specific features
– E.g., prefix “Am” is domain-specific for Apis
mellifera
– Features can be weighted based on their
contribution across domains
Mar 9, 05
BeeSpace
13
Maximum Entropy Model
for Gene Tagging
• Given an observation (a token or a noun phrase),
together with its context, denoted as x
• Predict y  {gene, non-gene}
• Maximum entropy model:
P(y|x) = K exp(ifi(x, y))
• Typical f:
– y = gene & candidate phrase starts with a capital letter
– y = gene & candidate phrase contains digits
• Estimate i with training data
Mar 9, 05
BeeSpace
14
Plan: Customization with Feature
Adaptation
• i: trained on large set of data in domain A
(e.g., human or fly)
• i: trained on small set of data in domain B
(e.g., bee)
• i’ = i•i + (1 - i)•i: used for domain B
• i: based on how useful fi is across
different domains
– Large i if fi is domain-independent
– Small i if fi is domain-specific
Mar 9, 05
BeeSpace
15
Issues to Discuss
• Definition of gene names:
– Gene families? (e.g., cb1 gene family)
– Entities with a gene name? (e.g., Ks-1
transcripts)
• Difference between genes and proteins?
– E.g., “CREB (cAMP response element binding
protein)” and “AmCREB”?
• How to evaluate the performance on
honeybee data?
Mar 9, 05
BeeSpace
16
The End
• Questions?
• Thank You!
Mar 9, 05
BeeSpace
17