Comments on Bunescu et al and Cohen and Sarawagi papers

Download Report

Transcript Comments on Bunescu et al and Cohen and Sarawagi papers

Distance functions and IE – 5
William W. Cohen
CALD
Announcements
• Current statistics:
–
–
–
–
days
with unscheduled student talks: 5
students with unscheduled student talks: 3
Projects are due: 4/28 (last day of class)
Additional requirement: draft (for comments)
no later than 4/21
String distance metrics so far...
• Term-based (e.g. TF/IDF as in WHIRL)
– Distance depends on set of words contained in both s and t – so sensitive
to spelling errors.
– Usually weight words to account for “importance”
– Fast comparison: O(n log n) for |s|+|t|=n
• Edit-distance metrics
– Distance is shortest sequence of edit commands that transform s to t.
– No notion of word importance
– More expensive: O(n2)
• Other metrics
– Jaro metric & variants
– Monge-Elkan’s recursive string matching
– etc?
• Which metrics work best, for which problems?
Results - Overall
Combining Information Extraction
and Similarity Computations
Krauthammer et al
Background
• Common task in proteomics/genomics:
– look for (soft) matches to a query sequence in a
large “database” of sequences.
– want to find subsequences (genes) that are
highly similar (and hence probably related)
– want to ignore “accidental” matches
– possible technique is Smith-Waterman (local
alignment)
• want char-char “reward” for alignment to reflect
confidence that the alignment is not due to chance
Background
• Common task in proteomics/genomics:
– look for (soft) matches to a query sequence in a
large “database” of sequences.
– want to find subsequences (genes) that are
highly similar (and hence probably related)
– want to ignore “accidental” matches
– possible technique is Smith-Waterman (local
alignment)
• want char-char “reward” for alignment to reflect
confidence that the alignment is not due to chance
Smith-Waterman distance
c
o
h
e
n
d
o
r
f
m
0
0
0
0
0
0
0
0
0
c
1
0
0
0
0
0
0
0
0
c
0
0
0
0
0
0
0
0
0
o
0
2
1
0
0
0
2
1
0
h
0
1
4
3
2
1
1
1
0
n
0
0
3
3
5
4
3
2
1
s
0
0
2
2
4
4
3
2
1
k
0
0
1
1
3
3
3
2
1
i
0
0
0
0
2
2
2
2
1
dist=5
In general
“peaks” in the
matrix scores
indicate
highly similar
substrings.
Background
• Common task in proteomics/genomics:
– look for (soft) matches to a query sequence in a
large “database” of sequences.
– possible technique is Smith-Waterman (local
alignment)
• want char-char “reward” for alignment to reflect
confidence that the alignment is not due to chance
• based on substitutability theory/stats for amino acids
– doesn’t scale well
• BLAST and FASTA: fast approximate S-W
BLAST/FASTA ideas
• Find all char n-grams (“words”) in the
query string.
• FASTA:
– Use inverted indices to find out where these
words appear in the DB sequence
– Use S-W only near DB sections that contain
some of these words
BLAST/FASTA ideas
• Find all char n-grams (“words”) in the
query string.
• BLAST:
– Generate variations of these words by looking
for changes that would lead to strong
similarities
– Discard “low IDF” words (where accidental
matches are likely)
– Use expanded set of n-grams to focus search
query string
words and expansions
BLAST/FASTA ideas
• Find all char n-grams (“words”) in the query string.
• BLAST:
– Generate variations of these words by looking for changes that
would lead to strong similarities
– Discard “low IDF” words (where accidental matches are likely)
– Use expanded set of n-grams to focus search
• The BLAST program:
–
–
–
–
Widely used,
Fast implementation,
Supports asking multiple queries against a database at once...
Can one use it find soft matches of protein names (from a
dictionary) in text?
Basic idea:
• Biomedical paper
• Protein name dictionary
• Extracted protein name
(dict. entry->text)
• IE system:
dictionaries+BLAST
(optimized for this problem)
• Protein database
• Query strings
• Proposed alignment
(query->database)
• Query algorithm:
BLAST
1) Mapping text to DNA sequences
(Q: what sort of char similarity is this?)
2) Optimizing blast
• Split protein-name database into several parts (for
short, medium-length, long protein names)
– Scoring depends on length of matched string
• Require space chars before and after “short” protein
names.
• Manually search (grid search?) for better settings
for certain key parameters for each protein-name
subdatabase
– With what data?
• Evaluate on one review article, 1162 protein names
– inter-annotator agreement not great (70-85%)
2) Optimizing blast
2) Optimizing blast
Results
Results
Overall: precision 71.1%, recall 78.8%
(optimized)
IE with Dictionaries
Cohen & Sarawagi
Finding names you know about
• Problem: given dictionary of names, find them in
email text
– Important task beyond email (biology, link analysis,...)
– Exact match is unlikely to work perfectly, due to
nicknames (Will Cohen), abbreviations (William C) ,
misspellings (Willaim Chen), polysemous words (June,
Bill), etc
– In informal text it sometimes works very poorly
– Problem is similar to record linkage (aka data
cleaning, de-duping, merge-purge, ...) problem of
finding duplicate database records in heterogeneous
databases.
Finding names you know about
• Problem: given dictionary of names, find
them in email text
– Exact match is unlikely to work well for
informal text.
– Problem is similar to record linkage
– Hard to combine state of the art similarity
metrics (as used in record linkage) with state of
the art NER system due to representational
mismatch:
• Opening up the box, modern NER systems don’t
really know anything about names....
IE as Sequential Word Classification
A trained IE system
models the relative
probability of
labeled sequences
of words.
person name
location name
background
To classify, find the most likely state sequence for the given words:
Yesterday Pedro Domingos spoke this example sentence.
Any words said to be generated by the designated “person name”
state extract as a person name:
Person name: Pedro Domingos
IE as Sequential Word Classification
Modern IE systems use a rich representation for words, and
clever probabilistic models of how labels interact in a
sequence, but do not explicitly represent the names extracted.
wt - 1
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is “Wisniewski”
is in a list of city names
is under node X in WordNet
part of
ends in
is in bold font
noun phrase
“-ski”
is indented
O t 1
is in hyperlink anchor
last person name was female
next two words are “and Associates”
wt
wt+1
…
…
Ot
O t +1
Semi-Markov models for IE
with Sunita Sarawagi, IIT Bombay
• Train on sequences of
labeled segments, not
labeled words.
S=(start,end,label)
• Build probability
model of segment
sequences, not word
sequences
• Define features f of
segments
• (Approximately)
optimize feature
weights on training
data
f(S) = words xt...xu, length, previous
words, case information, ..., distance to
known name
m
maximize:
 log Pr(S
i 1
i
| xi )
Details: Semi-Markov model
Details: Semi-Markov model
Conditional Semi-Markov models
CMM:
CSMM:
A training algorithm for CSMM’s (1)
Review: Collins’ perceptron training algorithm
Correct tags
Viterbi tags
A training algorithm for CSMM’s (2)
Variant of Collins’ perceptron training algorithm:
voted
perceptron
learner for
TTRANS
like Viterbi
A training algorithm for CSMM’s (3)
Variant of Collins’ perceptron training algorithm:
voted
perceptron
learner for
TTRANS
like Viterbi
A training algorithm for CSMM’s (3)
Variant of Collins’ perceptron training algorithm:
voted
perceptron
learner for
TSEGTRANS
like Viterbi
Sample CSMM features
Experimental results
• Baseline algorithms:
– HMM-VP/1: tags are “in entity”, “other”
– HMM-VP/4: tags are “begin entity”, “end entity”,
“continue entity”, “unique”, “other”
– SMM-VP: all features f(w) have versions for “f(w) true for
some w in segment that is first (last, any) word of segment”
– dictionaries: like Borthwick
• HMM-VP/1: fD(w)=“word w is in D”
• HMM-VP/4: fD,begin(w)=“word w begins entity in D”,
etc, etc
• Dictionary lookup
Datasets used
Used small training sets (10% of available) in experiments.
Results
Results: varying history
Results: changing the dictionary
Results: vs CRF
Results: vs CRF