Review of previous remarks, and comments on Krauthammer et al

Download Report

Transcript Review of previous remarks, and comments on Krauthammer et al

Distance functions and IE – 4?
William W. Cohen
CALD
Announcements
• Current statistics:
–
–
–
–
days
with unscheduled student talks: 6
students with unscheduled student talks: 4
Projects are due: 4/28 (last day of class)
Additional requirement: draft (for comments)
no later than 4/21
The data integration problem
String distance metrics so far...
• Term-based (e.g. TF/IDF as in WHIRL)
– Distance depends on set of words contained in both s and t – so sensitive
to spelling errors.
– Usually weight words to account for “importance”
– Fast comparison: O(n log n) for |s|+|t|=n
• Edit-distance metrics
– Distance is shortest sequence of edit commands that transform s to t.
– No notion of word importance
– More expensive: O(n2)
• Other metrics
– Jaro metric & variants
– Monge-Elkan’s recursive string matching
– etc?
• Which metrics work best, for which problems?
Jaro metric
Winkler-Jaro metric
String distance metrics so far...
• Term-based (e.g. TF/IDF as in WHIRL)
– Distance depends on set of words contained in both s and t – so sensitive
to spelling errors.
– Usually weight words to account for “importance”
– Fast comparison: O(n log n) for |s|+|t|=n
• Edit-distance metrics
– Distance is shortest sequence of edit commands that transform s to t.
– No notion of word importance
– More expensive: O(n2)
• Other metrics
– Jaro metric & variants
– Monge-Elkan’s recursive string matching
– etc?
• Which metrics work best, for which problems?
So which metric should you use?
SecondString (Cohen, Ravikumar, Fienberg):
• Java toolkit of string-matching methods from AI,
Statistics, IR and DB communities
• Tools for evaluating performance on test data
• Exploratory tool for adding, testing, combining
string distances
– e.g. SecondString implements a generic “Winkler
rescorer” which can rescale any distance function with
range of [0,1]
• URL – http://secondstring.sourceforge.net
• Distribution also includes several sample
matching problems.
SecondString distance functions
• Edit-distance like:
–
–
–
–
Levenshtein – unit costs
untuned Smith-Waterman
Monge-Elkan (tuned Smith-Waterman)
Jaro and Jaro-Winkler
Results - Edit Distances
Monge-Elkan is the best on average....
Edit distances
SecondString distance functions
• Term-based, for sets of terms S and T:
– TFIDF distance
| S T |
– Jaccard distance: sim ( S , T ) 
| S T |
– Language models: construct PS and PT and use
SecondString distance functions
• Term-based, for sets of terms S and T:
– TFIDF distance
– Jaccard distance
– Jensen-Shannon distance
• smoothing toward union of S,T reduces cost of
disagreeing on common terms
• unsmoothed PS, Dirichlet smoothing, Jelenik-Mercer
– “Simplified Fellegi-Sunter”
Results – Token Distances
SecondString distance functions
• Hybrid term-based & edit-distance based:
– Monge-Elkan’s “recursive matching scheme”,
segmenting strings at token boundaries (rather
than separators like commas)
– SoftTFIDF
• Like TFIDF but consider not just tokens in both S
and T, but tokens in S “close to” something in T
(“close to” relative to some distance metric)
• Downweight close tokens slightly
Results – Hybrid distances
Results - Overall
Prospective test on two clustering tasks
An anomolous dataset
An anomalous dataset: census
An anomalous dataset: census
Why?
Other results with SecondString
• Distance functions over structured data
records (first name, last name, street, house
number)
• Learning to combine distance functions
• Unsupervised/semi-supervised training for
distance functions over structured data
Combining Information Extraction
and Similarity Computations
2) Krauthammer et al
1) Bunescu et al
Experiments
• Hand-tagged 50 abstracts for gene/protein entities
(pre-selected to be about human genes)
• Collected dictionary of 40,000+ protein names
from on-line sources
– not complete
– example matching is not sufficient
• Approach: use hand-coded heuristics to propose
likely generalizations of existing dictionary
entries.
– not hand-coded or off-the-shelf similarity metrics
Example name generalizations
Basic idea behind the algorithm
original dictionary
carefully-tuned
heuristics (aka hacks)
similar (but not identical
process) applied to word ngrams from text to do IE:
extract if n-gram -> CD
Example: canonicalizing “short names”
(different procedure for “full names” and “one-word” names)
Example: canonicalizing “short names”
(different procedure for “full names” and “one-word” names)
NF-25 in OD
NF<n>
Recognize:
NF<n>
Nf<n>
“... NF-kappa B...”
NF => CD
(from <x><n>)
NF<g><l>
NF in CD?
(<x><g><l>)
Results
• Why is precision less than 100%?
• When should you use “similarity by normalization”?
• Could a simpler algorithm do as well?
• Is there overfitting? (50 abstracts, <750 proteins)
...
Combining Information Extraction
and Similarity Computations
2) Krauthammer et al
1) Bunescu et al
Background
• Common task in proteomics/genomics:
– look for (soft) matches to a query sequence in a
large “database” of sequences.
– want to find subsequences (genes) that are
highly similar (and hence probably related)
– want to ignore “accidental” matches
– possible technique is Smith-Waterman (local
alignment)
• want char-char “reward” for alignment to reflect
confidence that the alignment is not due to chance
Background
• Common task in proteomics/genomics:
– look for (soft) matches to a query sequence in a
large “database” of sequences.
– want to find subsequences (genes) that are
highly similar (and hence probably related)
– want to ignore “accidental” matches
– possible technique is Smith-Waterman (local
alignment)
• want char-char “reward” for alignment to reflect
confidence that the alignment is not due to chance
Smith-Waterman distance
c
o
h
e
n
d
o
r
f
m
0
0
0
0
0
0
0
0
0
c
1
0
0
0
0
0
0
0
0
c
0
0
0
0
0
0
0
0
0
o
0
2
1
0
0
0
2
1
0
h
0
1
4
3
2
1
1
1
0
n
0
0
3
3
5
4
3
2
1
s
0
0
2
2
4
4
3
2
1
k
0
0
1
1
3
3
3
2
1
i
0
0
0
0
2
2
2
2
1
dist=5
In general
“peaks” in the
matrix scores
indicate
highly similar
substrings.
Background
• Common task in proteomics/genomics:
– look for (soft) matches to a query sequence in a
large “database” of sequences.
– possible technique is Smith-Waterman (local
alignment)
• want char-char “reward” for alignment to reflect
confidence that the alignment is not due to chance
• based on substitutability theory for amino acids
– doesn’t scale well
• BLAST and FASTA: fast approximate S-W
BLAST/FASTA ideas
• Find all char n-grams (“words”) in the
query string.
• FASTA:
– Use inverted indices to find out where these
words appear in the DB sequence
– Use S-W only near DB sections that contain
some of these words
BLAST/FASTA ideas
• Find all char n-grams (“words”) in the
query string.
• BLAST:
– Generate variations of these words by looking
for changes that would lead to strong
similarities
– Discard “low IDF” words (where accidental
matches are likely)
– Use expanded set of n-grams to focus search
query string
words and expansions
BLAST/FASTA ideas
• Find all char n-grams (“words”) in the query string.
• BLAST:
– Generate variations of these words by looking for changes that
would lead to strong similarities
– Discard “low IDF” words (where accidental matches are likely)
– Use expanded set of n-grams to focus search
• The BLAST program:
–
–
–
–
Widely used,
Fast implementation,
Supports asking multiple queries against a database at once...
Can one use it find soft matches of protein names (from a
dictionary) in text?
Basic idea:
• Biomedical paper
• Protein name dictionary
• Extracted protein name
(dict. entry->text)
• IE system:
dictionaries+BLAST
(optimized for this problem)
• Protein database
• Query strings
• Proposed alignment
(query->database)
• Query algorithm:
BLAST
1) Mapping text to DNA sequences
(Q: what sort of char similarity is this?)
2) Optimizing blast
• Split protein-name database into several parts (for
short, medium-length, long protein names)
• Require space chars before and after “short”
protein names.
• Manually search (grid search?) for better settings
for certain key parameters for each protein-name
subdatabase
– With what data?
• Evaluate on one review article, 1162 protein
names
– inter-annotator agreement not great (70-85%)
2) Optimizing blast
2) Optimizing blast
Results
Results
Overall: precision 71.1%, recall 78.8% (opt)