Transcript Slide 1

Semantic Matching and Applications
Dr. Virendrakumar (Virendra) C. Bhavsar
Professor
(Dean 2003-2008)
Faculty of Computer Science
University of New Brunswick (UNB)
Fredericton, Canada
[email protected]
(Thanks - Lu Yang, Biplab Sarker, Harold Boley)
1
Faculty of Computer Science
The First “Faculty of CS” in Canada
University of New Brunswick
Fredericton, New Brunswick
Canada
Oldest English Language University in Canada
Established in 1785
2
3
4
Outline
• Motivation
• Syntactic Matching
• Semantic Matching
-Concept Similarity in a Taxonomy
-Schema Matching
• Weighted Tree Representation
• Partonomy Tree Similarity Algorithm
• Non-semantic matching on nodes
• Semantic Matching
• Inner nodes vs. leaf nodes
• Global similarity measure (inner nodes)
• Local similarity measures (for leaf nodes)
• Conclusion
5
Syntactic Matching
Exact String Matching
• Binary result 0.0 or 1.0
Permutation of strings
“Java Programming” versus “Programming in Java”
Number of identical words
Maximum length of the two strings
Example 1
For two node labels “a b c” and “a b d e”, their similarity is:
2
= 0.5
4
6
Syntactic Matching
Example 2:
Node labels “electric chair” and “committee chair”
1
= 0.5
2
meaningful?
• Syntactic Matching does not consider additional
domain knowledge
•Semantic matching techniques are needed for the
above problems
7
Semantic Matching Applications
• Databases
• e-Business
• Case-Based Reasoning (CBR)
• Information Retrieval
• Web Services
• Natural Language Processing
• Information Integration
• Other applications
8
e-Business Applications
Main
Server
Agents
User Info
User Profiles
…
User Agents
…
…
User
Web
Browser
To other sites
(network)
e-Market
…
Matcher1
Matchern
• e-business, e-learning …
• Buyer-Seller matching
• Metadata for buyers and sellers
• Keywords/keyphrases
9
Concept Similarity in a Taxonomy
• Taxonomy
A taxonomy is a scheme that partitions a body of knowledge and
defines the relationships among the pieces
Taxonomy
A
B
Given a taxonomy and two
concepts (e.g., A and B),
find the semantic similarity
(distance) of the two
concepts
10
Semantic Matching Techniques
• Schema Matching
• Element Matching
• Concept Matching
11
Schema Matching
• Schema
A schema is simply a set of elements connected by some structure
• Schema matching
(E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. The VLDB Journal 10:334-350, 2001)
•
Takes two schemas as input and produces a mapping between
elements of the two schemas that correspond semantically to each
other
• Applications
• Schema integration
• Data warehouses
• E-Commerce
• Semantic query processing
12
Concept Similarity in a Taxonomy ─ WordNet
• Semantic knowledge base: WordNet
(G.. Miller. Nouns in WordNet: A Lexical Inheritance System. International Journal of Lexicography, 1990, 3(4):245-264.)
WordNet is the product of a research project at Princeton University
which has attempted to model the lexical knowledge of a native speaker
of English
• Information in WordNet is organized around logical groupings called
synsets
• Each synset consists of synonymous word forms and semantic
pointers that describe relationships between current synset and other
synsets
13
Concept Similarity in a Taxonomy (Cont’d)
• A taxonomy is represented as a tree or a graph
• IS-A/HAS A (Contributes 80% among all the relationships)
• Part-of/Has-Part
• Classification of the semantic concept similarity measures
• Edge-based approaches — shortest path length
• Node-based approaches — node information content
• Combination of edge-based and node-based approaches
• Edge-weight assignment
14
Concept Similarity in a Taxonomy
─ Two Classical Algorithms (I)
• Edge-based approach — shortest path length
(R. Rada, H. Mili, E. Bicknell, and M. Blettner. Development and Application of a Metric on Semantic Nets. IEEE Transactions on Systems, Man,
and Cybernetics. 1989, 19(1):17-30)
• Applied to the medical domain
• Intuition
The more similar two concepts are, the smaller the concept
distance is between them
• Distance (A, B) = minimum number of edges separating a and b
where, A and B are two concepts represented by the nodes a and
b in an IS-A semantic net
15
Concept Similarity in a Taxonomy
─ Two Classical Algorithms (I)
• Problems with the Rada et al. algorithm
• For larger or more general domains, this algorithm leads to
less accurate results than expected (Resnik et al. 1995, 1999)
• The distances between any two adjacent nodes are not
necessarily equal (Richardson & Smeaton 1995)
• It relies on the notion that links in a taxonomy represent
uniform distances (Wu & Palmer 1995, Richardson & Smeaton 1995,
Corley & Mihalcea 2005)
• Certain sub-taxonomies might be much denser than others
• Only IS-A relationship was considered (Resnik 1999)
16
Concept Similarity in a Taxonomy
─ Two Classical Algorithms (II)
• Node-Based Approach — Node Information Content (Resnik et
al.1995, 1999)
Goal: To solve the problem that certain sub-taxonomies are much
denser than others in a taxonomy
• Intuition
One key to the similarity of two concepts is the extent to which
they share information in common, indicated in an IS-A taxonomy
by a highly specific concept that subsumes them both
• Method
Associating probabilities with concepts in the taxonomy
function p: C→[0,1], such that for any c in C, p(c) is the probability of
encountering an instance of concept c
=>
p is monotonic as one moves up the taxonomy: if c1 IS-A c2, then p(c1)
≤ p(c2)
17
Concept Similarity in a Taxonomy
─ Two Classical Algorithms (II)
• Probability computation of a concept
freq(c) =
∑ count(n)
n in words(c)
where words(c) is the set of words subsumed by concept c
p(c) = freq(c)/N
where N is the total number of words observed
• Quantification of the information content*
-logp(c)
As probability increases, informativeness decreases, so the more
abstract a concept, the lower its information content
• Similarity computation
sim(c1, c2) =
max
[-logp(c)]
c in S(c1, c2)
where S(c1, c2) is the set of concepts that subsume both c1 and c2
* S. Ross. A First course in Probability. Macmillan, 1976.
18
Concept Similarity in a Taxonomy
─ Two Classical Algorithms (II)
• Problems of the information content based approach
Example (Richardson & Smeaton, 1995)
{Produce, Green goods} 3.034
{Fruit} 3.374
{Apple} 3.945
{Boxberry} 7.576
{Berry} 4.907
{Banana} 5.267
{Cranberry} 6.285
• It does not differentiate the similarity values of any pair of concepts in
a sub-hierarchy as long as their “smallest common ancestor” is the
same
e.g. sim(Apple, Berry) = sim(Apple, Banana) = 3.374
19
Concept Similarity in a Taxonomy
─ Two Classical Algorithms (II)
• It only considers the IS-A links too
• Structure information is overlooked
• It derives coarse similarity results
20
Concept Similarity in a Taxonomy
─ Other Algorithms
length *
(1) Sim   log
2* D
length is the length of the shortest path between two
concepts and D is the maximum depth of the taxonomy
2 * N3
(2)Sim 
N1  N2  2 * N3
N3
(Wu & Palmer, 1994)
N1 and N2 are the numbers of nodes from c1 and c2 to
their most specific common superconcept c3; N3 is the
number of nodes from c3 to the root of the taxonomy
(3)Sim 
Root
2 * IC( LCS )
IC(concept1)  IC(concept2 )
c3
N1
c1
N2
c2
(Lin, 1998)
IC is the information content computed by [4]; LCS is the Least Common Subsumer
21
* C. Leacock and M. Chodorow. Combining Local context and WordNet Sense Similarity for Word Sense Disambiguation. In
WordNet, An Electronic Lexical Database. The MIT Press, 1998.
Concept Similarity in a Taxonomy
─ Other Algorithms (Cont’d)
(4)Sim 
1
IC(concept1)  IC(concept2 )  2 * IC( LCS )
(Jiang-Conrath,
1997)
• (Li-Bandar-McLean, 2003)
• The main problem of previous algorithms:
Either information sources are directly used as metric of similarity
or a method uses a particular information source without
considering the contribution of others.
• Intuition
Concepts at upper layers of the hierarchy have more general
semantics and less similarity between them, while concepts at
lower layers have more concrete semantics and stronger
similarity. This leads to the consideration of concept depth.
• Three factors that affect concept similarity
Path length, depth, and density
22
Concept Similarity in a Taxonomy
─ Other Algorithms (Cont’d)
• The similarity of two concepts is represented by the function
s(c1, c2) = f ( f1(l), f2(h), f3(d))
where l, h, and d are the shortest path length, the depth of the nearest
common subsumer of c1 and c2, and the local density of c1 and c2, respectively
f1(l )  el
α is a constant
eh  eh
f 2 (h)  h h
e e
β>0
esim(c1 ,c2 )  esim(c1 ,c2 )
f3 (sim)  sim(c ,c ) sim(c ,c )
1 2 e
1 2
e
λ>0
After integration and evaluation:
eh  eh
s(c1, c2 )  f1(l )  f 2 (h)  el  h
e  eh
was reported as the best one
23
Concept Similarity in a Taxonomy
─ Evaluation of the Algorithms
• Basic idea
Investigates the algorithms’ performance against human common sense
• Benchmark word set
A commonly used word set is from an experiment by Rubenstein and
Goodenough1 (51/65) and 26 years later by Miller and Charles2(38/30). Subjects
rated the semantic similarity values from 0 to 4
• Reference results
Similarity method
Correlation
Resnik Replication to Miller-Charles
0.9583
Li-Bandar-McLean
0.8914
Jiang-Conrath
0.8484
Lin
0.8213
Resnik (Information Content)
0.745
Rada (Edge-Counting)
0.664
24
1
2
H. Rubenstein and J. B. Goodenough. Contextual Correlates of Synonymy. Comm. ACM. 1965. 8:627-633.
G. a. Miller and W. G. Charles. Contextual Correlates of Semantic Similarity. Language and Cognitive Processes, 1991, 6(1):1-28.
Concept Similarity in a Taxonomy
─ Edge-Weight Assignment
• Most of the aspects that should be considered are the
structural characteristics (Jiang & Conrath, 1995)
• Local network density
One easy way to measure the local density is to find the
number of children of a given node. The greater the density,
the closer the distance between the nodes Node depth
Distance shrinks as one descends the hierarchy, since
differentiation is based on finer and finer details
• Type of link
• Link strength
The closeness between a child node and its parent node,
against those siblings
25
Schema Matching
• Schema representations
Entity-Relationship (ER) model, object-oriented (OO) model, XML
and directed graphs
• Element vs. structure
Match is performed for individual schema elements or for
combinations of elements
26
Schema Matching ─ Approaches
• Cupid (Madhavn et al. 2001)
• Intended to be generic across data models
• Three phrases
• Linguistic element-level matching
• Transformation from the original schema to a tree and the
computation of the structural similarity between pairs of elements
• Mapping between elements
• Similarity flooding (Melnik et al. 2002)
• A graph matching algorithm
• Schemas are converted into directed labeled graphs
• The element-level mapping is fed into the SF matcher
• Filters are applied to select relevant subsets of match results
27
Schema Matching ─ Approaches (Cont’d)
• COMA (Do & Rahm 2002)
• Represents schemas as rooted directed acyclic graphs
• Combines match algorithms in a flexible way
• Represents a generic match system supporting different applications
and multiple schema types such as XML and relational schemas
• Allows an interactive and iterative match process
• Reuses the previously obtained match results
28
Recap
• Concept similarity matching in a taxonomy
• Edge-based
• Node-based
• Considerations
• Path length
• Node information content
• Node depth
• Edge weight
• Semantic matching are applicable in many applications
• Schema matching
• Structure matching
• Element matching
29
Motivation
• More and more on-line transactions (e.g. e-Bay, kajiji, etc.)
• Buyers and sellers input key words and/or specify values
for some product features
• A list of recommended sellers (with product advertisements)
and/or buyers (with product requests) is presented
• Flat representation of products cannot represent the
hierarchical ‘part-of’ relationship of product parts
• Match-making is not precise
• Negotiation space is big
30
e-Business Applications
Main
Server
Agents
User Info
User Profiles
…
…
User
Web
Browser
User Agents
…
To other sites
(network)
e-Market
…
Matcher1
Matchern
• e-business, e-learning …
• Buyer-Seller matching
•Ranked pairings of Buyers and Sellers
• Metadata for buyers and sellers
• Keywords/keyphrases
31
Match-Making
• Web service discovery [Field and Hoffner 2003]
[Kawamura et al. 2001] [Trastour et al. 2001]
• Multi-agent systems [Bertels et al. 2004] [Decker et al.
1996] [Marsh et al. 2003] [Ogston and Vassiliadis 2001]
• Database schema, XML schema, and Data integration
[Melnik et al. 2001] [Madhavan et al. 2001]
• Ontology integration [Jiang and Conrath 1997]
[Maguitman et al. 2005] [Sussan 1993]
•……
32
Partonomy Tree Similarity Algorithm
─ Tree Representation
• Tree
representation for product/service descriptions
[Bhavsar et al. 2004]
• Characteristics of our trees
• Node-labled, arc-labled and arc-weighted
• Sibling arcs are labled in lexicographical order
• Sibling arc weights sum to 1.0
A simple example “Car” tree:
Car
Year
Color
0.3
Make
0.5
0.2
Black
Ford
2002
33
Seller Weights (1)
• Advertisements on TV, Internet, and in
newspapers
• Sellers emphasize specific product/service features
to attract buyers
• Sellers seek buyers having preferences close
to their products
34
Seller Weights (2)
• No weights
seller
Car
Color
1/3
Make
1/3
Red
Ford
Year
1/3
2002
35
Seller Weights (3)
• Weights are not specified
buyer
0.7834
Car
Year
0.8
Color
0.1
seller
Car
Color
1/3
Make
1/3
Make
0.1
Black
Ford
2002
Red
Ford
Year
1/3
2002
36
Seller Weights (4)
• Sellers give weights
• All the seller trees below are identical except the arc weights
0.925
buyer
Car
Color
0.1
Black
Make
0.1
Year
0.8
Ford
2002
seller1
Car
Color
0.05
Red
0.85
Color
0.2
Red
Make
0.05
Ford
seller3
Car
Car
Ford
2002
0.65
seller2
Make
0.2
Year
0.9
Year
0.6
2002
Color
0.6
Red
Make
0.1
Ford
Year
0.3
2002
37
Seller Weights (5)
• Using seller weights, both buyers and sellers
can find the most promising trading partners
• The negotiation space is decreased
• Sellers can always select the same weight for
all fanout branches if they do not want to
emphasize any attributes of their
products/services
38
Tree Similarity/Distance (1)
Some important algorithms in literature:
• Node-labeled tree edit distance [Lu 1979] [Wang et al.
1994] [Shasha et al. 2001]
• Insertion, deletion and substitution
• Minimum cost to transform one tree to another one
• ATreeGrep [Shasha et al. 2002]
• Approximate tree searching
• Only nodes are labeled
• Siblings cannot be identical
• Schema tree matching — Cupid [Madhavan et al. 2001]
• Transformation from the original schema to a tree and
computation of the structural similarity between
pairs of elements
•……
39
Partonomy Tree Similarity Algorithm
─ Similarity Algorithm
• Partonomy
similarity [Bhavsar et al. 2004]
Fragments of learning object trees [Boley et al. 2005] for learning object
matching (http://www.cs.unb.ca/agentmatcher)
educational
0.3334
edu-set
general
0.5
format
0.5
tec-set
platform
0.5
technical
0.7
0.3
0.3333
gen-set
0.5
general
technical
0.3333
language title
lom
t´
lom
t
gen-set
language title format
0.2
0.8
tec-set
platform
0.1
0.9
*
WinXP
en
en
Introduction HTML WinXP
to Oracle
Basic
Oracle
* : Don’t Care
 (si (wi + w'i)/2)
A(si) ≥ si
 (A(si)(wi + w'i)/2)
40
Semantic Matching
• Inner nodes versus Leaf nodes
• Inner nodes — class-oriented
Inner node labels can be classes
• Classes are located in a taxonomy tree
• Taxonomic class similarity measure (global similarity measure)
• Leaf nodes — type-oriented
• Address, currency, date, price and so on
• Type similarity measures (local similarity measures)
41
Semantic Matching (Cont'd)
Non-Semantic
Matching
Exact String
Matching
(both inner
and leaf
nodes)
String
Permutation
(both inner
and leaf
nodes)
Semantic
Matching
Taxonomic
Class Similarity
(inner nodes)
Type Similarity
(leaf nodes)
42
Semantic Matching ─ Global Similarity
• Global similarity measure (for inner nodes) [Yang et al. 2005]
Distributed Programming
Tuition
Credit
0.2 Duration Textbook0.4
0.1
0.3
2months “Introduction $800
3
to Distributed
Programming”
Object-Oriented Programming
Tuition
Credit
0.1 Duration Textbook0.2
0.5
0.2
$1000
3months “Objected-Oriented
3
Programming
Essentials”
t1
t2
Partonomy trees
43
Tree Similarity Algorithm Development
— Tree Similarity Function
• Desired General Properties
• Similarity function maps to [0.0, 1.0]
• Reflexivity
sim(t,t) =1.0
• Symmetry (could be relaxed)
sim(t,t’) = sim(t’,t)
• Negative properties
• The following two conditions do not hold
• If sim(x,y) = 1.0, then x = y
•
“Triangle Inequality”
[sim(x,y)+ sim(y,z)] sim(x,z) >= sim(x,y) sim(y,z)
44
Tree Similarity Algorithm Development
— Tree Similarity Function
• Desired Special Properties
•
•
•
•
Breadth/Depth Robustness
Subtree Aggregation
Taxonomic class similarity of inner node labels
Local similarity of leaf node labels
Select a tree, t, in the tree data set as the reference tree, the
similarity between t and the trees in the data set:
Similarity
1.0
Similarity
approaches 0.0
0.0
t
Trees
45
Tree Similarity Algorithm Development
— Tree Similarity Algorithm
Desired properties
• Realize all desired properties of the function
• (Recursive) Top-down traversal via subtree pairs under
identically labeled arcs
• Bottom-up similarity computation
• Combining weights of identically labeled arcs
• Inner node similarity
• Node-label similarity — syntactic or semantic
• Aggregation of subtree similarities
• Leaf node similarity — syntactic or semantic
46
Tree Similarity/Distance (2)
• Arc-weighted tree distance [Amenta et al. 2007]
• Point representation of trees
• Arcs can be weighted/unweighted
• Only leaf nodes are labeled
• No arc labels
• Robinson-Foulds distance is employed
47
Tree Simplicity
• Buyers and sellers may omit some product/service
features, which lead to missing (sub)trees in
product/service trees
• Calculate the simplicity of the missing (sub)tree for
the similarity computation
• The simpler the missing (sub)tree, the greater its
similarity and the empty tree
48
Tree Simplicity
•
Desired properties
A tree simplicity function on a tree t is a real valued function simp(t) → R[0.0, 1.0] with
the
following properties:
1) Breadth property
simp[bt] ]: N1→ R[0.0, 1.0]
bt is the fanout of tree t. For two trees t and t':
bt > bt' => simp(t) < simp(t')
The simplicity of a tree monotonically decreases with increasing tree breadth.
2) Depth property
simp[dt]]: N0→ R[0.0, 1.0]
dt is the depth of tree t. For two trees t and t':
dt > dt' => simp(t) < simp(t')
The simplicity of a tree monotonically decreases with increasing tree depth.
49
Semantic Matching ─ A Taxonomy Tree
• The taxonomy tree of “Programming Techniques” according
to the ACM Computing Classification System
(http://www.acm.org/class/1998/ccs98.txt)
Programming Techniques
0.6
0.7
Object-Oriented
0.9
0.8
0.5
Programming
0.5
General
Concurrent
Programming
Sequential
Applicative
Automatic 0.7
0.5
Programming
Programming Programming
Parallel
Distributed
Programming Programming
50
Semantic Matching
─ Taxonomic Class Similarity
• The arc weights can be determined by human experts or
machine learning algorithms [Singh 2005]
• Sibling arc weights do not need to add up to 1
• Three factors that affect the taxonomic class similarity
• The shortest path length between two classes
• Arc weights on the shortest path
• Level difference of two classes
51
Semantic Matching
─ Taxonomic Class Similarity
• Taxonomic class similarity computation [Yang et al. 2005]
dc1 dc2
Ns
TS (c1 , c2 )  (1 
) * M *G
Nt
where
TS(c1, c2) is the taxonomic class similarity of classes c1 and c2
Ns: the number of edges of the shortest path
Nt: the number of edges of the whole tree
M: the product of the arc weights on the shortest path
d c1  d c2
: the level difference factor where G’s value is in
(0.0, 1.0) and | dc1  dc2 | is the absolute difference of the
depths of classes c1 and c2 (We assume G=0.5 here)
G
52
Semantic Matching
─ Taxonomic Class Similarity
Example
Programming Techniques
0.7
0.6
General
0.8
0.5
0.5
Applicative
Automatic 0.7
Programming Programming
0.9
Object-Oriented
Programming
Concurrent
Programming
Sequential
0.5
Programming
Parallel
Distributed
Programming Programming
• red arrows stop at their nearest common ancestor
TS (Distributed Programming, Object - Oriented Programming)
3
2 1
 (1  ) * (0.7 * 0.5 * 0.7) * 0.5
 0.0766
8
53
Semantic Matching
─ Encoding Subtaxonomies
• Encoding subtaxonomy trees into partonomy trees
• A converse task
Computes the similarity of pairs of taxonomies
e.g. subtaxonomies of the background taxonomy, as
required in our Teclantic project (http://teclantic.cs.unb.ca)
• Allows the direct reuse of our partonomy similarity
algorithm and permits weighted (or ‘fuzzy’) taxonomic
subsumption with no added effort
54
Semantic Matching
─ Encoding Subtaxonomies
Background Taxonomy tree of “Programming Techniques” for encoding
Programming Techniques
Applicative
Concurrent
Object-Oriented
Programming
Programming
Programming
Automatic
0.1
0.1
0.15
Programming
General
0.3
0.2
*
*
*
*
*
Distributed
Parallel
Programming Programming
0.4
0.6
*
Sequential
Programming
0.15
*
*
• Sibling arc weights must sum up to 1.0
• Classes are represented as arc labels (lexicographical ordered)
• All node labels except the root node label are changed into “Don’t Care”
55
Semantic Matching
─ Encoding Subtaxonomies
Two course trees with encoded subtaxonomy trees
course
course
Classification
Tuition
Duration
Title 0.05
0.65
Credit
0.15
0.1
taxonomy
$800
0.05
2months
3
Programming
1.0
Techniques
* Sequential
Concurrent
Programming
Programming
0.3
0.7
*
*
Parallel
Distributed
Programming
Programming
0.4
0.6
*
Distributed
Programming
Classification
Tuition
Duration
0.65
Title
0.05
Credit
0.05 0.05
taxonomy
$1000
0.2
3months
3
Programming
1.0
Techniques
* Sequential
Object-Oriented
Programming
Programming
0.2
0.8
*
*
Object-Oriented
Programming
*
• Weight assignment in the "Classification" branch (two options)
• By human expert
• By machine learning
• Normalizes corresponding weights in the background taxonomy
56
Semantic Matching ─ Local Similarity
• Local similarity measures (for leaf nodes)
Special-purpose similarity measures for various data
types realizing semantics to be invoked when computing
similarity of any two of their instances
• “Price” type
• “Date” type [Yang et al. 2005]
•...
57
Semantic Matching ─ Price Matching
• Price
• Price is the omnipresent factor that determines buyers’ and
sellers’ decision-making
• Price similarity seems to be asymmetric for buyers and sellers
e.g. buyer asks $800 and seller asks $1000 — Unsuccessful
buyer asks $1000 and seller asks $800 — Successful
The similarity of $800 and $1000 is different for the above cases
58
Semantic Matching ─ Price Matching
• Transform
the asymmetry to symmetry
• Buyers and sellers always have price ranges in their minds
[Bpref, Bmax] and [Smin, Spref]
Bpref : buyer’s preferred price
Bmax : buyer’s maximum acceptable price
Smin : seller’s minimum acceptable price
Spref : seller’s preferred price
• Our price-range similarity measure is based on the
intuition that the greater the overlap between the buyer’s
and seller’s price ranges, the higher is their similarity value
59
Semantic Matching ─ Price Matching Algorithm
• Pseudo code of the price-range similarity algorithm
PriceRangeSim ([Bpref, Bmax], [Smin, Spref])
Begin
If Spref <= Bpref similarity = 1.0
else if Bmax < Smin similarity = 0.0
else if Bmax = Smin
similarity =
0.005
MAX  MIN
else
{ MIN = min{MIN, Smin}
MAX = max{MAX, Bmax}
similarity =
Bmax  Smin
MAX  MIN
}
return similarity
End.
• This algorithm can be easily adapted to the “price”-typed attributes
e.g. “salary range” in job seeking and recruiting e-Market
60
Semantic Matching ─ Date Matching
• “Date”-typed leaf node similarity measure
if | d1 – d2 | ≥ 365
0.0
{
DS(d1, d2) =
1–
| d1 – d2 |
365
otherwise
where DS(d1, d2) is the date similarity of two dates d1 and d2
Project
Project
start_date
0.5
end_date
0.5
Nov 3, 2004
May 3, 2004
t1
start_date
0.5
0.74 end_date
0.5
Jan 20, 2004
Feb 18, 2005
t2
61
Evaluation
•
Analytic study of algorithmic properties
(for special
cases)
•
Experiments to assess properties of functions
algorithms
•
Comparison with other algorithms
Arc-weighted tree distance [Amenta et al. 2007]
Weighted keywords/phrases similarity
[Marsh 2003]
Online hotel search [Niemann 2006]
-
and
62
Implementation, Testing and
Evaluation
• Implementation
• In Java
• Testing on systematically varied cases
• Evaluation
• Experiments to assess properties of algorithms
• Analytic study of algorithmic properties
• Show the behavior of the algorithms with increasing
breadth and depth
63
Implementation, Testing and
Evaluation
• Comparison with other algorithms
• Arc-weighted tree distance [Amenta et al. 2007]
• Weighted keywords/phrases [Marsh et al. 2003]
• ATreeGrep [Shasha et al. 2002]
• Node-labeled tree edit distance [Lu 1979]
64
Partonomy Tree Similarity Algorithm
─ Application
• eduSource
project
prefilter parameters
(Query URI)
user
input
(1)
End
user
UI
(Java)
Search
Results
(3)
(4) (5)
WOO (2)
RuleML file
Similarity
Engine
(Java)
prefiltered
WOO RuleML (6) (7)
CanCore files
files
Translator
(XSLT)
Prefilter
(SQL)
CanCore
files
partial
CanCore
files
CANLOM
(XML)
(c)
LOMGen
(Java)
HTML
files
(a)
LOR
(HTML)
(b)
Recommended
results (8)
Administrator
input
DATABASE
(Access)
Administrator
Keyword Table
65
Partonomy Tree Similarity Algorithm
─ Application
• Teclantic
protal http://www.teclantic.ca
•ca)
66
Conclusion
• Weighted trees for product/service descriptions
• Partonomy tree similarity algorithm
• Synchronously traverses trees top-down
• Aggregates intermediate similarity values bottom-up
• Semantic Global and Local Matching
• Taxonomic Class Similarity
• Encoding Subtaxonomies into Partonomies
• Leaf-Node Type Similarity Measures
• Future Work
• Improvement of Taxonomic Class Similarity
• Generalization of Local Similarity Measures
67
Publications
1. Lu Yang, B. K. Sarker, V. C. Bhavsar, and H. Boley. “Range
Similarity and Satisfaction Measures for Buyers and Sellers in eMarketplaces”. Journal of Intelligent Systems, pp. 61-65, 2007.
(Revised version of [2])
2. Lu Yang, Biplab K. Sarker, Virendrakumar C. Bhavsar, and Harold
Boley, "Range Similarity Measures between Buyers and Sellers in eMarketplaces", In Proceedings of the Second Indian International
Conference on Artificial Intelligence, Pune, Dec. 20-22, 2005.
3. Lu Yang, Marcel Ball, Virendrakumar C. Bhavsar, and Harold Boley,
"Weighted Partonomy-Taxonomy Trees with Local Similarity
Measures for Semantic Buyer-Seller Match-Making". Journal of
Business and Technology.
68
Publications
4. Harold Boley, Virendrakumar C. Bhavsar, David Hirtle, Anurag Singh,
Zhongwei Sun, and Lu Yang, "A Match-Making System for Learners
and Learning Objects", International Journal of Interactive
Technology and Smart Education, 2(3), 2005.
5. Jing Jin, Biplab K. Sarker, V.C. Bhavsar and H. Boley, and Lu Yang,
"Towards a Weighted-Tree Similarity Algorithm for RNA Secondary
Structure Comparison", In Proceedings of the 8th International
Conference on High Performance Computing in Asia Pacific Region,
IEEE Computer Society, December 2005.
6. Lu Yang, Marcel Ball, Virendrakumar C. Bhavsar, and Harold Boley,
69
Publications
7. Lu Yang, Biplab K. Sarker, Virendrakumar C. Bhavsar, and Harold
Boley, "A Weighted-Tree Simplicity Algorithm for Similarity Matching
of Partial Product Descriptions", In Proceedings of ISCA 14th
International Conference on Intelligent and Adaptive Systems and
Software Engineering, Toronto, 2005, pp.55-60.
8. Virendrakumar C. Bhavsar, Harold Boley, and Lu Yang, "A WeightedTree Similarity Algorithm for Multi-Agent Systems in e-Business
Environments", Computational Intelligence, 20(4), pp.584-602, 2004.
9. Riyanarto Sarno, Lu Yang, Virendrakumar C. Bhavsar, and Harold
Boley, "The AgentMatcher Architecture Applied to Power Grid
Transactions", In Proceedings of the First International Workshop on
Knowledge Grid and Grid Intelligence, Halifax, 2003, pp.92-99.
10.Virendrakumar C. Bhavsar, Harold Boley, and Lu Yang, "A
Weighted-Tree Similarity Algorithm for Multi-Agent Systems in eBusiness Environments", In Proceedings of 2003 Business Agents
and the Semantic Web (BASeWEB'03) Workshop, Halifax, Canada,
June 14, 2003.
70
Theses
Lu Yang, Weighted Tree Similarity, Ph.D. Thesis, in progress.
Anurag Singh, “LOMGenIE: A Weighted
September 2005.
Tree Metadata Extraction Tool,” MCS Thesis,
Mathhieu Sebastien, “Match-making in Bartering Scenarios,” MCS Thesis, December 2005.
Jin Jin Jing, “Similarity of Weighted Directed Acyclic Graphs,” MCS Thesis, September 2006.
56. Jie Li, “Rule-based Social Networking for Expert Finding,” (co-supervisor: Dr. Harold
Boley, NRC), MCS Thesis, September 2006.
71
Thank you !
72
Semantic Matching Review - References
[1]. R. Rada, H. Mili, E. Bicknell, and M. Blettner. Development and Application of a Metric on Semantic Nets. IEEE Transactions on Systems, Man,
and Cybernetics. 1989, 19(1):17-30.
[2]. M. Sussna. Word Sense Disambiguation for Free-text Indexing Using a Massive Semantic Network. Proceedings of the Second International
conference on Information and Knowledge Management, Arlington, Virginia, 1993.
[3]. Z. Wu and M. Palmer. Verb Semantics and Lexical Selection. Proceedings of the 32nd Annual Meeting of the Associations for Computational
Linguistics, Las Cruces, New Mexico, 1994, 133–138.
[4]. P. Resnik. Using Information Content to Evaluate Semantic Similarity in a Taxonomy. Proceedings of the 14th International Joint Conference on
Artificial Intelligence, Montreal, August 1995, 1:448-453.
[5]. R. Richardson, and A. F. Smeaton. Using WordNet in a Knowledge-Based Approach to Information Retrieval. Working Paper, CA-0395, School
of Computer Applications, Dublin City University, Ireland, 1995.
[6]. J. J. Jiang and D. W. Conrath. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. Proceedings of International Conference
Research on Computational Linguistics (ROCLING X), Taiwan, 1997.
[7]. D. Lin. An Information-Theoretical Definition of Similarity. Proceedings of the Fifth International Conference on Machine Learning, Morgan
Kaufmann Publishers Inc., 1998, 296-304.
[8]. P.Resnik. Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language.
Journal of Artificial Intelligence Research, 1999, 11:95-130.
[9]. Y. Li, Z. A. Bandar, and D. McLean. An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources. IEEE
Transactions on Knowledge and Data Engineering, 2003, 15(4):871-882.
[10]. C. Corley and R. Mihalcea. Measuring the Semantic Similarity of Texts. Proceedings of the ACL Workshop on empirical Modeling of Semantic
Equivalence and Entailment, Ann Arbor, June 2005, 13-18.
[11]. A. G. Maguitman, F. Menczer, H. Roinestad and A. Vespingnan. Algorithmic Detection of Semantic Similarity. Proceedings of WWW 2005,
Chiba, Japan, May 10-14, 2005.
[12]. E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. The VLDB Journal 10:334-350, 2001.
[13]. J. Madhavan, P. A. Bernstein and E. Rahm. Generic schema matching with Cupid. Technical Report. Microsoft Research. August, 2001.
[14]. S. Melnik, H. G. Molina and E. Rahm. Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching. In
Proceedings of 18th Intl. Conf. on Data Engineering (ICDE), San Jose, CA, 2002.
[15]. H.-H. Do and E. Rahm. COMA - A System for Flexible Combination of Schema Matching Approaches. The VLDB Journal, 2002.
[16]. Institute of Electrical and Electronics Engineers. (1990) IEEE Standard Computer Dictionary: A Compilation of IEEE Standard Computer
Glossaries. New York, NY.
[17] G. Miller. Nouns in WordNet: A Lexical Inheritance System. International Journal of Lexicography, 1990, 3(4):245-264.
73
References
Amenta, N., M. Godwin, N. Postarnakevich, and K. St. John. (2007) Approximating Geodesic Tree Distance. Information Processing Letters 103, 61-65.
Bertels, K., N. Panchanathan, and S. Vassiliadis. (2004) Centralized Matchmaking for Minimal Agents. Proceedings of the Conference on Parallel and
Distributed Computer Systems, November, 2004.
Decker, K., M. Williamson, and K. Sycara. (1996) Matchmaking and Brokering. Proceedings of the Second International Conference on Multi-Agent
Systems.
Do, H.-H., and E. Rahm. (2002) COMA--A System for Flexible Combination of Schema Matching Approaches. Proceedings of 28th International Conference
on Very Large Databases, Hong Kong, 610-621.
Field, S., and Y. Hoffner. (2003) Web services and matchmaking. Int. J. Networking and Virtual Organisations, 2(1), 16-32.
Jiang, J. J., and D. W. Conrath. (1997) Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. Proceedings of International Conference
Research on Computational Linguistics (ROCLING X), Taiwan.
Kawamura, T., T. Hasegawa, and A. Ohsuga. (2001) Proposal of Semantics-based Web Service Matchmaking. International Conference on Computational
Intelligence and Multimedia Applications.
Lu, S. (1979) A tree-to-tree distance and its application to cluster analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2),
219-224.
Madhavan, J., P. A. Bernstein and E. Rahm. Generic schema matching with Cupid. Technical Report. Microsoft Research. August, 2001.
Maguitman, A. G., F. Menczer, H. Roinestad and A. Vespingnan. (2005) Algorithmic Detection of Semantic Similarity. Proceedings of WWW 2005, Chiba,
Japan, May 10-14, 2005.
Marsh, S., A. Ghorbani and V. C. Bhavsar. (2003) The ACORN Multi-Agent System, Web Intelligence and Agent Systems, IOS Press, Amsterdam, 1(1), 121.
Melnik, S., H. Garcia-Molina and E. Rahm. (2001) Similarity Flooding: A Versatile Graph Matching Algorithm. Extended Technical Report,
http://dbpubs.stanford.edu/pub/2001-25.
Ogston, E., and S. Vassiliadis. (2001) Matchmaking among Minimal Agents without a Facilitator. Proceedings of the 5th International Conference on
carnomous Agents, 608-615.
Shasha, D., J. Wang, and K. Zhang. (1994) Exact and Approximate Algorithm for Unordered Tree Matching. IEEE Transactions on Systems, Man and
Cybernetics, 24(4), 668-678.
Shasha, D., J. Wang, and R. Giugno. (2002) Algorithms and Applications of Tree and Graph Searching. Proc. Of the 21th ACM SIGMOD-SIGACT-SIGART
Symposium on Principles of Database Systems. 39-52.
Sussna, M. (1993) Word Sense Disambiguation for Free-text Indexing Using a Massive Semantic Network. Proceedings of the Second International
conference on Information and Knowledge Management, Arlington, Virginia.
Trastour, D., C. Bartolini, and J. G-C. (2001) A Semantic Web Approach to Service Description for Matchmaking of Services. Internal report, HewlettPackard Company.
Wang, J., K. Zhang, K. Jeong, and D. Shasha (1994) A System for Approximate Tree Matching. IEEE Transactions on Knowledge and Data Engineering.
6(4), 559-570.
Yang, L., V. C. Bhavsar, and H. Boley. (2008) On Semantic Concept Similarity Methods. The 4th International Conference on Information &
Communication Technology and Systems (ICTS), Indonesia (to appear).
74
References
Rada, R., H. Mili, E. Bicknell, and M. Blettner. (1989) Development and Application of a Metric on Semantic Nets. IEEE Transactions on Systems, Man, and
Cybernetics, 19(1), 17-30.
Lin, D. (1998) An Information-Theoretical Definition of Similarity. Proceedings of the Fifth International Conference on Machine Learning, Morgan
Kaufmann Publishers Inc., 296-304.
Jiang, J. J., and D. W. Conrath. (1997) Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. Proceedings of International Conference
Research on Computational Linguistics (ROCLING X), Taiwan.
Yang, L., B. K. Sarker, V. C. Bhavsar, and H. Boley. (2007) Range Similarity and Satisfaction Measures for Buyers and Sellers in e-Marketplaces. Journal
of Intelligent Systems.
Niemann, M., Malgorzata Mochol, and Robert Tolksdrof. (2006) Improving Online Hotel Search - What Do We Need Semantics for?. Semantics: The New
Paradigm Shift in IT November 28 - 30, Vienna, Austria.
75