Link Mining Tasks - UMD Department of Computer Science

Download Report

Transcript Link Mining Tasks - UMD Department of Computer Science

Link Mining
Lise Getoor
University of Maryland, College Park
joint work with Indrajit Bhattacharya, Qing Lu and Prithviraj Sen
08/22/2004
MRDM 2004 Workshop
1
Roadmap
• Intro to Link Mining
– Link Mining Tasks
– Link Mining Challenges
• Some Current Projects
– Link-based Classification
• Link-based classification using a variety of link
descriptions
• Link-based classification using labeled and unlabeled data
– Link-based Clustering
• Entity detection
• Group Detection
• Conclusion
08/22/2004
MRDM 2004 Workshop
2
Link Mining
• Traditional machine learning and data mining
approaches assume:
– A random sample of homogeneous objects from single
relation
• Real world data sets:
– Multi-relational, heterogeneous and semi-structured
• Link Mining
– newly emerging research area at the intersection of
research in social network and link analysis, hypertext and
web mining, graph mining, relational learning and inductive
logic programming
08/22/2004
MRDM 2004 Workshop
3
Linked Data
• Heterogeneous, multi-relational data
represented as a graph or network
– Nodes are objects
• May have different kinds of objects
• Objects have attributes
• Objects may have labels or classes
– Edges are links
• May have different kinds of links
• Links may have attributes
• Links may be directed, are not required to be binary
08/22/2004
MRDM 2004 Workshop
4
Sample Domains
•
•
•
•
•
•
•
•
web data (web)
bibliographic data (cite)
epidimiological data (epi)
communication data (comm)
customer networks (cust)
collaborative filtering problems (cf)
trust networks (trust)
biological data (bio)
08/22/2004
MRDM 2004 Workshop
5
Link Mining Tasks
•
•
•
•
•
•
•
•
•
Link-based Object Classification
Object Type Prediction
Link Type Prediction
Predicting Link Existence
Link Cardinality Estimation
Object Consolidation
Group Detection
Subgraph Discovery
Metadata Mining
08/22/2004
MRDM 2004 Workshop
6
Link-based Object Classification
• Predicting the category of an object based on its
attributes and its links and attributes of linked
objects
• web: Predict the category of a web page, based on words
that occur on the page, links between pages, anchor text,
html tags, etc.
• cite: Predict the topic of a paper, based on word occurrence,
citations, co-citations
• epi: Predict disease type based on characteristics of the
patients infected by the disease
08/22/2004
MRDM 2004 Workshop
7
Object Class Prediction
• Predicting the type of an object based on its
attributes and its links and attributes of linked
objects
• comm: Predict whether a communication contact is by email,
phone call or mail.
• cite: Predict the venue type of a publication (conference,
journal, workshop)
08/22/2004
MRDM 2004 Workshop
8
Link Type Classification
• Predicting type or purpose of link based on properties
of the participating objects
• web: predict advertising link or navigational link; predict an
advisor-advisee relationship
• epi: predicting whether contact is familial, co-worker or
acquaintance
08/22/2004
MRDM 2004 Workshop
9
Predicting Link Existence
• Predicting whether a link exists between two objects
• web: predict whether there will be a link between two pages
• cite: predicting whether a paper will cite another paper
• epi: predicting who a patient’s contacts are
08/22/2004
MRDM 2004 Workshop
10
Link Cardinality Estimation I
• Predicting the number of links to an object
• web: predict the authoratativeness of a page based on the
number of in-links; identifying hubs based on the number of outlinks
• cite: predicting the impact of a paper based on the number of
citations
• epi: predicting the number of people that will be infected based
on the infectiousness of a disease.
08/22/2004
MRDM 2004 Workshop
11
Link Cardinality Estimation II
• Predicting the number of objects reached along a
path from an object
• Important for estimating the number of objects that
will be returned by a query
• web: predicting number of pages retrieved by crawling a site
• cite: predicting the number of citations of a particular author in
a specific journal
08/22/2004
MRDM 2004 Workshop
12
Object Consolidation
• Predicting when two objects are the same, based on
their attributes and their links
• aka: record linkage, duplicate elimination, identity
uncertainty
• web: predict when two sites are mirrors of each other.
• cite: predicting when two citations are referring to the same
paper.
• epi: predicting when two disease strains are the same
• bio: learning when two names refer to the same protein
08/22/2004
MRDM 2004 Workshop
13
Group Detection
• Predicting when a set of entities belong to
the same group based on clustering both
object attribute values and link structure
• web – identifying communities
• cite – identifying research communities
08/22/2004
MRDM 2004 Workshop
14
Subgraph Identification
• Find characteristic subgraphs
• Focus of graph-based data mining (Cook &
Holder, Inokuchi, Washio & Motoda,
Kuramochi & Karypis, Yan & Han)
• bio – protein structure discovery
• comm – legitimate vs. illegitimate groups
• chem – chemical substructure discovery
08/22/2004
MRDM 2004 Workshop
15
Metadata Mining
• Schema mapping, schema discovery, schema
reformulation
• cite – matching between two bibliographic sources
• web - discovering schema from unstructured or semistructured data
• bio – mapping between two medical ontologies
08/22/2004
MRDM 2004 Workshop
16
Link Mining Tasks
•
•
•
•
•
•
•
•
•
Link-based Object Classification
Object Type Prediction
Link Type Prediction
Predicting Link Existence
Link Cardinality Estimation
Object Consolidation
Group Detection
Subgraph Discovery
Metadata Mining
08/22/2004
MRDM 2004 Workshop
17
Link Mining Challenges
•
•
•
•
•
•
•
•
Logical vs. Statistical dependencies
Feature construction
Instances vs. Classes
Collective Classification
Collective Consolidation
Effective Use of Labeled & Unlabeled Data
Link Prediction
Closed vs. Open World
Challenges common to any link-based statistical model (Bayesian
Logic Programs, Conditional Random Fields, Probabilistic Relational
Models, Relational Markov Networks, Relational Probability Trees,
Stochastic Logic Programming to name a few)
08/22/2004
MRDM 2004 Workshop
18
Logical vs. Statistical Dependence
• Coherently handling two types of dependence
structures:
– Link structure - the logical relationships between
objects
– Probabilistic dependence - statistical relationships
between attributes
• Challenge: statistical models that support rich
logical relationships
• Model search complicated by the fact that
attributes can depend on arbitrarily linked
attributes -- issue: how to search this huge
space
08/22/2004
MRDM 2004 Workshop
19
Model Search
P1
I1
P3
P2
A1
?
P
08/22/2004
MRDM 2004 Workshop
20
Feature Construction
• In many cases, objects are linked to a set of
objects. To construct a single feature from
this set of objects, we may either use:
– Aggregation
– Selection
08/22/2004
MRDM 2004 Workshop
21
Aggregation
P7
P1
P4
P3
P2
I1
A1
mode
I2
08/22/2004
P92
P6
A2
?
P
P8
P51
mode
?
P
MRDM 2004 Workshop
22
Selection
P1
P3
P2
I1
A1
P
?
08/22/2004
MRDM 2004 Workshop
23
Individuals vs. Classes
• Does model refer
– explicitly to individuals
– classes or generic categories of individuals
• On one hand, we’d like to be able to model that a
connection to a particular individual may be highly
predictive
• On the other hand, we’d like our models to generalize
to new situations, with different individuals
08/22/2004
MRDM 2004 Workshop
24
Instance-based Dependencies
I1
P3
A1
Papers that
cite P3 are
likely to be
08/22/2004
MRDM 2004 Workshop
25
Class-based Dependencies
?
I1
?
A1
Papers that
cite
are
likely to be
08/22/2004
MRDM 2004 Workshop
26
Collective classification
• Using a link-based statistical model for
classification
• Inference using learned model is complicated
by the fact that there is correlation between
the object labels
08/22/2004
MRDM 2004 Workshop
27
Collective consolidation
• Using a link-based statistical model for
object consolidation
• Consolidation decisions should not be made
independently
08/22/2004
MRDM 2004 Workshop
28
Labeled & Unlabeled Data
• In link-based domains, unlabeled data provide
three sources of information:
– Helps us infer object attribute distribution
– Links between unlabeled data allow us to make use
of attributes of linked objects
– Links between labeled data and unlabeled data
(training data and test data) help us make more
accurate inferences
08/22/2004
MRDM 2004 Workshop
29
Link Prior Probability
• The prior probability of any particular link is
typically extraordinarily low
• For medium-sized data sets, we have had
success with building explicit models of link
existence
• It may be more effective to model links at
higher level--required for large data sets!
08/22/2004
MRDM 2004 Workshop
30
Closed World vs. Open World
• The majority of SRL approaches make a
closed world assumption, which assumes that
we know all the potential entities in the
domain
• In many cases, this is unrealistic
• Work by Milch, Marti, Russell on BLOG
08/22/2004
MRDM 2004 Workshop
31
Link Mining Summary
• Link Mining Tasks
– Link-based Object
Classification
– Object Type Prediction
– Link Type Prediction
– Predicting Link Existence
–
–
–
–
–
Link Cardinality Estimation
Object Consolidation
Group Detection
Subgraph Discovery
Metadata Mining
• Link Mining Challenges
– Logical vs. Statistical
dependencies
– Feature construction
– Instances vs. Classes
– Collective Classification
08/22/2004
– Collective Consolidation
– Effective Use of Labeled &
Unlabeled Data
– Link Prediction
– Closed vs. Open World
MRDM 2004 Workshop
32
Roadmap
• Intro to Link Mining
– Link Mining Tasks
– Link Mining Challenges
• Some Current Projects
Link-based Classification
• work with Qing Lu and Prithviraj Sen
• Link-based classification using a variety of link
descriptions
• Link-based classification using labeled and unlabeled data
– Link-based Clustering
• Entity detection
• Group Detection
• Conclusion
08/22/2004
MRDM 2004 Workshop
33
Object Classification
• Traditional Object
Classification
– Assume objects sampled
from a single relation
– Object Attributes (OA)
X4
X5
X3
X2
X1
08/22/2004
MRDM 2004 Workshop
34
Object Classification with Linked Data
• Traditional Object
Classification
– Assume objects sampled
from a single relation
– Object Attributes (OA)
X4
X5
X3
• Linked Data
 Links among objects
 Represented as a graph
X2
X1
08/22/2004
MRDM 2004 Workshop
35
Link-based Object Classification
• Predicting the category of an object
based on its attributes and its links
and attributes of linked objects
• Citation domain: Predict the topic of a
paper, based on word occurrence, citations
and co-citations
08/22/2004
MRDM 2004 Workshop
36
Related Work: Link-based Classification
• Hypertext Classification using Links
– Class labels of linked objects
• Soumen Chakrabarti (1998)
• Oh, et al. (1999)
– Unique document ID, Popescul et al. (2002)
– Regularities, Yang et al. (2002)
• Use of Unlabeled Data
 Co-training, Blum and Mitchel (1998)
 EM-algorithm, Nigam et al. (2000)
 Systematical investigation of EM and Co-training, Ghani
(2001)
 TSVM, Joachims (1999)
08/22/2004
MRDM 2004 Workshop
37
Our Approach
• Link-based models
– Integrate link features with object attributes
using logistic regression
– Investigate use of labeled and unlabeled data for
link-based classification
08/22/2004
MRDM 2004 Workshop
38
Features
• Object Attributes
– Notation: OA(X)
• Link Descriptions
– Notation: LD(X)
– Statistics computed from linked objects
– Computed separately for each of:
•
•
•
•
In-Links(X)
Out-Links(X)
Co-In(X)
Co-Out(X)
– Three types of Link Descriptions:
• Mode, Binary, Count
08/22/2004
MRDM 2004 Workshop
39
Link Descriptions
Categories
X
08/22/2004
MRDM 2004 Workshop
40
Link Descriptions
Categories
In-Links(X)
mode:
X
08/22/2004
MRDM 2004 Workshop
41
Link Descriptions
Categories
In-Links(X)
mode:
Out-Links(X)
mode:
X
08/22/2004
MRDM 2004 Workshop
42
Link Descriptions
Categories
In-Links(X)
mode:
Out-Links(X)
mode:
X
CO(X)
mode:
08/22/2004
MRDM 2004 Workshop
43
Link Descriptions
Categories
In-Links(X)
mode:
Out-Links(X)
mode:
X
CO(X)
mode:
08/22/2004
CI(X)
mode:
MRDM 2004 Workshop
44
Link Descriptions
Categories
In-Links(X)
mode:
binary: (1,1,1)
Out-Links(X)
mode:
binary: (1,1,0)
X
CO(X)
mode:
CI(X)
mode:
binary: (1,1,0)
08/22/2004
binary: (1,0,0)
MRDM 2004 Workshop
45
Link Descriptions
Categories
In-Links(X)
mode:
binary: (1,1,1)
count: (3,1,1)
Out-Links(X)
mode:
binary: (1,1,0)
count: (1,2,0)
X
CO(X)
mode:
CI(X)
mode:
binary: (1,1,0)
count: (2,1,0)
08/22/2004
binary: (1,0,0)
count: (2,0,0)
MRDM 2004 Workshop
46
Predictive Model for Classification
• A structured logistic regression
– Compute P(c | OA(X)) and P(c | LD(X)) separately using
separate logistic regression models
Pc | LDf x 

Pc 
f{ In,O ut,C I,C O}
ĉ(x)  arg max Pc | OAx 
c
– where OA(X) are the object attributes and LDf(X) are the
link features
08/22/2004
MRDM 2004 Workshop
47
Prediction
• category set {
}
P1
P2
P3
P5
P4
Step 1: Bootstrap using object attributes only
08/22/2004
MRDM 2004 Workshop
48
Prediction
P1
P2
P3
P5
P4
Step 2: Iteratively update the category of each
object, based on linked object’s categories
08/22/2004
MRDM 2004 Workshop
49
Data Sets
Data Set papers citations categories vocabulary
CoraI
3181
6185
7
1400
CoraII
3300
11794
10
3174
CiteSeer
3600
7522
6
3000
08/22/2004
MRDM 2004 Workshop
50
Experiment I
Content and Link Effectiveness
Content Only
Links Only
90
Content + Links
80
Avg F1 Measure
70
60
50
40
30
20
10
0
CoraI
CoraII
CiteSeer
Data Sets
08/22/2004
MRDM 2004 Workshop
51
Experiment II
Effectiveness of Different Link Types and Models
90
80
Avg F1 Measure
70
IN-links
60
OUT-links
50
CI-links
40
CO-links
30
ALL
20
10
0
Mode
Binary
CoraI
Count
Mode
Binary
Count
CoraII
Mode
Binary
Count
CiteSeer
Data Sets and Models
08/22/2004
MRDM 2004 Workshop
52
Experiment III
• Setup
– 20% data as test data
– remaining data: 20%, 40%, 60%, 80% labeled data
• Link-based classification using labeled and
unlabeled data
– Labeled-only: learn model using only labeled data
– Labeled and Unlabeled: learn model using both
labeled and unlabeled data
08/22/2004
MRDM 2004 Workshop
53
Accuracy
Learning with Labeled and
Unlabeled Data
0.82
0.8
0.78
0.76
0.74
0.72
0.7
0.68
0.66
0.64
0.62
labeled & unlabeled
labeled only
content only
20
40
60
80
Percentage Labeled Data
08/22/2004
MRDM 2004 Workshop
54
Ordering Strategies
0.85
Average Accuracy
0.83
0.81
RAND 1
0.79
DEC PP 1
0.77
DEC OUTLINKS 1
0.75
INC OUTLINKS 1
0.73
DEC PP 2
0.71
DEC OUTLINKS 2
0.69
INC OUTLINKS 2
0.67
RAND (HARD CLASSIF.)
DEC PP (HARD CLASSIF.)
0.65
0
100 200 300
400 500 600 700
800 900 1000 1100
Iteration
08/22/2004
MRDM 2004 Workshop
55
LBC: Summary
• Variety of ways of describing link neighborhoods
– Mode, Binary, Count
– In-links, Out-links, CI-links and CO-links
• In link-based classification, unlabeled data provide
useful information:
– Helps us infer object attribute distribution
– Links between unlabeled data allow us to make use of
attributes of linked objects
– Links between labeled data and unlabeled data (training data
and test data) help us make more accurate inferences
• Link-based Challenges addressed:
– Feature construction
– Collective classification
– Use of labeled and unlabeled data
08/22/2004
MRDM 2004 Workshop
56
Roadmap
• Intro to Link Mining
– Link Mining Tasks
– Link Mining Challenges
• Some Current Projects
– Link-based Classification
• Link-based classification using a variety of link
descriptions
• Link-based classification using labeled and unlabeled data
Link-based Clustering
• work with Indrajit Bhattacharya
• Entity detection
• Group Detection
• Conclusion
08/22/2004
MRDM 2004 Workshop
57
Deduplication and Group Detection
• Object Consolidation
– Observations come with noise or multiple
representations
• Multiple entries for the same person in a customer
database
• Group Detection
– Identify groups of similar entities
• Group authors by research interest
08/22/2004
MRDM 2004 Workshop
58
Terminology
Entities
References
Aho, A. V.
Alfred V Aho
Alfred Aho
AV Aho
Alfred Aho, John Hopcroft, Jeffrey Ullman
Links
AV Aho, BW Kernighan, PJ Weinberger
Entity Groups
G1
(Programming Languages)
G2
(Databases)
G3
(Algorithms)
08/22/2004
MRDM 2004 Workshop
59
Deduplication and Group Detection
• The two problems need to be addressed
together
– Goldberg and Senator, KDD 95
Knowledge
KDD Tools
DB´
DB
Consolidation
&
Link Formation
08/22/2004
MRDM 2004 Workshop
60
Related Work: Deduplication
• AI, Machine Learning
• Statistics
–
–
–
–
–
–
String similarity measures, Monge & Elkan; Cohen
Object Consolidation, ejada et al.
Learning string distances, Bilenko & Mooney
Active learning, Sarawagi et al
Coreference resolution, Mccallum & Wellner
Identity uncertainty, Pasula et al
– Blocking, Newcombe
– “Match/non-match”, Fellegi & Sunter
– EM with match variable, Winkler
• Databases
– Efficient record linkage, Hernandez & Stolfo,
Monge & Elkan
– Use of co-occurrence, Chaudhuri et al,
Ananthakrishna et al
08/22/2004
MRDM 2004 Workshop
61
Related Work: Group Detection
• Text Retrieval
– Spectral techniques, Ng, Jordan & Weiss; Dhillon et al;
Ding et al
– Probabilistic modeling with latent variables, Hofmann;
Blei, Ng & Jordan; Rosen-Zvi et al
• Hypertext Mining
– Eigen decomposition for ranking, Brin & Page; Kleinberg
– Finding web communities, Gibson et al
– “Missing link”, Cohn & Hofmann
• Probabilistic Link Modeling
– Generative model for links, Getoor et al.; Kubica et al
08/22/2004
MRDM 2004 Workshop
62
Paper Resolution Problem
• Example
– R. Agrawal, R. Srikant. Fast algorithms for mining
association rules in large databases. In VLDB-94,
1994.
– Rakesh Agrawal and Ramakrishnan Srikant. Fast
Algorithms for Mining Association Rules. In Proc. of the
20th Int'l Conference on Very Large Databases,
Santiago, Chile, September 1994.
• Traditionally, string similarity
08/22/2004
MRDM 2004 Workshop
63
Author Resolution Problem
• Given a set of papers, determine the set of
authors
• First and middle names vary
– Check common name transforms
• Problems remain
– How about ‘J. Smith’ and ‘John Smith’?
– Do two instances of ‘J. Smith’ refer to the same
author?
• Use co-author relationships
08/22/2004
MRDM 2004 Workshop
64
Author Deduplication: Example
Alfred V Aho
A V Aho
Jeffrey D Ullman
J D Ullman
S C Johnson
P1: Code generation for machines
with multiregister operations
P2: The universality of database
languages
A V Aho
Alfred V Aho
J D Ullman
Jeffrey D Ullman
P3: Optimal partial-match retrieval
when fields are independently
specified
Aho P1
Aho P2
S C Johnson
P4: Code generation for expressions
with common subexpressions
Aho P1,P2,P3,P4 Aho P3
Aho P4
Ullman P1
Ullman P2Ullman P1,P2,P3,P4Ullman P3
Ullman P4
Johnson P1
Johnson P1,P4
Johnson P4
08/22/2004
MRDM 2004 Workshop
65
Deduplication Using Links
• Cluster similar author references into
duplicates
– Problem: define appropriate distance
measure
• Weighted combination of attribute and
link distances
08/22/2004
MRDM 2004 Workshop
66
Link Distances for Deduplication
• To compare two author references, compare all their
links/relations
• Distance between two links
– How many duplicates do they share?
– d(l1,l2) = 1 – |duplicates(l1,l2)| / max(|l1|,|l2|)
• Distance between Link Sets: Link detail distance
– d(l,L) = minl’ in L d(l,l’)
– ddetail(L1,L2) = avg[avgl in L1 d(l,L1), avgl in L2 d(l,L2)]
• Distance between Link Sets: Link summary distance
– Group detail distance is costly because of pair-wise
comparison between group sets
– Maintain group summary: all unique references in the group
set
– dsumm(L1,L2)=d(sum1,sum2)
08/22/2004
MRDM 2004 Workshop
67
Group Detection: Example
Groups
Entities
PL
R. Sethi
Algorithms
Databases
A. Aho
J. Ullman
J. Hopcroft
Alfred Aho, John Hopcroft, Jeffrey Ullman, Data Structures and Algorithms
Links
AV Aho, R Sethi, J D Ullman, Compilers: Principles, Techniques and Tools
• Problem: Discover the hidden set of groups and
mapping from entities to groups
08/22/2004
MRDM 2004 Workshop
68
Group Detection Using Links
• Cluster similar links into groups
Alfred Aho, John Hopcroft, Jeffrey Ullman,
Design and Analysis of Computer Algorithms
Algorithms
Alfred Aho, John Hopcroft, Jeffrey Ullman,
Data Structures and Algorithms
AV Aho, R Sethi, J D Ullman,
Compilers: Principles, Techniques and Tools
PL &
Compilers
Alfred V. Aho, Brian W. Kernighan, Peter J. Weinberger,
The AWK Programming Language
Alfred V. Aho, Jeffrey D. Ullman,
Principles of Compiler Design
08/22/2004
MRDM 2004 Workshop
69
Link Distances for Group Detection
• Define cluster distance considering generation
probability of observed links given model M
• Probabilistic Distance dLP of two clusters
– What is the change in probability of observed data if the
two clusters are merged?
• Group summary distance dsumm
– dLP is lower for higher overlap in entity sets of two clusters
– But entity sets are the link summaries
– Use summary distance as approximation of dLP
08/22/2004
MRDM 2004 Workshop
70
Deduplication / Group Detection
Using Clustering
• Each cluster contains currently known
duplicates / links from the same group
• Initialize clusters
– Deduplication: using attribute distance only
• Select candidates for merge using threshold
• Greedily pick best candidate using d(ci,cj)
– Compare different distance measures
• Update link sets / summaries and cluster
distances and continue
08/22/2004
MRDM 2004 Workshop
71
Evaluation: Data Generator Parameters
• Structural Parameters
– Number of author-entities and groups
– Degree of overlap among the groups
• Noise parameters
– For deduplication data, generate noisy attributes
for each entity in the links
• Generative Parameters
– Number of links
– Mean size of links
08/22/2004
MRDM 2004 Workshop
72
Evaluation: Metric
• Diversity of clusters
– How many different entities does a cluster
contain?
– Links from how many different groups does a
cluster contain?
• Dispersion of entities/groups
– How many different clusters is an entity spread
over?
– How many different clusters are links from the
same group spread over?
• Measure average dispersion and diversity
08/22/2004
MRDM 2004 Workshop
73
Results: Deduplication
• Algorithm Parameters: mixing weight and threshold
– Superior results with link distances
Dispersion vs diversity for varying alpha
Dispersion vs diversity for varying threshold
1000
1000
900
800
attribute
700
summary
cluster diversity
cluster diversity
900
600
500
400
300
attribute
700
summary
600
500
400
300
200
200
100
100
0
0
1
•
800
1.2
1.4
1.6
1.8
2
entity dispersion
2.2
2.4
1
1.2
1.4
1.6 1.8
2 2.2
entity dispersion
2.4
2.6
2.8
Detailed results in DMKD ’04 paper, Iterative Record Linkage for
Cleaning and Integration, Indrajit Bhattacharya and Lise Getoor.
08/22/2004
MRDM 2004 Workshop
74
Results: Distance Measures
• Comparison of attribute distance, group
summary distance and group detail distance
Entity Dispersion
2.3
3
2.1
2.5
2
1.5
1
2200 2000 1800 1600 1400 1200 1000 800
cluster diversity
3.5
attribute
1.9
group_summary
1.7
group detail
1.5
attribute
group_summary
group_detail
1.3
2200 2000 1800 1600 1400 1200 1000 800
# clusters
08/22/2004
entity dispersion
Cluster Diversity
# clusters
MRDM 2004 Workshop
75
Deduplicating Real Data
• Machine learning papers from Citeseer
– Citations hand-matched by Steve Lawrence et al
– Author references hand-matched by Culotta &
McCallum
– 1504 paper citations
– 2892 author references
– 1167 author entities identified
• Initial results
– Link summary improves entity dispersion over
attribute clustering
– Discovered labeling errors that are hard to
identify considering attributes only
08/22/2004
MRDM 2004 Workshop
76
Deduplicating Real data
• 174 | 610 | barron_a_r | A.R. Barron
• 175 | 610 | barron_r_l | R.L. Barron
• Not the same entity
– Barron, A.R., Barron, R.L., 1988. Statistical
learning networks: a unifying view. In: 1988
Symposium on the Interface: Statistics and
Computer Science, pp. 192-203.
08/22/2004
MRDM 2004 Workshop
77
Deduplicating Real data
• 2097 | 8460 | ramakrishnan_c_r | C. R. Ramakrishnan
• 2098 | 8460 | ramakrishnan_i_v | I. V. Ramakrishnan
• Not the same entity
– A Symbolic Constraint Solving Framework for Analysis of
Logic Programs, C.R. Ramakrishnan, I.V. Ramakrishnan and
R. Sekar, ACM Conference on Partial Evaluation and
Semantics based Program Manipulation (PEPM), June 1995
08/22/2004
MRDM 2004 Workshop
78
Deduplicating Real data
• Parse Error
– 1734 | 7010 | minton_andrew_b_philips_steven |
Andrew B. Philips Steven Minton
• Same entity as
– 1735 | 7020 | minton_s | Minton , S.
08/22/2004
MRDM 2004 Workshop
79
Deduplicating Real data
• Parse Error
– 2083 | 8370 | raedt_2_l_de | 2. L. De Raedt
• Same entity as
– 2085 | 8380 | raedt_l_de | L. De Raedt
08/22/2004
MRDM 2004 Workshop
80
Deduplication and Group Detection
Summary
• Study of novel distance measures for
clustering similar entities in linked
environments
• Unified generative model for evaluating the
related problems
• Link-based clustering shows superior
performance over attribute clustering for
both tasks on synthetic data
• Link-based Challenges addressed:
– Collective consolidation
08/22/2004
MRDM 2004 Workshop
81
Roadmap
• Intro to Link Mining
– Link Mining Tasks
– Link Mining Challenges
• Some Current Projects
– Link-based Classification
• work with Qing Lu and Prithviraj Sen
• Link-based classification using a variety of link
descriptions
• Link-based classification using labeled and unlabeled data
– Link-based Clustering
• Entity detection
• Group Detection
• Conclusion
08/22/2004
MRDM 2004 Workshop
82
Link Mining Summary
• Link Mining Tasks
– Link-based Object
Classification
– Object Type Prediction
– Link Type Prediction
– Predicting Link Existence
–
–
–
–
–
Link Cardinality Estimation
Object Consolidation
Group Detection
Subgraph Discovery
Metadata Mining
• Link Mining Challenges
– Logical vs. Statistical
dependencies
– Feature construction
– Instances vs. Classes
– Collective Classification
08/22/2004
– Collective Consolidation
– Effective Use of Labeled &
Unlabeled Data
– Link Prediction
– Closed vs. Open World
MRDM 2004 Workshop
83
References
•
Deduplication and Group Detection Using Links Indrajit Bhattacharya and Lise Getoor. 10th ACM SIGKDD Workshop on Link
Analysis and Group Detection, Seattle, WA, August 2004.
•
Word Sense Disambiguation using Probabilistic Models, Indrajit Bhattacharya, Lise Getoor and Yoshua Bengio. 42nd Annual
Meeting of the Association for Computational Linguistics, Barcelona, SP, July 2004.
•
Iterative Record Linkage for Cleaning and Integration Indrajit Bhattacharya and Lise Getoor. 9th ACM SIGMOD Workshop on
Research Issues in Data Mining and Knowledge Discovery, Paris, FR, June 2004.
•
Using the Structure of Web Sites for Automatic Segmentation of Tables, Kristina Lerman, Lise Getoor, Steve Minton and Craig
Knoblock. Proceedings of ACM-SIGMOD 2004 International Conference on Management of Data, Paris, FR, June 2004.
•
Structure Discovery using Statistical Relational Learning, Lise Getoor. Data Engineering Bulletin, vol. 26, No. 3, 2003.
•
Link Mining: A New Data Mining Challenge, Lise Getoor. SIGKDD Explorations, volume 5, issue 1, 2003. Iterative Deduplication, I.
Bhattacharya, L. Getoor.
•
Link Mining: A New Data Mining Challenge, L. Getoor. SIGKDD Explorations, volume 4, issue 2, 2003.
•
Link-based Classification, Q. Lu and L. Getoor, International Conference on Machine Learning, August, 2003
•
Labeled and Unlabeled Data for Link-based Classification, Q. Lu and L. Getoor. ICML workshop on The Continuum from Labeled
to Unlabeled Data, August, 2003.
•
Link-based Classification for Text Classification and Mining, Q. Lu and L. Getoor. IJCAI workshop on Text Mining and Link
Analysis
•
IJCAI 03 Workshop: Learning Statistical Models from Relational Data SRL 2003, http://kdl.cs.umass.edu/srl2003
•
ICML 04 Workshop: Statistical Relational Learning and Connections to Other Fields, SRL 2004, http://www.cs.umd.edu/srl2004
Supported by
08/22/2004
MRDM 2004 Workshop
84