No Slide Title

Download Report

Transcript No Slide Title

HeteroClass: A Framework for Effective
Classification from Heterogeneous Databases
Mayssam Sayyadian
University of Illinois at Urbana-Champaign
Final Project Presentation for CS512: Principles and
Algorithms for Data Mining
May 2006
Roadmap

Motivation
– From multi-relational classification  multi-DB classification
– Motivating examples

Challenges
– Heterogeneous data sources
– Attribute instability

HeteroClass framework
– Data integration techniques help data mining techniques
– Diverse ensembles to address attribute instability


Empirical evaluation
Discussion, broader impact, and conclusion
Multi-relational Classification


Classification: Old problem, but
important problem !
Most real-world data are stored
in relational databases.
– To classify objects in one
relation, other relations provide
crucial information.
– Cannot convert relational data
into a single table without expert
knowledge or loosing essential
information.
– Multi-relational classification:
Automatically classifying objects
using multiple relations
Motivation

Necessity drives applications …
– From multi-relational scenarios to multi-database scenarios
– Cross-database links play an important role; but are difficult to capture
automatically
Product1
Product
Vendor4
database
product_id
Product1 product_name
product_id
UPC
product_name
UPC category
category
manufacturer
Transaction
product_id
product_name
trans_id
Transaction1
trans_id
product_id
user_id
vendor_id
manufacturer
ip_address
Listing1Listing
Vendor4
database
date
product_idproduct_id
vendor_id
price
vendor_id
price
Vendor1
Vendor4
database
amount
Customer1
user_id
customer_name
Vendor
vendor_id
vendor_name
vendor_id
vendor_name
Vendor1
database
description
description
address
zipcode
phone
email
UPC
product_id
Customer2
user_id
customer_id
vendor_id
customer_name
address
ip_address
amount
date
Customer
Customer3
user_id
creditcard_no
customer_name
customer_name
phone
address zipcode
zipcode
Product3
phone prd_id
emailproduct_name
UPC
Vendor1: Shopping Database
manufacturer
Transaction1
trans_id
product_id
customer_id
date
credit_no
Vendor2
database
Transaction3
trans_id
product_id
creditcard_no
date
Manufacturer 3
manuf_name
address
Vendor3
database
Examples

In many real-world settings, data sources are
– Diverse
– Autonomous and heterogeneous
– Scattered over Web, organizations, enterprises, digital libraries, “smart”
homes, personal information systems (PIM), etc.
– … but: inter-related

Examples:
– What sort of graduates of UIUC will be donating money to the school in
future? (multiple databases in an organization)
– What products are probable to be sold for the next 30 day? (multiple
databases in an enterprise)
– Which customers will be using both traveling and dining services of our
company?
– Which movie scenes are captured outdoor? (multimedia/spatial databases)
– Etc.
Challenge 1: Heterogeneity of Data Sources

Heterogeneity of data sources:
– Data level heterogeneity; e.g. Phil vs. Philippe, Jiawei Han vs. Han, J.
– Heterogeneity of schemas; e.g.
Personnel(Pno, Pname, Dept, Born) vs.
Empl(id, name, DeptNo, Salary, Birth), Dept(id, name)
– Structure/format heterogeneity; e.g.
<Schema name=“HR-Schema”>
CREATE TABLE Personnel(
<ElementType name=“Personnel”>
EmpNo int PRIMARY KEY,
<element type=“EmpNo”/>
EmpName varchar(50),
<element type=“EmpName”/>
DeptNo int REFERENCES Dept,
<ElementType name=“Dept”>
vs.
Salary dec(15,2)
<element type=“id”/>
);
<element type=“name”/>
CREATE TABLE Dept(
</ElementType>
</ElementType>
</Schema>
DeptNo int PRIMARY KEY,
DeptName varchar(70)
);
Challenge 2: Attribute Instability

In distributed databases, objects/data values are
– Horizontally distributed
– E.g. UIUC organization: HR, OAR, Facilities, Business,
Academics, …
– Vertically distributed
– E.g. Comparison shopping: Yahoo Shopping, Amazon, E-Bay, …
– Distributed with various distributions
– E.g. Spatial/Multimedia databases; DBs of different training size
– Distributed with various characteristics (features)
– Important when we need to run feature selection before learning
the model
HeteroClass Framework: Observations


Data integration before data mining is very costly and difficult
(if not impossible)
Data warehousing before data mining is difficult and not
efficient
– Security concerns
– Providers do not give full access to their data
– Efficiency concerns (large databases)

Data co-existence paradigm is most natural
– Data sources expose standalone classifiers
– DataSupport Systems Platform (DSSPs)
– Data mining embedded in DBMSs
– It is possible (and useful) to build specialized classifiers
– No need for full integration/warehousing; partial integration is enough
HeteroClass Framework: Architecture

Join Discovery Module
– Alleviate heterogeneity
– Output useful links between databases (and their relations)

Ensemble builder
– Build diverse ensemble of classifiers

Meta-learner
– Voting
– Decision tree meta-learner
Approximate Join Discovery – Step 1

Two step process
1. Find approximate join paths based on set resemblance
– Field similarity by exact match/substring similarity
– The resemblance of two sets A, B is
,  |A
– After some algebra, we find that
|A  ,(|A|+|B|)/(1+ ,)
– There are fast algorithms to compute :
– h(a) is a hash function.
– m(A) = min(h(a), a in A).
– Observation:
Pr[m(A) = m(B)] = ,
– The sets are the unique values in the fields
– The resemblance of two fields is a good measure of their
similarity
Approximate Join Discovery – Step 2

For each set of joinable attributes in the two schemas
– Estimate their semantic distance using a schema matching tool
– Filter join paths, that are not semantically meaningful, i.e. their
semantic distance is high
– E.g.
– (DB1.last-name, DB2.street-name)
– Might be joinable because of common substrings (Clinton
St. vs. Bill Clinton, George West vs. West Main St., etc.)
– But this join is not semantically meaningful
– This happens frequently in categorical attributes (e.g. with
“yes” / “no” values.
Schema Matching
Given a schema:
CREATE TABLE Personnel(
Build its graph model:
Pno int,
columnType
column
Pname string,
Dept string);
Pno
table
type
column
&1
name
Given two models:
Personnel
&2
&5
int
&6
string
Pname
column
column
&3
Dept
&4
SQL type
Run similarity flooding algorithm
For each two schema elements (a, b), output
a semantic similarity distance based on an
iterative graph matching algorithms
DI & Approximate Join Discovery

Why it is useful?
– When learning a classifier in multi-relational setting:
– The number of possible predicates (predicate space)  grows
exponentially with the number of links
– Improves efficiency
– We eliminate spurious links
– Improves accuracy
– Best effort DI
– We need only a semantic distance
– No need for exact matches (no need for human confirmation)
– No need for schema mapping
Contribution: Using schema matching and DI techniques
helps mining database structure and join discovery
Ensemble of Classifier’s Approach

Ensemble methods:
– Use a combination of models to increase global accuracy
– Ensemble should be
– Accurate
– Build specialized (site-specific) classifiers
– Build classifiers on homogeneous regions
 Improve prediction capability
– Diverse
– Each learned hypothesis should be as different as possible
– Each learned hypothesis should be consistent with training data
– Ensure all classifiers do not make same errors
– Exploit various attribute collection (joins/schemas/DBs)
HeteroClass: Intuition

Main observation (theoretically proved)
– To improve accuracy of an ensemble of classifiers:
– They should disagree on some inputs
– Diversity/ambiguity: measure of disagreement
– Generalization error = mean error – diversity (variance): ( E  E  D )
– Increase ensemble diversity
– Maintain average error of ensemble members
 Decrease generalization error
HeteroClass: Algorithm
1234-
learn C0 = BaseLearn (training data)
initialize ensemble={C0}
set  = ensemble error
while (error converges, or no more training data
left, or enough ensembles created) do
4.1- Pick two (or more) site-specific classifiers and
add a number of joins (J1 … Jn) that connect them
4.2- training data = training data + data accessible
through J1 … Jn
4.3- C’ = BaseLearn (training data)
4.4- C* = C + {C’}
4.5- if new ensemble’s error is increased: C = C –
{C’}
Evaluation Settings

Dataset 1: Inventory
– Products, Stores, Inventory records: 5 databases
– Classification task:
– Product availability in store

Dataset 2: DBLife data sources
– DBLP database (2 XML databases from Clio project @Toronto)
– Authors, Papers, Publications, Conferences
– DBLife-Researchers database:
– CS Researchers, their associations and meta-date (Department, Title,
etc.)
– DBWorld-Events:
– People, Events (service, talk, etc.)
– Classification tasks:
– Research area for researchers
HeteroClass Accuracy
Algorithms: Accuracy
0.9
0.8
accuracy
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Inventory
CrossMine
DBLife
MDBM*
HeteroClass
HeteroClass: Sensitivity Analysis
Accuracy vs. Ensemble Size
0.9
accuracy
0.8
0.7
Join Discovery Accuracy
1
0.6
0.8
Inventory
0.5
0.6
DBLife
0.4
0.4
0.2
0
0.3
Inventory 1
1
2
3
4
5
6 7
8
#classifiers in the ensemble
9
10
Inventory 2
Join Discovery
Inventory 3
Inventory 4
Inventory 5
Join Discovery + Schema Matching
Conclusion

Take-home messages:
– Data mining across multiple data sources is
– Important and necessary
– Heterogeneity is challenging
– Best-effort DI helps data mining
– Improve structure mining using schema matching
– Ensemble's approach help in heterogeneous settings with data coexistence paradigm
– Committee of local & specialized classifiers can help in building a
global classifier

HeteroClass framework is general
– Applicable to other data mining tasks:
– Clustering: build global clusters from local clusters
– Etc.
Thank You!
Backup: Similarity Flooding Algorithm
 e(s,p,o)  A : edge e labeled p from s to o in A

Pairwise Connectivity Graph PCG(A, B):
x, y , p, x, y PCGA, B

 x, p, x A and  y, p, y B
Induced Propagation Graph: add edges in opposite direction
– Forward edge weights: initialize with a q-gram string similarity matcher
– Edge weights: propagation coefficients (how the similarity propagates to
neighbors)
Backup: Similarity Flooding Algorithm


Input  Graph  Mapping  Filtering
Mapping: similarity flooding (SFJoin)
– Initial similarity values taken from initial mapping; e.g. a name
matcher or a fully functional schema matching tool
– In each iteration similarity of two elements affects the similarity of
their respective neighbors
– Iterate until similarity values are stable (fixpoint computation)

Filtering:
–
–
–
–
Needed to choose best candidate match
Many heuristics
Difficult to implement fully automatically (needs human intervention)
In HeteroClass: only need semantic distance  no need for filtering