Sharad`s PPT
Download
Report
Transcript Sharad`s PPT
Overcoming the Quality Curse
Sharad Mehrotra
University of California, Irvine
Collaborators/Students (Current)
Dmitri Kalashnikov, Yasser Altowim, Hotham Altwaijry, Jeffrey Xu, Liyan
Zhang
Alumini
Stella Zhaoqi Chen, Rabia Nuray-Turan, Virag Kothari
Beyond DASFAA 2003 paper ..
Improving
Quality
Improving
Efficiency
DASFAA 2003
New
Domains
Video data
Image data
Speech data
Sensor data
Entity Search
People Search
Location Search
2
Data Cleaning – a vital component of
Enterprise Data Processing Workflow
Analysis/Min
Quality of
Analysis
ing
Quality
Data of
Data
• Historical data analyses
•
Trends, patterns, rules,
models, ..
Decisions
Quality of
Decisions
• Long term strategies
• Business decisions
ETL
Data Cleaning
OLTP
Point of sale
Organizational
customer data
Data Sources
Quality(Data) Quality(Decisions)
3
Entity Resolution Problem
Real
World
Digital
World
4
Standard Approach to Entity Resolution
Deciding if two reference u and v co-refer
Analyzing their features
J. Smith
Feature 2
u
Feature 3
[email protected]
?
?
?
?
John Smith
Feature 2
Feature 3
v
[email protected]
s (u,v) = f (u,v)
“Similarity function”
“Feature-based similarity”
(if s(u,v) > t then u and v are declared to co-refer)
5
Measuring Quality of Entity Resolution
Entity dispersion
Ideal Clustering
1
1
1
1
1
1
2
2
2
2
2
2
C1
C2
Div
H
1
0
1
0
Dis
E1
E2
1
1
Cluster diversity
for an entity, into how many clusters its
repr. are clustered, ideal is 1
for a cluster, how many distinct entities
it contains, ideal is 1
Measures:
F-Measure.
B-Cubed F-Measure.
Variation of Information (VI).
Generalized Merge Distance (GMD).
…
One Misassigned (Example 1)
1
1
1
1
1
2
2
2
2
2
2
1
C1
C2
Half Misassigned
1
1
1
2
2
2
2
2
2
1
1
1
C1
C2
Div
H
2
0.65
2
0.65
Div
H
2
1
2
1
Dis
E1
E2
2
0
2
0
Dis
E1
E2
2
2
Dis/Div cannot distinguish the two c
Entropy can:
since 0.65 < 1, first clustering is be
One Misassigned (Example 2)
1
1
2
1
2
1
1
2
1
2
2
2
C1
C2
Div
H
2
0.592
1
0
Dis
E1
E2
1
2
0
Average entropy decreases (impro
compared to Example 1
The Quality Curse --
Why Standard “Feature-based” Approach leads to Poor Results
• Significant entity dispersion.
• Significant cluster diversity.
S Mehrotra
has joined
the faculty
at University
of Illinois.
Photo Collection
of Sharad
Mehrotra
from Beijing,
He received
his
PhD2007
fromSIGMOD
UT, Austin.
China
June
Trip He got his
bachelors from IIT, Kanpur in India
S. Mehrotra, PhD from University of Illinois is visiting UT,
Sharad Mehrotra, research interests: data management,
Austin to give a talk on prefetching on multiprocessor
Professor, UC Irvine
machines. He received his bachelors from India.
Overcoming the Quality Curse (1)..
Look more carefully at data for additional
evidences
8
Exploiting Relationships among Entities
Author table (clean)
Publication table (to be cleaned)
?
A1, ‘Dave White’, ‘Intel’
P1, ‘Databases . . . ’, ‘John Black’, ‘Don White’
A2, ‘Don White’, ‘CMU’
P2, ‘Multimedia . . . ’, ‘Sue Grey’, ‘D. White’
A3, ‘Susan Grey’, ‘MIT’
P3, ‘Title3 . . .’, ‘Dave White’
A4, ‘John Black’, ‘MIT’
P4, ‘Title5 . . .’, ‘Don White’, ‘Joe Brown’
A5, ‘Joe Brown’, unknown
P5, ‘Title6 . . .’, ‘Joe Brown’, ‘Liz Pink’
A6, ‘Liz Pink’, unknown
P6, ‘Title7 . . . ’, ‘Liz Pink’, ‘D. White’
Intel
=
?
MIT
John Black
CMU
4
w2
Susan Grey
P6
=?
P3
w3 = ?
2
Dave White
w
w1 = ?
1
P2
Don White
P1
ER Graph
Liz Pink
P5
P4
Joe Brown
Context Attraction
Principle (CAP): Nodes
that are more connected
have a higher chance of
co-referring to the same
entity
9
Exploiting Relationships for ER
Ph.D. Thesis, Stella Chen
• Formalizing the CAP principle [SDM 05, IQIS 05]
• Scaling to large graphs [TODS 06]
• Self-Tuning [DASFAA 07, JCDL 07, Journal IQ 11]
– Not all relationships are equal
– E.g., mutual interest in Bruce Lee movies possibly not as
important as being colleagues at a university for predicting coauthorship.
• Merging relationship evidence with other evidences
[SIGMOD ‘09]
• Applying to People search on Web [ICDE ‘07, TDKE 08,
ICDE 09 (demo)]
10
Effectiveness of Exploiting Relationships
• WEPS
• Multimedia
11
Smart Video Surveillance
• Camera Array to track human
activities
CS Building in UC Irvine
Query/
Analysis
Event
Database
Semantic
Extraction
Surveillance
Video
Database
Video collection
12
Event Model
Query Examples:
Who was the last visitor to Mike Carey’s office yesterday?
Who spends more time in Labs – database students or
embedded computing students?
Query
/Analysis
Event model :
when
Event
Database
who
what
Temporal placement
Activity recognition
Face recognition
event
Semantic
Extraction
localization
where
Surveillance
Video
Database
extraction
Other
property
Bob
Person Identification Challenge
?
?
?
Alice
other
Event model :
Person Identification
when
what
Temporal placement
Activity recognition Who
Face recognition
who
?
event
localization
where
extraction
Other
property
14
Traditional Approach
?
?
?
Traditional
Approach
Poor
Performance
Face
Detection
Detect 70 faces/
1000 images
Face
Recognition
2~3 images/
person
15
Rationale for Poor Performance
resolution
Sampling
rate
1
1
frame/sec
frame/sec
original
performance
original
performance
Drop
to
Poor Quality of Data
Drop (original)
to
No faces
1/2
frame/sec
70%
53%
Small faces
Low resolution
Drop
to
1/3
frame/sec
Low temporal Resolution
(1/2
Drop original)
to
30%
35%
(1/3
16
Effectiveness of Exploiting Relationships
• WEPS
• Multimedia [IQ2S PERCOM 2011]
17
Results on Face Clustering [ACM ICMR 2013
Best Paper Award]
Results
High Precision,
662 clusters
31 Real Person,
631 merges
4
Times
High Precision,
203 clusters
31 Real Person,
172 merges
Overcoming the Quality Curse (2)..
Look
outside
the box
20
Exploiting Search Engine Statistics
Google Search results
of “Andrew McCallum”
Sebastian Thrun
Machine Learning
Text Retrieval
Tom Mitchell
CRF
UAI 2003
Andrew McCallum
AND
(Machine
Learning
OR
Text
Sebastian
Thrun
AND
Tom
Andrew
McCallum
AND
(Machine
Sebastian
Thrun AND
AND Tom
Retrieval)
Mitchell
Andrew
AND)
Learning
OR McCallum
Text Retrieval
Mitchell
(CRF
OROR
UAIUAI
2003)
Sebastian
Thrun
AND
(CRF
2003)
AND (CRF OR UAI 2003)
Search Engine Queries
to learn correlations
amongst contexts
• Correlations amongst
context entities provide
additional source of
information to resolve
entities
Exploiting Web Search Engine Statistics
Ph.d. Thesis, Rabia Nuray
• Web Queries to Learn correlations [SIGIR 08]
• Application to Web People Search [WePS 09]
• Cluster refinement to overcome the singleton
cluster problem [TODS 11-a]
• Making Web querying robust to server side
fluctuations [tech. report]
• Scaling up the Web Query Technique [TODS 11a]
4/1/2016
22
FB
4/1/2016
1.00
0.95
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
ERWeb+CR
ERWeb
SIGIR10
GRAPE
Basic
CIKM09
PolyUHK
ITC-UT
UVA
XMEDIA
UCI
UMD
FICO
LANZHOU
HAC-BIGRAMS
CASIANED
UGUELPH
HAC-TOKENS
AUG
UPM-SINT
ALL-IN-ONE
UNN
CHEAT-SYSTEM
ECNU
UNED
PRIYAVEN
ONE-IN-ONE
BUAP
Comparing with the State-of-the-art on WEPS2 Dataset
Systems
23
Observation/Conclusion…
• Additional Evidences can be exploited to
improve data quality
• BUT …it is Expensive!!
• Example: Web Queries Approach
– Number of queries : 4K2 ( ~ 40K for 100 results)
– Very large to submit to a search engine & expect realtime results
– ~6-8 minutes (network costs, search engine load)
• Solutions:
– Local Caching of the Web
– Ask only important queries
– Reduces to 1-2 min. without degrading quality much
29
(Near) Future: Addressing the Efficiency
Curse …
Two complementary approaches
– Pay as you go data cleaning –
– Progressive algorithm to obtain best quality given budget
constraint
Improving
Quality
– Query driven data cleaning –
– Perform minimal cleaning to answer query/analyses task.
Prevent having to clean unnecessary data.
Improving
Efficiency
New
Domains
DASFAA 2003
30