CS295_Intro - University of California, Irvine

Download Report

Transcript CS295_Intro - University of California, Irvine

CS295: Info Quality & Entity Resolution
University of California, Irvine
Fall 2010
Course introduction slides
Instructor: Dmitri V. Kalashnikov
Copyright © Dmitri V. Kalashnikov, 2010
Class Organizational Issues
• Class Webpage
– http://www.ics.uci.edu/~dvk/CS295.html
– Will put these slides there
• Rescheduling of Class Time
– Now
– Tue Thr 3:30-4:50 PM @ ICS209
– Twice a week
– New
–
–
–
–
–
Thr 3:00-5:20 PM @ ?
10 min break in the middle
Once a week
Easier on students
Is this time slot OK?
2
Class Structure
• Student Presentation-Based Class
– Students will present publications
– Papers cover recent trends (not comprehensive)
– Prepare slides
– Slides will be collected after presentation
– Discussion
• Final grade
– Quality of slides
– Quality of presentations
– Participation & attendance
– We are a small size class, so please attend!
– No exams!
3
Tentative Syllabus
4
Presentation
• Topics
– All papers are split into “topics”
– Covering a topic on a day
• Student Presentations
– Each student will choose a day/topic to present
– A student presents only during 1 day in quarter!
– If it is possible
– To reduce workload
– A student will present 2(?) papers on that day
– Please start preparing early!
5
Tentative List of Publications
6
How to present a paper
• Present “high-level” idea
– Main idea of the paper [should be clear]
• Present technical depth of techniques
1) Cover techniques in detail
2) Try to analyze the paper [if you can]
– Discuss what you like about the paper
– Criticize the technique
– Do you see flaws/weaknesses in the proposed methodology?
– Do you think the techniques can be improved?
– Do you think authors should have included additional info/algo?
• Present experiments
– Explain datasets/setup/graphs (analyze results)
– Criticize experiments [if you can]
– Large enough data? More experiments needed? Unexplained trends? etc
7
Who wants to present first?
8
Talk Overview
• Class Organizational issues
 Intro to Data Quality & Entity Resolution
9
Data Processing Flow
Data
Analysis
Decisions
Data
Analysis
 Organizations & People
collect large amounts of data
 Many types of data
 Data is analyzed for a variety
of purposes
– Textual
– Semi structured
– Multimodal
– Automated analysis: Data Mining
– Human in the loop: OLAP
– Ad hoc
 Analysis for Decision Making
– Business Decision Making
– Etc
10
Quality of decisions depends on quality of data
Quality of
Data
Quality of
Analysis
Quality of
Decisions
• Quality of data is critical
• $1 Billion market
– Estimated by Forrester Group
• Data Quality
– Very old research area
– But no comprehensive textbook exists yet!
11
Example of Analysis on Bad Data: CiteSeer
Unexpected Entries
– Lets check two people in DBLP
• Analysis on bad data can
– “A. Gupta”
– “L.to
Zhang”
lead
incorrect
results.
• Fix errors before analysis.
More than 80% of researchers working on data mining projects spend more than
40% of their project time on cleaning and preparation of data.
CiteSeer: Top-k most cited authors
DBLP
DBLP
12
*Why* Data Quality Issues Arise?
Types of DQ Problems
–
–
–
–
–
–
Ambiguity
Uncertainty
Erroneous data values
Missing Values
Duplication
etc
13
Example of Ambiguity
– Ambiguity
– Categorical data
– “Location: Washington”
– D.C.? State? Something else?
14
Example of Uncertainty
– Uncertainty
– Numeric data
– “John’s salary is between $50K and $80K”
– Query: find all people with salary > $70K
15
Example of Erroneous Data
– Erroneous data values
– <Name: “John Smixth”, Salary: “Irvine, CA”, Loc: $50K>
16
Example of Missing Values
– Missing Values
– <Name: “John Smixth”, Salary: null, Loc: null>
17
Example of Duplication
– Duplication
– <1/01/2010, “John Smith”, “Irvine, CA”, 50k>
– <6/06/2010, “John Smith”, CA”, 55k>
– Same? Different?
18
Inherent Problems vs Errors in Preprocessing
– Inherent Problems with Data
– The dataset contains errors (like in prev. slide)
– <Name: “John Smixth”, Salary: null, Loc: null>
– Errors in Preprocessing
– The dataset might not contain errors
– But preprocessing algo (e.g., extraction) fails
– Text: “John Smith lives in Irvine CA at 100 main st, his salary is $25K”
– Extractor:
<Person: Irvine,
Location: CA,
Salary $100K,
Address: null>
*When* Data Quality Issues Arise? Past.
– Manual entering of data
– Prime reason in the past [Winkler, etc]
– People make mistakes while typing in info
– E.g. Census data
– Tools to prevent entering bad data in the fist place
– E.g., field redundancy
– Trying hard to avoid the problem altogether
– Sometimes it is not possible
– E.g. missing data in census
– Tools to detect problems with data
– If-then-else and other types of rules to fix problems
– Applied by people inconsistently
– Even though strict guidelines existed
– Asking people to fix problems is not always a good idea!
20
*When* DQ Issues Arise? Present.
– Automated generation of DB content
– Prime reason for DQ issues nowadays
– Analyzing unstructured or semi-structured raw data
– Text / Web
– Extraction
– Merging DBs or Data sources
– Duplicate information
– Inconsistent information
– Missing data
– Inherent problems with well structured data
– As in the shown examples
21
Data Flaw wrt Data Quality
Raw Data
Handle
Data Quality
Analysis
Decisions
Two general ways to deal with DQ problems
1) Resolve them and then apply analysis on clean data
– Classic Data Quality approach
2) Account for them in the analysis on dirty data
– E.g. put data into probabilistic DBMS
– Often not considered as DQ
22
Resolve only what is needed!
Raw Data
Handle
Data Quality
Analysis
Decisions
– Data might have many different (types of) problems in it
– Solve only those that might impact your analysis
– Example
Publication DB: <paper_id, author_name, title, venue>
<1, John Smith, “Title 1…”, SIGMxOD>
<2, John Smith, “Title 2…”, SIGIR>
–
–
–
–
All papers by John Smith
Venues might have errors
The rest is accurate
Task: count papers => Do not fix venues!!!
23
Focus of this class: Entity Resolution (ER)
 ER a very common Data Quality challenge
 Disambiguating uncertain references to objects
 Multiple Variations
−
−
−
−
−
−
−
−
−
−
−
−
−
Record Linkage [winkler:tr99]
Merge/Purge [hernandez:sigmod95]
De-duplication [ananthakrishna:vldb02,sarawagi:kdd02]
Hardening soft databases [cohen:kdd00]
Reference Matching [mccallum:kdd00]
Object identification [tejada:kdd02]
Identity uncertainty [pasula:nips02, mccallum:iiweb03]
Coreference resolution [ng:acl02]
Fuzzy match and fuzzy grouping [@microsoft]
Name Disambiguation [han:jcdl04, li:aaai04]
Reference Disambiguation [km:siam05]
Object Consolidation [mccallum:kdd03wkshp, chen:iqis05]
Reference Reconciliation [dong:sigmod05]
 Ironically, some of them are the same (duplication)!
24
Entity Resolution: Lookup and Grouping
Lookup
Grouping
– List of all objects is given
– Match references to objects
– No list of objects is given
– Group references that corefer
… J. Smith ...
MIT
Intel Inc.
.. John Smith ...
.. Jane Smith ...
25
When ER challenge arises?
 Merging multiple data sources (even structured)
– “J. Smith” in DataBase1
– “John Smith” in DataBase2
– Do they co-refer?
 References to people/objects/organization in raw data
– Who is “J. Smith” mentioned as an author of a publication?
 Location ambiguity
– “Washington” (D.C.? WA? Other?)
 Automated extraction from text
– “He’s got his PhD/BS from UCSD and UCLA respectively.”
– PhD: UCSD or UCLA?
 Natural Language Processing (NLP)
− “John met Jim and then he went to school”
− “he”: John or Jim?
26
26
Standard Approach to Entity Resolution
– Choosing features to use
– For comparing two references
– Choosing blocking functions
– To avoid comparing all pairs
– Choosing similarity function
– Outputs how similar are two references
– Choosing problem representation
– How to represent it internally, e.g. as a graph
– Choosing clustering algorithm
– How to group references
– Choosing quality metric
– In experimental work
– How to measure the quality of the results
27
Inherent Features: Standard Approach
Deciding if two reference u and v co-refer
Analyzing their features
J. Smith
Feature 2
u
Feature 3
[email protected]
?
?
?
?
John Smith
Feature 2
Feature 3
v
[email protected]
s (u,v) = f (u,v)
“Similarity function”
“Feature-based similarity”
(if s(u,v) > t then u and v are declared to co-refer)
28
Advanced Approach: Information Used
External Data
Public Datasets
E.g., DBLP, IMDB
Web
Ontologies
E.g., DMOZ
Ask a person
- Not frequently
- Might not work well
Encyclopedias
E.g., Wikipedia
Dataset
?
J. Smith
Jane Smith
John Smith
?
Feature 2
Feature 2
?
u
Feature 3
Feature 3
?
[email protected]
Inherent Features
Context Features
v
+
[email protected]
(Condit.) Functional Dependencies
& Consistency constraints
J. Smith
u
John Smith
v
Entity Relationship Graph
(Social Network)
Blocking Functions
• Comparing each reference pair
− N >> 1 references in dataset
− Each can co-refer with the remaining N-1
− Complexity N(N-1)/2 is too high…
• Blocking functions
– A fast function that finds potential matches quickly
Naïve for R1
R2
Ground Truth
R3
R1
R4
R2
BF1
- one extra
R3
R1
R4
R2
BF2
- one lost
- one extra
R3
R1
R4
R1
R2
R3
R4
R5
R5
R5
R5
R6
R6
R6
R6
30
Blocking Functions (contd)
• Multiple BFs could be used
− Better if independent
− Use different record fields for blocking
• Examples
− From [Winkler 1984]
1)
2)
3)
4)
5)
Fst3 (ZipCode) + Fst4 (NAME)
Fst5 (ZipCode) + Fst6 (Street name)
10-digit phone #
Fst3(ZipCode) + Fst4(LngstSubstring(NAME))
Fst10(NAME)
− BF4
is #1 single
− BF1 + BF4 is #1 pair
− BF1 + BF5 is #2 pair
31
BFs: Other Interpretations
• Dataset is split into smaller Blocks
− Matching operations are performed on Blocks
− Block is a clique
• Blocking
− Applying somebody else’s technique first
− Not only will find candidates
− But also will merge many (even most) cases
− Will only leave “tough cases”
− Apply your technique on these “tough cases”
32
Basic Similarity Functions
s (u,v) = f (u,v)
J. Smith
Feature 2
u
Feature 3
[email protected]
0.8
0.2
0.3
0.0
John Smith
Feature 2
Feature 3
v
[email protected]
− How to compare attribute values
− Lots of metrics, e.g. Edit Distance, Jaro, Ad hoc
− Cross attribute comparisons
− How to combine attribute similarities
− Many methods, e.g. supervised learning, Ad hoc
− How to mix it with other types of evidences
− Not only inherent features
33
Standardization & Parsing
• Standardization
− Converting attribute values into the same format
− For proper comparison
• Examples
− Convert all dates into MM/DD/YYYY format
− So that “Jun 1, 2010” matches with “6/01/10”
− Convert time into HH:mm:ss format
− So that 3:00PM and 15:00 match
− Convert Doctor -> Dr.; Professor -> Prof.
• Parsing
− Subdividing into proper fields
− “Dr. John Smith” Jr. becomes
− <PREF: Dr.; FNAME: John; LNAME: Smith; SFX: Jr.>
34
Example of Similarity Function
• Edit Distance (1965)
− Comparing two strings
− The min number of edits …
− Insertions
− Deletions
− Substitutions
− … needed to transform on string into another
− Ex.: “Smith” vs. “Smithx” one del is needed.
− Dynamic programming solution
• Ex. of Advanced Version
− Assign different costs to ins, del, sub
− Some errors are more expensive (unlikely) than others
− The distance d(s1,s2) is the min cost transformation
− Learn (supervised learning) costs from data
35
Clustering
• Lots of methods exists
− Really a lot!
• Basic Methods
− Hierarchical
− Agglomerative
-1
a
+1
+1
+1
+1
+1
b
d
c
+1
e
+1
f
-1
− Decide threshold t
− if s(u,v) > t then merge(u,v)
− Partitioning
• Advanced Issues
− How to decide the number of clusters K
− How to handle negative evidence & constarints
− Two step clustering & cluster refinement
− Etc, very vast area
36
Quality Metrics
• Purity of clusters
− Do clusters contain mixed
elements? (~precision)
• Completeness of clusters
− Do clusters contain all of its
elements? (~recall)
• Tradeoff between them
− A single metric that combines
Ideal Clustering
1
1
1
1
1
1
2
2
2
2
2
2
One Misassigned (Example 1)
1
1
1
1
1
2
2
2
2
2
2
1
One Misassigned (Example 2)
1
them (~F-measure)
1
1
2
1
2
1
2
1
2
2
2
Half Misassigned
1
1
1
2
2
2
2
2
2
1
1
1
37
Precision, Recall, and F-measure
• Assume
− You perform an operation to find relevant (“+”) items
− E.g. Google “UCI” or some other terms
− R is the ground truth set, or the set of relevant entries
− A is the answer returned by some algorithm
• Precision
− P = |A ∩ R| / |A|
− Which fraction of the answer A are correct (“+”) elements
• Recall
− R = |A ∩ R| / |R|
− Which fraction of ground truth elements were found (in A)
− F-measure
− F = 2/(1/P + 1/R) harmonic mean of precision and recall
38
Quality Metric: Pairwise F-measure
− Example
− R = {a1, a2, a5, a6, a9, a10}
− A = {a1, a3, a5, a7, a9}
− A ∩ R = {a1, a5, a9}
− Pre = |A ∩ R| / |A| = 3/5
− Rec = |A ∩ R| / |R| = 1/2
− Pairwise F-measure
− “+” are pairs that should be merged
− “-” are pairs that should not be merged
− Now, given an answer, can compute Pre, Rec, F-measure
− A widely used metric in ER
− But a bad choice in many circumstances!
− What is the good choice?
39
Web People Search (WePS)
1. Query Google with
a person name
2. Top-K Webpages
3. Task: Cluster Webpages
(related to any John Smith)
(A cluster per person)
John Smith
• Web domain
• Very active research area
• Many problem variations
Person 1
Person 2
Person 3
Person N
Unknown beforehand
− E.g., context keywords
40
Recall that…
Lookup
Grouping
– List of all objects is given
– Match references to objects
– No list of objects is given
– Group references that corefer
… J. Smith ...
MIT
Intel Inc.
.. John Smith ...
.. Jane Smith ...
WePS is a grouping task
User Interface
User Input
Results
System Architecture
John Smith
Top-K
Webpages
Search Engine
Preprocessed
Webpages
Auxiliary
Information
Clustering
Preprocessing
- TF/IDF
- NE/URL Extraction
- ER Graph
Results
Postprocessing
- Custer Sketches
- Cluster Rank
- Webpage Rank
Person1
Person2
Person3
43