The Cimple Project on Community Information Management
Download
Report
Transcript The Cimple Project on Community Information Management
The Cimple Project on
Community Information Management
AnHai Doan
University of Wisconsin-Madison
The CIM Problem
Numerous online communities
– database researchers, movie fans, legal professionals,
bioinformatics, enterprise intranets, tech support groups
Each community = many data sources + many members
Database community
– home pages, project pages, DBworld, DBLP, conference pages, ...
Movie fan community
– review sites, movie home pages, theatre listings, ...
Legal profession community
– law firm home pages
2
The CIM Problem
Members often want to discovery, query, monitor
information in the community
Database community
–
–
–
–
what is new in the past week in the database community?
any interesting connection between researchers X and Y?
find all citations of this paper in the past one week on the Web
what are current hot topics? who has moved where?
Legal profession community
– which lawyers have moved where?
– which law firms have taken on which cases?
3
The CIM Problem
To address such needs, build data portals
Starting out topic-based, now structured data portals
– DBLP, Citeseer, IMDB, GlobalSpec, etc.
Limitations of current solutions
– mostly by hand, labor intensive, error prone
– hard-to-port solutions
– few services other than browsing and keyword search
4
Cimple Project @ Wisconsin / Yahoo! Research
Develop generic solutions to create structured data portals
via extraction + integration + mass collaboration
Jim Gray
Researcher
Homepages
**
*
*
Pages
* *
Group Pages
mailing list
Keyword search
SQL querying
Web pages
Conference
DBworld
Jim Gray
*
**
**
*
SIGMOD-04
give-talk
SIGMOD-04
**
*
Text documents
Question
answering
Browse
Mining
Alert/Monitor
News summary
DBLP
Personalize system, provide feedback
5
The Research Team
Faculty / Vice President
– AnHai Doan
– Raghu Ramakrishnan
Current students
–
–
–
–
–
–
–
–
Pedro DeRose
Warren Shen
Fei Chen
Yoonkyong Lee
Doug Burdick
Mayssam Sayyadian
Xiaoyong Chai
Ting Chen
6
Prototype System: DBLife
Integrate data of the DB research community
1164 data sources
Crawled daily, 11000+ pages = 160+ MB / day
7
Data Extraction
8
Data Integration
Raghu Ramakrishnan
co-authors = A. Doan, Divesh Srivastava, ...
9
Resulting ER Graph
“Proactive Re-optimization
write
write
Shivnath Babu
coauthor
write
Pedro Bizarro
coauthor
advise
coauthor
Jennifer Widom
David DeWitt
advise
PC-member
PC-Chair
SIGMOD 2005
10
Querying The ER Graph
Query: “David DeWitt Jennifer Widom”
coauthor
1.
David DeWitt
Jennifer Widom
coauthor
2.
Jennifer Widom
David DeWitt
PC-member
PC-Chair
SIGMOD 2005
Shivnath Babu
3.
advise
coauthor
coauthor
David DeWitt
Jennifer Widom
11
Provide Services
DBLife system
12
Mass Collaboration: Example 1
Picture is removed if enough users vote “no”.
13
Mass Collaboration Meets Jeff Naughton
Jeffrey F. Naughton swears that
this is David J. DeWitt
14
Mass Collaboration: Example 2
Community Wikipedia
backed up by a structured underlying database
15
What We Have Done
Define the CIM problem / understand it a little bit
– start to talk about it in the DB community
[SIGMOD-06 tutorial, IEEE DEB-06, CIDR-07]
Build DBLife / helps clarify research issues
– live at
dblife.cs.wisc.edu
– latest stuff at dblife-labs.cs.wisc.edu
Start some preliminary research
– ICDE-07a, ICDE-07b, ICDE-07b
16
What We Would Like to Do Next
Release DBLife
– as a research / education tool
possible service to the DB community
demo of CIM systems
benchmark / challenge for data integration / extraction
Develop and release a generic Cimple platform
– anyone can use it to build structured data portals
Build CimBase: a hosting service
– anyone can specify a structured portal on CimBase
– we will build and host it
Continue research / expand team / build alliance
17
Research Challenges (1)
Jim Gray
Researcher
Homepages
**
*
*
Pages
* *
Group Pages
mailing list
Keyword search
SQL querying
Web pages
Conference
DBworld
Jim Gray
*
**
**
*
SIGMOD-04
give-talk
SIGMOD-04
**
*
Text documents
Question
answering
Browse
Mining
Alert/Monitor
News summary
DBLP
Personalize system, provide feedback
Information extraction
Data integration
Mass collaboration
18
Research Challenges (2)
Jim Gray
Researcher
Homepages
**
*
*
Pages
* *
Group Pages
mailing list
Keyword search
SQL querying
Web pages
Conference
DBworld
Jim Gray
*
**
**
*
SIGMOD-04
give-talk
SIGMOD-04
**
*
Text documents
Question
answering
Browse
Mining
Alert/Monitor
News summary
DBLP
Personalize system, provide feedback
Exploiting extracted data
Handling uncertainty / provenance / explanation
Dealing with evolving data, versioning, temporal data
19
Research Challenges (3)
Jim Gray
Researcher
Homepages
**
*
*
Pages
* *
Group Pages
mailing list
Keyword search
SQL querying
Web pages
Conference
DBworld
Jim Gray
*
**
**
*
SIGMOD-04
give-talk
SIGMOD-04
**
*
Text documents
Question
answering
Browse
Mining
Alert/Monitor
News summary
DBLP
Personalize system, provide feedback
What is the right architecture?
What is the right data model / storage?
How to build continuously running systems
How to build massively scalable hosting services?
How to build a generic CIM platform?
20
Rest of the Talk
The CIM problem
The Cimple solution approach
What we have done / plan to do
Research challenges
– information extraction
– data integration (focus on entity matching)
– mass collaboration
Broader perspectives
21
Declarative IE
Current IE research
– develops learning- & rule-based solutions [SIGMOD-06 tutorial]
– focuses largely on improving accuracy
DECLARATIVE IE
Dr. R. Ramakrishnan
This is a fun topic ...
Real-world IE applications
– glue multiple such solutions together, using Perl
Serious problems
– hard to develop, understand, debug, and optimize
22
Example in DBLife
Find conference name in raw text
#############################################################################
# Regular expressions to construct the pattern to extract conference names
#############################################################################
# These are subordinate patterns
my $wordOrdinals="(?:first|second|third|fourth|fifth|sixth|seventh|eighth|ninth|tenth|eleventh|twelfth|thirteenth|fourteenth|fifteenth)";
my $numberOrdinals="(?:\\d?(?:1st|2nd|3rd|1th|2th|3th|4th|5th|6th|7th|8th|9th|0th))";
my $ordinals="(?:$wordOrdinals|$numberOrdinals)";
my $confTypes="(?:Conference|Workshop|Symposium)";
my $words="(?:[A-Z]\\w+\\s*)"; # A word starting with a capital letter and ending with 0 or more spaces
my $confDescriptors="(?:international\\s+|[A-Z]+\\s+)"; # .e.g "International Conference ...' or the conference name for workshops (e.g.
"VLDB Workshop ...")
my $connectors="(?:on|of)";
my $abbreviations="(?:\\([A-Z]\\w\\w+[\\W\\s]*?(?:\\d\\d+)?\\))"; # Conference abbreviations like "(SIGMOD'06)"
# The actual pattern we search for. A typical conference name this pattern will find is
# "3rd International Conference on Blah Blah Blah (ICBBB-05)"
my
$fullNamePattern="((?:$ordinals\\s+$words*|$confDescriptors)?$confTypes(?:\\s+$connectors\\s+.*?|\\s+)?$abbreviations?)(?:\\n|\\r|\\.|<)";
############################## ################################
# Given a <dbworldMessage>, look for the conference pattern
##############################################################
lookForPattern($dbworldMessage, $fullNamePattern);
#########################################################
# In a given <file>, look for occurrences of <pattern>
# <pattern> is a regular expression
#########################################################
sub lookForPattern {
my ($file,$pattern) = @_;
23
Example in DBLife (cont.)
# Only look for conference names in the top 20 lines of the file
my $maxLines=20;
my $topOfFile=getTopOfFile($file,$maxLines);
# Look for the match in the top 20 lines - case insenstive, allow matches spanning multiple lines
if($topOfFile=~/(.*?)$pattern/is) {
my ($prefix,$name)=($1,$2);
# If it matches, do a sanity check and clean up the match
# Get the first letter
# Verify that the first letter is a capital letter or number
if(!($name=~/^\W*?[A-Z0-9]/)) { return (); }
# If there is an abbreviation, cut off whatever comes after that
if($name=~/^(.*?$abbreviations)/s) { $name=$1; }
# If the name is too long, it probably isn't a conference
if(scalar($name=~/[^\s]/g) > 100) { return (); }
# Get the first letter of the last word (need to this after chopping off parts of it due to abbreviation
my ($letter,$nonLetter)=("[A-Za-z]","[^A-Za-z]");
" $name"=~/$nonLetter($letter) $letter*$nonLetter*$/; # Need a space before $name to handle the first $nonLetter in the pattern if there
is only one word in name
my $lastLetter=$1;
if(!($lastLetter=~/[A-Z]/)) { return (); } # Verify that the first letter of the last word is a capital letter
# Passed test, return a new crutch
return newCrutch(length($prefix),length($prefix)+length($name),$name,"Matched pattern in top $maxLines lines","conference
name",getYear($name));
}
return ();
}
24
Solution: Declarative, Compositional IE
Treat each solution as a “black box”
Glue black boxes using a Datalog-like language
– author(y,d) :- docs(d), name(y,d), title(x,d), distance-line(x,y)<3
– name(y,d) :- docs(d), seeds(s), namepatterns(s,p), match(p,d,y)
– title(x,d)
:- docs(d), lines(x,n,d), allcaps(x), (n<5)
seeds(s)
DECLARATIVE IE
Dr. R. Ramakrishnan
This is a fun topic ...
Raghu, Ramakrishnan
Divesh, Srivastava
...
p = Raghu Ramakrishnan
R. Ramakrishnan
Dr. Ramakrishnan, etc.
25
IE Execution Plan
PROJECT_[y,d]
distance-line(x,y)<3
DECLARATIVE IE
Dr. R. Ramakrishnan
This is a fun topic ...
match(y,p,d)
SELECT_[allcaps(x) and (n<5)]
lines(x,n,d)
namepatterns(p,s)
docs(d)
docs(d)
seeds(s)
26
Sample Optimization: Push Down Selections
PROJECT_[y,d]
distance-line(x,y)<3
DECLARATIVE IE
Dr. R. Ramakrishnan
This is a fun topic ...
match(y,p,d)
SELECT_[allcaps(x) and (n<5)]
lines(x,n,d)
namepatterns(p,s)
docs(d)
docs(d)
seeds(s)
27
Sample Optimization: Order Operations
PROJECT_[y,d]
distance-line(x,y)<3
DECLARATIVE IE
Dr. R. Ramakrishnan
This is a fun topic ...
match(y,p,d)
SELECT_[allcaps(x) and (n<5)]
lines(x,n,d)
namepatterns(p,s)
docs(d)
docs(d)
seeds(s)
28
Sample Optimization:
Efficient Large-Scale Pattern Matching
PROJECT_[y,d]
distance-line(x,y)<3
DECLARATIVE IE
Dr. R. Ramakrishnan
This is a fun topic ...
match(y,p,d)
SELECT_[allcaps(x) and (n<5)]
lines(x,n,d)
namepatterns(p,s)
docs(d)
docs(d)
seeds(s)
29
Related Project: Avatar @ IBM Almaden
Person can be reached at PhoneNumber
Person followed by ContactPattern followed by PhoneNumber
Declarative Query Language
ContactPattern RegularExpression(Email.body,”can be reached at”)
PersonPhone
Precedes (
Precedes (Person, ContactPattern, D),
Phone, D)
30
Information Extraction: Another Example
DECLARATIVE IE
Dr. R. Ramakrishnan
This is a fun topic ...
time 0
DECLARATIVE IE
Dr. R. Ramakrishnan
This is a great topic ... DECLARATIVE IE
Dr. R. Ramakrishnan
time 1
More will follow soon ...
time 2
How to efficiently extract information over text streams?
31
Data Integration Research: Setting the Context
Past and current work
– build the foundation:
TSIMMIS, Information Manifold, UPenn, P2P, etc.
– develop solutions for specific integration tasks:
wrapping, schema matching, entity matching, adaptive QP, etc.
– branching into many app. domains:
bioinformatics, PIM (e.g., semex, iMemex), etc.
– top-k, topX query processing
Our work in Cimple
– compositional solutions for schema matching, entity matching, etc.
[VLDB-05a, VLDBJ-06, ICDE-07a, Tech Report-07a]
– best-effort data integration:
e.g. keyword search +
automatic schema matching +
automatic entity matching over relational databases [ICDE-07b]
– data integration for masses [Tech Report-07b]
32
Sample Data Integration Challenge in Cimple:
Matching Mentions of Entities
Jim Gray
Researcher
Homepages
**
*
*
Pages
* *
Group Pages
mailing list
Keyword search
SQL querying
Web pages
Conference
DBworld
Jim Gray
*
**
**
*
SIGMOD-04
give-talk
SIGMOD-04
**
*
Text documents
Question
answering
Browse
Mining
Alert/Monitor
News summary
DBLP
Personalize system, provide feedback
33
Extremely Important Problem!
Appears in numerous real-world contexts
Plagues many applications that we have seen
– Citeseer, Rexa, DBLP, InfoZoom, etc.
Why so important?
Many services rely on correct mention matching
Incorrect matching propagates errors
34
An Example
Discover related organizations
using occurrence analysis:
“J. Han ... Centrum voor Wiskunde en Informatica”
DBLife incorrectly matches this mention “J. Han” with
“Jiawei Han”, but it actually refers to “Jianchao Han”.
35
Classical Mention Matching
Applies just a single “matcher”
Focuses mainly on improving matcher accuracy
Our key observation:
A single matcher often has limited utility
36
Illustrating Example
Only one Luis Gravano
d1: Luis Gravano’s Homepage d2: Columbia DB Group Page
L. Gravano, K. Ross.
Text Databases. SIGMOD 03
Members
L. Gravano K. Ross
L. Gravano, J. Sanz.
Packet Routing. SPAA 91
L. Gravano, J. Zhou.
Text Retrieval. VLDB 04
d4: Chen Li’s Homepage
Two
Chen Li-s
J. Zhou
d3: DBLP
Luis Gravano, Kenneth Ross.
Digital Libraries. SIGMOD 04
Luis Gravano, Jingren Zhou.
Fuzzy Matching. VLDB 01
Luis Gravano, Jorge Sanz.
Packet Routing. SPAA 91
C. Li.
Machine Learning. AAAI 04
Chen Li, Anthony Tung.
Entity Matching. KDD 03
C. Li, A. Tung.
Entity Matching. KDD 03
Chen Li, Chris Brown.
Interfaces. HCI 99
What is the best way to match mentions here?
37
A liberal matcher:
good for matching Luis Gravano,
bad for matching Chen Li
s0 matcher: two mentions match
if they share the same name.
d1: Luis Gravano’s Homepage d2: Columbia DB Group Page
L. Gravano, K. Ross.
Text Databases. SIGMOD 03
Members
L. Gravano K. Ross
L. Gravano, J. Sanz.
Packet Routing. SPAA 91
L. Gravano, J. Zhou.
Text Retrieval. VLDB 04
d4: Chen Li’s Homepage
J. Zhou
d3: DBLP
Luis Gravano, Kenneth Ross.
Digital Libraries. SIGMOD 04
Luis Gravano, Jingren Zhou.
Fuzzy Matching. VLDB 01
Luis Gravano, Jorge Sanz.
Packet Routing. SPAA 91
C. Li.
Machine Learning. AAAI 04
Chen Li, Anthony Tung.
Entity Matching. KDD 03
C. Li, A. Tung.
Entity Matching. KDD 03
Chen Li, Chris Brown.
Interfaces. HCI 99
38
A conservative matcher:
good for matching Chen Li,
bad for matching Luis Gravano
s1 matcher: two mentions match if they
share the same name and at least
one co-author name.
d1: Luis Gravano’s Homepage d2: Columbia DB Group Page
L. Gravano, K. Ross.
Text Databases. SIGMOD 03
Members
L. Gravano K. Ross
L. Gravano, J. Sanz.
Packet Routing. SPAA 91
L. Gravano, J. Zhou.
Text Retrieval. VLDB 04
d4: Chen Li’s Homepage
J. Zhou
d3: DBLP
Luis Gravano, Kenneth Ross.
Digital Libraries. SIGMOD 04
Luis Gravano, Jingren Zhou.
Fuzzy Matching. VLDB 01
Luis Gravano, Jorge Sanz.
Packet Routing. SPAA 91
C. Li.
Machine Learning. AAAI 04
Chen Li, Anthony Tung.
Entity Matching. KDD 03
C. Li, A. Tung.
Entity Matching. KDD 03
Chen Li, Chris Brown.
Interfaces. HCI 99
39
Better solution:
apply both matchers in a workflow
d1: Luis Gravano’s Homepage d2: Columbia DB Group Page
L. Gravano, K. Ross.
Text Databases. SIGMOD 03
Members
L. Gravano K. Ross
L. Gravano, J. Sanz.
Packet Routing. SPAA 91
L. Gravano, J. Zhou.
Text Retrieval. VLDB 04
d4: Chen Li’s Homepage
s1
union
s0
d3
Luis Gravano, Jingren Zhou.
Fuzzy Matching. VLDB 01
Luis Gravano, Jorge Sanz.
Packet Routing. SPAA 91
C. Li.
Machine Learning. AAAI 04
Chen Li, Anthony Tung.
Entity Matching. KDD 03
C. Li, A. Tung.
Entity Matching. KDD 03
Chen Li, Chris Brown.
Interfaces. HCI 99
s0
s0 matcher: two mentions match
if they share the same name.
d4
union
d1
J. Zhou
d3: DBLP
Luis Gravano, Kenneth Ross.
Digital Libraries. SIGMOD 04
d2
s1 matcher: two mentions match if they
share the same name and at least
one co-author name.
40
Key Challenges
s1
How to compose
matchers, to form a
space of workflows?
How to estimate the
accuracy of each
workflow?
How to efficiently find
one with high accuracy?
union
s0
d3
d4
union
d1
s0
d2
[See ICDE-07a]
41
Mass Collaboration: The General Idea
Many applications have multiple developers / users
– how to exploit feedback from all of them?
Variants of this is known as
– collective development of system, mass collaboration,
collective curation, Web 2.0 applications, social software, etc.
Has been applied to many applications
– open-source software, bug detection, tech support group, Yahoo!
Answers, Google Co-op, and many more
Studied in some academic contexts, e.g., ESP Game
Little has been done in extraction / integration contexts
– except in industry, e.g., epinions.com
42
Sample Mass Collaboration in DBLife
43
Sample Mass Collaboration in DBLife
IE
W1
Raw data
W2
Wn
44
Key Challenges
What types of extraction / integration tasks
are most amenable to mass collaboration?
– e.g., see MOBS project at Illinois [WebDB-03, ICDE-05]
How to entice people to contribute?
What can they contribute?
What is the underlying data model?
How to handle the Naughton effect?
How to propagate user contributions?
How to undo?
How to reconcile multiple conflicting editions?
– e.g., see ORCHESTRA project at Penn [Taylor & Ives, SIGMOD-06]
45
Sample Research: Summary
Information extraction
– how to do it in a declarative / compositional fashion?
– how to apply database-like optimization techniques?
Data integration
– how to do it incrementally (best effort, pay-as-you-go)?
an example of a Data Space?
– how to do it in a compositional fashion?
Human computation / mass collaboration
– new! (Though industry has been doing it for years.)
– how to do it for data management tasks?
46
Conclusions
Community Information Management
– increasingly crucial problem
The Cimple project
– sample challenges: information extraction
data integration
human computation
– extends the footprints of DB technologies to Web data
– develops new DB technologies
DBLife prototype
– research/education tool, community service, benchmark
Search “cimple wisc” for project homepage
47
Broader Perspectives
[speculation mode]
Current Web: keyword search over text
Future Web
– should have increasingly more structure
– should have more ways to exploit structure
– should be more “social”
This future Web should be great for our community
– we are the “Structure King”
– if the Web remains text-centric not as good for us
How to accelerate the coming of this future Web?
– Cimple and many current projects can contribute
– but as a community we need more efforts in this direction!
48