SIGMOD2009 Overview
Download
Report
Transcript SIGMOD2009 Overview
SIGMOD2009 Overview
Web group
Li Yukun
Outline
Overview SIGMOD2009
Overview two selected papers
Optimizing Complex Extraction Programs over Evolving Text Data
Exploiting Context Analysis for Combining Multiple Entity Resolution
Systems
Section of SIGMOD2009
Research Session 1: Security I
Research Session 2: Databases on Modern Hardware
Research Session 3: Information Extraction
Research Session 4: Security II
Research Session 5: Large-Scale Data Analysis
Research Session 6: Entity Resolution
Research Session 7: Testing and Security
Research Session 8: Column Stores
Research Session 9: Data on the Web
Research Session 10: Probabilistic Databases I
Research Session 11: Database Optimization
Research Session 12: Probabilistic Databases II
Research Session 13: Skyline Query Processing
Research Session 14: Understanding Data and Queries
Research Session 15: Nearest Neighbor Search
Research Session 16: Query Processing on Semi-structured Data
Research Session 17: Data Integration
Research Session 18: Keyword Search
Research Session 19: Semi-structured Data Management
Research Session 20: Data Management Pearls
Research Session 21: Indexing
SIGMOD keynote talks
Enterprise Applications - OLTP and OLAP - Share One
Database Architecture
Hasso Plattner (Hasso-Plattner-Institute for IT Systems
Engineering)
Transforming Data Access Through Public Visualization
Fernanda B. Viegas (IBM)
Martin Wattenberg (IBM)
Web-based visualizations—ranging from political art projects to news stories—have reached
audiences of millions. Meanwhile, new initiatives in government, aimed at all citizens, point to an
era of increased transparency. a "living laboratory" web site where people may upload their own
data, create interactive visualizations, and carry on conversations. Political discussions, citizen
activism, religious discussions, game playing, and educational exchanges all happen on the site.
To further support these scenarios, and the users they represent, will require continued innovation
in data presentation and interaction.
SIGMOD INVITED SESSIONS
Special Invited Session on Human-Computer Interaction with Information
Design for Interaction
Daniel Tunkelang (Endeca)
Voyagers and Voyeurs: Supporting Social Data Analysis
Jeffrey Heer (Stanford University)
Augmented Social Cognition
Ed H. Chi (PARC)
Special Invited Session on Systems Research and Information
Management
Storage Class Memory: Technology, Systems and Applications
Richard F. Freitas (IBM)
Distributed Data-Parallel Computing Using a High-Level Programming
Language
Michael Isard (Microsoft Research)
Yuan Yu (Microsoft Research)
SIGMOD TUTORIALS
Large-Scale Uncertainty Management Systems: Learning and
Exploiting Your Data
FPGA: What's in it for a Database?
Keyword Search on Structured and Semi-Structured Data
Database Research in Computer Games
Anonymized Data: Generation, Models, Usage
Summary
Hot words
Probabilistic,Semi-structure, Security,
Search&Query, Extraction&resolution
User Interaction
DataSpace Framework
Browsing
Query
Kd search
关联数据库
Association
DB
Evolution
Entity
Association
Extraction
Email
Domain
resolution
Memo
用户日志
Users Documents
Integration
Web Blogs
pages
DB
Future work on DataSpace
Managing Entity and association
Entity Identify and Resolution
Data extraction and cleaning
Pay-as-you-go integration
Uncertain data mapping
Update of entity and association
Query&Search in dataspace
Keyword search
Approximate query
Facet-based search in dataspace
Selected readings
Data integration
Top-K Generation of Integrated Schemas Based on Directed and Weighted Correspondences
Core Schema Mappings
Entity Resolution
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems
Entity Resolution with Iterative Blocking
A Grammar-based Entity Representation Framework for Data Cleaning
Data on the Web
Indexing
A Revised R*-tree in Comparison with Related Index Structures
Understanding Data and Queries
Why Not?
Query by Output
Detecting and Resolving Unsound Workflow Views for Correct Provenance Analysis
Query processing on Semi-structured data
Scalable Join Processing on Very Large RDF Graphs
Optimizing Complex Extraction Programs over Evolving Text Data
Robust Web Extraction: An Approach Based on a Probabilistic Tree-Edit Model
Combining Keyword Search and Forms for Ad Hoc Querying of Databases
Outline
Overview SIGMOD2009
Two selected papers
Optimizing Complex Extraction Programs over Evolving Text Data
Exploiting Context Analysis for Combining Multiple Entity
Resolution Systems
Paper 1
Introduction
Motivation
Traditional IE method: Static
Practical conditions: Dynamic corpus
DBlife(10000+URLs,120+MB corpus snapshot.)
Enterprise Intranet
Problem
How to efficiently extract information based on
Dynamic corpora
Problem Definition
Concepts
Data pages, Extractors, Mentions
An extractor E:p→R(a1,a2,…,an) extracts mentions of
relation R from page p. A mention of R is a
tuple(m1,m2,…,mn,)such that mi is either a mention of
attribute ai or nil.
Examples
Assumptions
Extract mentions from each single data pages
Methods
Concepts
Extractor scope
Let s.start and s.end be the start and end character positions of a string s in a
page p. We say an extractor E has scope α iff for any mention m =
(m1, . . . ,mn) produced by E, (maxi mi.end − mini mi.start) < α, where mi.start
and mi.end are the start and end character positions of attribute mention mi in
page p.
Extractor Context
The β-context of mention m in page p is the string
p[(m.start−β)..(m.end+ β)], i.e., the string of m being extended on
both sides by β characters. We say extractor E has context β iff for
any m and p′ obtained by perturbing the text of p outside the βcontext of m, applying E to p′ still produces m as a mention.
Clallenges
Matchers (Find overlaping)
Solutions
CAPTURING IE RESULTS
REUSING CAPTURED IE RESULTS
Level of Reuse:
IE Results to Capture:
Storing Captured IE Results:
Scope of Mention Reuse
Overall Processing Algorithm
Identifying Reuse with Matchers
SELECTING A GOOD IE PLAN
Searching for Good Plans
Cost Model
Evaluation(DataSet)
Experimental Results
Paper 2
Introduction
Jone Smith
J. Smith
John.Smith
J.Smith
What is entity resolution
Motivation
to identify and group references that co-refer, that
is, refer to the same entity.
New data characters:
Examples
The output
a clustering of references, where each cluster is
supposed to represent one distinct entity.
Problem definition
Entity Resolution
ER problem has been studied in several research areas under
many names such as coreference resolution, deduplication, object
uncertainty,record linkage, reference reconciliation, etc. In the
past, a wide variety of techniques have been developed for ER
problem.
Methods
Similarity (metrics, textual, attributes, and etc.)
Blocking
Voting
Problem
Pay little attention to context feature
Problem Definition
To identify co-offer relationship between two
mentions
Context-based framework
Context features
Effectiveness
Generality
Number of clusters
Overview of the approaches
Meta-level Classification
Context-extended classification
Context-weighted Classification
Creating final clusters
Experiments
Web domain
Data set by WWW05[Bekkerman, and etc.]
Contain web pages of 12 different persons
Created by searching web using Google
RealPub domain
11682 publications
14590 authors
3084 departments
1494 organizations
Experimental results on Web domain
Summary
How to manage uncertainty data, and unstructured
data are becoming a hot topic
It is also important problem of DataSpace
Based on it, to select promising topics.
Thanks