\documentstyle[widepage,doublespace]{article}
Download
Report
Transcript \documentstyle[widepage,doublespace]{article}
Part II
Large-Scale
Web Database Integration Systems
Definitions
Web database (database search engine): Webaccessible database (WDB)
Characteristics:
Data are structured and are stored in database systems.
Data are accessible through a Web search interface.
Result pages are dynamically generated by wrapping data
in HTML files.
Web database integration: the process of enabling
unified access to multiple Web databases in the same
application domain.
An Example Web Database
More Examples
WDB Integration System vs. MSE
Major differences between Web databases and
regular document search engines (DSE):
DSE searches Web pages while WDB searches database
entities.
WDB usually has a complex interface while DSE usually
has a simple interface.
DSE ranks results by similarity while WDB usually ranks
results by some attribute values.
WDB Integration System Architecture
Web
User query
WDB List
Web Database
Discovery
WDB Interface
Schema
Extraction
WDB Clustering
By Domain
Result
Domain Mapping
Result Merging
Integrated Interface
Entity Identification
Database Selection
Result Annotation
Query Translation
and Dispatch
Result Extraction
WDB Cluster 1
Interface
Integration
......
WDB Cluster n
Integrated Interface 1
......
Integrated Interface n
Integrated Interface Generation Module.
……
WDB 1
World Wide Web
WDB m
Query Processing Module.
Main Technical Problems
WDB Search Interface Modeling
WDB Search Interface Extraction
WDB Search Interface Clustering
WDB Search Interface Integration
Global Query Mapping and Optimization
Search Result Extraction and Annotation
Online Entity Identification
Remaining Research Challenges
A Related Book
Eduard Dragut, Weiyi Meng, Clement Yu. Deep Web
Query Interface Understanding and Integration.
Morgan & Claypool Publishers, June 2012.
Table of Content
Introduction
Query Interface Representation and Extraction
Query Interface Clustering and Categorization
Query Interface Matching
Query Interface Attribute Integration
Query Interface Integration
Summary and Future Research
WDB Query Interface Modeling
Problem: Represent the information on each interface
in a format that is suitable for integration and
query submission.
An Example WDB Interface
An attribute
WDB Interface Modeling
Different models have been proposed:
WISE Three-Level Model: site-level, attribute-level,
and element-level.
Hierarchical Model: A search interface is modeled
as an ordered tree of elements.
Hierarchical model is designed to capture the order semantics
and the nested grouping of the attributes in an interface.
Querying Capability Model: Formally characterize
what kinds of queries are valid for a search
interface.
Hierarchical Model: An Example
aa.com
1. Where Do You
Want to Go?
origin
From: City
or Airport
Code
2. When Do You
Want to Go?
destination
To: City or
Airport
Code
Departure Date
depMonth
depDay
3. Number of
Passengers
numAdult
Adults
Return Date
depTime retMonth
retDay
4. What are
Your Service
Preferences?
carrier
5. Choose a
Carrier
numChild
Children
cabinClass
Class of
Service
retTime
maxiumStops
Number of
Connections
Query Interface Extraction
Automatic interface extraction: Automatically extract
information described in an interface representation
model from any given WDB interface.
Primarily two tasks:
Attribute extraction
•
•
Extract elements and labels from the interface.
Group elements and labels into logical attributes.
Attribute analysis
•
Extract and derive meta-information about each attribute
based on the interface representation model.
WDB Query Interface Clustering
Objective: Group WDBs into different clusters such
that all WDBs in the same cluster are related to the
same domain (e.g., sell the same type of products).
Techniques:
1. First, construct a concept hierarchy.
2. Then apply one of the following techniques
Supervised clustering (training required)
Unsupervised clustering (no training required)
Query Interface Integration
It is related to database schema integration.
Schema integration has been studied since 1980s.
Based on different data models: ER model, relational
model, object-oriented model, etc.
In different context: a single database during database
design, or multiple databases in multidatabase/data
warehouse systems.
Key issues: resolve name conflict, data type conflict,
structural conflicts, data inconsistency, etc.
Manual approach: Integration rules are manually
written.
Schema Integration vs. Interface
Integration
Comparing WDB interface integration and database schema
integration.
WDB interface schema is simpler (one table/view versus
multiple tables of a database schema).
Attributes in WDB interface are more complex as they may
consist of multiple elements.
WDB interface mixes attributes and query conditions while
database schema don’t.
Meta-data need to be extracted from WDB interface while
they are readily available in database schema.
WDB interface integration needs to integrate element format,
attribute layout and external values while database schema
integration doesn’t.
Attribute Matching
A key problem in schema/interface integration is to match
attributes from different schemas/interfaces.
A general framework for attribute matching [Rahm and
Bernstein, VLDB Journal 2001].
Develop a number of matchers based on different
information.
Dictionary-level information: attribute names
Schema-level information: data type, key, foreign key, …
Instance-level data: values of attributes
Utilize auxiliary information: Special dictionaries, thesaurus,
user-input, …
Attribute Integration
After attribute matching, attributes are divided
into clusters such that each cluster corresponds to
a global attribute in the integrated interface.
Remaining issues:
1. Determine the name of the global attribute for
each cluster.
2. Determine the domain type of each global
attribute. The domain type will determine the
format.
3. Determine the external values of each global
attribute.
Hierarchical Interface Integration (1)
An example of hierarchical schema representation
1. Where Do You Want to Go?
From: City To: City
2. When Do You Want to Go?
Departure Date
Jan
1
1
From
When …
Number … Class …
To
Departure
Return Adult ……
1am
3. Number of Passengers?
Adults Children
1
Where …
1am
Return Date
Jan
Root
Dmonth Dday Dtime
Rmonth Rday Rtime
0
4. Class of Service
Economy
Business
First Class
Siblings are ordered!
Hierarchical Interface Integration (2)
Simple mapping versus complex mapping
Simple mapping: 1-to-1 mapping between two
fields
Complex mapping: 1-to-m mapping between one
field in one interface and multiple fields in another
interface
Examples of 1-to-m mappings
from date
departure
date
month day year
No. of
passengers
passengers
adults children
Hierarchical Interface Integration (3)
Tree Merging
American Express
Please tell us
about yourself
Please tell us about
your employment
Occupation
State
Chase
Please tell us about
your employment
Phone
Years there
Address
Country
State
Company
address
City
Street
How to merge?
Hierarchical Interface Integration (4)
Grouping Constraint: Given subgroups in
different user interfaces, is it possible to find a
group such that all elements in each subgroup
are in adjacent locations?
Example: The following example satisfies this
requirement:
{state, city, street}
{country, state, city, street}
{country, state}
Hierarchical Interface Integration (5)
Preserving ancestor-descendant relationships
American Express
Please tell us
about yourself
Please tell us about
your employment
Occupation
Please tell us about
your employment
Phone
Years there
Street
Phone
Address
Occupation
Company
address
City
Please tell us
about yourself
Please tell us about
your employment
Country
State
Integrated
Chase
State
Years there
address
Country State City Street
Hierarchical Interface Integration (6)
Naming attributes
Group Naming Compatibility: Names of attributes
within a group in a user interface should be
compatible.
Example: Compatible naming
{adults, children}
{adults, children, infants}
{adults, infants}
Incompatible naming:
{adults, children}
{adults, children, #infants}
{#children, #infants}
Search Result Annotation
Goal: Identify the semantic meaning of each piece of
information within each search result record (SRR).
Before result annotation, SRRs on the result pages returned
from search engines need to be extracted first.
Some approaches combine result extraction and result
annotation in one step.
Data annotation is needed for
Comparison-shopping applications: entity identification,
result merging, …
Deep Web crawling and data collection
Result Annotation: Problem Description
title
authors
Entity Identification
Problem: Automatically derive rules to determine if
two search result records from different WDBs are
in fact the same entity (product).
Entity identification is closely related to entity
matching, entity resolution, duplicate detection, and
record linkage.
It is a classical problem in federated systems that
deal with data from multiple sources.
Remaining Research Challenges (1)
1. Automatic WDB discovery
Goal: Discover Web database interfaces from the Web
automatically.
Some issues to consider:
How to identify web pages that have a search interface?
There are already some existing work on this.
How to differentiate search interfaces for Web databases
from those for text search engines?
Is the information from the search interface sufficient? Do
we need information from search results?
How to learn a classifier?
Remaining Research Challenges (2)
2. Extraction and understanding of dynamic query
interfaces
An increasing number of query interfaces are
dynamic in the sense that the query interface may
alter after certain fields are selected. Two types of
dynamic changes have been observed.
The change of values of some fields (e.g., values under a
selection list).
The structure of the query interface (e.g., some fields are
added, deleted or modified).
Current query interface models do not consider
dynamic query interfaces.
Remaining Research Challenges (3)
3. Handling boundary query interfaces in Web-scale
clustering.
There are two challenges in Web-scale clustering of
query interfaces [Madhavan et el., 2007; Mahmoud
and Aboulnaga, 2010].
The number of domains is unknown in advance, which
means that the number of clusters is unknown in advance.
There are likely many query interfaces with unclear
domains, i.e., they appear between boundaries of multiple
domains.
However, the current solutions are not sufficiently
accurate and have significant room to improve.
Remaining Research Challenges (4)
4. Web database selection
Goal: For any given user query, identify the Web
databases that are most likely to return good results.
Some issues to consider:
How to summarize the content of a Web database?
Numerical attributes
Categorical attributes
Textual attributes
Relationships among the attributes
Remaining Research Challenges (5)
Web database selection (continued)
How to obtain the summaries automatically?
How to design sample queries for each type of attributes?
How to use the summaries to do Web database
selection?
How to measure “usefulness” based on different types of
attributes?
How to combine “usefulness” across different attributes?
Remaining Research Challenges (6)
5. Automatic SRR extraction from complex result pages
Goal: Automatically identify the rules to extract search
result records from complex result pages.
Some characteristics of complex result pages:
Record contains both text and images
SRRs may be organized into multiple columns/multiple
sections.
SRRs have a variety of formats.
Have no fixed sections (i.e., some sections only appear in some
result pages)
Some SRRs are divided into multiple blocks.
Remaining Research Challenges (7)
6. Global query processing and optimization
Goal: Evaluate global queries efficiently and correctly.
Some issues to consider:
It consists of many steps:
Identify relevant Web databases (global cost)
Translate/map global queries to local queries (global cost)
Submit queries and receive results (communication cost)
Evaluate translated queries by local Web databases (local cost)
Extract search results from result pages (global cost)
Filter out unqualified results (global cost)
How to optimize the above process?
What are the differences between Web integration systems
and multidatabase/federated database systems?
The End!