A Relational Approach to Incrementally Extracting and
Download
Report
Transcript A Relational Approach to Incrementally Extracting and
A Relational Approach to Incrementally
Extracting and Querying Structure in
Unstructured Data
BY ERIC CHU, AKANKSHA BAID, TING CHEN,
ANHAI DOAN, AND JEFFREY F. NAUGHTON",
PROCEEDINGS OF THE 33RD
INTERNATIONAL CONFERENCE ON VERY
LARGE DATA BASES (VLDB'07), VIENNA,
AUSTRIA, SEPTEMBER 2007, 1045-1056.
Presentation by Andrew Zitzelberger
Problem
To find information from unstructured data users
have to rely on a combination of keyword search,
browsing, and possibly predefined search options.
Current search mechanisms do not leverage the
potentially rich set of structures embedded in text.
Vision
Allow users to load a set of documents without any
pre-processing and immediately allow the user to
begin making queries.
At first the system provides no benefit above that
provided by a traditional Information Retrieval, but
information is extracted in the background providing
incrementally better search.
Contributions
Provide
a way to store the evolving set of documents and structures
tools that can be used to query and to incrementally process
the data
a way to handle changes in our understanding of the data set
as it is processed
Schema and Data Representations
Continuing extraction of heterogeneous structures
will gradually lead to sparse data set.
Sparse – comprises a large number of attributes, but most
entities have non-null values for only small fraction of all these
attributes.
Wide Tables
Wide Tables
Forego schema design and store all objects in a single
horizontal table using the interpreted storage format.
System uses an attribute catalog to record the name, id, type,
and size of each attribute.
Each tuple starts with a header with fields such as relation-id,
tuple-id, and record length
For each non-null attribute the tuple stores the attribute’s
identifier, length field, and value
Attributes that are not in the tuple are implicitly null.
Interpreted Schema
Attribute Catalog
Tuples
Wide Tables
1NF restriction generally used in databases is
removed.
A state has many lakes; unreasonable to store them all in a
separate column.
Creating a new table for each structure would also grow too
large and a query would require too many joins.
Allow complex attributes
Lists, arrays, tables, set of tuples, etc.
The administrators must decide how to deal with complex
attributes (user defined functions), the system does not
attempt to handle them.
Wide Table
Mapping Table
Maps concepts together
Temperatures in Fahrenheit on one page and Celsius on
another. We want to be able to compare the two.
When the system determines that an attribute can map to
another, it rewrites the query appropriately.
Host name column just for presentation, not part of actual
system
Relationship Table
Records complex structures that comprise multiple
attributes
Headquarter (city, company)
This table is used to keep track of which attributes
belong to which complex structure.
Operators
Extract – extracts structure
Integrate – identifying attributes that correspond to
the same real-world concept
Cluster – clustering a set of different attributes based
on some similarity structure
Operator Requirements
1) Each operator should be able to use different
algorithms.
2) Database administrators should be able to specify
a scope for the input on which chosen methods
operate.
3) Given the output database administrators should
be able to specify what they want to do with it.
Extract
Two types
Detects structure such as entities and relationships from
natural language (DIRPE, Snowball, KnowItAll).
Extracts structured data embedded in text of known format
such as LaTex, XML, and wiki markup text.
Output is a set of structures that can be stored in a
wide table or run through one of the other operators.
Database administrators need to decide how to apply
the extract operator.
Can apply to subsets of the documents or columns of already of
extracted information to get a finer granularity.
Integrate
Input: a set of structures from the wide table or a
previous operator
Output: one or more sets of mappings over attributes
that correspond to the same real world concept
Database administrators decide what to do with each
set of mappings
Store in mapping table
Consider collapsing the table (put attributes that map to one
another under a single column)
Cluster
Input: a set of documents or a set of attributes
Output: classification of documents or attributes into
one or more clusters
Clustering can help database administrators:
To know what views to build into the database
To consider splitting wide tables into different clusters
Operator Interaction
Operators can be combined synergistically in a
“whole is greater than the sum of the parts” fashion.
Six possible pairwise combinations of distinct
operators:
Integrate-Extract
Cluster-Extract
Extract-Cluster
Integrate-Cluster
Extract-Integrate
Cluster-Integrate
Operator Interaction
Integrate-Extract
Integrate helps find new targets for extract
Example: Integrate finds a mapping between “address” and
“sent-to”, the extractor can then be used on sent-to instances
Cluster-Extract
Cluster allows database administrators to find documents in a
specific domain, allowing them to use domain specific
extractors for better results than using domain-independent
extractors.
Operator Interaction
Extract-Cluster
The extracted set of structures may provide more information
that Cluster can use to group together documents or attributes.
Example: clustering pages about cities on Wikipedia based on
section names misses short pages because they don’t have a
section name. However, after extracting the “city info-box”
structure in some of the pages, Cluster was able to recognize
them and put them in the city cluster.
Operator Interaction
Integrate-Cluster
Integrate can prevent cluster from creating multiple clusters
where logically a single cluster would be better.
Example: Given a data set with attributes {C#, Company,
FirstName, LastName, CustID, Contact, Cname}, Cluster may
find two clusters {C#, Cname, FirstName, LastName} and
{CustID, Company, Contact}.
However, if we run Integrate first, we will have the mapping {C# =
CustID, Cname = Company, FirstName + LastName = Contact}
and only end up with one cluster.
Operator Interaction
Extract-Integrate
The primary tool. Need to extract before integrating.
Cluster-Integrate
Cluster can narrow the scope of Integrate in two ways:
1) Cluster may identify a domain for a set of structures, and then
the database administrators can apply a domain-specific schema
matchers.
2) Cluster may identify two overlapping sets of attributes (e.g.,
{CustID, Cname} and {CustID, Company}), and then database
administrators may want to look for possible mappings in the
difference between the two sets of attributes (e.g.,
Cname=Company)
Case Study: Wikipedia
Case Study: Wikipedia
Desirable Qualities:
Contents are embedded in wiki markup text.
There are guidelines on how to create and edit a wiki page.
Pages generally have a consistent organizational structure
Downloaded database dump on December 5, 2005
Dump contained 4 million XML files (8.5 GB)
Each XML file contained a blob that is the content of the page
and metadata bout the page (page-id, title, revision-id,
contributor-username, etc.)
Used a control set of major American Cities (254 files), major
universities from the states of Wisconsin, New York, and
California (255 files), and top male tennis players on the ATP
tour in the “Open Era” (373 files)
Incremental Processing
Stage 1: Initial Loading
Parsed and loaded the XML files into a single table which
initially had five columns: PageId, PageText, RevisionId,
ContributorUserName, and LastModificationDate
PageId = page title
PageText = content of page
Each page corresponds to a single row in the database.
With the full-text index on PageText, users can already query
the documents using the keyword searches, even though data
processing has not yet begun.
Incremental Processing
Stage 2: Extracting SectionName(text)
Extracted SectionName(text) where SectionName represents
the name of a first-level section in the page and text is the
content of that section.
Extracting this structure allows for “focused” keyword search.
e.g., Search for “World No. 1 tennis player”
Searching over all page text gives us 86 players, including all 23
who have been ranked No. 1 (text may mention players who
defeated a No. 1 player, etc.)
Assume that this fact is generally mentioned in the introduction,
we search only introduction sections and get 67 players, 21 of
which have been No 1 (better precision, worse recall)
• In the two that were missed this fact was express as “world number
one” instead of “world No. 1”
Incremental Processing
Incremental Processing
Stage 3: Extracting info box as a blob
Ideally we would want to extract attribute-value pairs, but
initially we store it as a blob to allow focused keyword
searches.
e.g., if we want to find out which universities have “cardinal” as
their school color, we can pose the keyword query “cardinal” over
the university info boxes which return 7 results, 6 of which are
correct (one has the mascot “Cardinal Burghy”)
If we were to run “cardinal university” over the PageText
column 51 pages are returned.
Incremental Processing
Stage 4: Extracting structured data from info boxes
and wiki tables
Gather attribute-value pairs from info boxes and tables
Have wide tables store pointers to tables containing the
information.
Incremental Processing
Queries
Find the average temperature for the winter months in Madison, Wisconsin.
Find all tennis players who have been ranked number one (according to the info
box).
Find universities who have temperature below 20o F in January.
Future Work
Deciding when to split wide tables as more
understanding is gained about the concept.
Repository of user defined programs for the
operators and handling complex objects.
Questions to Answer:
How to handle updates to the unstructured data?
How to record the evolution of data?
How to help users write queries that exploit the relational
structure?
How to optimize queries when they involve attributes that
have many mappings?
Analysis
Negatives
Requires a lot of work from the database administrators
Not tailored to the average users (SQL queries)
Positives
Pages can be queries immediately
Highly customizable
Allows for focused searches
Provide compelling idea for a personal search assistant