Automatic Wrapper Generation for Bioinformatics Integration

Download Report

Transcript Automatic Wrapper Generation for Bioinformatics Integration

Grid Based Data Integration with
Automatic Wrapper Generation
Xuan Zhang
Gagan Agrawal
Ohio State University
1
Overall Goal

Tools for data integration driven by:

Data explosion



Data size & number of data sources
New analysis tools and need for workflows
Autonomous resources

Heterogeneous data representation & various
interfaces
2
Motivation (Contd.)

Other Issues:



Frequent updates to data formats
Flat-file datasets
Ad-hoc sharing of data
3
Current Approaches

Manually written wrappers

Problems



Mediator-based integration systems

Problems



O(N2) wrappers needed, O(N) for a single updates
Portability of wrappers in a distributed environment
Need a common intermediate format
Unnecessary data transformation
Integration using web/grid services

Needs all tools to be web-services (all data in XML?)
4
Our Approach

Automatically generate wrappers




One layout descriptor per resource
Stand-alone wrapper programs
For integrated DBs, (grid) workflow systems
Transform data in files of arbitrary formats


No domain- or format-specific heuristics
Layout information provided by users
5
Our Approach (Contd.)


Help write layout descriptors using data
mining techniques (dils 2005, bibe
2005)
Particularly attractive for



Data grid environments and workflows
flat-file datasets
ad hoc data sharing
6
Our Approach: Advantages

Advantages:




No need to write wrappers while
integrating data or creating workflows
Only one descriptor per resource needed
No unnecessary transformations / storage
New resources can be integrated on-the-fly
7
Our Approach: Challenges

Description language



Wrapper generation and execution



Format and logical view of data in flat files
Easy to interpret and write
Correspondence between data items
Separating wrapper analysis and execution
Interactive tools for writing layout descriptors

What data mining techniques to use ? (dils 2005,
bibe 2005)
8
Wrapper Generation System
Overview
Layout Descriptor
Schema Descriptors
Parser
Mapping Generator
Data Entry Representation
Schema Mapping
Application Analyzer
Source
Dataset
WRAPINFO
DataReader DataWriter
Synchronizer
Target
Dataset
9
Suitability for a Grid Environment

Wrapper analysis can be implemented
as a grid service


Very low execution costs
Wrapper execution modules are taskindependent

Just need to port three modules on
different systems
10
Assumptions for the Current
Prototype




One tabular, the
Semi-structured
other semi-structured
Both datasets are
stored record-wise
Order of records not
disturbed
Suitable for
bioinformatics
tabular
11
Layout Description Language
Goal

To describe data in arbitrary flat file
format
Easy to interpret and write



Components:
1.
2.

Schema description
Layout description
Example: FASTA
12
Layout Description Language
…
>seq1 comment1 \n
ASTPGHTIIYEAVCLHNDRTTIP \n
>seq2 comment2 \n
ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \n
NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \n
KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n
>seq3 …

Key observations on
data layout




Strings of variable length
Delimiters widely used
Data fields divided into
variables
Repetitive structures

Key tokens





“constant string”
LINESIZE
[optional]
<repeating>
…
13
Layout Description Language
…
>seq1 comment1\n
ASTPGHTIIYEAVCLHNDRTTIP \n
>seq2 comment2 \n
ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \n
NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \n
KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n
>seq3 …

Component I: Schema Description
[FASTA]
//Schema Name
ID = string
//Data type definitions
DESCRIPTION = string
SEQ = string
14
Layout Description Language
…
>seq1 comment1 \n
ASTPGHTIIYEAVCLHNDRTTIP \n
>seq2 comment2 \n
ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \n
NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \n
KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n
>seq3 …

Component II: Layout Description
…
LOOP ENTRY 1:EOF:1 {
“>” ID “ ” DESCRIPTION
< “\n” SEQ >
“\n” | EOF
}
…
15
Mapping Cardinality

TRANSFAC
One-to-multiple
data field
One-to-one
data field
…
FA factor1_name
…
RA reference1.1_authors
…
RA reference1.2_authors
…
RA reference1.3_authors
…

Reference table
…
FA
…
RA
…
…
…
…
…
…
… factor1_name …
reference1.1_
authors
…
… factor1_name …
reference1.2_
authors
…
… factor1_name …
reference1.3_
authors
…
…
…
…
…
…
16
Analyzing Application

Goals - WRAPINFO



Summarize all application related
information necessary for the wrapper
Represent the information in look-up tables
and constant parameters
Represent the information in a platformindependent format, XML
17
Wrapper Generated
Value buffer
one_to_multiple_values
Input
dataset
Dataset
buffer
FA
RA
DataReader
DataWriter
Output
dataset
one_to_one_values
load
run
halt
run
Synchronizer
18
Wrapper Generated

Suitable for data grid

Three general modules

DataReader



DataWriter



Write one data field value
Remove value from list in the value buffer
Synchronizer



Extract one data field value
Write value to the value buffer if useful
Switch between calling DataReader and DataWriter
Manage dataset buffer
Application specific information in WRAPINFO
19
(in logarithm)
Experimental Results
(in logarithm)
TRANSFAC-to-Reference Problem
20
Experimental Results
SWISSPROT-to-FASTA Problem
21
Summary



Automatically generated wrappers can
perform well
Wrapper task analysis and wrapper execution
can be separated
Key Open Question:



How hard it is to write layout descriptors ?
Can we make the process semi-automatic ?
Data mining techniques seem quite promising (dils
2005, bibe 2005)
22