Automatic Wrapper Generation for Bioinformatics Integration
Download
Report
Transcript Automatic Wrapper Generation for Bioinformatics Integration
Grid Based Data Integration with
Automatic Wrapper Generation
Xuan Zhang
Gagan Agrawal
Ohio State University
1
Overall Goal
Tools for data integration driven by:
Data explosion
Data size & number of data sources
New analysis tools and need for workflows
Autonomous resources
Heterogeneous data representation & various
interfaces
2
Motivation (Contd.)
Other Issues:
Frequent updates to data formats
Flat-file datasets
Ad-hoc sharing of data
3
Current Approaches
Manually written wrappers
Problems
Mediator-based integration systems
Problems
O(N2) wrappers needed, O(N) for a single updates
Portability of wrappers in a distributed environment
Need a common intermediate format
Unnecessary data transformation
Integration using web/grid services
Needs all tools to be web-services (all data in XML?)
4
Our Approach
Automatically generate wrappers
One layout descriptor per resource
Stand-alone wrapper programs
For integrated DBs, (grid) workflow systems
Transform data in files of arbitrary formats
No domain- or format-specific heuristics
Layout information provided by users
5
Our Approach (Contd.)
Help write layout descriptors using data
mining techniques (dils 2005, bibe
2005)
Particularly attractive for
Data grid environments and workflows
flat-file datasets
ad hoc data sharing
6
Our Approach: Advantages
Advantages:
No need to write wrappers while
integrating data or creating workflows
Only one descriptor per resource needed
No unnecessary transformations / storage
New resources can be integrated on-the-fly
7
Our Approach: Challenges
Description language
Wrapper generation and execution
Format and logical view of data in flat files
Easy to interpret and write
Correspondence between data items
Separating wrapper analysis and execution
Interactive tools for writing layout descriptors
What data mining techniques to use ? (dils 2005,
bibe 2005)
8
Wrapper Generation System
Overview
Layout Descriptor
Schema Descriptors
Parser
Mapping Generator
Data Entry Representation
Schema Mapping
Application Analyzer
Source
Dataset
WRAPINFO
DataReader DataWriter
Synchronizer
Target
Dataset
9
Suitability for a Grid Environment
Wrapper analysis can be implemented
as a grid service
Very low execution costs
Wrapper execution modules are taskindependent
Just need to port three modules on
different systems
10
Assumptions for the Current
Prototype
One tabular, the
Semi-structured
other semi-structured
Both datasets are
stored record-wise
Order of records not
disturbed
Suitable for
bioinformatics
tabular
11
Layout Description Language
Goal
To describe data in arbitrary flat file
format
Easy to interpret and write
Components:
1.
2.
Schema description
Layout description
Example: FASTA
12
Layout Description Language
…
>seq1 comment1 \n
ASTPGHTIIYEAVCLHNDRTTIP \n
>seq2 comment2 \n
ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \n
NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \n
KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n
>seq3 …
Key observations on
data layout
Strings of variable length
Delimiters widely used
Data fields divided into
variables
Repetitive structures
Key tokens
“constant string”
LINESIZE
[optional]
<repeating>
…
13
Layout Description Language
…
>seq1 comment1\n
ASTPGHTIIYEAVCLHNDRTTIP \n
>seq2 comment2 \n
ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \n
NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \n
KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n
>seq3 …
Component I: Schema Description
[FASTA]
//Schema Name
ID = string
//Data type definitions
DESCRIPTION = string
SEQ = string
14
Layout Description Language
…
>seq1 comment1 \n
ASTPGHTIIYEAVCLHNDRTTIP \n
>seq2 comment2 \n
ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \n
NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \n
KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n
>seq3 …
Component II: Layout Description
…
LOOP ENTRY 1:EOF:1 {
“>” ID “ ” DESCRIPTION
< “\n” SEQ >
“\n” | EOF
}
…
15
Mapping Cardinality
TRANSFAC
One-to-multiple
data field
One-to-one
data field
…
FA factor1_name
…
RA reference1.1_authors
…
RA reference1.2_authors
…
RA reference1.3_authors
…
Reference table
…
FA
…
RA
…
…
…
…
…
…
… factor1_name …
reference1.1_
authors
…
… factor1_name …
reference1.2_
authors
…
… factor1_name …
reference1.3_
authors
…
…
…
…
…
…
16
Analyzing Application
Goals - WRAPINFO
Summarize all application related
information necessary for the wrapper
Represent the information in look-up tables
and constant parameters
Represent the information in a platformindependent format, XML
17
Wrapper Generated
Value buffer
one_to_multiple_values
Input
dataset
Dataset
buffer
FA
RA
DataReader
DataWriter
Output
dataset
one_to_one_values
load
run
halt
run
Synchronizer
18
Wrapper Generated
Suitable for data grid
Three general modules
DataReader
DataWriter
Write one data field value
Remove value from list in the value buffer
Synchronizer
Extract one data field value
Write value to the value buffer if useful
Switch between calling DataReader and DataWriter
Manage dataset buffer
Application specific information in WRAPINFO
19
(in logarithm)
Experimental Results
(in logarithm)
TRANSFAC-to-Reference Problem
20
Experimental Results
SWISSPROT-to-FASTA Problem
21
Summary
Automatically generated wrappers can
perform well
Wrapper task analysis and wrapper execution
can be separated
Key Open Question:
How hard it is to write layout descriptors ?
Can we make the process semi-automatic ?
Data mining techniques seem quite promising (dils
2005, bibe 2005)
22