Support on-the-fly bioinformatics data Integration

Download Report

Transcript Support on-the-fly bioinformatics data Integration

Supporting
on-the-fly data
Integration for
bioinformatics
Candidate: Xuan Zhang
Advisor: Gagan Agrawal
Road Map
•
•
•
•
•
•
Mission Statement
Motivation
Implementation
Comprehensive Examples
Future work
Conclusion
Mission Statement
• Enhance information integration systems on
– Functionality
• On-the-fly data incorporation
• Flat file data process
– Usability
• Declarative interface
• Low programming requirement
Motivation
• Integration is essential for biological research
– Biological data include
•
•
•
•
•
Sequences: DNA (GenBank), protein (Swiss-Prot)
Structure: RNA (RNAbase), protein (PDB)
Interaction: pathway (KEGG), regulation (GRBase)
Function: disease (OMIM)
2ndary: protein family (Pfam)
– Biological data is inter-related.
Motivation
• Challenges of bioinformatics integration
– Data volume: overwhelming
• DNA sequence: 100 gigabases (August, 2005)
– Data growth:
exponential
Figure provided by PDB
Motivation
• Challenges of bioinformatics integration
(cont.)
– Tools: Many and more
– Service interfaces: Variety
• Web pages
• Web service
• Grid service
Motivation
• Challenges of bioinformatics integration
(cont.)
– Inter-operability: Low
• Heterogeneous data sources
– Semi-structured by nature
– Flat file, relational, object-oriented databases
• Independently developed tools
• No data exchange standard
– Little Collaboration
Road Map
• Mission Statement
• Motivation
• Implementation
• Future
• Conclusion
– Approach Overview
– Advantage
– Components
Approach Summary
• Metadata
– Declarative description of data
– Data mining algorithms for semi-automatic
writing
– Reusable by different requests on same data
• Code generation
– Request analysis and execution separated
– General modules with plug-in data module
System Overview
Understand Data
Data File
User Request
Metadata Description
Layout Descriptor
Layout
Descriptor
--------------------------------------------------Layout
Descriptor
--------------------------------------------------Schema Descriptor
--------------------------------------------------Schema Descriptor
Schema Descriptor
Code
Generation
Request
Processor
Schema
Miner
Information Integration System
Answer
Layout
Miner
Process Data
Advantages
• Simple interface
– At metadata level, declarative
• General data model
– Semi-structured data
– Flat file data
• Low human involvement
– Semi-automatic data incorporation
– Low maintenance cost
• OK Performance
– Linear scale guaranteed
Road Map
• Mission Statement
• Motivation
• Implementation
• Future
• Conclusion
– Approach Overview
– Advantage
– Components
System Components
• Understand data
– Layout mining
– Schema mining
• Process data
– Wrapper generation
– Query Process
– Query Process with indices
Layout Mining
• Goal 1: Separate
delimiters from values
– D-score: location &
frequency
• Goal 2: Organize
delimiters and values
– NFA
Data File
Token Parser
Tokens
Delimiter Mining
Candidate Delimiters
Layout Learning
Layout Descriptor
Schema Mining Road Map
• Schema Mining
– Overview
– Mining System
– Core Mining Algorithm
– Experiments
Schema Mining Goals
• Ultimate goal: discover schema about an
unknown flat file dataset
• Immediate goal: Assign attributes with
meaningful labels
Our Approach
• Summarize values from bottom up
• Use knowledge from
– Ontology
– Heuristics
• A head-up: attribute label  attribute name
– What we can mine
• date
– What we cannot do
• Creation date, last modification date, birthday, …
Schema Mining Road Map
• Schema Mining
– Overview
– Mining System
– Core Mining Algorithm
– Experiments
Schema Mining System
• Major Components
– Data Cleaning and
summarization
– Score calculation
• Score function
• Ontology
• Heuristics
Raw attribute values
Value cleaning and summarization
Attribute summaries
Score calculation
Scores
Cutoff values
Clustering
algorithm
– Score Clustering
Labeling
Attribute Labels
Data Summarization
• Goal: reduce amount of data
• Collect frequent tokens
– Approximate frequent token mining algorithm
• Token categorization by profile
– Token profile: a ordered list of N(numerical),
A(alphabetic) and special characters
– Token categories:
• Word, number, else and other user defined categories
Score Function Template
– Simple
– Adjustable tradeoff between
sensitivity and
error tolerance
Temperature
• Desired property
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
F_pt
B_pt
t
Score Clustering
• Goal: Sort attributes into three groups, H
(high), L (low) and M (middle), by scores
• Mathematically, find two scores, scorei and
scorej, from {score1, score2, score3, …,
scoreN}, to minimize the standard deviation
• N (number of attributes) is not large. Exact
answer can be found.
Schema Mining Road Map
• Schema Mining
– Overview
– Mining System
– Core Mining Algorithm
• Mining with ontology
• Mining with heuristics
– Experiments
Use of Ontology
• An observation: a similarity between ontology
and schema
– Both satisfy “is-a” relation
• E.g “Diabetes is a disease.”
• Ontology: “diabetes” is a child of “disease”
• Schema: “diabetes” is a valid instance of attribute
“disease”
• Common ancestors in ontology ~ attribute
label
Real-world Complications
• To find an arbitrary value in an ontology
– Complete and comprehensive ontology?
• Selective sampling
– Error-free dataset?
• Adjustable sensitivity & fault tolerance
• Performance
Ontology Database
• Goal: to approximate a complete
comprehensive ontology database
• Approach
– “Complete”: sample popular terms
– “Comprehensive”: public ontology databases +
common facts
• Result
– 6 major categories
– 386 terms
Ontology Based Metrics (1)
1. Occurrence(term) =
Frequent_Count[i],
if term=Frequent_Token[i]
mini:[0, t] Frequent_Count[i],
if term=Frequent_Token[0]|…|Frequent_Token[t]
0, else
2. Strength(term) =
Occurrence(term) +  Strength(child_term)
Ontology Based Metrics (2)
• Two factors
– Relative strength compared with other concepts
– Completeness of ontology as a whole
• Ontology score = product of two factors
– Each modulated by the template score function
Mining With Heuristics (1)
• Use token profile
– “number”: {N, N.N}
– “date”: {N-A-N, N/N/N}
• Use frequent token counts
– “identification”: Frequent_Counts[]=1
• Use other token information
– “biological sequence”: length >45, or in 10’s
Mining With Heuristics (2)
• Use token sequence information
– “people name”: length (2~3), separator (“,” or
“and”), profile (not number, date)
• Again, these counts are modulated by the
template function to calculate scores
Schema Mining Road Map
• Schema Mining
– Overview
– Mining System
– Core Mining Algorithm
– Experiments
Schema Mining
Experiment Design
• Datasets
– GenBank, UniProt SWISSPROT and Pfam
• Cutoff values
– Exact clustering
• Evaluation
– Weighted Cohen’s Kappa
Compare group most, middle and little with true
label Y(yes), P(partial) and N(no)
Result Summary: Kappa
Very good
Good
Moderate
1: cellular component, 2: database, 3: date, 4: free text, 5: ID, 6: molecule type,
7: name, 8: number, 9: organism, 10: publication method, 11: sequence
Cellular Component (O)
Date (H)
Organism Name (O)
Schema Mining Summary
• According to Kappa tests, results are good or
very good
• Possible improvement
– Clustering method with better intelligence
– Better ontology database
– More involved language analysis
– Hybrid of bottom-up and top-down approaches
System Components
• Understand data
– Layout mining
– Schema mining
• Process data
– Metadata description language
– Wrapper generation
– Query Process
– Query Process with indices
Data Process Overview
• Automatic code generation approach
• Input
– Metadata about datasets involved
– Optional:
• Implicit data transformation task
• Request by users
• Indexing functions
• Output
– Executable programs
• General modules
• Task-specific data module
Metadata Description
• Two aspects of data in flat files
– Logical view of the data
– Physical data organization
• Two components of every data descriptor
– Schema description
– Layout description
• Design goals
– Powerful
– Easy for writing and interpretation
Metadata Challenges
• Examples of sequence formats
– ALN/ClustalW format
AMPS Block
file format
• –Major
Challenges:
– ClustalW
– Codata
1.
Various representation
– EMBL
– GCG/MSF
–
GDE
2.
Semi-structured
data
– Genebank
–
–
–
–
–
–
–
–
Fasta (Pearson)
NBRF/PIR
PDB format
Pfam/Stockholm format
Phylip
Raw
RSF
UniProtKB/Swiss-Prot
List and example provided by
>FOSB_MOUSE Protein fosB. 338 bp
MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGS
{
ITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTG
name "Short name for sequence"
GGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRN
longname "Long (more descriptive) name for sequence"
DRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLA
LOCUS
MMFOSB 4145 bp mRNA linear ROD 12-SEP-1993
sequence-ID "Unique ID number"
LPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPV
DEFINITION
Mouse fosB mRNA.
creation-date "mm/dd/yy hh:mm:ss"
TSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL
ACCESSION
X14897
direction [-1|1]
VERSION
X14897.1 GI:50991
strandedness [1|2]
KEYWORDS
fos cellular oncogene; fosB oncogene; oncogene.
type [DNA|RNA||PROTEIN|TEXT|MASK]
SOURCE
Mus musculus.
offset (-999999,999999)
ORGANISM Mus musculus
group-ID (0,999)
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
creator "Author's name"
Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus
descrip "Verbose description“
REFERENCE
1 (bases 1 to 4145)
comments "Lines of comments that can be fairly arbitrary text about a sequenc
AUTHORS
Zerial,M., Toschi,L., Ryseck,R.P., Schuermann,M., Mu
sequence "gctagctagctagctagctcttagctgtagtcgtagctgatgctagct gatgctagctagctagc
Bravo,R.
}
TITLE
The product of a novel growth factor activated gene, fo
interacts with JUN proteins enhancing their DNA binding activity
JOURNAL
EMBO J. 8 (3), 805-813 (1989)
MEDLINE
89251612
PUBMED
2498083
COMMENT
clone=AC113-1; cell line=NIH3T3.
FEATURES
Location/Qualifiers
EMBL-EBI
source
1..4145
/organism="Mus musculus"
Schema Descriptors
• Follow XML DTD standard for semi-structured
data
<?xml version='1.0' encoding='UTF-8'?>
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
FASTA (ID, DESCRIPTION, SEQ)>
ID (#PCDATA)>
DESCRIPTION (#PCDATA)>
SEQ (#PCDATA)>
• Simple attribute list for relational data
[FASTA]
//Schema Name
ID = string
//Data type definitions
DESCRIPTION = string
SEQ = string
Layout Descriptors
• Overall structure (FASTA example)
DATASET “FASTAData” {
DATATYPE {FASTA}
DATASPACE LINESIZE=80 {
//Dataset name
//Schema name
// ---- File layout details goes here ---}
DATA {osu/fasta}
}
//File location
File Layout
• Key observations on line-based biological
data files
– Strings of variable length
– Delimiters widely used
– Data fields may be divided into variables
– Repetitive structures
>seq1 comment1 \n
ASTPGHTIIYEAVCLHNDRTTIP \n
>seq2 comment2 \n
ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \n
KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n
>seq3 …
Layout Descriptors
• File layout (FASTA example)
>seq1 comment1 \n
ASTPGHTIIYEAVCLHNDRTTIP \n
>seq2 comment2 \n
ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \n
KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n
>seq3 …
DATASPACE LINESIZE=80 {
<
“>” ID “ ” DESCRIPTION
< “\n” SEQ >
“\n” | EOF
>
}
System Component
• Understand data
– Layout mining
– Schema mining
• Process data
– Metadata description language
– Wrapper generation
– Query execution
– Query execution with indices
Wrapper Generation
Road Map
•
•
•
•
•
Motivation and overview
System structure
Wrapper generation
Wrapper execution
Experiments
Wrapper Generation
Motivation
• Wrappers are essential for bioinformatics
integration
– Heterogeneous data sources
– Function: transform data
• Current solutions
– Manually written wrappers
– Scripts
Wrapper Generation
Advantages
• Wrapper generated automatically
– Stand-alone programs for integration systems and
workflows
– Little human interference. New resources can be
integrated on-the-fly
– Direct transformation. No unnecessary intermediate form
needed
– Only requires data description at metadata level, one
descriptor/data source
• Transfer data from flat files directly
– No DB support required
– No other domain or format heuristics
Wrapper Generation
System Overview
Layout Descriptor
Schema Descriptors
Layout Parser
Mapping Generator
Data Entry
Representation
Wrapper generation
system
Source
Dataset
Mapping File
Mapping Parser
Schema Mapping
Application Analyzer
WRAPINFO
DataReader DataWriter
Synchronizer
wrapper
Target
Dataset
Layout Parse Tree
• FASTA example
DATASPACE root
linesize = 80
DATASPACE LINESIZE=80 {
<
“>” ID “ ” DESCRIPTION
< “\n” SEQ >
“\n” | EOF
>
}
<>
<>
“>”-ID
“ “-DESCRIPTION
“\n”-DUMMY | EOF
Internal node: environment
Leaf: delimiter-variable (DLM-VAR) pair
“\n”-SEQ
Schema Mapping
• Algorithm: strict name matching
for field ft in target schema
for field fs in source schema
if ft=fs then add pair (fs, ft) to the mapping
• Output
– A list of attribute pairs
– A editable file for user to verify and modify
Wrapping Assumptions
• Convert semi-structured
(and structured) data to
structured data
Semi-structured
• Both datasets are stored
record-wise
• Order of records not
disturbed after wrapping
Structured
Data can be transformed entry by entry
Application Analyzer
• Task: to generate clear directions for wrapper
and organize them in WRAPINFOR
• Sub-tasks
– What values to store
– How to extract values
– How to store values
– How to write values
Important Concepts (1)
• “Useful”
– An attribute is useful iff its values are in target
• “Reachable”
– node b is reachable from node a, if there exists
a valid layout configuration such that a.DLM and
b.DLM defines the boundaries of a.VAR.
i.e “… a.DLM a.VAR b.DLM …”
– A value instance is between
• Its own delimiter
• The first appearance of its reachable delimiters
Important Concepts (2)
• Attribute Cardinality
– Regular attribute: fixed number of values per
entry
• ID
– Semi-structured attribute: varied number of
values per entry
• References
WRAPINFOR
• Contents: information to answer a particular
wrapping task
• Forms: in XML
– 5 look-up tables
• Delimiter, Usefulness, Cardinality, Label, Reachable
– 3 parameters
• one_to_one_total, one_to_multiple_total, complete_in
• Function: plug into general modules to form a
functional wrapper
Wrapper Generation
Road Map
•
•
•
•
•
Motivation and overview of our approach
System structure
Wrapper generation
Wrapper execution
Experiments
Wrapper Overview
Value buffer
one_to_multiple_values
Input
dataset
Dataset
buffer
FA
RA
DataReader
DataWriter
one_to_one_values
load
run
halt
run
Synchronizer
Output
dataset
Wrapper Structure
• One data module: WRAPINFO
• Three general action module
– Synchronizer: central controler
– DataReader, DataWriter: interact with datasets
• One value buffer
• Suitable for data grid
• Transform data one entry at a time
Wrapper Execution
• DataReader
– Extract attribute value
• Delimiter table + Reachable table
– Fill value buffer: Label look-up table
• DataWriter
– Retrieve from value buffer: Label look-up table
– Write target file
• Delimiter table + Reachable table + label table
• Synchronizer
– Call DataReader on source: parameters
– Call DataWriter on target: parameters
(in logarithm)
Wrapper Experiments (1)
•Analysis time constant
•Execution time linear
(in logarithm)
TRANSFAC-to-Reference Problem
Wrapper Experiments (2)
•Performance comparable
to handwritten codes
SWISSPROT-to-FASTA Problem
System Components
• Understand data
– Layout mining
– Schema mining
• Process data
– Metadata description language
– Wrapper generation
– Query execution
– Query execution with indices
Query Execution
Road Map
• Motivation
• System Overview
• System Implementation
– Languages
– System
• Experiments
Limitation of Wrapper
• Data Wrapping =
Data formatting + Data projection
• Other query types
– Selection
New Functionalities
– Cross Product
• Value examination
– Join
• Multiple datasets
Advantages
• Retrieve multiple pieces of information all at
once
• Data easily available
• Declarative languages only
• High flexibility
• Low over-head
• Suitable for data grid
System Enhanced
query
Source/target names
Dataset
descriptors
Metadata
collection
Query parser
Descriptor
parser
mappings
Schema & Layout information
Application analyzer
Query analysis
Query execution
Source
data files
QUERYINFOR
DataReader DataWriter
Synchronizer
Target
Data file
Query Execution
Road Map
• Motivation
• System Overview
• System Implementation
– Languages
• Metadata Description Language
• Query Language
– System
• Query Analysis
• Query Execution
• Experiments
Query Language
• Declarative, SQL-like
• Projection, selection, cross product, join queries
Target dataset
• Example AUTOWRAP POSTBLAST
FROM BLASTP, SWISSPROT
BY BLASTP.SP_ID = SWISSPROT.ID
Source datasets
Join
criteria
WHERE
Attribute
pairs
POSTBLAST.QUERY = BLASTP.QUERY
POSTBLAST.SP_AC = BLASTP.SP_AC
POSTBLAST.SP_ID = BLASTP.SP_ID
POSTBLAST.FULL_DESCR = SWISSPROT.DE
POSTBLAST.SEQUENCE = SWISSPORT.SQ
POSTBLAST.SCORE = BLASTP.SCORE
POSTBLAST.E_VALUE = BLASTP.E_VALUE
Application Analyzer
Enhancement
•
Constant values in query
– Pseudo-label look-up table
•
Other query information
– Parameters: comparing field pairs
•
Output: QUERYINFOR
Query Execution
• Query-Proc Structure
Source
data files
QUERYINFOR
DataReader DataWriter
Synchronizer
Target
Data file
• DataReader and DataWriter
– Similar to wrapper
• Value buffer
– Store useful values from one data entry of every
source dataset
Enhanced Synchronizer
• Synchronizer
– Set up pseudo-attributes: Pseudo label look-up
table
– Call DataReader on source 1 and 2; Call
DataWriter on target: Parameters
– Test join conditions: Parameters
– Clean value buffer: Parameters
Post-BLAST Query
– UNIQUE: halt once a
match found in source 2
– ALL: search all source 2
entries
2000
UNIQUE
1800
ALL
1600
Time (sec)
• Goal: Enhance BLAST
output to FASTA format
• Query: Join query
between BLAST output
(source 1) and
SWISSPROT (source 2)
• 2 modes
1400
1200
1000
800
600
400
200
0
3
5
12
Query Size (Sequence Number)
Chip-Supplement Query
Time (sec)
• Goal: Look up microarray
genes information into
tabular format
• Query: Join query
between protein array
and yeast genome
database
• 2 queries
– Chip-Supplement:
• array join genome
– Chip-Supplement-Sorted:
• genome join array
90
UNIQUE
80
ALL
70
60
50
40
30
20
10
0
Chip-Supplement
Chip-SupplementSorted
Query Type
OMIM-Plus Query
• Add reverse links of proteins to disease
database
• Join query between OMIM database and
SWISSPROT database
• Results in OMIM form
• 86.38 seconds/entry * 12,158 OMIM entry =
291.7 hours
System Components
• Understand data
– Layout mining
– Schema mining
• Process data
– Metadata description language
– Wrapper generation
– Query execution
– Query execution with indices
Query with Indices
Road Map
• Motivation and Overview
• System
• System Enhancement
– Language
– System Implementation
• Experiments
Query With Indices
Motivation
• Goal
– Improve the performance of query-proc program
• Index
– Maintain the advantages
• Flat file based
• Low requirement on programming
Challenges & Approaches
• Various indexing algorithms for various
biological data
– User defined indexing functions
– Standard function interfaces
• Flat file data
– Values parsed implicitly and ready to be indexed
– Byte offset as pointer
• Metadata about indices
– Layout descriptor
System Revisit
query
Source/target names
Dataset
descriptors
Metadata
collection
Query parser
Descriptor
parser
Schema & Layout information
mappings
Application analyzer
Query analysis
Query execution
Source
data files
Index file
QUERYINFOR
DataReader DataWriter
Synchronizer
Index functions
Target
data file
Language Enhancement
• Describe indices
– Indexing is a property of dataset
– Extend layout descriptors
DATASET “name”{
…
INDEX {attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc
[, attribute:index_file_loc:index_gen_fun:index_retr_fun:fun_loc]}
}
– Maintain query format
AUTOWRAP GNAMES
FROM CHIPDATA, YEASTGENOME
BY CHIPDATA.GENE = YEASTGENOME.ID
WHERE …
New meaning of “=“:
If index available, use index
retrieving function
Else, compare values directly
System Enhancement
• Metadata Descriptor Parser
+ parse index information
• Application Analyzer
+ index information: index look-up table
+ test condition: compare_field_indexing
Query-Proc Enhancement
• Synchronizer
+ if index is applicable, check availability of index
data file
• If no, call index generation function
+ Load indices
+ Call index retrieving function first for candidate
entry list
Microarray Gene
Information Look-up
90
81.59
80
Performance (sec)
• Goal: gather
information about
genes (120)
• Query: microarray
output join genome
database
• Index: gene names
in genome
70
60
50
40
30
20.89
20
10
0
0.01
query
analysis
0.72
index
generation
query with
indices
query w/o
indices
BLAST-ENHANCE Query
Performance (sec)
• Goal: Add extra
information to
BLAST output
• Query: BLAST
output join SwissProt database
• Index: protein ID in
Swiss-Prot
1200
3
5
12
1000
800
600
400
200
0
index
query w/
generation indices
query w/o
indices
OMIM-PLUS Query
10000000
1000000
Performance (sec)
• Goal: add SwissProt link to OMIM
• Query: OMIM join
Swiss-Prot
• Index: protein ID
in Swiss-Prot
100000
10000
1000
100
10
1
index
generation
query w/
indices
query w/o
indices
Homology Search Query
• Goal: find similar sequences
• Query: query sequence list * sequence
database
• Indexing algorithm
– Sequence-based
– Transformation of sub-string composition
– Indexing n-D numerical values
Homology Search (1)
– Data: yeast
genome
– wavelet
coefficients
– minimum
bounding
rectangles
350
Index generation
10
20
40
300
Performance (sec)
• Index (Singh’s
algorithm)
250
200
150
100
50
0
1
2
3
4
Database size (9.8MB)
5
Homology Search (2)
– Data: GenBank
– Wavelet coefficients
– Scalar quantization
– R-tree
30
performance (sec)
• Index
(Ferhatosmanoglu’s
algorithm)
25
20
10
20
40
15
10
5
0
1
2
3
4
Database size (250MB)
5
Road Map
•
•
•
•
•
•
Mission Statement
Motivation
Implementation
Comprehensive Example
Future work
Conclusion
Gene Name Nomenclature
• It is crucial to identify genes CORRECTLY
and UNAMBIGUOUSLY
– Genes with multiple names
– Multiple gene share same names
• Historically, little central control on naming
process
“…As biologists strive to make sense of the growing wealth of genomic
information, this messy nomenclature is becoming a bugbear…”
Helen Pearson, Nature, 2001
Gene Name in DBs
• Databases related to genes
– Genome databases (main force in nomenclature)
•
•
•
•
SGD (yeast)
HGNC (human)
TAIR (a plant)
dictyBase (an one-cell amoeba)
– Curated gene databases
• Entrez Gene by NCBI
– Curated gene product databases
• Swiss-Prot by SIB and EBI
Queries About Gene Name
• Gene identifiers usages in databases
– How are gene symbols in DB A used in DB B?
– How are gene alias in DB A used in DB B?
• Nomenclature across species
– Q1-Q2: genome – Entrez Gene, Swiss-Prot
– Q3-Q4: Entrez Gene – Swiss-Prot
• Nomenclature over time
– Q5-Q7: Swiss-Prot – genome
Challenges
• Various data representation
– Line-based texts
Metadata
descriptors
– Tabular forms with or without title
Format and schema
– Format evolves over time
learning
• Data storage
– Large volume
– Each file queried limited times
Flat file
processing
Integration System Revisit
Understand Data
Data File
Layout
Miner
Genome
Entrez Gene
Swiss-Prot
Process Data
User Request
- Join queries
Metadata Description
Layout Descriptor
Layout
Descriptor
--------------------------------------------------Layout
Descriptor
--------------------------------------------------Schema Descriptor
--------------------------------------------------Schema Descriptor
Schema Descriptor
Code
Generation
Query
Processor
Schema
Miner
Information Integration System
Nomenclature Results (1)
• Across Species
Q3-Q4
Q1-Q2
60
90
80
Percentage (%)
70
60
50
40
30
SGD
HGNC
TAIR
dictyBase
50
Per cent age (%)
SGD
HGNC
TAIR
dictyBase
40
30
20
20
10
10
0
Entrez Gene Entrez Gene
ID
Alias
Swiss-Prot
ID
Swiss-Prot
Alias
0
Swiss-Prot ID
Swiss-Prot Alias
Nomenclature Results (2)
• Over time
Q5: How many gene ID in Swiss-Prot are gene ID in genome?
Q6: How many gene ID in Swiss-Prot are alias in genome?
Q7: How many gene alias in Swiss-Prot are gene ID in genome?
Performance
• Linear w.r.t. source 1 size
Conclusion
• A frame work and a set of tools for on-the-fly
flat file data integration
– New data source understood semi-automatically
by data mining tools
– New data processed automatically by generated
programs
• Advantages
High level interface, flat file based, ok
performance, low maintenance cost