7. A Case Study on Protein Data

Download Report

Transcript 7. A Case Study on Protein Data

(Almost) Hands-Off
Information Integration for the Life Sciences
Ulf Leser, Felix Naumann
Presented By: Darshit Parekh
Table of Contents
•
Introduction
- Introduction to Life Sciences
- Data Integration in Life Sciences
• ALADIN
- Features of ALADIN
- System Architecture for ALADIN
- System Components Description
• A Case study on Protein Data
- Comparison of ALADIN with other existing Technologies
- Advantages, Challenges and Bottlenecks in ALADIN
• Summary
- Demo on COLUMBA Project
1. Introduction to Life Sciences











Life science is the study of living things- plants and animals.
It helps to explain how living things relate to each other and to their surroundings.
It is the in-depth study of living organisms
More specifically following fields are included
Agrotechnology
Animal Science
Bio-Engineering
Bioinformatics and Biocomputing
Cell Biology
Neuroscience
A Broad field that studies life.
2. Data Integration in Life Sciences
 Data integration in the life sciences has been topic of intensive research.
 It’s one of the areas where large number of complex databases is freely
available.
 Research in this area is important as there is advancement in the medical
technologies and hence, human health and wellness.
 The data required for this kind of research and analysis is widely scattered
over in many heterogeneous and autonomous data sources.
 Life Science databases have number of traits which we need to consider
when we start designing data integration system for the same.
 Life sciences database stores only primary type of object which are
described by rich and sometimes deeply nested annotations.
 We consider an example of Swiss-Prot which is essentially protein
database.
2. Data Integration in Life Sciences Contd….
 The “Hubs” in the life science data world, provide links to large number of
databases pointing from primary object of proteins to further information like
protein structure, publications, taxonomic information, the gene encoding the
protein, related diseases, known mutations , etc.
 Internally stored as a pair ( target-database, accession number).
 Presented as hyperlinks on web pages and helps the end user retrieve useful
information from the protein database.
 With large number of databases ,identifying the duplicates is an important tasks and
needs to carefully perform the task.
 Data Integration in life sciences involves either manual data curation or schema
mapping and integration approach.
2. Data Integration in Life Sciences Contd….
• Manual Data Curation
 Projects that achieve high standard of quality in the integrated data through
manual curation – data focused.
 The curation work is performed by experienced professionals.
 Swiss Prot- data on protein sequences from journal publications,
submissions, personal communications, other databases.
 Data Focused projects are typically managed by domain experts like
Biologist.
 Very little Database concepts or technology used in this case.
 Data in text like manner.
 Even if detailed schemata are developed, it can cannot be used to query the
database and obtain the results through query.
2. Data Integration in Life Sciences Contd….
• Schema Focused
 Projects of this type make use of Database technology and are maintained
by computer scientists, database analyst , database programmer etc.
 These projects are aimed at providing integration middleware rather than
building concrete databases.
 Techniques like schema integration, schema mapping and mediator based
query rewriting is used.
 Examples : TAMBIS, OPM.
 Some sort of wrapper required for query processing, detailed semantic
mapping between heterogeneous source schemata and a global mediated
schema.
 The mappings must be specified in a special language which makes the
work of domain experts very difficult.
2. Data Integration in Life Sciences Contd….
 Data Focused projects are very successful in biological scene but it does so
with a price.
 Schema focused projects are hardly used in real life science projects and
did not achieve the required attention that it should have.
 The major reason for its failure lies in the fact that it is schema centric
 The schema mapping and integration also leaves the biologists in a fix as
they are not used to with database technologies
3. Introduction to ALADIN
 ALADIN is a novel combination of data and text mining, schema
matching, and duplicate detection and justifies high level of automatism
feature.
 It leads to previously unseen relationship between objects, thus directly
supporting the discovery based work of life science researchers
 ALADIN has two major contributions. Firstly ALADIN is a knowledge
resource for life science research and secondly it offers challenges and
bottlenecks in the database reserach.
 ALADIN: Almost Automatic Data Integration.
 The novel feature includes automatic integration with minimum
information loss and also takes care of information quality.
 The proposed technique has features better than Data Focused and Schema
Matching techniques.
4. Features of ALADIN
 ALADIN’S architecture consist of several components that together allow for
automatic integration of diverse data sources into a global, materialized repository
of biological objects and links between them.
 The databases that ALADIN helps integration have data that is semi-structured and
text centric.
 ALADIN uses relational database as its basis.
 Can Integrate different types of data sources for which relational representation
exists or can be generated.
 ALADIN can integrate data in the XML file and flat file using appropriate parsers.
 Integration in ALADIN does not depend on predefined integrity constraints
structuring a schema but uses technologies from schema matching and data and text
mining to detect relation containing primary objects between each data source , to
infer relationships between relations and objects within one source , and to infer
relationships between objects in different sources.
4. Features of ALADIN Contd…









The system does not rely on any particular schema for its data sources .
Generic Parsers are used like generic-XML to relational mapping tools.
ALADIN integrates data sources in a five step process.
First step- Imports Data source into relational format.
Second step- From the relational representation it tries to find out relation that
represents the primary objects within the data source.
Third step- the fields containing annotations for the primary objects are detected .
Existing integrity constraints are discovered subject to availability , otherwise they
are guessed from the data analysis.
Fourth Step- Links between the objects of the primary relations of different data
sources are searched for. Links are generated based on similarity of text fields.
Fifth Step- Duplicates are detected across different data sources and they are
removed.
Once the data is imported the process becomes almost automatic.
4. Features of ALADIN Contd…
 ALADIN also supports structured queries , detects and flags structured and
flags duplicate objects , and adds a wealth of additional links between
objects that are undiscoverable when looking at the database in the
isolation.
 ALADIN can be readily browsed without any schema information.
 ALADIN is useful in scenarios where explorative approach is necessary.
5. System Architecture for ALADIN
5. System Architecture for ALADIN




The Main Components of Architecture
1. Data Import
The data source needs to be imported to into the relational database system.
In cases where no downloadable import method exists, this is where
ALADIN requires human work
The situation is very rare, most of the time the parsers are readily available.
Schema design or re-design is not required.
2. Discovery of Primary Objects
 Identifying primary objects stored in primary relation.
 Primary relation contain data about the main objects of interest in the
source such as “ proteins” and “diseases”.
5. System Architecture for ALADIN
 The relations store a primary key but it does not have information regarding
the foreign key.





3. Discovery of Secondary Objects.
Secondary objects are additional information about primary objects.
Cardinalities of relationships are determined in this step.
At the end of this step, the internal structure of the newly imported data
source is more clear or known.
In this step there can be a possibility of errors while identifying
relationships.
The errors can be minimized in the ALADIN by introducing performance
measure parameters.
5. System Architecture for ALADIN




4. Link Discovery
In this step we search for attributes that are cross-references to objects in
other data sources.
Cross-references always point to primary objects in other data sources as
these are the attributes with stable public ID’s.
The output from the second step are necessary input to determine all
possible link targets.
This step justifies the theoretical requirement of comparing all pairs of
attributes from all sources.
5. System Architecture for ALADIN
5. Duplicate Detection
 In this step, search for a special kind of “ links” between primary objects in
different data sources representing the same real world object is initiated.
 Such duplicate links are established if two objects are sufficiently similar
according to some similarity matrix.
 Knowledge of duplicates enhances the users browsing and querying
experience .
6. Browse, Search and Query Engine
 Once the data is integrated into the system, there are three modes to access
the data.
 Browsing displays objects and different kind of links that users can follow.
5. System Architecture for ALADIN
 Search allows the users to make a full text search on all stored data and a
focused search restricted to certain partitions of data like a certain data
sources , particular field, etc.
 Querying allows full SQL queries on the schemata as imported.
 Appropriate Graphical User Interfaces are provided to carry on with these
operations.
7. Metadata Repository.
 It contains known and discovered schemata , information about primary
and secondary relations, statistical metadata and sample data to improve
discovery efficiency.
Integration Steps in ALADIN
6. System Components Description






1. Data Import
Read data source into the relational database.
No necessary to have integrity constraints at this time.
Some databases like Swiss-Prot , the Gene Ontology provide direct
relational dump files.
For text-based exports, readily downloadable parsers are available.
Examples are BioSQL and BioPerl packages which are able to read SwissProt , Genebank Databases.
Some databases provide parsers with their export files, such as the open
MMS parser for the Protein Structure Database.
Databases exported as XML files can be parsed using a generic XML
shredder.
6. System Components Description







2. Discovery of Primary Relations
Discovering primary objects without the use of parsers.
Heuristic rules along with schema and the actual data is used to determine
the primary relation
Rules derived from the previous experience of data integration.
SQL query on each attributes.
The attributes found are alphanumeric in nature and are called accession
members .
Foreign key relationships and cardinalities.
The detected primary relation and set of relationships achieved are input to
the next steps
6. System Components Description




3. Discovery of secondary Relations
The need to connect objects in one source to the other source .
The step determines the description and annotation that is displayed
together with the primary object in the web interface.
The computation of the paths from the primary relation to each of the other
relations of data source is done using transitivity of relationships.
The paths are stored in the metadata repository.
4. Link Discovery
 Explicit Links and Implicit Links
 Explicit cross-references in life science databases.
 Cross references are stored as accession numbers .
6. System Components Description









E.g. “ ENSG00000042753” or “Uniprot:P11140”
String matching techniques needed.
Many relationships are not explicitly stored.
The implicit relationships are discovered by searching for similar data
values among the other data sources.
Three types of comparisons taken into consideration
First is DNA, RNA, or protein sequences compared to each other.
Second is attributes containing longer text strings, such as textual
descriptions are analyzed using information retrieval and text mining.
Use of standard vocabularies across the data sources.
The discovered links are stored in the metadata repository to avoid repeated
discovery and computation at query time.
7. A Case Study on Protein Data
 The design decisions of ALADIN have taken based on the past experiences
drawn from the integration projects in this domain.
 The paper discusses the most recent COLUMBA Project.
 COLUMBA is an integrated , relational data warehouse that annotates
protein structures taken from the protein data bank (PDB).
 The data explains following properties of protein Classification on structures
 Protein sequence and sequence features
 Functional annotation of proteins
 Participation on metabolic pathways
 The extraction and transformation from the initial data source schema into
the target schema is currently hard-coded and it certainly requires a lot of
effort.
7. A Case Study on Protein Data
 Understanding the schema is very difficult , as they are very poorly
documented and often use structures that are hard to understand just
looking at the schema .
 Transformation requires operations that are not defined in the current
schema mapping languages .
 Use of SQL and Java.
 COLUMBA annotate Protein structures from Protein Data Bank. It also
includes protein fold classification databases SCOP and CATH.
 Further functional and taxonomic annotation is available from Swiss-Prot,
the Gene Ontology (GO) and the NCBI Taxonomy.
7. A Case Study on Protein Data
Part of BioSQL Schema. Arrows indicate candidates for Primary relations and cross-references
7. A Case Study on Protein Data
 In the complete schema there are three tables with an in-degree above five.
 BioEntry
Bio EntryId
Display Id
Identifier
Accession
Description
Entry Version
Bio Database ID ( FK)
Taxon Id (FK)
 The BioEntry table is used to store primary objects.
7. A Case Study on Protein Data
 The ontology term table is used to store functional descriptions.
 Ontology Term
Ontology Term Id
Term Name
Term Definition
Term_Identifier
Category_Id (FK)
 The SeqFeature storing a meta representation of sequence features.
 SeqFeature
Seqfeature Id
Seqfeature Rank
Bioentry Id
Ontology Term Id (FK)
Seqfeature Source Id (FK)
7. A Case Study on Protein Data
 The BioEntry has an accession number candidate , whose values are mixed
characters and integers and all have the same length.
 The other fields in the BioEntry are either non unique (e.g. Taxon Id), have
no alphanumeric character ( e.g. Bioentry Id) or have varying length ( e.g.
name)
 This table qualifies in ALADIN as the primary relation.
 Primary and Foreign keys are determined by analyzing the scope of
different attributes storing surrogate keys.
 In the next step in COLUMBA, protein structures are connected to
annotations using existing cross –references or by matching sequence
similarity.
7. A Case Study on Protein Data
 The BioSQL schema contains several attributes whose values are excellent
candidates for finding out implicit links.
 OntologyTerm. Term_Definition linking to biological ontologies.
 BioEntry.description linking to disease or gene-focused databases.
 Biosequence.Biosequence_str, containing the actual DNA or protein
sequence.
 Duplicate Detection is an important step here as protein structures from the
PDB are available in three different flavors:
 The original PDB files
 Cleansed version available as dump files.
 Cleansed version available with a parser.
 PDB accession number is available in all the three versions and hence
removing redundancy is easy in this case.
7. A Case Study on Protein Data
• SQL example - fetch the accessions of all sequences from SwissProt:
SELECT DISTINCT bioentry.accession FROM bioentry
JOIN biodatabase USING (biodatabase_id)
WHERE biodatabase.name = 'swiss' -- or 'Swiss-Prot'
• SQL example - how many unique entries are there in GenBank:
SELECT COUNT(DISTINCT bioentry.accession) FROM bioentry
JOIN biodatabase USING (biodatabase_id)
WHERE biodatabase.name = 'genbank'
8. Some Related Work





Discovery Link
OPM
TAMBIS
Makes use of Schema information rather than the data
SRS- need to mention the primary and secondary relation explicitly in the
parsers .
 GenMapper
 BioMediator
 The Project closest to our proposal is Revere Project
Comparison of ALADIN with the existing
technologies
9. Challenges and Bottlenecks in ALADIN
 ALADIN system is a true challenge in terms of size, number, and the
complexity of the data sources to be integrated.
 Incorrectly identified primary or secondary relations leads to incorrect
targets to link discovery.
 Incorrect links in turn influence the precision of duplicate detection.
 Issue of performance is not addressed in the paper.
 Integrating new data sources to the existing ones leads to the poor
efficiency as it involves lots of calculations, sorting, schema matching . It
takes a lot of time to achieve the desired results.
 Another important problem is that of data changes. When the data in the
source changes, all links needs to be recomputed which involves lot of
overhead.
10. Summary
 ALADIN architecture and framework looks out to be a novel proposal for data
integration in life sciences.
 The design seems to be almost automatic using text mining, schema matching , data
mining and information retrieval.
 ALADIN offers clear added-value to the biological user when compared to the
current data landscape.
 It enables structured queries crossing several databases.
 The system suggests a lot of new relationships interlinking all the areas of life
sciences , offers ranked searched capabilities across databases for users that want
goggle style information retrieval .
 The queries like getting the genes of a certain species on a certain chromosome
that are connected to a disease via a protein whose function is known. For each
of the object types in query , several potential data sources exist and this
system takes into account all of the data sources , a feature not supported by
any of the integration technology.
11. DEMO OF COLUMBA PROJECT