Semantic Web Approach to Biological Database Integration

Download Report

Transcript Semantic Web Approach to Biological Database Integration

Biological Databases, Integration,
and Semantic Web
Kei Cheung, Ph.D.
Yale Center for Medical Informatics
Genomics and Bioinformatics, December 4, 2006
Outline
• Database introduction
– Overview
– Query language
• Database integration
– Issues
• Semantic Web approach to database
integration
– Overview of Semantic Web
Introduction
• The Human Genome Project has transformed
the biological sciences into information sciences
• Advances in the biological sciences depend on:
– creation of new knowledge
– effective information management
• Future progress in biological research will be
highly dependent on the ability of the scientific
community to both deposit and utilize stored
information on-line.
• The database challenge for the future will be to
develop new ways to acquire, store and retrieve
not only biological data, but also the biological
context for these data.
Variety of Biological Databases
• Different data categories
– DNA sequence, gene expression, protein
structure, pathway, etc
• Community vs. lab-specific vs. proprietary
databases
• Mega vs. medium vs. boutique databases
• One thing in common: many of them are
Web accessible
Food for thoughts
• Will a biological database different a
biological journal?
What is a database?
• A database is a collection of records stored in a
computer in a systematic way, so that a
computer program can consult it to answer
questions.
• The items retrieved in answer to queries
become information that can be used to make
decisions.
• The computer program used to manage and
query a database is known as a database
management system (DBMS)
– E.g., Oracle, MS Access, MySQL
Database components
• The central concept of a database is that of a
collection of records, or pieces of knowledge
• For a given database, there is a structural
description of the type of facts held in that
database: this description is known as a schema
• The schema describes the objects that are
represented in the database, and the
relationships among them.
Data Model
• There are a number of different ways of
organizing a schema (i.e., of modeling the
database structure): these are known as
data models.
– Relational model
– Hierarchical model
– Network model
– Object oriented model
Query Language
• A query language is a computer languages used
to create, modify, retrieve and manipulate data
from databases
• SQL (Structured Query Language) is a wellknown query language for relational databases
– SQL is an ANSI standard language for RDBMS’s
– Different RDBMS’s vendors may provide slightly
different SQL syntax or additional proprietary
extensions that are applicable only to their systems
SQL
•
•
•
•
•
•
CREATE TABLE
INSERT
SELECT
UPDATE
DELETE
CREATE VIEW
CREATE TABLE
CREATE TABLE <tablename> (
<column1> <data type1> [<constraint1>],
<column2> <data type2> [<constraint2>],
<column3> <data type3> [<constraint3>],
…
);
Example
CREATE TABLE sgd_features(
sgd_id VARCHAR(20) NOT NULL PRIMARY KEY,
feature_type VARCHAR(20) NOT NULL DEFAULT ‘ORF’,
quality VARCHAR(20),
feature_name VARCHAR(20),
standard_name VARCHAR(20),
chromosome INT(2) NOT NULL,
start_coord INT(10) NOT NULL,
end_coord INT(10) NOT NULL,
strand CHAR(1) NOT NULL,
description VARCHAR(500)
);
INSERT
INSERT INTO <table> (<column1>, …, <columnN>)
VALUES (<value1>, …, <valueN>);
Example
…
INSERT INTO empinfo (sgd_id, feature_type, feature_name,
chromosome, start_coord, stop_coord,strand, description)
VALUES (‘S000006692’, ‘tRNA’, ‘tQ(UUG)C’,
3, 168368, 168297, ‘C’, ‘tRNA-Gln’);
…
SELECT
SELECT [DISTINCT] <col1> [as <alias1>] [, <col2> [as <alias2>], ...]
FROM <table1> [as <alias1>] [, <table2> [as <alias2>] , …]
WHERE <Boolean conditions>;
[additional clauses]
Typical Conditional Operators: =, >, >=, <, <=, <>, LIKE, IN
Additional Clauses: ORDER BY, GROUP BY, HAVING, LIMIT
Built-in functions: UPPER, LOWER, SUBSTRING, LENGTH,
COUNT, MAX, MIN, AVG, etc
DISTINCT
SELECT DISTINCT chromosome
FROM sgd_features
WHERE (feature_type=‘ORF’);
The answer is: 1,2,4,9,10,15,16,17
count function
SELECT COUNT(*)
FROM sgd_features
WHERE (start_coord < 300000) AND (feature_name LIKE ‘Y%’);
The answer is 6
string function
SELECT LENGTH(feature_name)
FROM sgd_features
WHERE (id=‘S000007274’);
The answer is 5
math function
SELECT MIN(start_coord)
FROM sgd_features
WHERE (strand=‘W’);
ORDER BY
SELECT sgd_id, feature_type, feature_name, chromosome
FROM sgd_features
WHERE (feature_name like ‘Y%’)
ORDER BY start_coord DESC;
GROUP BY
This allows aggregate function to be performed on the column(s)
SELECT feature_type, AVG(stop_coord-start_coord) as “avg_diff”
FROM sgd_features WHERE (strand = ‘W’) GROUP BY feature_type;
HAVING
This is the same as the WHERE clause except it is performed upon
the data that have already retrieved from the database
SELECT feature_type, AVG(stop_coord-start_coord) as ‘avg_diff’
FROM sgd_features WHERE (strand = ‘W’) GROUP BY feature_type;
HAVING (AVG(stop_coord-start_coord) > 1000);
LIMIT [start, ] rows
Returns only the specified number of rows.
SELECT *
WHERE (feature_type=‘ORF’)
LIMIT 3
Join
sgd_features
gene_ont
SELECT s.feature_name, s.feature_type,
s.chromosome, g.bio_function, g.bio_process,
g.cell_location
FROM sgd_features as s, gene_ont as g
WHERE (s.sgd_id=g.sgdid);
UPDATE
UPDATE <table>
SET <col1>=<val1> [,<col2>=<val2>, …]
[WHERE clause];
Example
UPDATE empinfo
SET quality=‘Verified’
WHERE sgd_id=‘S00000010’
DELETE
DELETE FROM <table> [WHERE clause];
Example
DELETE FROM sgd_features WHERE sgd_id IN
(‘S00000010’, ‘S000003599’);
DELETE FROM sgd_features;
(this deletes all data in the table)
CREATE VIEW
CREATE VIEW <viewname> [<col1>, <col2>, …]
AS SELECT …;
Example (VIEW)
CREATE VIEW sgd_features_ORF_W
AS SELECT *
FROM sgd_features
WHERE feature_type=‘ORF’ AND strand=‘W’;
Other Database Topics
• Normalization
• Query optimization
• Maintenance
Database Integration
Needs for database integration
• Biological data are more meaningful in context,
no single DB supplies a complete context for a
given biological research study
• New hypotheses are derived by generalizing
across a multitude of examples from different
DBs
• Integration of related data enables validation
and consistency checking
Example
• Find the genes for a metabolic pathway,
which are localized within a genome (e.g.,
find the clusters of genes involved in
tryptophan biosynthesis, and in the histidine
biosysthesis, in the E. coli genome)
• This query involves integrating data from
pathway databases and genome mapping
databases
Issues
Growth in the number of biological
databases (published in NAR DB issue)
Database Size
Database Complexity
(e.g., DNA Microarray Database)
Different Levels of
Heterogeneity
• This problem arises from the fact that
different databases are designed and
developed independently to address local
needs.
• There are two broad levels of
heterogeneity
– Syntactic heterogeneity
– Semantic heterogeneity
Syntactic Heterogeneity
• Technological heterogeneity
– Difference in hardware platforms, operating systems,
access methods (e.g., HTTP, ODBC, etc)
– Difference in database engines, query languages,
data access interfaces, etc
• Different data formats/models
– structured vs. un-structured data, files vs. databases,
etc
– Object-oriented vs. relational data model
Semantic Heterogeneity
•
Nomenclature problem
– Gene/protein symbols/names (based on phenotype, sequence, function,
organisms, etc)
•
•
•
•
TSC1
ABCC2
ALDH
Sonic Hedgehog
– ID proliferation
• Different ID schemes: 1OF1 (PDB ID) and P06478 (SwissProt ID) correspond to
Herpes Thymidine Kinase
• Lexcial variation: GO1234, GO:1234, GO-1234
– Synonyms vs. homonyms
•
•
•
•
•
Dopamine receptor D2: DRD2, DRD-2, D2
Armadillo (fruitflies) vs. i-catenin (mice)
PSM1 (human) = PSM2 (yeast); PSM1 (yeast) = PSM2 (human)
PSA: prostate specific antigen, puromycin-sensitive aminopeptidase, psoriatric
arthritis, professional skaters association
“Biologists would rather share their toothbrush than a gene name …
Gene nomenclature is beyond redemption”, said Michael Ashburner
Other Semantic Heterogeities
• Inconsistent values
– The same set of markers may be found in different
orders across different mapping databases (e.g., GDB
and DB/12)
• Different units of measurement
– Kb vs. bp
• Different data coding schemes
– over-expressed vs. 2-fold change
Use of Standards to Address
the Heterogeneity Issue
• Data specification standards (e.g. MIAME)
• Data representation standards
– Syntax (e.g., XML, ASN.1)
– Structure (e.g., MAGE-ML)
• Standard vocabulary/ontology (e.g.,
MGED ontology working group, gene
ontology, etc)
Semantic Web
Semantic Web
• It provides a standard framework that allows data to be
integrated and reused across application, enterprise, and
community boundaries
• It is a web of data linked up in such a way as to be easily
processable by machines, on a global scale.
• It is about two things:
– It is about common formats for interchange of data
– It is about language for recording how the data relates
to real world objects.
• It is a collaborative effort led by the World Wide Web
Consortium or W3C with participation from a large
number of researchers and industrial partners
Semantic Web for the Life Sciences
• “… the life sciences are a flagship area for the semantic
web …” (Tim Berners-Lee)
• “… Today, boundaries that inhibit data sharing limit
innovation in research and clinical settings alike, and
impede the efficient delivery of care. Semantic web
technologies give us a chance to solve this problem,
resulting, ultimately, in faster drug targeting, more
accurate reporting, and better patient outcomes.” (Susan
Hockfield, President of MIT)
• Semantic Web Health Care and Life Sciences Interest
Group (SW HCLSIG)
– http://www.w3.org/2001/sw/hcls/
Component Technologies of
Semantic Web
• Uniform Resource Identifier (URI)
– A standard means of addressing resources on the
Web
– e.g., http://en.wikipedia.org/wiki/Protein_P53
• Ontology
– Specification of a conceptualization of a knowledge
domain
• Ontological Language
– Resource Description Framework (RDF)
– Web Ontology Language (OWL)
– Both RDF and OWL have XML serialization
• Database and Tool
– RDF databases: Sesame, Kowari, Oracle
– OWL Reasoners: Racer, Pellet, FaCT
RDF Statement
A RDF statement consists of:
• Subject: resource identified by a URI
• Predicate: property (as defined in a name space identified by a
URI)
• Object: property value (literal) or a resource
For example, the dbSNP Website is a subject, creator is a
predicate, NCBI is an object.
A resource can be described by multiple statements.
Graphical & XML Representation
http://www.ncbi.nlm.nih.gov/SNP
http://purl.org/dc/elements/1.1/creator
http://www.ncbi.nlm.nih.gov
http://purl.org/dc/elements/1.1/language
en
<?xml version="1.0"?>
<rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmlns:dc=“http://purl.org/dc/elements/1.1”
xmlns:ex=“http://www.example.org/terms”>
<rdf:Description about=“http://www.ncbi.nlm.nih.gov/SNP”>
<dc:creator rdf:resource=“http://www.ncbi.nlm.nih.gov”></dc:creator>
<dc:language>en</dc:language>
date>
</rdf:Description>
</rdf:RDF>
RDF Schema (RDFS)
•
•
RDF Schema terms:
– Class
– Property
– type
– subClassOf
– range
– domain
Example:
<Person,type,Class>
<has_parent,type,Property>
<Family_member,subClassOf,Person>
<“Joe Smith”, type, Family_member>
<has_parent, range, Family_member>
<has_parent, domain, Family_member>
Web Ontology Language (OWL)
•
•
•
•
OWL builds on top of of RDF
It semantically extends RDFS
It is based on description logics
Three species of OWL
– OWL-Lite
– OWL-DL
– OWL-Full
Current Syntactic Web vs.
Future Semantic Web
Form Vision to Implementation:
Data Mashup
Data Mashup
• It refers to web resources that weave data
from different web resources into a new
service
• Disciplines like biosciences would benefit
greatly from data mashups
• Data mashup is difficult to create without
sharing data in a standard machinereadable format
Nature’s Data Mashup: A Google
Earth Application
Tracking of Avian Flu
The End