Transcript slides

Data Representation and the
Role of Ontologies
PHAR 201/Bioinformatics I
Philip E. Bourne
Department of Pharmacology, UCSD
• Prerequisite
reading: Genome
Research (2001) 11:1425-1433
PHAR 201 Lecture 4, 2012
1
Consider this Course a Workflow in How You Will Handle Data
(Regardless of Type) For the Rest of Your Lives
We Use Macromolecular Structure Data to Illustrate the Process
And Hence Learn Structural Bioinformatics in the Process
Data In
Recognize redundancy
In the data
Classify the data
Understand the scope
and complexity of the
data
Understand the
methods to
physically instantiate
the model
Analyze the data
PHAR 201 Lecture 4, 2012
Understand the
experiment to
understand the
errors
Understand how
to best represent
(model) the data
Discover new science
From the data
2
Agenda
• Before there were ontologies there was mmCIF
• Briefly review the history of ontology development
• Review the Gene Ontology (GO)
– Motivation
– Features
– Related research activities around GO
PHAR 201 Lecture 4, 2012
3
The PDB Format
• A full description is here
• It was designed around an 80 column punched
card!
• It was designed to be human readable
• It is used by almost every piece of software that
deals with structural data
PHAR 201 Lecture 4, 2012
4
The PDB Format - Records
• Every PDB file may be broken into a number of lines
terminated by an end-of-line indicator. Each line in the
PDB entry file consists of 80 columns. The last character
in each PDB entry should be an end-of-line indicator.
• Each line in the PDB file is self-identifying. The first six
columns of every line contain a record name, left-justified
and blank-filled. This must be an exact match to one of the
stated record names.
• The PDB file may also be viewed as a collection of record
types. Each record type consists of one or more lines.
• Each record type is further divided into fields.
PHAR 201 Lecture 4, 2012
5
The PDB Format – An Example –
The Header
PHAR 201 Lecture 4, 2012
6
The PDB Format – An Example – The
Atomic Coordinates
PHAR 201 Lecture 4, 2012
7
The Description – Atom Records
PHAR 201 Lecture 4, 2012
8
What is Wrong with this Approach?
• The description and the data are separate
• Parsing is a nightmare – the most complex piece
of code we have in our research laboratory
probably remains the PDB parser
• There are no relationships between items of data
• Some data just cannot be parsed
• The fixed column format cannot represent some of
today’s structures …
PHAR 201 Lecture 4, 2012
9
Structures are Spread Over Multiple Files –
Most Users are Not Aware of this
PHAR 201 Lecture 4, 2012
10
PDB Format Important Components of the Data are Lost to
All But Humans
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
3 REFINEMENT. BY THE RESTRAINED LEAST-SQUARES PROCEDURE OF
3 J. KONNERT AND W. HENDRICKSON (PROGRAM *PROLSQ*). THE R
3 VALUE IS 0.168 FOR 2680 REFLECTIONS WITH I GREATER THAN
3 2.0*SIGMA(I) REPRESENTING 74 PER CENT OF THE TOTAL
3 AVAILABLE DATA IN THE RESOLUTION RANGE 10.0 TO 2.0
3 ANGSTROMS.
4 THE ERABUTOXIN A (EA) CRYSTAL STRUCTURE IS ISOMORPHOUS WITH
4 THE KNOWN STRUCTURE OF ERABUTOXIN B (PROTEIN DATA BANK
4 ENTRIES *2EBX*, *3EBX*). EA DIFFERS FROM EB BY A SINGLE
4 SUBSTITUTION - EA ASN 26 FOR EB HIS 26. THE EA STARTING
4 MODEL WAS OBTAINED FROM A MOLECULAR REPLACEMENT STUDY IN
4 WHICH COORDINATES FOR 309 OF THE 475 ATOMS IN THE EB
4 STRUCTURE (*2EBX*) WERE USED.
PHAR 201 Lecture 4, 2012
11
mmCIF Was Developed to
Address these Problems
Methods in Enzymology. 1997 277, 571-590
PHAR 201 Lecture 4, 2012
12
mmCIF – Scope of the Initial
Effort
• All PDB data should be captured
• Describe a paper’s material and methods
section
• Describe biologically active molecule
• Fully describe secondary structure but not
tertiary or quaternary
• Describe details of chemistry (inc. 2D)
• Meaningful 3D views
PHAR 201 Lecture 4, 2012
13
mmCIF - Extract from a Data File
loop_
_atom_site.group_PDB
_atom_site.type_symbol
_atom_site.label_atom_id
_atom_site.label_comp_id
_atom_site.label_asym_id
_atom_site.label_seq_id
_atom_site.label_alt_id
_atom_site.Cartn_x
_atom_site.Cartn_y
_atom_site.Cartn_z
_atom_site.occupancy
_atom_site.B_iso_or_equiv
_atom_site.footnote_id
_atom_site.entity_id
_atom_site.entity_seq_num
_atom_site.id
ATOM N N VAL A 11 . 25.360 30.691 11.795 1.00 17.93 . 1 11 1
ATOM C CA VAL A 11 . 25.970 31.965 12.332 1.00 17.75 . 1 11 2
ATOM C C VAL A 11 . 25.569 32.010 13.881 1.00 17.83 . 1 11 3
PHAR 201 Lecture 4, 2012
14
mmCIF - Extract from the Dictionary
save__atom_site.Cartn_x
_item_description.description
;
The x atom site coordinate in angstroms specified according to
a set of orthogonal Cartesian axes related to the cell axes as
specified by the description given in
_atom_sites.Cartn_transform_axes.
;
_item.name
'_atom_site.Cartn_x'
_item.category_id
atom_site
_item.mandatory_code
no
_item_aliases.alias_name '_atom_site_Cartn_x'
_item_aliases.dictionary
cifdic.c94
_item_aliases.version
2.0
loop_
_item_dependent.dependent_name
'_atom_site.Cartn_y'
'_atom_site.Cartn_z'
_item_related.related_name '_atom_site.Cartn_x_esd'
_item_related.function_code associated_esd
_item_sub_category.id
cartesian_coordinate
_item_type.code
float
_item_type_conditions.code esd
_item_units.code
angstroms
PHAR 201 Lecture 4, 2012
15
Summary
• mmCIF has provided the PDB with a robust data
representation which serves as conceptual and
physical schema upon which the current RCSB,
PDBe and PDBj are built
• This work predated XML and XML-schema but
embodies the important concepts inherent in these
descriptions
• mmCIF was later exactly converted into XML and
is now used more than mmCIF, but much less than
the old PDB format
• PDB format will be phased out over a period of
years
PHAR 201 Lecture 4, 2012
16
Agenda
• Before there were ontologies there was mmCIF
• Briefly review the history of ontology development
• Review the Gene Ontology (GO)
– Motivation
– Features
– Related research activities around GO
PHAR 201 Lecture 4, 2012
17
Formal Definitions Taken from Knowledge
Engineering ….
1. A systematic account of existence.
2. (From philosophy) An explicit formal specification of how to
represent the objects, concepts and other entities that are
assumed to exist in some area of interest and the relationships
that hold among them.
3. For AI systems, what "exists" is that which can be represented.
When the knowledge about a domain is represented in a
declarative language, the set of objects that can be represented
is called the universe of discourse. We can describe the ontology
of a program by defining a set of representational terms.
Definitions associate the names of entities in the universe of
discourse (e.g. classes, relations, functions or other objects) with
human-readable text describing what the names mean, and
formal axioms that constrain the interpretation and well-formed
use of these terms. Formally, an ontology is the statement of a
logical theory.
PHAR 201 Lecture 4, 2012
18
Formal Definitions Taken from Knowledge
Engineering Continued
4. A set of agents that share the same ontology will be able to
communicate about a domain of discourse without necessarily
operating on a globally shared theory. We say that an agent
commits to an ontology if its observable actions are consistent
with the definitions in the ontology. The idea of ontological
commitment is based on the Knowledge-Level perspective.
5. The hierarchical structuring of knowledge about things by
subcategorizing them according to their essential (or at least
relevant and/or cognitive) qualities. See subject index. This is an
extension of the previous senses of "ontology" (above) which has
become common in discussions about the difficulty of
maintaining subject indices.
PHAR 201 Lecture 4, 2012
19
We will not focus too much on
the formal definitions 
But more on how these formal
concepts have been applied to biology
PHAR 201 Lecture 4, 2012
20
The History of Ontologies from a Biological
Perspective …
• Early biological database efforts (1990’s)
adopted knowledge bases as a model e.g.
RiboWeb
• They used the products from the AI
community e.g. Ontolingua
• Some of the concepts of knowledge bases
remain – notably ontologies, but they are
now mostly cast in more familiar
commercial frameworks e.g. relational
databases
PHAR 201 Lecture 4, 2012
21
The History of Ontologies from a Biological
Perspective Continued
• Biological community in general was slow to see the
value
• Medical informatics community adopted ontologies
early
• Late 90’s database providers in particular began to
work together – the gene ontology (GO) being a major
product of this effort
• 1998-2004 ontologies were the rage and warranted
their own session at Bioinformatics meetings and are
taken seriously by the biological community
• 2004- accepted as part of biological data
representation and use
PHAR 201 Lecture 4, 2012
22
The History of Ontologies from a Biological
Perspective Continued
• Centers established to support the
maintenance of ontologies:
– The Open Biomedical Ontologies (OBO)
Foundry
– National Center for Biomedical Ontology
(BioPortal 2.0)
PHAR 201 Lecture 4, 2012
23
What Isn’t An Ontology?
• A database or program
– because they share internal formats only – it is not
global
• A table of contents
– Because it is not a formal representation of the
concepts
• A terminology (aka controlled vocabulary)
– Because it is a set of terms without a formal
structure of how they relate
PHAR 201 Lecture 4, 2012
24
Examples of Valuable Terminologies
(Controlled Vocabularies) That Are Not
Ontologies
•
•
•
•
•
ICD-9 for diseases
SNOMED/RCD codes for symptoms
EC Numbers (?)
Taxonomy
SMILES strings
PHAR 201 Lecture 4, 2012
25
Ontology As Language
• The ontology becomes the language of the
domain it describes
• The language = syntax + semantics
• While that language must be understood
by computers human readability counts
PHAR 201 Lecture 4, 2012
26
Ontology as Contract
Purposes of Ontologies
• data exchange
• unification/translation
• calling knowledge
services
• representing theories
• human
communication
Parties to the contract
• programmers
• data admins
• programmers,
netbots
• scientists
• collaborators
PHAR 201 Lecture 4, 2012
27
Ontology Specifications
• XML – provides a syntax for structured documents
• XML Schema - a language for structuring XML
documents and adding data types
• RDF - a data model for objects and relations
between them and represented in XML
• RDF Schema – describes properties and classes
of RDF resources with semantics to generalize
• OWL 2 – Web Ontology Language – adds more
vocabulary particularly of relationships between
classes (e.g. disjointness, cardinality)
PHAR 201 Lecture 4, 2012
28
Here is Another One..
http://richard.cyganiak.de/2007/10/lod/lod-datasets_2010-09-22_colored.html
PHAR 201 Lecture 4, 2012
29
References: Ontologies in
Bioinformatics
• Bio-ontologies workshops since 1997
• Historical papers on knowledge
sharing
• mmCIF as an ontology - Westbrook
and Bourne (2000) Bioinformatics
16(2) 159-168 [PDF]
• Review 2006 – Bodenreider and
Stevens Briefings in Bioinformatics
PHAR 201 Lecture 4, 2012
30
Agenda
• Before there were ontologies there was mmCIF
• Briefly review the history of ontology development
• Review the Gene Ontology (GO)
– Motivation
– Features
– Related research activities around GO
PHAR 201 Lecture 4, 2012
31
References
• GO Itself - Creating the Gene Ontology Resource:
Design and Implementation Genome Research
(2001) 11:1425-1433
• Nucleic Acids Res. 2010 Jan;38(Database issue):D3315. Epub 2009 Nov 17.
• The GO Website - http://www.geneontology.org
• Application of GO –
The Gene Ontology Annotation (GOA) project:
implementation of GO in SWISS-PROT, TrEMBL, and
InterPro Genome Res. 2003 Apr;13(4):662-72. Epub
2003 Mar 12
PHAR 201 Lecture 4, 2012
32
Brief History
• Started by Saccharomyces Genome
Database, FlyBase and the Mouse
Genome Database
• Grown to a consortium of members (see
here)
PHAR 201 Lecture 4, 2012
33
Roles of the GO Consortium
• Write and maintain the ontologies
themselves
• Associate the ontologies to genes in the
respective databases of members
• Provide tools to facilitate the development
and maintenance of ontologies
PHAR 201 Lecture 4, 2012
34
Gene Ontology (GO)
http://www.geneontology.org/
•
Three levels of annotation:
– Molecular function - what a gene product does at the
biochemical level
– Biological process - a broad biological perspective – not
currently a pathway (no dynamics or dependencies)
– Cellular component - location within cellular structures (eg Golgi
apparatus) and macromolecular complexes (ribosome)
PHAR 201 Lecture 4, 2012
35
GO Goals
From Genome Res 2001
Aug;11(8):1425-33
PHAR 201 Lecture 4, 2012
36
Structure of GODirected Acyclic Graph (DAG)
Example from molecular function:
Parent
Transmembrane
receptor
is_a
Child
Protein tyrosine
kinase
is_a
Transmembrane receptor tyrosine protein kinase
PHAR 201 Lecture 4, 2012
37
Structure of GODirected Acyclic Graph (DAG)
Relationship of Child to Parent
is_a represents an instance of
part_of
A mitotic chromosome is_a instance of a chromosome
A telomere is part_of a chromosome
PHAR 201 Lecture 4, 2012
38
Example - Molecular Function
PHAR 201 Lecture 4, 2012
39
Example - Biological Process
PHAR 201 Lecture 4, 2012
40
Example - Cellular Location
PHAR 201 Lecture 4, 2012
41
Use of GO within the PDB
http://pdb.rcsb.org
PHAR 201 Lecture 4, 2012
42
Use of GO Within the Open Literature
PHAR 201 Lecture 4, 2012
43
Some Issues –
Levels of Granularity – Species Specificity
• Chitin metabolism is part of cuticle
synthesis in fly
• Chitin metabolism is part of cell wall
organization in yeast
PHAR 201 Lecture 4, 2012
44
Some Issues
• GO is dynamic – parent child relationships
can change
• When does a process begin and end?
• Is_a and part_of not always clear – is actin
cytoskeleton is_a cytoskeleton or part_of
cytoskeleton
• A community effort
PHAR 201 Lecture 4, 2012
45
Relationship to Gene Products
• A gene product is a protein or functional
RNA
• A gene product may have more than one
function and therefore be related to
multiple GO terms
• The name of a gene product may only
reflect one of its functions
PHAR 201 Lecture 4, 2012
46
GO is Really 3 Independent
Ontologies
• Annotation of a gene product by one ontology is
independent of its annotation by another
ontology
• Example: Products of the MDH1 MDH2 and
MDH3 genes are all isoforms of malate
dehydrogenanse in yeast with the same
function, but localize to different cellular
locations and are involved in different
biochemical processes
PHAR 201 Lecture 4, 2012
47
Evidence Codes
• The evidence for assigning a gene product
to a GO term itself has a controlled
vocabulary
PHAR 201 Lecture 4, 2012
48
Research Applications of GO
PHAR 201 Lecture 4, 2012
49
PHAR 201 Lecture 4, 2012
50
Research Applications of GO
PHAR 201 Lecture 4, 2012
51
PHAR 201 Lecture 4, 2012
52
PHAR 201 Lecture 4, 2012
53