Transcript Slide 1

Chapter 1
Introduction
What is bioinformatics
Quantitation is essential in biology
Counting bacterial colonies
Counting animals in a natural environment
Counting genetic variability among plants and fruit flies led to the laws of Mendelian
inheritance
More complex quantitative tools involve predictions of human population growth or
enzyme kinetics
Very sophisticated tools may involve application of “game theory” to model behavior
and evolution
Non-linear partial differential equations to model cardiac blood flow or in situ
cytoplasm flow
None of these examples are bioinformatics
Bioinformatics relate to macromolecules
Earliest bioinformatics exercise: Margaret Dayhoff (1965) first protein sequence
database Atlas of Protein Sequence and Structure (now PIR)
Early 1970s Brookhaven National Laboratory compiled Protein Database (PDB) of Xray and NMR structures
First sequence alignment algorithm Needleman Wunsch 1970s
Routine sequence comparisons and database searching
First protein structure prediction algorithm Chou and Fasman 1974
1980s saw establishment of GenBank and FASTA and BLAST
Human Genome Project started late 1980s
Main reason why bioinformatics flourished and grew was due to enormous volumes
of sequence data
Definition
Bioinformatics is the discipline that uses computers to store,
retrieve, manipulate and distribute information related to
biological macromolecules such as RNA, DNA and proteins
Computational biology encompasses all areas of biology that
involve computation
Goal
Better understand a living cell and how it functions at a
molecular level
Two major fields
1. Development of computational tools and databases
•Software for sequence analysis
•Sequence alignment, sequence database searching,
motif and pattern discovery, gene and promoter finding,
reconstruction of evolutionary relationships, genome
assembly and comparison
•Software for structural analysis
•Protein and nucleic acid structural analysis, comparison,
classification and prediction
•Software for functional analysis
•Gene expression profiling, protein-protein interaction
prediction, protein sub-cellular location prediction,
metabolic pathway reconstruction
•Construction and curation of biological databases
2. Generate biological knowledge to better understand living
systems
•Often identify new problems that require new software to
analyze
•Bioinformatics is essential for basic genomic and molecular biology research
•Major impact in biotechnology and biomedical sciences
•Knowledge-based drug design
•3D structure allows design of ligands that fit
•Reduces time and cost to develop drugs
•Forensic DNA analysis
•Bayesian statistics and likelihood-based methods
•Personalised healthcare
•Agricultural biotechnology
•Plant genome databases
•Gene expression profiles
•New crop varieties
Limitations of bioinformatics
•The results are as good as the data
•Errors in sequences
•Hypothesis independent
•Bioinformatics does not replace traditional hypothesisdriven approaches
•It complements and identified new questions
•Integrate gene expression and protein functions in the
cell
•Analysis at the level of systems: systems biology
•Description of a cell as a mathematical model
•Predictive value
Chapter 2
Biological Databases
What is a database?
•A database is a computerized archive used to store and
organize data so that information can be retrieved by a variety
of search criteria
•A database can be thought as a stack or record cards, where
each record card contains defined items of information, say
Name, Address, Phone Number, Birth Date, etc.
•In a database, each such card is an entry, and each set
information item is a field
•Each field of each entry contains a value (can be NULL)
•Search all entries retrieve entries than contain a specific value
in a field
•This process is called making a query
•Biological databases often have higher level requirements
such as knowledge discovery, where previously unknown
relations between values are found
Different database formats
•Flat file
•ASCII file
•Rows of comma delimited entries
•The computer has to read the entire file to find all entries or
relationships
•Many databases are distributed as flat files
•Below is a simple ASCII data file from REBASE, a database
of restriction enzyme cleavage sites
(http://rebase.neb.com/rebase/rebase.html)
REBASE version 807
strider.807
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
REBASE, The Restriction Enzyme Database
http://rebase.neb.com
Copyright (c) Dr. Richard J. Roberts, 2008.
All rights reserved.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rich Roberts
#AarI,cacctgc,4,8,
AatII,gacgt/c,
AbsI,cc/tcgagg,
AccI,gt/mkac,
Acc65I,g/gtacc,
#AceIII,cagctc,7,11,
#AciI,ccgc,-3,-1,
AclI,aa/cgtt,
#AcuI,ctgaag,16,14,
AfeI,agc/gct,
AflII,c/ttaag,
Jun 30 2008
Relational database
•The relational database does not describe relations
between entries
•Relation is the mathematical term for “table”
•Thus a relational database is composed of tables
•Each table is composed of rows (entries = tuple) and each
row has columns (attributes) with a value in each cell
•Where multiple tables share a common column, it is
possible to get relationship between the columns in different
tables by combinining data with identical values for a column
Entries/Tuple
Columns/Attributes
Student Number
Name
State
1
Jack
Kansas
2
John
Maryland
3
Jill
Washington
A simple three table relational database
Student
Number
Name
Gender
State
1
Jack
M
Kansas
2
John
M
Maryland
3
Jill
F
Maryland
Student
Number
Course
1
BOC314
2
BOC334
3
BOC364
Course
Description
BOC314
Biochemistry
BOC334
Proteomics
BOC364
Bioinformatics
Query: What courses do students from Maryland take?
Query: Do females take more courses in the first or second semester
Object oriented databases
•Attributes of entries are represented as members of classes
•Each member can be a member of more than one class
•This gives rise to a hierarchical relationship, very much like a tree
•Parent objects point to child objects, which, in turn, pointy to their
child objects
•Thus, all students from Maryland will be pointed to by the Maryland
object
•All students who do BOC364 will be pointed to by the BOC364
object
•Great care must be taken when designing a object-oriented
database to ensure efficient querying
Biological databases
•Primary databases
•Raw sequence data
•GenBank
•PDB
•Secondary databases
•Computationally processed or curated database
•SWISS-PROT
•PIR
•Specialized databases
•For specific interest groups
•FlyBase
•SGD
Primary Databases
Three major databases
GenBank (http://www.ncbi.nlm.nih.gov/Genbank/)
EMBL
DDBJ
Sequences are exchanged on a daily basis
Each database is up to date (use any one)
Deposition of data a prerequisite to publication
Secondary databases
•Significant processing of original raw data
•Annotation
•ORFs
•Functional links
•SWISS-PROT
•Carefully curated database
•High quality
•SWISS-PROT, trEMBL and PIR combined in UniProt
•Pfam aligned protein sequences to define families
•BLOCKS – motifs and patterns
•DALI – secondary predictions to find evolutionary relationships
Specialized Databases
•Often focused on a specific aspect of an organism
•Curated by experts
•Highly annotated and processed data
•SGD
•FlyBase
•WormBase
Interconnection between biological databases
•Need to access both primary and secondary database
•Provide links between databases
•Difficult to connect databases with different structures:
ASCII, Relational and Object-oriented
•Common Object Request Broken Architecture (CORBA)
•eXtensible Markup Language (XML)
Information retrieval
Entrez (Aahn-tray)
Gateway that allows text-based searches of a wide variety of data
Using “Limits” in Entrez
Preview/Index
History
Clipboard
Online Mendelian Inheritance in Man
PubMed
GenBank file format
GenBank file format continued
FASTA format
•First line start with “>” sign followed by any information
•Sequence continues with 60 or 80 characters per line
Abstract Syntax Notation (ASN.1)
Sequence retrieval system (SRS)
(http://srs6.ebi.ac.uk/)
Result of SRS search