Introduction to bioinformatics

Download Report

Transcript Introduction to bioinformatics

Introduction to bioinformatics
Sylvia B. Nagl
What is bioinformatics?
• an emerging interdisciplinary research area
• deals with the computational management
and analysis of biological information: genes,
genomes, proteins, cells, ecological systems,
medical information, robots, artificial
intelligence...
The Core of Bioinformatics to date
•Relationships between
TDQAAFDTNIVTLTRFVM
EQGRKARGTGEMTQLLNS
LCTAVKAISTAVRKAGIA
HLYGIAGSTNVTGDQVKK
LDVLSNDLVINVLKSSFA
TCVLVTEEDKNAIIVEPE
KRGKYVVCFDPLDGSSNI
DCLVSIGTIFGIYRKNST
DEPSEKDALQPGRNLVAA
GYALYGSATMLV
sequence
3D structure
protein functions
•Properties and evolution of genes, genomes,
proteins, metabolic pathways in cells
•Use of this knowledge for prediction, modelling, and
design
“The holy grail of bioinformatics”
GCTCCTCACTGTCTGTGTTTATTC
TTTTAGCTTCTTCAGATCTTTTAG
TCTGAGGAAGCCTGGCATGTGCA
AATGAAGTTAACCTAA...
> 500, 000 genes
sequenced to date
Expected number of
unique protein
structures:
~ 700-1, 000
Basic concepts
• conceptual foundations of bioinformatics:
evolution
protein folding
protein function
• bioinformatics builds mathematical models
of these processes to infer relationships between components
of complex biological systems
Information processing in cells
nucleic acids
proteins
coding regions
regulatory
sites
transcripts
One-to-many mappings!
Context-dependence!
Global approaches: Toward a new Systems Biology
Global cell state
Genome
Protein population:
proteomics
Genome activation
patterns: transcriptomics
•How does the spatial and
temporal organisation of
living matter give rise to
biological processes?
Organisation:
tissue imaging
EM
X-ray, NMR
cells
molecular complexes
Global approaches: Toward a new Systems Biology
Perturbation
Living cell
Biological knowledge
(computerised)
Sequence information
Dynamic response
•Basic principles
“Virtual cell”
Structural information
Bioinformatics
Mathematical
modelling
Simulation
•Practical
applications
We do not know yet whether the information in the genome is sufficient
to reconstruct an entire biological system. Information on building blocks
not enough, information on their interactions is essential.
External environment
Internal environment
Metabolic net
Genetic networks
DNA hRNA
mRNAs
proteins
Bioinformatics in context
Mathematics/
computer
science
Genomics
Molecular
biology
Ethical, legal,
and social
implications
Bioinformatics
Biophysics
Molecular
evolution
Current challenges to users
• Potential hurdles:
Methods are in flux and not fully developedscattered and heterogeneous resources
• Remedies: Web resources
navigation guides
integration of tools and databanks
http://www.biochem.ucl.ac.uk/~nagl/bioinformatics.html
Sequence homology search of the
genome of Plasmodium
falciparum
Target identification for antimalerial
drugs
The search for new antimalarial
drugs
• Malaria is one of the leading causes of morbidity
and mortality in the tropics.
• 300 to 500 million estimated clinical cases and 1.5
million to 2.7 million deaths per year.
• Nearly all fatal cases are caused by Plasmodium
falciparum.
• The parasite's resistance to conventional
antimalarial drugs such as chloroquine is growing
at an alarming rate.
•P. falciparum has a plastidlike organelle, called the
apicoplast, acquired by endosymbiosis of an alga.
Jomaa et al. (1999)
•Self-replicating, maternally inherited (35kb, circular DNA).
•Comparative genome analysis: Search for orthologs.
Apicoplast contains enzymes found in plant and bacterial,
but not animal metabolic pathways.
•Potential target for antimalerial drugs:
DOXP reductoisomerase
Jomaa et al. (1999) Science 285: 1573-1576:
Biological databases
The challenge
(Boguski, 1999)
In 1995, the number of genes in the database started to exceed
the number of papers on molecular biology and genetics in the
literature!
Data types
primary data
sequence
AATGCGTATAGGC
DNA
DMPVERILEALAVE
amino acid
secondary data
secondary
protein structure
“motifs”: regular
expressions, blocks,
profiles, fingerprints
primary database
secondary db
e. g., alpha-helices, betastrands
tertiary data
tertiary protein
structure
atomic co-ordinates
domains, folding units
tertiary db
Primary biological databases
• Nucleic acid
EMBL
GenBank
DDBJ (DNA Data
Bank of Japan)
• Protein
PIR
MIPS
SWISS-PROT
TrEMBL
NRL-3D
International nucleotide data banks
EMBL
GenBank
Europe
EMBL
EBI
USA
NLM
NCBI
International
Advisory Meeting
Collaborative Meeting
TrEMBL
DDBJ
Japan
NIG
CIB
NRDB
GenBank file format
GenBank file format
Swiss-Prot
SWISS-PROT file format
SWISS-PROT file format
SWISS-PROT file format
SWISS-PROT file format
Other primary protein databases
• TrEMBL (translated EMBL) in SWISS-PROT format
rapid access to sequence data from genome projects
computer-annotated supplement to SWISS-PROT
translations of all coding sequences (CDS) in EMBL
• SP-TrEMBL
• REM-TrEMBL: immunoglobulins, T-cell receptors, short
fragments, synthetic and patented sequences
Other primary protein databases
The Protein Information Resource (PIR)
• integrated system of protein sequence databases
and derived related databases, e. g., alignment
databases
• rapid searching, comparison, and pattern matching of
protein sequences
• retrieval of descriptive, bibliographic, feature, and
concurrent cross-reference information
• aims to be comprehensive and consistently
annotated
PIR: related databases
NRL-3D Sequence-Structure Database
• produced by PIR from sequence and annotation
information extracted from three-dimensional
structures in the Protein Databank (PDB)
• allows keyword and similarity searches
PIR: related databases
PATCHX integrated with PIR
• a non-redundant database of protein sequences
produced by MIPS, the European branch of PIRInternational
The PIR Protein Sequence Database and PATCHX
together provide the most complete collection of
protein sequence data currently available in the
public domain.
Composite protein sequence dbs
NRDB
OWL
MIPSX(PIR+PATCHX)
SP+TrEMBL
PIR
PIR
PIR
TrEMBL
SP
SP
SP
SP
PDB
GenBank
MIPSOwn
GenPept
NRL-3D
NRL-3D
MIPSH
PIRMOD
MIPSTrn
EMTrans
GBTrans
Kabat
PseqIP
OWL composite database
By accession number
• By database code
• By text
• By sequence
• By title
• By author
• By query language
• By regular expression
OWL only released every 6-8
weeks
Direct OWL access:
OWL Blast server
Two other useful sites
INFOBIOGEN-The Public Catalog of Databases
http://www.infobiogen.fr/services/dbcat/
KEGG-Kyoto Encyclopedia of Genes and Genomes
http://www.genome.ad.jp/kegg/
Kyoto Encyclopedia of Genes and Genomes (KEGG) is an effort to
computerize current knowledge of molecular and cellular biology in
terms of the information pathways that consist of interacting molecules
or genes and to provide links from the gene catalogs produced by
genome sequencing projects.
Sequence Retrieval System (SRS)
Database browser that allows
users to
•retrieve
•link
•access
entries from all interconnected
resources.
Users can formulate queries
across a range of different
database types.
Guide to Protein Databases:
http://www.biochem.ucl.ac.uk/~robert/bioinf/lecture1/index
.html
http://www.biochem.ucl.ac.uk/~robert/bioinf/lecture2/index
.html
With thanks to Dr Roman Laskowski.