Basic Concepts of Bioinformatics

Download Report

Transcript Basic Concepts of Bioinformatics

How Bioinformatics can change your life
Basic Concepts of
Bioinformatics
M. Alroy Mascrenghe
MBCS, MIEEE, MIT
[email protected]
A lecture given for the BCS Wolerhampton Branch at the University of Wolverhampton
http://www.geocities.com/mark_ai/
TOC









Introduction
Basic concepts in Molecular biology
Bioinformatics techniques
Areas in bioinformatics
Applications
Related Computer Technology
Conference in Glasgow
Acknowledgements
Reference
M.Alroy Mascrenghe
2
Introduction……
M.Alroy Mascrenghe
3
2000






A Major event happened that was to
change the course of human history
It was a joint British and American
effort
nothing to do with IRAQ!
It was a race – who will complete
first
Race Test – not whether they have
taken drugs but whether they can
produce them!
Human genome was sequenced
M.Alroy Mascrenghe
4
A Situ…somewhere in the
near future






A virus –not ‘I love you’ virus- creates an epidemic
Geneticists and bioinformaticians role on their
sleeves
Genetic material of the virus is compared with the
existing base of known genetic material of other
viruses
As the characteristics of the other viruses are
known
From genetic material computer programs will
derive the proteins necessary for the survival of the
virus
When the protein (sequence and structure) is
known then medicines can be designed
M.Alroy Mascrenghe
5
What is

The marriage between computer
science and molecular biology


The algorithm and techniques of
computer science are being used to
solve the problems faced by molecular
biologists
‘Information technology applied to
the management and analysis of
biological data’

Storage and Analysis are two of the
important functions – bioinformaticians
build tools for each
M.Alroy Mascrenghe
6
Biology
Computer
Science
Chemistry
Statistics
Bioinformatics
M.Alroy Mascrenghe
7
What is..




This is the age of the Information
Technology
However storing info is nothing new
Information to the volume of
Britannica Encyclopedia is stored in
each of our cells
‘Bioinformatics tries to determine
what info is biologically important’
M.Alroy Mascrenghe
8
Basics
of
Molecular Biology….
M.Alroy Mascrenghe
9
DNA & Genes



DNA is where the genetic information is
stored
Blonde hair and blue eyes are inherited by
this
Gene - The basic unit of heredity




There are genes for characteristics i.e. a gene
for blond hair etc
Genes contain the information as a
sequence of nucleotides
Genes are abstract concepts – like
longitude and latitudes in the sense that
you cannot see them separately
Genes are made up of nucleotides
M.Alroy Mascrenghe
10
M.Alroy Mascrenghe
11
Nucleotide (nt)

Each nt I made up of





Sugar
Phospate group
Base
The base it (nt) contains makes the only
difference between one nt and the other
There are 4 different bases




G(uanine),A(denine),T(hymine),C(ytosine)
The information is in the order of nucleotide
and the order is the info
Genes can be many thousands of nt long
The complete set of genetic instructions is
called genomes
M.Alroy Mascrenghe
12
Chromosomes


DNA strings make
chromosomes
Analogy
Letters - nt
 Sentences – genes
 Individual volumes of Britannica
encyclopedia – chromosomes
 All voles together - Genome

M.Alroy Mascrenghe
13
Double Helix




The DNA is a double helix
Each strand has complementary
information
Each particular base in one strand is
bonded with another particular base in the
next strand
 G-C
 A-T
For example  AATGC
one strand
 TTACG
other strand
M.Alroy Mascrenghe
14
Proteins

Proteins are very important
biological feature

Amino Acids make up the proteins
20 different amino acids are there
The function of a protein is
dependant on the order of the amino
acids


M.Alroy Mascrenghe
15
Proteins…








The information required to make aa is
stored in DNA
DNA sequence determines amino acid
sequence
Amino Acid sequence determines protein
structure
Protein structure determines protein
function
A Substance called RNA is used to carry
the Info stored in the DNA that in turn is
used to make proteins
Storage - DNA
Information Transfer – RNA
RNA is the message boy!
M.Alroy Mascrenghe
16
Central dogma
DNA
transcription
RNA Polymerase
M.Alroy Mascrenghe
RNA
Translation
Protein
Ribosomes
17
M.Alroy Mascrenghe
18
Proteins…..



Since there are 20 amino acids to
translate one nt cannot correspond
to one aa, neither can it correspond
as twos
So in triplet codes – codon – protein
information is carried
The codons that do not correspond
to a protein are stop codons – UAA,
UAG, UGA
Some codons are used as start
codons - AUG as well as to code
methionine
(RNA has U instead of T)

M.Alroy Mascrenghe
19
Protein Structure







Shows a wide variety as opposed to the
DNA whose structure is uniform
X-ray crystallography or Nuclear Magnetic
Resonance (NMR) is used to figure out the
structure
Structure is related to the function or rather
structure determines the function
Although proteins are created as a linear
structure of aa chain they fold into 3 d
structure.
If you stretch them and leave them they will
go back to this structure – this is the native
structure of a protein
Only in the native structure the proteins
functions well
Even after
the translation is over protein 20
M.Alroy Mascrenghe
goes through some changes to its structure
Gene Expression






Gene Expression – the process of
Transcripting a DNA and translating a RNA
to make protein
Where do the genes begin in a
chromosome?
How does the RNA identify the beginning
of a gene to make a protein
A single nt cannot be taken to point out the
beginning of a gene as they occur
frequently
But a particular combination of a nucleotide
can be
Promoter sequences – the order of nt
which mark the beginning of a gene
M.Alroy Mascrenghe
21
Bioinformatics
Techniques…..
M.Alroy Mascrenghe
22
Prediction and Pattern
Recognition


The two main areas of bioinformatics
are
Pattern recognition


‘A particular sequence or structure has
been seen before’ and that a particular
characteristic can be associated with it
Prediction

From a sequence (what we know) we
can predict the structure and function
(what we don’t know)
M.Alroy Mascrenghe
23
Dot plots….



Simple way of evaluating
similarity between two
sequences
In a graph one sequence is on
one side the next on the other
side
Where there are matches
between the two sequences the
graph is marked
M.Alroy Mascrenghe
24
M.Alroy Mascrenghe
25
Alignments


A match for similarity between the characters of two or
more sequences
Eg.



TTACTATA
TAGATA
There are so many ways to align the above two
sequences

1.



2.





TTACTATA
TAGATA
3.


TTACTATA
TAGATA
TTACTATA
TAGATA
So which one do we choose and on what basis?
Solution is to Provide a match score and mismatch score
M.Alroy Mascrenghe
26
Gaps

Introduce gaps and a penalty
score for gaps
TTACTATA
 T_A_GATA


In gap scores a single indel which is two characters long is preferred to two indels which are each one
character long

However not all gaps are bad






TTGCAATCT
CAA
How do we align?
---CAA--These gaps are not biologically significant
Semi Global Alignments
M.Alroy Mascrenghe
27
Scoring Matrix




For DNA/protein sequence alignment we create a matrix
If A and A score is 1
If A and T score is -5
If A and C score is -1
M.Alroy Mascrenghe
28
Dynamic Programming




As the length of the query sequences
increase and the difference of length
between the two sequence also increases
–more gaps has to be inserted in various
places
We cannot perform an exhaustive search
Combinatorial explosion occurs – too much
combinations to search for
Dynamic programming is a way of using
heuristics to search in the most promising
path
M.Alroy Mascrenghe
29
Databases





Sequence info is stored in
databases
So that they can be manipulated
easily
The db (next slide) are located
at diff places
They exchange info on a daily
basis so that they are up-to-date
and are in sync
Primary db – sequence data
M.Alroy Mascrenghe
30
Major Primary DB
Nucleic Acid
Protein
EMBL (Europe)
PIR Protein Information
Resource
MIPS
GenBank (USA)
DDBJ (Japan)
SWISS-PROT
University of Geneva,
now with EBI
TrEMBL
A supplement to SWISSPROT
NRL-3D
Composite DB



As there are many db which one to
search? Some are good in some
aspects and weak in others?
Composite db is the answer – which
has several db for its base data
Search on these db is indexed and
streamlined so that the same stored
sequence is not searched twice in
different db
M.Alroy Mascrenghe
32
Composite DB

OWL has these as their primary
db
SWISS PROT (top priority)
 PIR
 GenBank
 NRL-3D

M.Alroy Mascrenghe
33
Secondary db

Store secondary structure info
or results of searches of the
primary db
Compo
Primary
DB
Source
PROSITE SWISS-PROT
PRINTS
OWL
M.Alroy Mascrenghe
34
Database Searches




We have sequenced and identified
genes. So we know what they do
The sequences are stored in
databases
So if we find a new gene in the
human genome we compare it with
the already found genes which are
stored in the databases.
Since there are large number of
databases we cannot do sequence
alignment for each and every
sequence
So heuristics must be used again.
M.Alroy Mascrenghe

35
Areas in
Bioinformatics…
M.Alroy Mascrenghe
36
Genomics


Because of the multicellular structure, each
cell type does gene expression in a
different way –although each cell has the
same content as far as the genetic
i.e. All the information for a liver cell to be a
liver cell is also present on nose cell, so
gene expression is the only thing that
differentiates
M.Alroy Mascrenghe
37
Genomics - Finding Genes





Gene in sequence data – needle in a
haystack
However as the needle is different
from the haystack genes are not diff
from the rest of the sequence data
Is whole array of nt we try to find and
border mark a set o nt as a gene
This is one of the challenges of
bioinformatics
Neural networks and dynamic
programming are being employed
M.Alroy Mascrenghe
38
Organism
Genome Gene
Size
Number
(Mb)
Web Site
bp * 1,000,000
Yeast
13.5
6,241
Fruit Flies
180
13,601
Homo
Sapiens
3,000
45,000
http://genomewww.stanford.ed
u/Saccharomyce
s
http://flybase.bio.
indiana.edu
http://www.ncbi.n
lm.nih.gov/geno
me/guide
Proteomics


Proteome is the sum total of an
organisms proteins
More difficult than genomics





4
Simple chemical makeup
Can duplicate
20
complex
can’t
We are entering into the ‘post
genome era’
Meaning much has been done with
the Genes – not that it’s a over
M.Alroy Mascrenghe
40
Proteomics…..





The relationship between the RNA and the protein it codes are
usually very different
After translation proteins do change
 So aa sequence do not tell anything about the post
translation changes
Proteins are not active until they are combined into a larger
complex or moved to a relevant location inside or outside the cell
So aa only hint in these things
Also proteins must be handled more carefully in labs as they tend
to change when in touch with an inappropriate material
M.Alroy Mascrenghe
41
Protein Structure Prediction


Is one of the biggest challenges
of bioinformatics and esp.
biochemistry
No algorithm is there now to
consistently predict the structure
of proteins
M.Alroy Mascrenghe
42
Structure Prediction methods

Comparative Modeling
Target proteins structure is
compared with related proteins
 Proteins with similar sequences
are searched for structures

M.Alroy Mascrenghe
43
Phylogenetics





The taxonomical system reflects
evolutionary relationships
Phylogenetics trees are things which reflect
the evolutionary relationship thru a
picture/graph
Rooted trees where there is only one
ancestor
Un rooted trees just showing the
relationship
Phylogenetic tree reconstruction algorithms
are also an area of research
M.Alroy Mascrenghe
44
Applications….
M.Alroy Mascrenghe
45
Medical Implications



Pharmacogenomics
 Not all drugs work on all patients, some good
drugs cause death in some patients
 So by doing a gene analysis before the
treatment the offensive drugs can be avoided
 Also drugs which cause death to most can be
used on a minority to whose genes that drug is
well suited – volunteers wanted!
 Customized treatment
Gene Therapy
 Replace or supply the defective or missing gene
 E.g: Insulin and Factor VIII or Haemophilia
BioWeapons (??)
M.Alroy Mascrenghe
46
Diagnosis of Disease






Diagnosis of disease
 Identification of genes which cause the
disease will help detect disease at early
stage e.g. Huntington disease Symptoms – uncontrollable dance like
movements, mental disturbance,
personality changes and intellectual
impairment
Death in 10-15 years
The gene responsible for the disease has
been identified
Contains excessively repeated sections of
CAG
So once analyzed the couple can be
counseled
M.Alroy Mascrenghe
47
Drug Design



Can go up to 15yrs and
$700million
One of the goals of
bioinformatics is to reduce the
time and cost involved with it.
The process

Discovery


Computational methods can
improves this
Testing
M.Alroy Mascrenghe
48
Discovery
Target identification




Identifying the molecule on which the
germs relies for its survival
Then we develop another molecule
i.e. drug which will bind to the target
So the germ will not be able to interact
with the target.
Proteins are the most common targets
M.Alroy Mascrenghe
49
Discovery…



For example HIV produces HIV
protease which is a protein and
which in turn eat other proteins
This HIV protease has an active
site where it binds to other
molecules
So HIV drug will go and bind
with that active site

Easily said than done!
M.Alroy Mascrenghe
50
Discovery…



Lead compounds are the
molecules that go and bind to
the target protein’s active site
Traditionally this has been a trial
and error method
Now this is being moved into the
realm of computers
M.Alroy Mascrenghe
51
Related Computer
Technology………….
M.Alroy Mascrenghe
52
PERL





Perl is commonly used for
bioinformatics calculations as its
ability to manipulate character
symbols
The default CGI language
It started out as a scripting language
but has become a fully fledged
language
IT has everything now, even web
service support
http://bio.perl.org
M.Alroy Mascrenghe
53
The place of XML & Web
Services







Various markup languages are being created –
Gene Markup language etc to represent
sequence/gene data
Web Services – program to program interaction,
making the web application centric as opposed to
human centric
So this has to platform language independent
Protocols like SOAP help in this regard
In bioinformatics various databases are being used,
different platforms, languages etc
So web services helps achieve platform
independence and program interaction
Since sequence data bases are in various formats,
platforms SOAP also helps in this regards
M.Alroy Mascrenghe
54
The place of GRID





GRID - new kid on the block
Using many computers to fulfill a
single computational tasks
Bioinformatics is the ideal
platform as it has to deal with a
large amount of data in
alignment and searches
E-science initiative in the UK
ORACLE 10g – the worlds first
GRID database
M.Alroy Mascrenghe
55
Data bases and Mining



Lot of the sequence databases are
available publicly
As there is a DB involved various
data mining techniques are used to
pull the data out
As there is a lot of literature – articles
etc – on this area a data mining on
the literature – not on the sequence
data has also become a PhD topic
for many
M.Alroy Mascrenghe
56
European Molecular Biology
Network (EMBnet)



A central system for sharing, training
and centralizing up to date bio info
Some of the EMBnet sites are:
SQENET


UCL


http://www.seqnet.dl.ac.uk
http://www.biochem.ucl.ac.uk/bsm/dbbro
wser/embnet/
EBI – European Bioinformatics
Institute

www.ebi.ac.uk
M.Alroy Mascrenghe
57
References





Dan E. Krane and Michael L. Raymer
 Basic Concepts of Bioinformatics
Arthur M Lesk
 Intro to Bioinformatics
T.K. Attwood & D. J. Parry-Smith
 Intro to Bioinformatics
The genetic Revolution
 Dr Patrick Dixon
Prof David Gilbert’s Site
 http://www.brc.dcs.gla.ac.uk/~drg/
M.Alroy Mascrenghe
58
Thank You!
M.Alroy Mascrenghe
59