Investigating Sequences - BioQUEST Curriculum Consortium
Download
Report
Transcript Investigating Sequences - BioQUEST Curriculum Consortium
Investigating Sequences
Stephen Everse
University of Vermont
Biochemical similarities among all organisms
GENOTYPE (i.e. Aa)
• Genetic information
encoded in nucleic acids
• Protein synthesis by
ribosomes using common
genetic code*
• Many common families of
genes and proteins (rRNA,
enzymes, proteins for
transport, replication and
expression of DNA)
• All modern day cells
descended from a common
ancestor
• Evolutionary relationships
revealed by gene
sequences
PHENOTYPE (pink flower)
The tree of life
Phylogenetic relationships among organisms
determined by ribosomal RNA sequences
Red lines indicate pathogens
First cells are thought to have existed as early
as 3.8 billion years ago. They were probably prokaryotes.
Oldest eucaryotic cell fossils are about 1.8 billion yrs ago
J. Burke 2005
What is Bioinformatics?
• (Molecular) Bio - informatics
• One idea for a definition?
Bioinformatics is conceptualizing biology in terms of molecules (in
the sense of physical-chemistry) and then applying “informatics”
techniques (derived from disciplines such as applied math, CS, and
statistics) to understand and organize the information
associated with these molecules, on a large-scale.
• Bioinformatics is “MIS” for Molecular Biology Information. It is a
practical discipline with many applications.
Math &
Stats
Bioinformatics
Bio
Alexandrov and Gerstein © 2000
Comp
Sci
Areas of current and future
development of bioinformatics
• Molecular biology and genetics
• Phylogenetic and evolutionary sciences
• Different aspects of biotechnology including
pharmaceutical and microbiological industries
• Medicine
• Agriculture
• Eco-management
Bioinformatics key areas
e.g. homology
searches
organisation of knowledge
(sequences, structures,
functional data)
M. Nilges © 2003
Why do we want to compare
sequences?
• Relationships
– Phylogenetic trees can be constructed based on comparison of
the sequences of a molecule (example: 16S rRNA) taken from
different species
– Residues conserved during evolution play an important role
• Prediction of protein structure and function
– Proteins which are very similar in sequence generally have
similar 3D structure and function as well
– By searching a sequence of unknown structure against a
database of known proteins the structure and/or function can
in many cases be predicted
Center for Biological Sequence Analysis © 2001
•
Aligning Text Strings
Marc Gerstein © 1999
Mol Bio Information - Protein
Marc Gerstein © 1999
Summary
• Central dogma of biology generates material
appropriate for bioinformatical study (DNA, RNA,
proteins, phenotype, etc)
• One form of bioinformatics is the comparison of
sequences
• BUT, How do we bring this to the classroom?
Creating Inquiry Opportunities
Domain
Principles
Analysis
Tools
Data
Sets
Establishing a Problem Space
Domain
Principles
Problem
Space
Analysis
Data
Tools
Sets
Creating problem spaces
that provide a rich context
for using bioinformatics
data and tools allows
students to focus on using
their understanding of
biology to investigate
meaningful questions.
Problem Spaces
• Foundation
–
–
–
–
–
–
Introduction
Background
Data
Tools
Bibliography
Curricular Resources
• Starting Points
Malaria
Malaria is caused by one of four
species of Plasmodium (falciparum,
vivax, malariae and ovale). Of these
P. falciparum is the most lethal being
estimated to cause 200 million clinical
cases, and 1-3 million deaths
(including many children) every year
world-wide.
Lifecycle
The Plasmodium falciparum Genome- A
Consortium Project
(chromosomes 1, 3-9, 13)
(chromosomes 2, 10, 11 and 14)
(chromosome 12)
Plasmodium genomics special issue
Nature 3rd October 2002
Plasmodium falciparum
Genome Project
Curation
To maximise the benefits to the scientific community of Plasmodium
genome sequencing, the Pathogen Genomics group is committed to the
curation of Plasmodium spp. This will ensure that annotation is
updated and maintained, and will form a framework that underpins
global efforts to understand the parasite and the disease it causes.
If you would like to contribute to the curation of any gene(s) please
contact the curator [email protected] and visit GeneDB.
http://www.sanger.ac.uk/Projects/P_falciparum/
See how Brad Goodner at Hiram College involves his
students in curation:
http://www.hiram.edu/biology/faculty/goodner.html
Drug search …
International computing grid searches for malaria drugs
2/2/07.
Using an international computing grid spanning 27 countries,
scientists on the WISDOM project analysed an average of 80 000
possible drug compounds against malaria every hour. In total, the
challenge processed over 140 million compounds, with a UK physics
grid providing nearly half of the computing hours used.
http://malaria.wellcome.ac.uk/doc_WTX037265.html
Enabling Grids for E-sciencE (EGEE) is the largest multidisciplinary grid infrastructure in the world, which brings
together more than 120 organizations to produce a
reliable and scalable computing resource available to the
global research community. At present, it consists of 250
sites in 48 countries and more than 68,000 CPUs available
to some 8,000 users 24 hours a day, 7 days a week
http://www.eu-egee.org/
The data
Genetic structure of Plasmodium falciparum field isolates
in eastern and north-eastern India
H. Joshi, N. Valecha, A. Verma, A. Kaul, P.K. Mallick, S.
Shalini, S.K. Prajapati, S.K. Sharma, V. Dev, S. Biswas, N.
Nanda, M.S. Malhotra, S.K. Subbarao & A.P. Dash.
Malaria Journal
2007
Vol 6
Page 60
http://www.malariajournal.com/content/6/1/60
The study…
• Isolates were collected from microscopically diagnosed
P. falciparum positive subjects in three Indian states
with varied malaria epidemiology;
• Merozoite surface protein-1 (MSP-1, 17 kDa) &
protein-2 (MSP-2, 46-53 kDa) of P. falciparum is a
target of the host's humoral immunity and a malaria
vaccine candidate; and
• 131 P. falciparum isolates of msp-1 (block 2) and msp-2
(central repeat region, block3) were obtained as well
as others from Genbank.
msp-1 blocks
Hoffmann et al. (2003) Malaria J. 2:24
Block 2 of msp-1
Nucleotide Sequence
aatgaagaag
gtggtgcaag
tgcaagtgct
agtgctcaaa
gtacaagtcc
aaatacttca
aaattactac
tgctcaaagt
caaagtggtg
gtggtacaag
atcatctcgt
tctggtgcaa
aaaaggtgca
ggtgcaagtg
caagtgctca
tggtccaagt
tcaaacactt
gccctccagc
agtgctcaaa
ctcaaagtgg
aagtggtgca
ggtccaagtg
tacctcgttc
tgatgcaagc
Amino Acid Sequence
NEEEITTKGASAQSGASAQSGASAQSGASAQSGASAQ
SGASAQSGTSGPSGPSGTSPSSRSNTLPRSNTSSGASPP
ADAS
Our Notation …
Subset of protein & nucleotide
sequences available (n):
–
–
–
–
Consortium (1)
Indian (9)
Community (12)
Sudan (1)
Our workspace …
• National Center Biological Information
(http://ncbi.nlm.nih.gov)
• Biology Workbench
(http://workbench.sdsc.edu)
Malaria Triad:
Genetics & Genomics
This web resource provides data and information
relevant to malaria genetics and genomics. These
resources include organism specific sequence
BLAST databases (Plasmodium falciparum only, all
Plasmodium ), genome maps, linkage markers, and
information about genetic studies. Links are
provided for other malaria web sites and genetic
data on related apicomplexan parasites .
http://www.ncbi.nlm.nih.gov/projects/Malaria/
The tools …
• Session Tools ~ file folders
• Protein Tools/Nucleic Tools
– Find sequences (Ndjinn)
– Upload sequences (Add)
– Align sequences (CLUSTALW)
• Alignment Tools
– Display options (BOXSHADE, DRAWGRAM)
Let’s explore …
> Sequence 1
GAGGTAGTAATTAGATCCGAAA…
> Sequence 2
GAGGTAGTAATTAGATCTGAAA…
> Sequence 3
GAGGTAGTAATTAGATCTGTCA…
• Form groups of ~3/computer
• Look at the data
(http://bioquest.org/oakwood_2008/malaria-problem-space)
• Choose a problem/question to explore
Favorite movie of the week …
Inside the Cell
Harvard BioVisions Video
What you are seeing is discussed here
Homology
• Homologous sequences can be divided into two
groups
– orthologous sequences: sequences that differ because
they are found in different species (e.g. human a-globin
and mouse a-globin)
– paralogous sequences: sequences that differ because of a
gene duplication event (e.g. human a-globin and human bglobin, various versions of both )
M. Craven @ 2002
So this means …
Source: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html
Search algorithms
•
Smith-Waterman (1981)
•
FASTA (Pearson 1995)
•
BLAST (Altschul 1990, 1997)
– demanding of time and memory resources
– Speed up searches by an order of magnitude compared to SmithWaterman
– Good statistics
– Extremely fast
• One order of magnitude faster than FASTA
• Two orders of magnitude faster than Smith-Waterman
– Almost as sensitive as FASTA
Things to keep in mind when working
with alignments
• Pairwise alignment programs always find the optimal
alignment of two sequences
– They do so even if it does not make any sense at all to align the
two sequences
– ”Optimal” means optimal according to the substitution matrix
and gap penalties you choose – also if you choose the wrong
ones
• Generally the underlying assumptions are wrong
– The frequency of substitution is not the same at all positions
– Nor is the frequencies of insertions and deletions the same
– Affine gap penalties do not properly model indel events
Center for Biological Sequence Analysis © 2001
•
•
•
•
Simplest way: the identity matrix
A very crude model : to use the genetic code
How to score the exchange of
matrix, the number of point mutations
two amino acids in an
necessary to transform one codon into the
other.
alignment?
Other similarity scoring matrices might be
constructed from any property of amino
acids that can be quantified -partition
coefficients between hydrophobic and
hydrophilic phases
– charge
– molecular volume, etc.
Unfortunately, all these biophysical
quantities suffer from the fact that they
provide only a partial view of the picture there is no guarantee, that any particular
property is a good predictor for
conservation of amino acids between related
proteins.
Marc Gerstein © 1999
Pairwise alignment of hemoglobin alpha
chain and myoglobin
24.7% identity;
Global alignment score: 130
10
20
30
40
50
HBA_HU VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKG--::: ..
: .:.:: : .. .: . : :.: : : :
: .:
. :..:.
MYG_PH VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED
10
20
30
40
50
60
60
70
80
90
100
110
HBA_HU ---HGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNF-KLLSHCLLVTLAAHL
:: : ::
. .
:. :.. :: : .. :... ...:. .. .: ..
MYG_PH LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKI-PIKYLEFISEAIIHVLHSRH
70
80
90
100
110
120
130
140
HBA_HU PAEFTPAVHASLDKFLASVSTVLTSKYR-----:..:
......: :
...::.
MYG_PH PGDFGADAQGAMNKALELFRKDIAAKYKELGYQG
120
130
140
150
Center for Biological Sequence Analysis © 2001
Important things to remember when using
alignment to search databases
•
When searching in databases, size does matter!
– Searching large databases take very long time
– The significance of matches drops when the database is expanded
•
Doing things differently can lead to different conclusions
– Nucleotide comparison vs. protein comparison
•
Think before and after you search
– The obvious thing to do is not always the right thing to do
– Conclusions based on matches should be drawn with greater care
Marc Gerstein © 1999
Why multiple alignment is better
• More sequences contain more information
• Multiple sequence alignment allows us to compare all
related proteins simultaneously
• It allows us to identify features that are conserved
among the sequences
• Using a multiple sequence alignment (a profile) one
can find more related sequences than by simple
pairwise comparison
Center for Biological Sequence Analysis © 2001
A multiple sequence alignment of
globins
HBB_HUMAN
HBB_HORSE
HBA_HUMAN
HBA_HORSE
MYG_PHYCA
GLB5_PETMA
LGB2_LUPLU
--------VHLTPEEKSAVTALWGKVN--VDEVGGEALGRLLVVYPWTQRFFESFGDLST
--------VQLSGEEKAAVLALWDKVN--EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
*: :
:
* .
: .:
* :
* :
.
HBB_HUMAN
HBB_HORSE
HBA_HUMAN
HBA_HORSE
MYG_PHYCA
GLB5_PETMA
LGB2_LUPLU
PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL
PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL
----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL
----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL
EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF
ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV
VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV
. .:: *. :
.
: *. * .
: .
Center for Biological Sequence Analysis © 2001