Bioinformatics and Molecular Evolution

Download Report

Transcript Bioinformatics and Molecular Evolution

BINF6201/8201: Molecular Sequence Analysis
Dr. Zhengchang Su
Office: 351 Bioinformatics Building
Email: [email protected]
Office hours: Tuesday and Thursday: 2:00~3:00pm
08-23-2010
Textbook and reading materials
 Textbook: Bioinformatics and Molecular Evolution
by Paul G. Higgins and Teresa K. Attwood,
Blackwell Publishing, 2005.
 Additional readings from the current literature may be
assigned as appropriate
 All lecture slices will be available on line at
http://bioinfo.uncc.edu/zhx/binf8201/binf8201.html
Students Evaluation
 Weekly or bi-weekly homework assignments, Ph.D
students may have additional assignments (30%).
 Two midterm exams (60%):
10/5(Tuesday) and 12/14 (Tuesday)
 Classroom participation will count for 10% of the grade.
Sequence data explosions
 Three almost equivalent biological sequence databases
International Sequence Database Collaboration
1. GenBank at NCBI
2. European Molecular Biology Laboratory (EMBL) Nucleotide
Sequence Database at European Bioinformatics Institute (EBI)
3. DNA database of Japan (DDBI)
 Features
1. All published biological sequences are requested to be deposited in
the one of these three databases;
2. Data are exchanged among these three databases on a daily basis.
Data explosions
 Both the number/length of sequences and number of transistors in a
CPU increase exponentially with the time.
 However the number/length of sequences increases even faster than
the number of transistors in a CPU.
N (t )  N 0e rt ;
lnN(t)
ln N (t )  rt  ln N 0 .
(t)
Sequence data explosions are the result of the
continuous development of new sequencing
technologies:
 Chain termination (Sanger) method (1977)
 Automation of sequence determination (late 1980s)
 Shotgun sequencing strategy (1995)
 NexGen sequencing technologies (2004)
1. 454 pyrosequencing: 454 Life Sciences/Roche Diagnostics
2. Solexa sequencing: Illumina
3. SOLiD sequencing: Applied Biosystems
4. Helico BioSciences:
5. Pacific Biosciences:
6. Polonator: open source
Data explosions
 Since 1995, the number of sequenced genomes also increases
exponentially.
As of 8-19-2010 http://www.genomesonline.org
C om plete
In pipline
A rchaea
92
186
B acteria
1135
4804
E ukaryota
133
1548
Total
1360
6538
Data explosions :
 Since 2006, the number of meta-genome sequences increases
exponentially thanks to the advent of next-generation sequencing
technologies.
 In September, 2009, about 200 meta-genomes are sequenced or are in
the process of sequencing.
http://www.genomesonline.org
Data explosions
 The speed of computers also increase exponentially with the time.
 However, how can we use the ever powerful computers to solve
biological problems is a very challenging task for computer science and
biology research communities.
Data explosions
 More and more biological researches use computational analyses.
What is genomics?
 The availability of whole genome sequences of organisms has led to
the birth of Genomics that studies the organisms based on the genetic
information encoded in the genomes.
 According to the subjects of the study, genomics can be divided into:
1. Functional genomics, which is coupled with the development of
relevant high-throughput technologies, such as,
• Microarray/RNA-Seq: transcriptomics
• Mass spectrometry: Proteomics
• Nucleus magnetic resonance (MR) and mass spectrometry:
Metabolomics
2. Comparative/evolutionary genomics
What is Bioinformatics?
 For a short answer:
“Bioinformatics is the use of computational methods to study
biological data and problems”.
 For a more detailed answer:
Bioinformatics is
1. “The development and use of computational methods for
studying the structure, function, and evolution of genes, proteins
and whole genomes;”
2. “The development and use of methods for the management and
analysis of biological information arising from genomics and
high-throughput experiments.”
Population genetics, molecular evolution and sequence
analysis
 According to the evolutionary theory, biological sequences are related
to one another through heredity and variation;
 Sequence analysis methods are thus based on the principles of the
evolution of sequences.
 Therefore, to analyze sequences, we must understand
1. the dynamics changes of genes (loci) in a population of the same
species— population genetics; and
2. how the gene sequences change during the course of evolution
among different species — molecular evolution.
Sequence Similarity
 The similarity of two sequences can be identified by aligning the two
sequences using an alignment method/algorithm, such as the BLAST or
Smith-Waterman method/algorithm.
Two parameters to describe the similarity of two sequences
1. Identity
2. Similarity
Identities = 38/139 (27%), Similarity = 66/139 (47%), Gaps = 9/139
(6.5%)
LELTYIVNFGSELAVVSMLPTFFETTFDLPKATAGILASCFAFVNLVARPAGGLISDSVG
+
Y + FG +A + LPT+ T +
AG
+ FA
++ARP GG +SD +
MSFLYAIVFGGFVAFSNYLPTYITTIYGFSTVDAGARTAGFALAAVLARPVGGWLSDRIA
SRKNTMGFLTAGLGVGYLVMSMIKPGTFTGTTGIAVAVVITMLASFFVQSGEGATFALVP
R
+ L
+ +
P ++ T I +AV + +
G G FA V
PRHVVLASLAGTALLAFAAALQPPPEVWSAATFITLAVCLGV--------GTGGVFAWVA
-LVKRRVTGQVAGLVGAYGNVG
G V G+V A G +G
RRAPAASVGSVTGIVAAAGGLG
Homologous Sequence
 Homology: If the similarity of the two sequences are high enough, it
is highly likely that they have evolved from a common ancestor, and
we say that they are homologous to each other.
For example, if two sequences of 100 amino acids have 80% of
identical residuals, the probability by chance that the two sequences
share this level of similarity is (1/20)80.
 Homology of two sequences can only be inferred computationally,
but is difficult to be tested experimentally.
Orthologs and Paralogs
There are two distinct types of homologous relationships, which differ
in their evolutionary history and functional implications.
• Orthologs: Evolutional counterparts derived from a single ancestral
gene in the last common ancestor of the given two species. Therefore,
orthologous genes are related due to vertical evolution. Orthologous
genes typically have the same function.
• Paralogs: homologous genes evolved through duplication within the
same or ancestral genome. Therefore, paralogous genes are related due
to duplication events. Paralogous genes do not necessary have the same
function.
duplication
speciation
speciation
Divergence evolution
 When the similarity between two sequences are very low, say, 8%
identity, then they could be still homologous due to divergent
evolution;
Speciation or
duplication
homologues
 Divergently evolved genes usually have similar biochemical functions.
Convergence evolution
 When the similarity between two sequences are very low, say, 8%,
they could be of difference origin, and the observed sequence similarity
is due to convergent evolution under functional selection during the
course of evolution. These two sequences are called analogues.
analogues
 Analogues may have similar biochemical functions, and they usually
only share several amino acids in the active site of enzymes, called
motifs.
Horizontal gene transfer (HGT)
 During evolution, a progeny obtains its genes from its ancestor
(vertical gene transfer), however, it also can obtain genes from other
species, genera, or even taxa. This phenomenon is called horizontal
gene transfer or lateral gene transfer.
 HGT is very pervasive, in particular, in prokaryote, and is believed
to be a major driving force for evolution.
LCA (Last common ancestor)
•
Bacteria
Archaea
Eukaryota
Vertical
gene
transfer
Horizontal gene transfer