Alignment: pairs of sequences

Download Report

Transcript Alignment: pairs of sequences

Tutorial 2: Some problems in bioinformatics
1. Alignment
pairs of sequences
Database searching for sequences
Multiple sequence alignment
Protein classification
2. Phylogeny prediction (tree construction)
Sources:
1) "Bioinformatics: Sequence and Genome Analysis" by David W. Mount. 2001. Cold Spring
Harbor Press
2) NCBI tutorial http://www.ncbi.nlm.nih.gov/Education/ and
http://www.ncbi.nih.gov/BLAST/tutorial/Altschul-1.html
3) Brian Fristensky. Univ. of Manitoba
http://www.umanitoba.ca/faculties/afs/plant_science/COURSES/bioinformatics
Alignment: pairs of sequences
DNA: A, G, C, T
TCGCA
|| ||
TC-CA
protein: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y
KQTGKG
| |||
KSAGKG
DNA to RNA to protein to phenotype
DNA to RNA to protein to phenotype
Alignment: pairs of sequences
Concepts:
Similarity
Identity
Homology
Orthology
Paralog
KQTGKGV
| |||:
KSAGKGL
4/7 identical
5/7 similar
Homology is based on evolutionary history
Figure 45 Lineage-specific expansions of domains and architectures of transcription factors. Top,
specific families of transcription factors that have been expanded in each of the proteomes.
Approximate numbers of domains identified in each of the (nearly) complete proteomes representing
the lineages are shown next to the domains, and some of the most common architectures are shown.
Some are shared by different animal lineages; others are lineage-specific.
A partial alignment of globin sequences.
Proteins with very little identity (10% or less) can be recognized
as sharing a common domain if they match a pattern.
Homology, orthology and paralogy
orthologs diverged at a
speciation event
paralogs diverged at a
gene duplication event
- Fitch, W.M. 2001.
Homology: A personal view of some of the problems.
Trends Genet. 16: 227-231.
Alignment: pairs of sequences
Scoring schemes
Score = matches - mismatches - gaps
What is the best way to evaluate the contribution of each?
GKG-RRWDAKR
||| ||
GKGAKRWESAP
A partial alignment of globin sequences from Pfam.
Proteins with very little identity (10% or less) can be recognized
as sharing a common domain if they match a pattern.
Alignment: pairs of sequences
Global vs. local alignment.
(end gaps are ignored in local alignment)
Dynamic programming
TCGCA
|| ||
TC-CA
Brian Fristensky. Univ. of Manitoba
http://www.umanitoba.ca/faculties/afs/plant_science/COURSES/bioinformatics/lec04/lec04.2.html
Dynamic programming
Brian Fristensky. Univ. of Manitoba
http://www.umanitoba.ca/faculties/afs/plant_science/COURSES/bioinformatics/lec04/lec04.2.html
Dynamic programming
Brian Fristensky. Univ. of Manitoba
http://www.umanitoba.ca/faculties/afs/plant_science/COURSES/bioinformatics/lec04/lec04.2.html
Dynamic programming
Brian Fristensky. Univ. of Manitoba
http://www.umanitoba.ca/faculties/afs/plant_science/COURSES/bioinformatics/lec04/lec04.2.html
Alignment: pairs of sequences
Scoring schemes
Score = matches - mismatches - gaps
"The dynamic programming algorithm was improved in performance by Gotoh (1982) by
using the linear relationship for a gap weight wx = g + rx, where the weight for a gap
of length x is the sum of a gap opening penalty (g) and a gap extension penalty (r) times
the gap length (x), and by simplifying the dynamic programming algorithm."
D. W. Mount
GKG-RRWDAKR
||| ||
GKGAKRWESAP
VS.
KQTGKG-RRWDAKR
| |||
|||
KSAGKG-----AKR
Alignment: amino acid substitution matrices
Scoring schemes
"Any [scoring] matrix has an implicit amino acid pair frequency distribution that
characterizes the alignments it is optimized for finding. More precisely, let pi be the
frequency with which amino acid i occurs in protein sequences and let qij be the freqeuncy
with which amino acids i and j are aligned within the class of alignments sought. Then,
the scores that best distinguish these alignments from chance are given by the formula:
Sij = log (qij / pipj)
The base of the logarithm is arbitrary, affecting only the scale of the scores. Any set of
scores useful for local alignment can be written in this form, so a choice of substitution
matrices can be viewed as an implicit choice of 'target frequencies'"
- Altschul et al. 1994 (Nature Genetics 6:119)
Those frequencies are characteristic of the sequences being aligned, and are primarily a
function of their degree of divergence.
Alignment: amino acid substitution matrices
Substitution matrices -- BLOSUM 62
Henikoff and Henikoff. 1992.
Amino acid substitution matrices from protein blocks.
PNAS 89: 10915-10919.
Alignment: amino acid substitution matrices
Substitution matrices -- BLOSUM 62
Alignment: implementations
Fasta
Introduces the concept of k-tuple perfects alignment
to seed longer global alignments.
BLAST -- Basic Local Alignment Search Tool
Initiates an alignment locally and then extends that
alignment.
GKG
|||
GKG
GKG-RRW
||| ||
GKGAKRW
Alignment:
Searching databases for sequences
There are many modifications of BLAST for specific purposes.
The NCBI BLAST interface
The NCBI BLAST interface
Extreme value distribution
the expected distribution of the maximum of many
independent random variables, generally
Y = exp [-x -e-x ]
K and lambda are statistical
parameters dependent upon the
scoring system and the
background amino acid
frequencies of the sequences
being compared. While FASTA
estimates these parameters from
the scores generated by actual
database searches, BLAST
estimates them beforehand for
specific scoring schemes by
comparing many random
sequences generated using a
standard protein amino acid
composition [12].
Fasta can be run at EMBL.
The software is also available for download.
Alignment: Multiple sequence alignment
Alignment: Protein classification
Phylogeny prediction (tree construction)
root
Phylogeny prediction (tree construction)
Character-based Methods
Parsimony
Maximum Likelihood
tree that maximizes the likelihood of seeing the data
Bayesian Analysis
trees with greatest likelihoods given the data
Distance Methods
Unweighted Gap-pair method with Arithmetic Means
Neighbor joining
Within- and
between-species
variation along a
single
chromosome.
a,The interspecies relationships of five chromosome regions to corresponding DNA sequences
in a chimpanzee and a gorilla. Most regions show humans to be most closely related to
chimpanzees (red) whereas a few regions show other relationships (green and blue).
b, The among-human relationships of the same regions are illustrated schematically for five
individual chromosomes.
Tutorial III:
Open problems in bioinformatics
Tentatively:
Detection of subtle signals
promoter elements
exon splicing enhancers
noncoding RNAs
weak protein similarities
Microarrays
Protein folding and homology modeling
Thursday, June 10, 2:00 - 3:45
Microarray expression data
Statistical analysis -- what has changed
Clustering -- which genes change together
Clustering -- promoter recognition
Clustering -- database integration
Phenotype determination (e.g. cancer prognosis)
Tutorial 2: Some problems in bioinformatics
1. Alignment
pairs of sequences
Multiple sequence alignment
Database searching for sequences
Protein classification
2. Phylogeny prediction (tree construction)
3. microarray expression data
4. Protein structure
Protein folding
Structure prediction
Homology modeling
Sources:
1) "Bioinformatics: Sequence and Genome Analysis" by David W.
Mount. 2001. Cold Spring Harbor Press
2) NCBI tutorial http://www.ncbi.nlm.nih.gov/Education/
3) Cold Spring Harbor course in Computational Genomics (1999) Pearson