Bioinformatics in the Mathematics Curriculum
Download
Report
Transcript Bioinformatics in the Mathematics Curriculum
Bioinformatics in the
Mathematics Curriculum
Jennifer R. Galovich
MAA Short Course
Mathfest 2007
San Jose, CA
HELP!!!!
Need judges for Math/Bio student talks
(Janet Andersen prize)
Saturday afternoon (talks are at Fairmont)
What is Bioinformatics?
Mathematics
Molecular Biology
Computer Science
Outline
• Level I: Algorithms for sequence (DNA,
RNA, amino acid) alignment
• Level II: Getting a better handle on
microarray data
• Level III: RNA secondary structure
•
College of St. Benedict/St. John’s University
I.
Algorithms for
Sequence Alignment
Algorithms for Sequence Alignment
– Biological Context –
•
Similar sequences
•
Explore frequently occurring patterns to identify important
functional motifs
•
Starting point for phylogenetic analysis to measure variation
between species and among populations:
Similarity of sequences
Evolutionary conservation
similar structure/function.
Algorithms for Sequence Alignment
– Mathematical Learning Goals –
• Concept of an algorithm
• Matrix language and notation
• Recursions and dynamic programming
Potential Audience
• Mathematics for Liberal Arts
• Mathematics for Allied Health Professions
• Freshman Seminar
Global vs Local
• The global alignment problem: Measure
the similarity between two sequences
considered in their entirety.
[Needleman-Wunsch]
• The local alignment problem: Identify
strongly similar subsequences (and ignore
the rest)
[Smith-Waterman]
Sequences differ because of mutations
occurring over the course of evolution.
Three types of mutation:
• Substitution of one base by another
• Insertion of one or more bases
• Deletion of one or more bases
Insertion of base X into S gives S*
Deletion of base X from S* gives S
“Indel”
Scoring functions
• Reward matches
• Penalize mismatches
• Penalize indels (gap penalty)
X
X
Match
+1
X
Y
Mismatch
-1
_
X
or
Indel
-2
X
Pairwise Alignment
• Let S and T be (DNA) sequences. Insert
spaces (gaps) into S and/or T so that S
and T are the same length.
• Score the alignment according to a
previously constructed scoring function
• Reinsert gaps, as needed, to find the
maximum score alignment
Example
S:
T:
A T C T G A T
T G C A T A
A T C T G A
T G C _ A T
-1
-1
+1 -2
Can you do better?
-1
-1
T
A
-1 = -6
Naïve Sequence Alignment
If S has length n and T has length m then there are
n m
= way too many
n
possible alignments.
A better way….
Needleman-Wunsch (1970)
Definition: The ith prefix of a sequence is the
subsequence consisting of the first i letters
N-W solves the alignment problem by
constructing a DP (dynamic programming)
matrix A where A(i,j) gives the score of
the best alignment between the ith prefix
of S and the jth prefix of T, keeping track
of the best alignment as it builds.
A(i,j) = max
A
T
C
T
G
A
T
0
-2
-4
-6
-8
-10
-12
-14
A(i 1, j ) 2
A(i, j 1) 2
A(i 1, j 1) 1
Align Si with a gap -- vertical
Align Tj with a gap -- horizontal
Si /Tj match/mismatch -- diagonal
T
G
C
A
T
A
-2
-1
-1
-4
-3
-2
-6
-5
-8
-5
-10
-7
-12
-9
ETC
A
T
C
T
G
A
T
0
-2
-4
-6
-8
-10
-12
-14
S
T
Score
T
-2
-1
-1
-3
-5
-7
-9
-11
G
-4
-3
-2
-2
-4
-4
-6
-8
C
-6
-5
-4
-1
-3
-5
-5
-7
A
-8
-5
-6
-3
-2
-4
-4
-6
T
-10
-7
-4
-5
-2
-3
-5
-3
A T C T G A T
T G C A T A --1 -1 +1 -1 -1 +1 -2 = -4
A
-12
-9
-6
-5
-4
-3
-2
-4
Your turn!
Align
S: AAAC
T: AGC
1. Make a best guess
2. Use N-W to check
0
A
-2
A
-4
A
-6
C
-8
A
G
C
-2
-4
-6
A
G
C
0
-2
-4
-6
A
-2
1
-1
3
A
-4
-1
0
-2
A
-6
-3
-2
-1
C
-8
-5
-4
-1
--AGC
AAAC
A – GC
A A AC
A G--C
A AAC
Local alignment:
Smith-Waterman
Same idea, but adapt scoring function
to ignore negative scores:
A(i,j) = max
A(i 1, j ) 2
A(i, j 1) 2
A(i 1, j 1) 1
0
Align Si with a gap -- vertical
Align Tj with a gap -- horizontal
Si /Tj match/mismatch – diagonal
Start over
Example
T
A
T
C
T
G
A
T
0
0
0
0
0
0
0
0
G
0
0
1
0
1
0
0
1
C
0
0
0
0
0
2
0
0
TG
TG
A
0
0
0
1
0
0
1
0
or
AT
AT
T
0
1
0
0
0
0
1
0
A
0
0
2
0
1
0
0
2
0
1
0
1
0
0
1
0
A Riff on Gaps
So far, our gaps have been linear, but they could be…
• Affine (to penalize opening a gap differently from
extending of a gap)
• Your favorite increasing concave down function of the
length of the gap
P. Higgs and T. Attwood,
Bioinformatics and Molecular Evolution, p. 122.
Other variations on the theme
I. Multiple sequence alignment – align many
sequences in order to uncover regions
conserved across an entire group.
Progressive alignment:
All pairwise alignments
Distance matrix
Cluster diagram (guide tree)
Align clusters to form larger clusters
II. Align proteins (sequences of amino acides)
Scoring matrix for DNA is 4 X 4.
Scoring matrices for amino acids are 20 x 20,
e.g.
PAM (Point Accepted Mutation) and
BLOSUM (BLocks SUbstitution Matrices)
Both based on estimates of probabilities of
substitution of one amino acid for another,
using different data bases
Software – CLUSTALX (via MEGA)
Example: Compare mitochondrial DNA
sequences of primates
(Human, chimp, gorilla, orangutan, gibbon)
Data from
Brown et al. J. Molecular Evolution 18 (1982) 225 – 239.
Resulting phylogeny
Pan troglodytes V00672
(Chimpanzee)
Gorilla V00658
Hylobates lar V00659
(Gibbon)
P pygmaeus V00675
(Orangutan)
Homo sapiens D38112
II.
Managing Microarray
Data
Managing Microarray Data
– Biological Context –
Goal: Measure (simultaneously) the level of
expression of the genes in a cell (by
measuring concentration of mRNA)
Applications:
• Compare mRNA levels in different types of
cells
• Characterize different types of cancer
• ETC
Managing Microarray Data
– Mathematical Learning Goals–
• Matrix operations
• Reinforce theorems about and properties
of eigenvalues and eigenvectors
• Diagonalization of a symmetric matrix
What is a microarray?
• “A” microarray experiment is actually the
same experiment performed on many
genes or proteins at the same time, hence
LOTS of data to examine for trends and
other features.
• Physically, a microarray is a slide onto
which a rectangular array of spots of DN A
sequences ( aka “probes”) have been
deposited.
How to make a microarray
http://www.accessexcellence.org/RC/VL/GG/microArray.html
Microarray
Matrix
• Compute R = log2(red/green intensity ratios)
• Compare arrays for many samples (time points,
organisms, tumors,….) with all intensities computed
relative to the same reference.
• Produce an p x N matrix (p indexes the genes, N
indexes the samples) where
R < 0 : gene is down-regulated in test (red) sample compared to
reference (green) sample
R = 0: gene is equally expressed in both samples
R > 0: gene is up-regulated in test (red) sample compared to
reference (green) sample
Example
http://media.pearsoncmg.com/bc/bc_campbell_ge
nomics_2/medialib/web_art/Web_Art_Ch_6.pdf
Problem
Somehow -- Find patterns of expression in
what is typically something like thousands
of genes expressed in tens or hundreds of
different tumor cells types – the p x N
matrix.
Linear Algebra to the Rescue!!
• Goal – Engineer a projection of the highdimensional data space onto a lower
dimensional space – that is, find a point of view
from which to observe the higher dimensional
space that captures as much of the variability in
the data as possible, and ignores the “noise”.
Principal Component Analysis
The Algorithm
• Center the p x N matrix [X1 XN]:
Let M = 1 (X1 ++ XN),
N
let B = (X1-M ++ XN-M), and
let S =
1
N 1
BBT
• Diagonalize the p x p covariance matrix S.
• Since S is positive semi-definite,the
eigenvalues 1, …, p are non-negative.
• Order the eigenvalues of S in decreasing order
and let u1,…,up be the corresponding (unit)
eigenvectors. Let P = [u1,…,up]
• Define the change of variable Y = PX. Then the
variance of y1 is maximized. Thus y1 is the first
principal component; Similarly, y2, the second
principal component, is orthogonal to y1 and
maximizes the remaining variance, etc.
Bottom Line
Instead of trying to understand the data in
a p-dimensional space, reduce the
dimensionality of the data space by
choosing as many of the principal
components yi as are needed to account
for as much of the variance as desired.
Payoffs
• Identify the genes with the largest
(absolute) coefficients in the principal
components to give some biological
interpretation to the components
• Use this biological interpretation to assist
in classifying the samples
• Plot the data with respect to the principal
components to visualize clusters
Crescenzi and Giuliani, FEBS Letters 507 (2001)
Higgs and Attwood, Bioinformatics and Molecular Evolution
Extensions…
• Singular Value Decomposition (Alter et al.)
• Clustering methods – hierarchical and
otherwise, including gene shaving (Hastie, et
al.)
• Machine learning, e.g. support vector machines.
(Moore)
III.
Combinatorics
and
RNA Folding
Combinatorics and RNA Folding
– Biological Context –
Crick’s Central Dogma
DNA
RNA
Proteins
B. Subtilis RNase P RNA
GUUCUUAACGUUCGGGUAAUCGCUGCAGAUCUUGA
AUCUGUAGAGGAAAGUCCAUGCUCGCACGGUGCUG
AGAUGCCCGUAGUGUUCGUGCCUAGCGAAGUCAUA
AGCUAGGGCAGUCUUUAGAGGCUGACGGCAGGAAA
AAAGCCUACGUCUUCGGAUAUGGCUGAGUAUCCUU
GAAAGUGCCACAGUGACGAAGUCUCACUAGAAAUG
GUGAGAGUGGAACGCGGUAAACCCCUCGAGCGAGA
AACCCAAAUUUUGGUAGGGGAACCUUCUUAACGGA
AUUCAACGGAGAGAAGGACAGAAUGCUUUCUGUAG
AUAGAUGAUUGCCGCCUGAGUACGAGGUGAUGAGC
CGUUUGCAGUACGAUGGAACAAAACAUGGCUUACA
GAACG UUAGACCACU
http://www.bioinfo.rpi.edu/~zukerm/lectures/RNAfold-html/img24.gif
B. Subtilis RNase P RNA
http://www.pharmazie.uni-marburg.de/pharmchem/akhartmann/bilder/rnase_p_bsubtilis.gif
Folding
Structure
Function
Challenge:
Predict/describe RNA secondary structure
Combinatorics and RNA Folding
– Mathematical Learning Goals –
Exploration of various topics in
combinatorics:
• Catalan numbers
• Binary trees
• Non-crossing set partitions
• 3-2-1 avoiding permutations
• Permutation statistics and properties of
their distributions
RNA secondary structure as a combinatorial object
(certain non-crossing set partitions)
Definition: A secondary structure on
{1, 2, ..., n} is a simple graph (i.e., a set of unordered pairs of
elements of {1, 2, ..., n}) such that
(i) the degree of every vertex is at most 1
(ii) if (i, j) is an edge, then |i - j| >1
(iii) if i < h < j < k and (i, j) is an edge, then
(h, k) cannot be an edge.
NO…
NO…
NO…
YES!!!
Is there a formula for s(n, k), the number of
secondary structures on {1, 2, ..., n} which have
exactly k edges?
Yes.
Do we know what it is?
Yes!!
Theorem: (Schmitt and Waterman, 1994)
The number of secondary structures on {1, 2, ..., n} which
have exactly k edges (bonds) is
n
k
1
n
k
1
1
nk 1
n 2k k
Proof idea:
Secondary structures on {1, 2, ..., n}
with k edges
(Unlabelled) ordered trees
with n – k + 1 vertices and n - 2k leaves
So…Count the ordered trees.
The S-W Bijection by example:
n = 13 k = 4
0
1
2
14
13
12
7
9
11
8
3
4
6
10
5
1
2
3
4
5
6
7
8
9 10 11 12 13
A Permutation Model: Exploiting
the Catalan connection
• Recall the Catalan number:
1 2n
Cn
n 1 n
and that MANY sets of combinatorial objects
are counted by Cn:
Non-crossing set partitions of {1,2,...,n)
Ordered trees on n+1 vertices
and...a host* of others
*Current number of combinatorial interpretations of Cn: 135. See Richard Stanley’s
webpage for the latest additions.
http://www-math.mit.edu/~rstan/ec/
3-2-1 Avoidance
Definition: A 3-2-1 avoiding permutation is a
permutation with no decreasing substring of
length 3.
Example:
1 4 6 2 5 3 is not 3-2-1 avoiding.
1 4 6 2 3 5 is 3-2-1 avoiding
Important Fact:
3-2-1 avoiding permutations on n letters are counted by Cn.
Non-crossing
set partitions
3-2-1 avoiding
permutations
RNA
secondary
structures
???
Some Definitions
Let π1 π2... πn be a permutation.
πk is an excedance if k < πk
πk is a strict non-excedance if k > πk
a pair (πi, πj) is an inversion in π if i < j but πi > πj.
π has a descent at position i if πi > πi+1
Example:
k: 1 2 3 4 5 6 7
πk : 4 7 3 2 6 5 1
Let Πn be the set of all 3-2-1 avoiding permutations such that:
(i) If position i has a strict non-excedance, position i+1 does not.
(ii) If c is a strict non-excedance, then c+1 is not.
(iii) Every strict non-excedance is the second element of at least
two inversion pairs.
Let Πn,k be the set of all permutations in Πn which have
exactly k strict non-excedances.
Example: n = 13; k = 4
k: 1 2 3 4 5 6 7 8 9 10 11 12 13
πk : 2 4 5 1 7 3 8 9 11 6 12 10 13
Main Theorem
Let SSn,k be the set of all RNA
secondary structures with k bonds.
Then there is a bijection from SSn,k
to Πn,k.
n = 13, k = 4
1
2
3 4
5
6 7 8
9 10 11 12 13
2 4 5 1 7 3 8 9 11 6 12 10 13
Permutation Statistics
exc(π) = number of excedances in π
inv(π) = number of inversions in π
maj(π) = sum of the descent positions in π
Example:
k=4
exc(π) = 8
inv(π) = 12
maj(π) = 28
1 2 3
4
5
6 7 8
9 10 11 12 13
2 4 5
1
7
3 8 9 11
6 12 10 13
Two RNA SS Statistics
Tau: Let vi be the number of unpaired bases
internal to bond i. Then we define
τ(s) =
i vi
Bond Index : B(s) = sum of the positions
corresponding to left or right bonds.
τ(s) = (4 + 2 + 1 + 1) = 8
1
B(s) = (1+2+4+6+7+9+11 +12) = 52
2 3
4
5
6 7
8
9 10 11 12 13
Our example: n = 13, k = 4
s:
1
π:
2
3
4
5
6 7
8
9
10 11 12 13
2 4 5 1 7 3 8 9 11 6 12 10 13
τ(s) = 8
B(s) = 52
inv(π) = 12
maj(π) = 28
Theorem:
(1) inv = τ + k
(2) B = 2 (maj + k) - inv
Distributions
Fact: The statistics inv and maj are equidistributed
on the set of all permutations, and are both
symmetric and unimodal.
What about B and τ?
Fact: B is symmetric on SSn,k and is unimodal
when k = 1 (boring)
Conjectures: B is unimodal for any value of k.
τ is unimodal, but not symmetric.
Future directions
• Prove the unimodality conjectures
• Use the B and τ statistics to evaluate folding
algorithms or find potential novel RNA structures
or to distinguish among various types of small
RNAs
• Investigate: Is there a useful connection between
these statistics and RNA function?
• Use permutation statistics to describe/identify
RNA motifs
• Look at B statistic on experimentally verified
structures
In summary….
• Bioinformatics tools can be introduced at all
levels of the mathematics curriculum to
reinforce standard content.
• You don’t need to be an expert (in biology or
computer science) to do this!
Opportunity!!!
Invited paper session
“Mathematical Questions in Bioinformatics”
Friday, 2 - 4