Bioinformatics problems I
Download
Report
Transcript Bioinformatics problems I
BNFO 602
Lecture 2
Usman Roshan
Bioinformatics problems
• Sequence alignment: oldest and still
actively studied
• Genome-wide association studies: new
problem, great potential for
personalized medicine and personal
genomics
• Phylogenetics: understanding
evolutionary histories
Pairwise sequence alignment
• How to align two sequences?
Pairwise alignment
• How to align two sequences?
• We use dynamic programming
• Treat DNA sequences as strings over the
alphabet {A, C, G, T}
Pairwise alignment
Dynamic programming
Define V(i,j) to be the optimal pairwise alignment
score between S1..i and T1..j (|S|=m, |T|=n)
Dynamic programming
Define V(i,j) to be the optimal pairwise alignment
score between S1..i and T1..j (|S|=m, |T|=n)
Time and space complexity is O(mn)
Dynamic programming
Animation slides by Elizabeth Thomas in
Cold Spring Harbor Labs (CSHL)
http://meetings.cshl.org/tgac/tgac/flash/DynamicProgramming.swf
How do we pick gap
parameters?
Structural alignments
• Recall that proteins have 3-D structure.
Structural alignment - example
1
Alignment of thioredoxins from
human and fly taken from the
Wikipedia website. This protein
is found in nearly all organisms
and is essential for mammals.
PDB ids are 3TRX and 1XWC.
Structural alignment - example
2
Taken from http://bioinfo3d.cs.tau.ac.il/Align/FlexProt/flexprot.html
Unaligned proteins.
2bbm and 1top are
proteins from fly and
chicken respectively.
Computer generated
aligned proteins
Structural alignments
• We can produce high quality manual
alignments by hand if the structure is
available.
• These alignments can then serve as a
benchmark to train gap parameters so
that the alignment program produces
correct alignments.
Benchmark alignments
• Protein alignment benchmarks
– BAliBASE, SABMARK, PREFAB,
HOMSTRAD are frequently used in studies
for protein alignment.
– Proteins benchmarks are generally large
and have been in the research community
for sometime now.
– BAliBASE 3.0
Biologically realistic scoring matrices
• PAM and BLOSUM are most popular
• PAM was developed by Margaret
Dayhoff and co-workers in 1978 by
examining 1572 mutations between 71
families of closely related proteins
• BLOSUM is more recent and computed
from blocks of sequences with sufficient
similarity
PAM
• We need to compute the probability transition
matrix M which defines the probability of
amino acid i converting to j
• Examine a set of closely related sequences
which are easy to align---for PAM 1572
mutations between 71 families
• Compute probabilities of change and
background probabilities by simple counting
Genome wide association
studies
Application of SNPs:
association with disease
• Experimental design to detect cancer
associated SNPs:
– Pick random humans with and without
cancer (say breast cancer)
– Perform SNP genotyping
– Look for associated SNPs
– Also called genome-wide association study
Case-control example
• Study of 100 people:
– Case: 50 subjects with
cancer
– Control: 50 subjects without
cancer
• Count number of alleles and
form a contingency table
#Allele1
#Allele2
Case
10
90
Control
2
98
Odds ratio
• Odds of allele 1 in
cancer = a/b = e
• Odds of allele 1 in
healthy = c/d = f
• Odds ratio of recessive
in cancer vs healthy =
e/f
#Allele1
#Allele2
Cancer
a
b
Healthy
c
d
Risk ratio (Relative risk)
• Probability of allele 1 in
cancer = a/(a+b) = e
• Probability of allele 2 in
healthy = c/(c+d) = f
• Risk ratio of recessive
in cancer vs healthy =
e/f
#Allele1
#Allele2
Cancer
a
b
No cancer
c
d
Odds ratio vs Risk ratio
• Risk ratio has a natural interpretation
since it is based on probabilities
• In a case-control model we cannot
calculate the probability of cancer given
recessive allele. Subjects are chosen
based disease status and not allele type
• Odds ratio shows up in logistic
regression models
Example
• Odds of allele 1 in case =
15/35
• Odds of allele 1 in control =
2/48
• Odds ratio of allele 1 in case
vs control = (15/35)/(2/48) =
10.3
• Risk of allele 1 in case =
15/50
• Risk of allele 2 in control =
2/50
• Risk ratio of allele 1 in case
vs control = 15/2 = 7.5
#Allele1
#Allele2
Case
15
35
Control
2
48
Odds ratios in genome-wide
association studies
• Higher odds ratio means stronger
association
• Therefore SNPs with highest odds
ratios should be used as predictors or
risk estimators of disease
• Odds ratio generally higher than risk
ratio
• Both are similar when small
Statistical test of association
(P-values)
• P-value = probability of the observed data (or
worse) under the null hypothesis
• Example:
– Suppose we are given a series of co in-tosses
– We feel that a biased coin produced the tosses
– We can ask the following question: what is the probability
that a fair coin produced the tosses?
– If this probability is very small then we can say there is a
small chance that a fair coin produced the observed tosses.
– In this example the null hypothesis is the fair coin and the
alternative hypothesis is the biased coin
Effect of population structure
on genome-wide association
studies
• Suppose our sample is drawn from a
population of two groups, I and II
• Assume that group I has a majority of allele
type I and group II has mostly the second
allele.
• Further assume that most case subjects
belong to group I and most control to group II
• This leads to the false association that the
major allele is associated with the disease
Effect of population structure
on genome-wide association
studies
• We can correct this effect if case and
control are equally sampled from all
sub-populations
• To do this we need to know the
population structure
Population structure prediction
• Treated as an unsupervised learning
problem (i.e. clustering)