Transcript Slide 1
Canadian Bioinformatics
Workshops
www.bioinformatics.ca
Module #: Title of Module
2
Lecture 6
Sequence Analysis Basics
MBP1010
†
Dr. Paul C. Boutros
Winter 2014
DEPARTMENT OF
MEDICAL BIOPHYSICS
†
Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE)
This workshop includes material
originally developed by Drs. Raphael Gottardo,
Sohrab Shah, Boris Steipe and others
Course Overview
•
•
•
•
•
•
•
•
•
•
Lecture 1: What is Statistics? Introduction to R
Lecture 2: Univariate Analyses I: continuous
Lecture 3: Univariate Analyses II: discrete
Lecture 4: Multivariate Analyses I: specialized models
Lecture 5: Multivariate Analyses II: general models
Lecture 6: Sequence Analysis
Lecture 7: Microarray Analysis I: Pre-Processing
Lecture 8: Microarray Analysis II: Multiple-Testing
Lecture 9: Machine-Learning
Final Exam (written)
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
House Rules
• Cell phones to silent
• No side conversations
• Hands up for questions
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
Topics For This Week
• Statistics Review
• Examples
• Attendance
• Sequence Analysis: BLAST & Friends
• Sequence Analysis: Next-Generation Sequencing
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
Review From Lecture #1 & 2
What are good times/reasons to use a spreadsheet?
Quick data-visualization, especially of direct
tabular data. Calculator-like work. EDA.
How do you decide what your null hypothesis is?
Based on the domain knowledge?
Explain what a p-value is to your grand mother?
Evidence against null; probability of FP,
probability of seeing as extreme a value by
chance alone
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
Review From Lecture #2
When is it right to use a parametric test?
Parametric tests have distributional assumptions
What is the t-statistic?
Signal:Noise ratio
How do we test the assumptions of the t-test?
Data sampled from normal distribution;
independence of replicates; independence of
groups; homoscedasticity
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
Flow-Chart For Two-Sample Tests
Is Data Sampled From a
Normally-Distributed Population?
Yes
No
Equal Variance
(F-Test)?
Yes
Homoscedastic
T-Test
Yes
Sufficient n for
CLT (>30)?
No
Heteroscedastic
T-Test
Lecture 6: Sequence Analysis Basics
No
Wilcoxon
U-Test
bioinformatics.ca
Review From Lecture #3
What is statistical power?
Probability a test will incorrect reject the null
AKA sensitivity or 1- false-negative rate
What is does power depend on?
P-value, effect-size and sample-size
When do you use correlations vs. hypothesis-testing?
The question is mal-posed, correlations test a
specific hypothesis: that two variables are
non-randomly associated with one another.
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
Review From Lecture #3
• Hypergeometric test
• Is a sample randomly selected from a fixed population?
• Proportion test
• Are two proportions equivalent?
• Fisher’s Exact test
• Are two binary classifications associated?
• (Pearson’s) Chi-Squared Test
• Are paired observations on two variables independent?
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
Review From Lecture #4
Assumptions of linear-modeling
• One variable is a response and one a predictor
• No adjustment is needed for confounding or
other between-subject variation
• Linearity
• σ2 is constant, independent of x
• Predictors are independent of each other
• For proper statistical inference (CI, p-values),
errors are normally distributed
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
Review From Lecture #4
How do we assess the adequacy of a model?
By considering the size of the residuals (R2)
How can we test the quality of a model?
Residual plots; qq plots; prediction accuracy
Compare a one-way ANOVA to a logistic regression
Linear model where x is factorial vs. one
where y is factorial
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
Lots of Analyses Are Linear Regressions
Y = a0 + a1x1
x1 continuous
Linear Regression
Y = a0 + a1x1
Y factorial
Logistic Regression
Y = a0 + a1x1
x1 factorial
1-way ANOVA
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
Including All ANOVAs
Y = a0 + a1x1
x1 factorial
1-way ANOVA
Y = a0 + a1x1 + a2x2 + a3x1x2
x1 x2 two-level factors
Lecture 6: Sequence Analysis Basics
2-way ANOVA
bioinformatics.ca
Example #1
You are conducting a study of osteosarcomas using mouse models. You
are using a strain of mice naturally susceptible to these tumours at ~20%
penetrance. You are studying two transgenic lines, one with deletion of a
tumour suppressor (TS), the other with amplification of an oncogene
(OG). Tumour penetrance in these is 100%.
Your hypothesis: tumours in wild-type mice are smaller than tumours in
transgenic mice. Your data:
TS (cm3)
3.9
7.1
3.1
4.4
5.0
Lecture 6: Sequence Analysis Basics
OG (cm3)
5.2
1.9
5.0
6.1
4.5
4.8
Wildtype (cm3)
1.1
1.5
2.1
2.5
0.3
2.2
bioinformatics.ca
Example #2
You are conducting a study of osteosarcomas using mouse models. You
are using a strain of mice naturally susceptible to these tumours at ~20%
penetrance. You are studying two transgenic lines, one with deletion of a
tumour suppressor (TS), the other with amplification of an oncogene
(OG). Tumour penetrance in these is 100%. To generalize your results,
you test a less tumour-prone strain as well.
Your hypothesis: animals from either transgenic are more tumour-prone
than wild-type animals. Your data is weeks to tumour-formation (by
manual palpation or surgery) or X for no tumour.
TS (weeks)
4
5
5
X
4
Lecture 6: Sequence Analysis Basics
OG (weeks)
6
2
X
X
4
3
Wildtype (weeks)
X
X
X
X
3
3
4
bioinformatics.ca
Example #3
You are conducting a study of osteosarcomas using mouse models. You
are using a strain of mice naturally susceptible to these tumours at ~20%
penetrance. You are studying two transgenic lines, one with deletion of a
tumour suppressor (TS), the other with amplification of an oncogene
(OG). Tumour penetrance in these is 100%. To generalize your results,
you test a less tumour-prone strain as well.
Your hypothesis: OG animals are more likely to respond to a novel
therapeutic (NT) as assessed by a pathologist, than TS or wildtype mice.
TS (response)
Yes
Yes
No
No
No
OG (response)
Yes
No
No
Yes
Yes
Yes
Lecture 6: Sequence Analysis Basics
Wildtype (response)
No
No
No
No
Yes
No
No
bioinformatics.ca
Example #4
You are conducting a study of osteosarcomas using mouse models. You
are using a strain of mice naturally susceptible to these tumours at ~20%
penetrance. You are studying two transgenic lines, one with deletion of a
tumour suppressor (TS), the other with amplification of an oncogene
(OG). Tumour penetrance in these is 100%. To generalize your results,
you test a less tumour-prone strain as well.
Your hypothesis: You now realize that you are seeing two different
subtypes of osteosarcoma (AD and SQ), and that these are prevalent at
different frequencies across the lines. You suspect tumour-size differs
between lines, but is confounded by this histology difference. Your data:
TS (cm3)
3.9 (AD)
7.1 (SQ)
3.1 (AD)
4.4 (AD)
5.0 (SQ
Lecture 6: Sequence Analysis Basics
OG (cm3)
5.2 (SQ)
1.9 (AD)
5.0 (AD)
6.1 (SQ)
4.5 (SQ)
4.8 (SQ)
Wildtype (cm3)
1.1 (AD)
1.5 (SQ)
2.1 (SQ)
2.5 (SQ)
0.3 (AD)
2.2 (SQ)
bioinformatics.ca
Attendance Break
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
Sequence Alignment
G
E
N
E
S
I
S
G
60
40
30
20
20
10
0
E
40
50
30
20
20
10
0
Lecture 6: Sequence Analysis Basics
N
30
30
40
20
20
10
0
E
20
30
20
30
20
10
0
T
20
20
20
20
20
10
0
I
0
0
0
10
0
20
0
C
10
10
10
10
10
10
0
S
0
0
0
0
10
0
10
bioinformatics.ca
Alignments Tell Us About
• Function or activity of a new gene/protein
• Structure or shape of a new protein
• Location or preferred location of a protein
• Stability of a gene or protein
• Origin of a gene or protein
• Origin or phylogeny of an organelle
• Origin or phylogeny of an organism
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
Key (and widely mis-used) terms
Similarity
• Similarity refers to the
likeness or % identity
between 2 sequences
Homology
• Homology refers to
shared ancestry
• Similarity means
sharing a statistically
significant number of
bases or amino acids
• Two sequences are
homologous is they
are derived from a
common ancestral
sequence
• Similarity does not
imply homology
• Homology usually
implies similarity
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
Similarity is Quantifiable
• It is correct to say that two sequences are X%
identical
• It is correct to say that two sequences have a
similarity score of Z
• It is generally incorrect to say that two sequences
are X% similar
• In part because similarity is not itself quantifiable: we
do not have any standard metrics of “sequence
similarity”, instead we call measure sequence identity.
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
Homology is Binary
• If two sequences have a high % identity it is OK to
say they are homologous
• It is incorrect to say two sequences have a
homology score of Z
It is incorrect to say two sequences are X%
homologous
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
Sequence Similarity At Multiple Levels
THESTORYOFGENESIS
THISBOOKONGENETICS
Two Character
Strings
THESTORYOFGENESI-S
* *
*
*
*
* * * *
*
*
THISBOOKONGENETICS
Character
Comparison
THE STORY OF GENESIS
THIS BOOK ON GENETICS
Context
Comparison
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
Dynamic Programming: The CS Core
G
E
N
E
S
I
S
G
10
0
0
0
0
0
0
E
0
10
0
0
0
0
0
N
0
0
10
0
0
0
0
E
0
10
0
10
0
0
0
G
|
G
T
0
0
0
0
0
0
0
I
0
0
0
0
0
10
0
E
|
E
Lecture 6: Sequence Analysis Basics
C
0
0
0
0
0
0
0
S
0
0
0
0
10
0
10
N
|
N
E
|
E
G
E
N
E
S
I
S
T
*
S
G
60
40
30
20
20
10
0
I
|
I
E
40
50
30
20
20
10
0
N
30
30
40
20
20
10
0
C
S
|
S
E
20
30
20
30
20
10
0
T
20
20
20
20
20
10
0
I
0
0
0
10
0
20
0
C
10
10
10
10
10
10
0
S
0
0
0
0
10
0
10
bioinformatics.ca
Alignments Tell Us About
•
•
•
•
•
•
Developed by Needleman & Wunsch (1970)
Refined by Smith & Waterman (1981)
Ideal for quantitative assessment
Guaranteed to be mathematically optimal
Slow O(N2) algorithm
Performed in 2 stages
• Prepare a scoring matrix using recursive function
• Scan matrix diagonally using traceback protocol
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
Scoring Matrix: The Key to Alignment
• An empirical model of evolution, biology and
chemistry all wrapped up in a 20 X 20 table of
integers
• Or 4x4 for DNA/RNA
• Structurally or chemically similar residues should
ideally have high diagonal or off-diagonal
numbers
• Structurally or chemically dissimilar residues
should ideally have low diagonal or off-diagonal
numbers
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
Most Common Scoring Matrices
• Proteins: PAM
• Developed by M.O. Dayhoff (1978)
• PAM = Point Accepted Mutation
• Matrix assembled by looking at patterns of substitutions in
closely related proteins
• 1 PAM corresponds to 1 amino acid change per 100
residues
• 1 PAM = 1% divergence or 1 million years in evolutionary
history
• Nucleotides: flat/uniform matrix
• +1 for all matches
• -3 for all mismatches
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
Problems with Dynamic Programming
• Great for doing pairwise global alignments
• Produces a quantitative alignment “score”
• Problems if one tries to do alignments with very large
sequences (memory requirement grows as N2 or as N
x M)
• Serious problems if one tries to align one sequence
against a database (10’s of hours)
• Need an alternative...
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
Let’s try an experiment...
ACDEAGHNKLM...
KKDEFGHPKLM...
SCDEFCHLKLM...
Align
ACDEFGHIKLM...
MCDEFGHNKLV...
QCDEFGHAKLM...
AQQQFGHIKLPI...
WCDEFGHLKLM...
SMDEFAHVKLM...
ACDEFGFKKLM...
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
Score Distribution?
8000
7000
6000
5000
4000
3000
2000
1000
0
<20
30
40
50
Lecture 6: Sequence Analysis Basics
60
70
80
90
100
110
>120
bioinformatics.ca
Score Distribution?
Gaussian?
Poisson?
Other?
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
An Extreme Value Distribution
8000
7000
P(x) =
6000
x
x -e
ee
5000
4000
3000
2000
1000
0
<20
30
40
50
Lecture 6: Sequence Analysis Basics
60
70
80
90
100
110
>120
bioinformatics.ca
Who Cares?
• If you can predict the usual score
distribution prior to performing an
alignment search then it is possible to
predict which alignments and which
sequences will be worth aligning
• Saves on time!
• Gives a significance value (not just a raw
score) to sequence alignments
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
BLAST
• Basic Local Alignment Search Tool
• Developed in 1990 and 1997 (S. Altschul)
• A heuristic method for performing local
alignments through searches of high scoring
segment pairs (HSP’s)
• 1st to use statistics to predict significance of
initial matches - saves on false leads
• Offers both sensitivity and speed
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
BLAST
• Looks for clusters of nearby or locally dense “similar or
homologous” k-tuples
• Uses “look-up” tables to shorten search time
• Uses larger “word size” than FASTA to accelerate the
search process
• Performs both Global and Local alignment
• Fastest and most frequently used sequence alignment
tool -- THE STANDARD
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
But the most common sequence-analysis
questions today occur using NGS
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
Next-Generation Sequencing
• High accuracy needed
• Misaligned reads are a source of false positive variant
calls in NGS data
• High sensitivity needed
• The aligner must allow for differences between the
individual and reference to find the correct mapping
position
• High speed needed
• With large data the informatics cost is significant
• My lab uses ~$150k of storage per year: comparable to
mouse-costs for large wet-labs.
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
NGS Pipelines Are Complex
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
How To Analyze NGS Data
• Don’t, ask an expert.
• Many, many high-profile disasters in this area.
• And likely many more coming up!
• Deciding the correct alignment strategy is very hard.
• Does a reference genome exist? How accurate is it?
• How long are your reads?
• How much computational hardware do you have?
• Even figuring out what method is incredibly hard.
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
Benchmarking Studies
• Consider one very simple cancer tumour/normal pair
• 100% tumour cellularity
• No minor sub-populations or intra-tumoural heterogeneity
• No indels or complex structural variations
• Tumour-type known
• Female (to avoid Y-chromosome challenges)
• 120 submissions from groups around the world, including:
• Broad, EMBL Heidelberg, BCCA, Baylor, UCSC, etc.
• How did they do?
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
Benchmarking Studies
60% of groups made > 10% error
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
But It isn’t Hopeless!
• While most groups did poorly...
• 60% had >10% error
• A subset of groups did extremely well
• 5% had <5% error
• These included all the major sequencing centres who competed.
Morale: Don’t analyze your own NGS data!
Lecture 6: Sequence Analysis Basics
bioinformatics.ca
Course Overview
•
•
•
•
•
•
•
•
•
•
Lecture 1: What is Statistics? Introduction to R
Lecture 2: Univariate Analyses I: continuous
Lecture 3: Univariate Analyses II: discrete
Lecture 4: Multivariate Analyses I: specialized models
Lecture 5: Multivariate Analyses II: general models
Lecture 6: Sequence Analysis
Lecture 7: Microarray Analysis I: Pre-Processing
Lecture 8: Microarray Analysis II: Multiple-Testing
Lecture 9: Machine-Learning
Final Exam (written)
Lecture 6: Sequence Analysis Basics
bioinformatics.ca