CSC598BIL675-2016

Download Report

Transcript CSC598BIL675-2016

Sequence I/O
How to find sequence information
from Bio import SeqIO
orchid_dict = SeqIO.to_dict(SeqIO.parse("ls_orchid.fasta", ”fasta"))
creates Python dictionary with each entry held as a SeqRecord object in memory
Sequence I/O
Writing a file
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_protein
Creating seqRec
rec1 =
SeqRecord(Seq("MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD" \
+"GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK" \
+"NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM" \
+"SSAC", generic_protein),
id="gi|14150838|gb|AAK54648.1|AF376133_1",
description="chalcone synthase [Cucumis sativus]")
rec2 = SeqRecord(Seq("YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLDARQ"
\
+"DMVVVEIPKLGKEAAVKAIKEWGQ", generic_protein),
id="gi|13919613|gb|AAK33142.1|",
description="chalcone synthase [Fragaria vesca subsp. bracteata]")
my_records = [rec1, rec2, rec3]
SeqIO.write(my_records, "my_example.faa", "fasta")
writing to file
Pairwise Alignment - DNA
Seq 1
A C T T G A
Seq 1
Seq 2
Seq 1
Seq 2
Single Nucleotide
Polymorphism (SNP)
Seq 2
A G T T A G A
A C T T G A
A G T T A G A
A C T T - G A
A G T T A G A
insertion/deletion
(indel)
Protein sequences
Seq 1
Seq 2
identity
similarity
or not
GDAFPLHKLEL-GPV
GDA P+H++ + GPV
GDASPVHQMTMKGPV
= the residues are identical
= the residues are physio-chemically similar
eg hydrophobic A L I V F M
= gap or different
Describing similarity
The sequences are identical
100% similarity
The sequences are similar (protein)
there is some level of similarity
this can range from 99% to 45% even 20%
use of adjectives: highly, significantly, somewhat
The sequences are homologous
implies an evolutionary relationship
ie they have a common ancestor
requires knowledge of protein or species phylogeny
Alignment algorithms
GLOBAL ALIGNMENT
Considers similarity across the
entire length of the sequences
LOCAL ALIGNMENT
Looks at regions of similarity
within the sequences,
substrings.
These are usually biologically
meaningful motifs or domains
Original manual method
Start with a 2D matrix on which you put the query sequence
(I) on the vertical, and the test sequence (J) on the
horizontal axis.
G
D
I
H
I
F
Find the common k-words.
G
GDIHIF
GDVHIF
D
V
H
I
F
Dotplot of similarity
G
D
I
H
I
F
G
D
V
H
I
F
Finding the diagonal. Visualization of similarity
Doesn’t scale well
GDIHAL
GRVHIF
G
D
I
H
A
L
G
R
V
- Visualization of similarity
not so good with distantly
related sequences
- No statistical/numerical
measure of similarity
+
H
I
F
- Slow
+
+
Computational considerations
Sequence orientation
Character substitutions
Physio-chemical properties
which strand?
variations
protein structure
=> Biological meaning?
Need a system whereby substitutions and gaps are weighted.
Different types of scoring systems can be used with different
algorithms –
The main problem is computational time.
Comparing Strings
Called “edit distance” measurements.
Hamming distance: the number of different characters between
strings of equal lengths; only allows substitutions.
Levenshtein distance: the strings do not have to be the same
length; allows insertions, deletions, substitutions.
Applied to automated spelling corrections…
Dynamic programming
A method of solving a large problem by dividing it up into smaller subproblems
and solving those, then putting them together to solve the larger problem.
Can often have more than one way of aligning two sequences,
dynamic programming allows backtracking and testing of various
options.
The path that most effectively links together the subproblems into an
optimal
selected
as of
thescoring
final alignment.
Needs
to solution
be basedison
statistics
system and probability
thresholds set
May still end up with a few solutions of equal scores –
Ultimately need a biologically relevant answer.
Dynamic programing
A recursive algorithm
To find an LCS of X and Y , we may need to find the LCSs of X and Yn-1 and
of Xm-1 and Y . Furthermore, each of these subproblems has the
subsubproblem of finding an LCS of Xm-1 and Yn-1.
c[i,j] is the length of an LCS of the sequences Xi and Yj. The optimal
substructure of the LCS problem gives the recursive formula
ì
0
ï
c[i, j] = í
c[i -1, j -1]+1
ï max(c[i, j -1], c[i -1, j])
î
if i = 0 or j = 0
if i, j > 0 and xi = yj
if i, j > 0 and xi ≠ yj
Dynamic programing
BCBA
AB C BDAB
BDCAB A
Backtracking
BCBA
AB C BDAB
BDCAB A
Step 3: Computing the length of a LCS
BCBA
AB C BDAB
BDCAB A
CSC317
16
Step 4: Constructing a LCS (Backtracking)
BCBA
AB C BDAB
BDCAB A
CSC317
17
Putting it all together
• Create a matrix - as in dotplot but “virtual”
• Calculate the match/mismatch score step by step as
you go along – how many are too much?
• Then backtrack and do it again – iterative, to find
optimal score
Needleman and Wunsch
Global alignment algorithm for sequence comparison.
Biased towards finding similarity, rather than difference.
Based on 2D matrix, gives a maximum match; ie the largest
number of residues (n or aa) of one sequence that can be
matched with another, allowing for all possible deletions
and substitutions.
Begins at beginning of sequence and ends at the end.
Smith-Waterman algorithm
Local alignment algorithm which finds shorter, locally similar
regions (substrings).
It is more sensitive to weaker sequence similarities, and to
finding conserved motifs
This is also a matrix-based approach, but each cell in the
matrix defines the end of a potential alignment.
It is the basis for many subsequent alignment algorithms,
including BLAST.
Sequence alignments
For pairwise alignments Biopython contains the Bio.pairwise2 module
What you would do:
1.) Prepare an input file of your unaligned sequences, typically this will be a
FASTA file which you might create using Bio.SeqIO.
2.) Call the command line tool to process this input file, typically via one of
Biopython’s command line wrappers (which we’ll discuss here).
3.) Read the output from the tool, i.e. your aligned sequences, typically using
Bio.AlignIO..
Pairwise sequence alignments
contains essentially the same algorithms as water
(local) and needle (global)
from Bio import pairwise2
from Bio import SeqIO
seq1 = SeqIO.read("alpha.faa", "fasta”)
seq2 = SeqIO.read("beta.faa", "fasta")
alignments = pairwise2.align.globalxx(seq1.seq, seq2.seq)
global alignment
2 letter code (scoring matches, gaps)
x … match scores 1, x … no cost for gaps
m ... general values, s ... costs for gaps
print(pairwise2.format_alignment(*alignments[0]))
Pairwise sequence alignments
from Bio import pairwise2
from Bio import SeqIO
rom Bio.SubsMat.MatrixInfo import blosum62
costs for opening, extending gaps
seq1 = SeqIO.read("alpha.faa", "fasta”)
seq2 = SeqIO.read("beta.faa", "fasta")
alignments = pairwise2.align.localds(seq1.seq, seq2.seq, blosum62, -10, -0.5)
local alignment
substitution matrix, Blosum, PAM
print(pairwise2.format_alignment(*alignments[0]))
Problem with N-W and S-W
They are exhaustive, with stringent statistical thresholds
As the databases got larger, these began to take too long
– computational constraints
Need a program that runs an algorithm that is based on
dynamic programming (substrings) but uses a heuristic
approach (takes assumptions, quicker, not as refined).
Therefore can deal with comparing a sequence against 100
million others in a relatively short time… the bigger the
database, the more optimal the results.
Two heuristic approaches
Each is based on different assumptions and calculations of probability
thresholds (ie the statistical significance of results)
1. FASTA
All proteins in the db are equally likely to be related to the query
 probability multiplied by the number of sequences in database
2. BLAST
Query more likely to be related to a sequence its own length or
shorter
 probability multiplied by N/n
N: total number of residues in database
n: length of subject sequence
The BLAST algorithm
Basic Local Alignment Search Tool
Breaks down your sequence into smaller segments (words)
Does the same for all sequences in the database
Looks for exact matches, word by word, and expands those upand down- stream one base at a time, allowing for a certain
number of mismatches
Stops when sequence ends or statistical significance becomes too
low (too many mismatches)
Can find more than one area of similarity between two sequences
BLAST
Using online BLAST
database type
from Bio.Blast import NCBIWWW
result_handle = NCBIWWW.qblast("blastn", "nt", "8332116")
Blast program
sequence
or
from Bio.Blast import NCBIWWW
from Bio import SeqIO
Just sequence
record = SeqIO.read("m_cold.fasta", format="fasta")
result_handle = NCBIWWW.qblast("blastn", "nt", record.seq)
result_handle = NCBIWWW.qblast("blastn", "nt", record.format(“fasta”)
whole seq. information
BLAST
Using local BLAST
http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download
- First we create a command line prompt (for example):
blastn -query opuntia.fasta -db nr -out opuntia.xml -evalue 0.001
Blast
program
query
sequence
database
output
file
evalue
threshold
- then we us the BLAST Python wrappers from BioPython:
from Bio.Blast.Applications import NcbiblastxCommandline
blastn_exe = r”path to blastn"
blastn_cline = NcbiblastnCommandline(query="opuntia.fasta", db="nr", evalue=0.001, out="opuntia.xml”)
blastn_cline()
use Bio.Blast.NCBIXML.parse() to parse results in the output file
Algorithm evolution
Smith Waterman Local alignment algorithm – finds small,
locally similar regions (substrings), matrix-based, each cell in
the matrix defined the end of a potential alignment.
BLAST – start with highest scoring short pairs and extend
and down the sequence.
Great, but when you’re talking about millions of reads…