0 - For link to GMS6014, click here

Download Report

Transcript 0 - For link to GMS6014, click here

Stand-alone tools 2.
Practice –the ClustalX application.
1. Download the zip file to the GMS6014 folder.
2. Unzip the files to a folder named “clustalx”.
3. Edit the MDM2_isoforms_5.fasta file with
WordPad and save.
4. Run the .exe file.
5. Load sequence file, select sequences, perform
alignment.
6. Write the alignment to a ps file.
Stand-alone tools 3.
Command line applications:
 Accounts for a large number of high-quality,
sophisticated programs.
Practice – (install and) run standalone blast
in your own computer
Pet Projects:
Searching for potential ortholog of
oncogene MDM2 in the fruit fly
genome
Practice – Install the blast program (1)
1. Download the BLAST executable file, save the
file in a folder, such as c:\GMS6014\blast\
2. Run the installation program by double click.
Inspect the folder following installation.
3. Add three more folders to your /blast directory,
“/query”, “/dbs”, and “/out”.
Practice – Install the blast program (2)
5. Inspect the contents of the doc, data, and bin
folder. Move the programs from blast\bin to the
blast folder.
6. Bring a command (cmd) window by typing
“cmd” in the StartRun box.
7. Go to the blast folder by typing “cd
C:\GMS6014\blast”
8. Try to run the program by typing “blastall”, read
the output.
Practice -- BLAST search in your own computer
1. Download data file from the course web page, or Ensemble.
Save in the blast\dbs folder.
2. Start a CMD window, navigate to the C:\GMS6014\blast
folder.
3. At the prompt “C:\GMS6014\blast >” type the command
“formatdb –i dbs\Dm.P –p T” -- format the dataset for
the program.
4. Compose the query sequence save as “3TNF.txt” in the
“blast\query\” folder.
5. Initiated the search by typing “blastall –p blastp –d
dbs\Dm.P –i query\4_MMD2.fasta –o
out\Mdm2_DmP.html –T T”
What’s in a command?
formatdb
–i
dbs\Dm.P
–p T
Program –
format database
for search.
Feed me the
input file name
Tell me is it a
protein
sequence file?
For more info, refer to the “user manual” file in the blast\doc folder.
Advantages of Running BLAST at Your
Own Machine
 Do it at any time, no waiting on the line.
 Search for multiple sequences at once.
 Search a defined data set.
 Automate Blast analysis.
 Combine Blast with other analysis.
 …..
BLAST is a program implemented in
C/C++
void BlastTickProc(Int4 sequence_number, BlastThrInfoPtr thr_info)
{
if(thr_info->tick_callback &&
(sequence_number > (thr_info->last_db_seq + thr_info->db_incr))) {
NlmMutexLockEx(&thr_info->callback_mutex);
thr_info->last_db_seq += thr_info->db_incr;
thr_info->tick_callback(sequence_number, thr_info->number_of_pos_hits);
thr_info->last_tick = Nlm_GetSecs();
Should I care ?
NlmMutexUnlock(thr_info->callback_mutex);
}
return;
}
/*
Sends out a message every PERIOD (i.e., 60 secs.) for the index.
THis function runs as a separate thread and only runs on a threaded
platform.
If you care:
1.) Data structure and Algorithm
char: name
SEQ char: sequence
int: seq_length
Identify the best alignment
for two sequences (p69-73)
Seq1: MA-DSV—WC..
Seq2: MALD-IHWS..
Programming language comparison
Translation : C
Translation : Python
/* TRANSLATION: 3 or 6 frame translate cDNA sequences
*/
//--------------------------------------------------------------------------#include "translation.hpp"
f#Translation -- read from fasta DNA file and translate into
three frames
int main(int argc, char **argv)
{ int num_seq=0;
char string[MAXLINE];
DSEQ * dseq;
import string
#
from Bio import Fasta
from Bio.Tools import Translate
infile.getline (string,MAXLINE);
if (string[0]=='>') strncpy (dbname,string,MAXLINE);
while (!infile.eof())
{ dseq=Get_Lib_Seq ();
if (dseq->reverse==0)
Translation (&dseq->name[1], dseq->seq);
else
Translation (&dseq->name[1], dseq->r_seq);
num_seq++;
if (num_seq%1000==0)
{ cout<<num_seq<<endl;
cout<<dseq->name<<endl;
}
delete dseq;
}
infile.close();
outfile.close();
cout<<num_seq<<" translated"<<endl;
getch();
return 0;
}
DSEQ* Get_Lib_Seq()
{ int i,n;
char str[MAXLINE];
DSEQ* dseq;
n = 0;
dseq=new DSEQ;
strcpy (dseq->name, dbname);
from Bio.Alphabet import IUPAC
from Bio.Seq import Seq
ifile = "S:\\Seq\\test.fasta"
parser = Fasta.RecordParser()
file =open (ifile)
iterator = Fasta.Iterator (file, parser)
cur_rec = iterator.next()
cur_seq = Seq (cur_rec.sequence,IUPACUnambiguousDNA())
translator = Translate.unambiguous_dna_by_id[1]
translator.translate (cur_seq)
Programming languages
Efficiency, Power
Simplicity, Fast Dev.
C/C++
Perl - Bioperl
Java Biojava
Python Biopython
Observe: scripting is not that difficult
Example: Python and bioPython.
1. Simple python scripts.
2. Batch Blast with a Python script.
Blast output
Questions after the Blast search?
Questions:
• Is this a expressed gene in the Fruit fly?
- Gene prediction & gene structure
• Is this the true ortholog of MDM2?
- Fundamentals of sequence comparison
• What can we learn from the comparison of sequences?
-- protein dommains/motifs.
Blast output
How to measure the similarity between
two sequences
Q: which one is a better match to the query ?
Query:
M A T W L
Seq_A:
M A T P P
Seq_B:
M P P W I
Judging the match using “Scoring Matrix”
Q: which one is a better match to the query ?
Query: M A T W L
Query: M A T W L
Seq_A: M A T P P
Seq_B: M P P W I
Score: 5 4 5 -4 -3
Score: 5-1-1 112
Total: 7
Total: 16
“Scoring Matrix” assigns a score to each pair
of amino acids
A
S
T
L
I
V
K
L –1 –2 –2
4
3
1 -2 –4
BLOSUM-62
D
...
BLOSUM - Blocks Substitution Matrices
Block: very well conserved region of a protein family. –
perform the same (similar) function.
ASLDEFL
SALEDFL
ASLDDYL
ASIDEFY
ASIDEFY
…
Score(a1/a2) = 2* log2
observed frequency of
a1/a2
predicated frequency
of a1/a2
AA: 6
AS: 3
SS: 0
BLOSUM - Blocks Substitution Matrices
Block: very well conserved region of a protein family. –
perform the same (similar) function.
ASLDEFL
ASLEDFL
ASLDDYL
SALEEFL
ASLDDYL
SALEEFL
…
Score
(a1/a2)
observed
> 0 frequency of
a1/a2
>
predicated
frequency of
a1/a2
=0
observed
< 0 frequency of
a1/a2
<
predicated
frequency of
a1/a2
BLOSUM - Blocks Substitution Matrices
Block: very well conserved region of a protein family. –
perform the same (similar) function.
ASLDEFL
ASLEDFL
ASLDDYL
SALEEFL
ASLDDYL
SALEEFL
…
observed
frequency of
L/I
i.e: 0.03
>
Score (L/I) > 0
predicated
frequency of
L/I
i.e: 0.1*0.1 = 0.01
Substitution of L / I
is common in
conserved sequences
BLOSUM - Blocks Substitution Matrices
Block: very well conserved region of a protein family. –
perform the same (similar) function.
ASLDEFL
ASLEDFL
ASLDDYL
SALEEFL
ASLDDYL
SALEEFL
…
observed
frequency of
L/K
i.e: 0.0002
<
Score (L/K) < 0
predicated
frequency of
L/K
i.e: 0.1*0.1 = 0.01
Substitution of L / K
is rare in conserved
sequences
“Scoring Matrix” assigns a score to each pair
of amino acids
A
S
T
L
I
V
K
L –1 –2 –2
4
3
1 -2 –4
BLOSUM-62
D
...
Scoring matrix –BLOSUM 62