Local BLAST - For link to GMS6014, click here

Download Report

Transcript Local BLAST - For link to GMS6014, click here

Practice – file types (Cont.)
Load the “Mysequence.doc” file to Webcutter
using “Choose file” and then “Upload
sequence file”.
-Notice that the “sequence” in the sequence box are
nonsense characters.
Clear input; Browse and then load the .txt file.
Run an analysis.
Always keep you sequences in .txt file for
downstream analysis.
Representation of sequence
The need to represent associated info with sequence
• Structured data entry
• Specialized databases
3-d Structure
Mutation / Diseases
Protein family / Protein domain
Interaction
Pathway
….
Representation of sequence
The need to represent associated info with sequence
• Structured data entry
• Specialized databases
• Complex / customized data structure
- Object-oriented data representation
(Mount, p44-45)
Public Resources for Bioinformatics
•Databases
•Analysis Tools
Observe: List of databases and service at NCBI,
EBI, KEGG, and Ensembl.
Pet Project:
MDM2, or your favorite gene
What can we know about this gene?
 Search for “curated” databases.
 To prepare for future analysis, save annotated
sequence files as genename.html (in a target
folder).
 For downstream sequence analysis, save pure
sequence as FASTA format file.
Where and how much information are
available for my gene?
Observe: The information contents and
presentation format for the same gene in
SwissProt, NCBI protein, NCBI Genes, etc..
Public Resources (I) – Databases and
data sources
Over 1,000 in the sea of databases.
Content-specific, such as DNA, Protein,
Structure, etc.
Species-specific, such as flybase, wormbase,
OMIM, etc.
System-specific, such as MetaCyc, AFCS, etc.
Database concept:
Database - efficiently store, update, and
retrieve information (data).
Types of Databases – Relational DB, Object DB,
native XML DB.
Database management systems – Access, Sybase
MySQL, Oracle, etc.
Database concept – tables in relational
databases
“TNF”=TNF[All Fields]
TNF[Name]
Accessi Organ.
on
Ref.
Name
….
…..
medline1
TNF
Key
Features
words
…..
…….
…..
….
….
medline2
P53
….
Protein table
……..
……
Database concept – relationship between
tables
Accessi
on
Organ. Ref.
Name
Key
Features
words
….
…..
medline1
P27
…..
…….
…..
….
….
medline2
P53
….
……..
……
Protein table
ID
title
year
author
abstract
medline1
…..
1970
….
…..
…..
medline2
….
1980
….
….
…
Reference table
Representation of sequence
The need to represent associated info with sequence
• Structured data entry
• Specialized databases
• Complex / customized data structure
- Object-oriented data representation
(Mount, p44-45)
Observe/Practice
Search for MDM2 in the Gene database and the and
Proteins databases.
Search for MDM2 in “All Text” v.s “gene name” in
the Gene database.
Compare results.
Download the human MDM2 protein sequences for
all 8 isoforms.
Public Resources (II) – Analysis tools
 Web-based analysis tools – easy to
use, but often with less customization
options.
 Stand-alone analysis tools – requires
installation and configuration, but
provides more customizatio0n options.
 Commercial analysis tools
 Scripting for bioinformatics projects
Practice: navigating the related
resources through links
Using the “PubMed” link, search annotated
references on MDM2.
Using the “GEO Profiles” link, search gene
expression information on MDM2.
Using the “Map Viewer” link to observe the
chromosome location and gene structure of
the MDM2 locus – change the option of
“Map Viewer” to include prediction of CpG
island.
Public Resources for Bioinformatics
•Databases : how to find relevant
information.
•Analysis Tools
Public Resources (II) – Analysis tools
 Web-based analysis tools – easy to
use, but often with less customization
options.
 Stand-alone analysis tools – requires
installation and configuration, but
provides more customizatio0n options.
 Commercial analysis tools
 Scripting for bioinformatics projects
web-based tools
• Identification of web-based
bioinformatics resources.
– Portals, lists,
– Google search
• Organization
– Book mark.
– html page.
web-based tools
Practice –retrieve genomic sequence from
Ensemble and perform reverse
complementation with SMS
Stand-alone tools 1.
Rules of the thumb:
 Make a folder for each program.
 Make a sub-folder for input/output
if necessary.
 Link GUI-based .exe application to
program menu
Stand-alone tools 2.
Practice –the ClustalX application.
1.
2.
3.
4.
5.
Download the zip file to the GMS6014 folder.
Unzip the files to a folder named “clustalx”.
Edit the 3TNF file with WordPad and save.
Activate the .exe file.
Load sequence file, select sequences, perform
alignment.
6. Write the alignment to a ps file.
Stand-alone tools 3.
Command line applications:
 Accounts for a large number of high-quality,
sophisticated programs.
Practice – (install and) run standalone blast
in your own computer
Pet Projects:
Identifying the ortholog of MDM2 (Tumor
necrosis factor) in an insect genome.
Practice – Install the blast program (1)
1. Download the BLAST executable file, save the
file in a folder, such as c:\GMS6014\blast\
2. Run the installation program by double click.
Inspect the folder following installation.
3. Add three more folders to your /blast directory,
“/query”, “/dbs”, and “/out”.
Practice – Install the blast program (2)
5. Inspect the contents of the doc, data, and bin
folder. Move the programs from blast\bin to the
blast folder.
6. Bring a command (cmd) window by typing
“cmd” in the StartRun box.
7. Go to the blast folder by typing “cd
C:\GMS6014\blast”
8. Try to run the program by typing “blastall”, read
the output.
Practice -- BLAST search in your own computer
1. Download data file from the course web page, or Ensemble.
Save in the blast\dbs folder.
2. Start a CMD window, navigate to the C:\GMS6014\blast
folder.
3. At the prompt “C:\GMS6014\blast >” type the command
“formatdb –i dbs\Dm.P –p T” -- format the dataset for
the program.
4. Compose the query sequence save as “3TNF.txt” in the
“blast\query\” folder.
5. Initiated the search by typing “blastall –p blastp –d
dbs\Dm.P –i query\4_MMD2.fasta –o
out\Mdm2_DmP.html –T T”
What’s in a command?
formatdb
–i
dbs\Dm.P
–p T
Program –
format database
for search.
Feed me the
input file name
Tell me is it a
protein
sequence file?
For more info, refer to the “user manual” file in the blast\doc folder.
Advantages of Running BLAST at Your
Own Machine
 Do it at any time, no waiting on the line.
 Search for multiple sequences at once.
 Search a defined data set.
 Automate Blast analysis.
 Combine Blast with other analysis.
 …..
BLAST is a program implemented in
C/C++
void BlastTickProc(Int4 sequence_number, BlastThrInfoPtr thr_info)
{
if(thr_info->tick_callback &&
(sequence_number > (thr_info->last_db_seq + thr_info->db_incr))) {
NlmMutexLockEx(&thr_info->callback_mutex);
thr_info->last_db_seq += thr_info->db_incr;
thr_info->tick_callback(sequence_number, thr_info->number_of_pos_hits);
thr_info->last_tick = Nlm_GetSecs();
Should I care ?
NlmMutexUnlock(thr_info->callback_mutex);
}
return;
}
/*
Sends out a message every PERIOD (i.e., 60 secs.) for the index.
THis function runs as a separate thread and only runs on a threaded
platform.
Programming language comparison
Translation : C
Translation : Python
/* TRANSLATION: 3 or 6 frame translate cDNA sequences
*/
//--------------------------------------------------------------------------#include "translation.hpp"
f#Translation -- read from fasta DNA file and translate into
three frames
int main(int argc, char **argv)
{ int num_seq=0;
char string[MAXLINE];
DSEQ * dseq;
import string
#
from Bio import Fasta
from Bio.Tools import Translate
infile.getline (string,MAXLINE);
if (string[0]=='>') strncpy (dbname,string,MAXLINE);
while (!infile.eof())
{ dseq=Get_Lib_Seq ();
if (dseq->reverse==0)
Translation (&dseq->name[1], dseq->seq);
else
Translation (&dseq->name[1], dseq->r_seq);
num_seq++;
if (num_seq%1000==0)
{ cout<<num_seq<<endl;
cout<<dseq->name<<endl;
}
delete dseq;
}
infile.close();
outfile.close();
cout<<num_seq<<" translated"<<endl;
getch();
return 0;
}
DSEQ* Get_Lib_Seq()
{ int i,n;
char str[MAXLINE];
DSEQ* dseq;
n = 0;
dseq=new DSEQ;
strcpy (dseq->name, dbname);
from Bio.Alphabet import IUPAC
from Bio.Seq import Seq
ifile = "S:\\Seq\\test.fasta"
parser = Fasta.RecordParser()
file =open (ifile)
iterator = Fasta.Iterator (file, parser)
cur_rec = iterator.next()
cur_seq = Seq (cur_rec.sequence,IUPACUnambiguousDNA())
translator = Translate.unambiguous_dna_by_id[1]
translator.translate (cur_seq)
Observe: scripting is not that difficult
Example: Python and bioPython.
1. Simple python scripts.
2. Batch Blast with a Python script.
Representation of sequence
The need to include annotations and functional
information with each sequence.
• Structured data entry
• GeneBank
• EMBL / SwissProt
Observe: The difference of data structure
between SwissProt, NCBI protein, and NCBI
Genes.