Computational-Skills..

Download Report

Transcript Computational-Skills..

Computational Skills Course
week 3
Mike Gilchrist
NIMR May-July 2011
WEEK THREE
Integrating third party tools, and
simple scripting.
Review: SQL and ‘group by’
Using ‘group by’ in SQL to aggregate data is very useful, but can sometimes be a little unintuitive. The commonest problem is to have too high an expectation of what it can do,
and not approach aggregate queries in a very literal fashion. Often one needs to take two
steps where you would like to think you can do it in one...
Take a simple BLAST query where you may get several hits in your target database for
some query sequences, e.g.
query_id,
CA0981762
CA0981762
CA0981762
CA0981762
XP_1349.1
XP_1349.1
subject_id
chr3
99.8
chr3
100.0
chr3
94.0
chr11
89.0
chr8
100.0
chr8
98.1
120
25
510
511
1
11
670
121
550
540
867
832
12345900
12347120
7865498
5655543
87601800
85981234
12346450
12348116
7865538
5655572
87600933
85982055
0.00
1e-120
1e-23
1e-8
0.00
0.00
max(pi)
100.0
89.0
100.0
max(q_end – q_start) sum(q_end – q_start)
550
694
29
29
876
1688
group by query_id, subject_id
CA0981762
CA0981762
CA0981762
CA0981762
XP_1349.1
XP_1349.1
chr3
chr3
chr3
chr11
chr8
chr8
CA0981762 chr3
CA0981762 chr11
XP_1349.1 chr8
x 3
x 2
count(*)
3
1
2
Working with BLAST
BLAST: too for aligning some sequences against some others.
Powerful, versatile and quite accurate.
Slow for some specialised applications.
BLASTn , BLASTx, BLASTp, etc.
$prompt>blastn [options] -query [queryfile.fasta] -db [database name] [> output.txt]
The query file is your responsibility and must be plain text.
The database file is an ‘indexed’ file and is what creates the speed.
BLAST makes that for you from another plaintext sequence file.
$prompt>makeblastdb -in [database.fasta] –dbtype [DNA/protein] -out [database]
The most useful option is to create TABULAR output and redirect the
file into a text file, which you then load into a database table.
But there are many more you will want to use...
Some BLAST parameters
$prompt>blastn –outfmt 6 –evalue 1e-20 -max_target_seqs 20 –query q.fasta –db dbname
-outfmt
-evalue
-max_target_seqs
-db_soft_mask
-megablast/blastn
-wordsize
output in many forms
worst scoring alignment to report
reports best ‘n’ matches (except…)
masks repeat regions for initial lookup (only)
(different optimisation)
shorter wordsize can be more accurate but slower
BLAST makes approximations and uses ‘word’ length initial matches to
extend to find ‘best’ alignments.
Probably not the best tool for high throughput sequence data!
$prompt>blastdbcmd [parameters, etc]
[can be used to export sections of sequence data from a formatted
blastable database]
BLAST flavours
BLASTn
DNA query vs DNA database
BLASTx
(translated)DNA query vs protein database
BLASTp
protein query vs protein database
tBLASTn
protein query vs (translated)DNA database
tBLASTx
(translated)DNA query vs (translated)DNA database
Task...
Get BLAST running on your computer.
Look at the tabular output, and design database
table to hold an entire row of output data.
Run a BLAST search and load the output data into
the database table.
Query the data in the table for something
interesting...
Scripting
What is a script?
Essentially just a series of commands you could
otherwise run one after another at the prompt – you
just run the script instead.
But you can send parameters to the script.
This leads to flexibility and re-useability.
This can be used to create complex analysis
pipelines, or just simplify common tasks.
Scripting
What is a script?
Essentially just a series of commands you could
otherwise run one after another at the prompt – you
just run the script instead.
But you can send parameters to the script.
This leads to flexibility and re-useability.
This can be used to create complex analysis
pipelines, or just simplify common tasks.
Platforms
Windows batch files:
my-script.bat
rem
rem
rem
[your comments here]
query file
%1.fasta
output file
%2.txt
blastn –query %1.fasta –db frog-genome > %2.txt
LOAD DATA LOCAL INFILE %2.txt INTO TABLE blast_data
Unix shell scripts:
my-script.sh
#!/bin/sh
if [ $# = 0 ]
then
echo usage: [path] [file] [read length]
exit
fi
grep -c ">" $1$2-SAMPL-$3.fasta > $1$2-SAMPL-$3-COUNT.txt
Catches for shell scripts
Unix shell scripts
Need to make sure executable:
$prompt>ls -lh *.sh
-rw-r--r-- 1 migil sequence 2.5K Jul 22 2010 my-script.sh
$prompt>chmod +x my-script.sh
$prompt>ls -lh *.sh
-rwxr-xr-x 1 migil sequence 2.5K Jul 22 2010 my-script.sh
is . in your path? [‘.’ = ‘here’]
$prompt>./my-script.sh
Windows batch files
Cannot run internal scripts for other programs easily...
Platforms
my-script.sh
#!/bin/sh
if [ $# = 0 ]
then
echo usage: [path] [file] [read length]
exit
fi
grep -c ">" $1$2-SAMPL-$3.fasta > $1$2-SAMPL-$3-COUNT.txt
mysql -u solexa << EOSQL
use slx
truncate table blast_hits_two_names;
LOAD DATA LOCAL INFILE $1$2-SAMPL-$3-COUNT.txt INTO TABLE
blast_hits_two_names;
select count(*), count(distinct query_name)
from blast_hits_two_names;
EOSQL
Case study
Look for conserved motifs in fly human orthologs.
-ve
+ve
5432101234
---------L RC
L
M_KDCSPK_V
I HG
R I
V E
M
F
>grep [LMIV].[RKH][^CDGE][^C]S[^P][^KR].[LVIMF] fly-proteins.fasta > fly.txt
>
>grep [LMIV].[RKH][^CDGE][^C]S[^P][^KR].[LVIMF] Hs-proteins.fasta > Hs.txt
Load this data into a database
Find fly/human orthologs by reciprocal best blast
Look for ortholog pairs where both contain the motif...
Normal fasta file
>gi|10047086|ref|NP_061821.1| mitogen-inducible gene 6 protein [Homo sapiens]
MSIAGVAAQEIRVPLKTGFLHNGRAMGNMRKTYWSSRSEFKNNFLNIDPITMAYSLNSSA
QERLIPLGHASKSAPMNGHCFAENGPSQKSSLPPLLIPPSENLGPHEEDQVVCGFKKLTV
NGVCASTPPLTPIKNSPSLFPCAPLCERGSRPLPPLPISEALSLDDTDCE
>gi|10047090|ref|NP_055147.1| small muscle protein, X-linked [Homo sapiens]
MNMSKQPVSNVRAIQANINIPMGAFRPGAGQPPRRKECTPEVEEGVPPTSDEEKKPIPGA
KKLPGPAVNLSEIQNIKSELKYVPKAEQ
>gi|10047100|ref|NP_057387.1| WW domain binding protein 5 [Homo sapiens]
MKSCQKMEGKPENESEPKHEEEPKPEEKPEEEEKLEEEAKAKGTFRERLIQSLQEFKEDI
HNRHLSNEDMFREVDEIDEIRRVRNKLIVMRWKVNRNHPYPYLM
>gi|10047102|ref|NP_057388.1| ribosomal protein L24-like [Homo sapiens]
MRIEKCYFCSGPIYPGHGMMFVRNDCKVFRFCKSKCHKNFKKKRNPRKVRWTKAFRKAAG
KELTVDNSFEFEKRRNEPIKYQRELWNKTIDAMKRVEEIKQKRQAKFIMNRLKKNKELQK
VQDIKEVKQNIHLIRAPLAGKGKQLEEKMVQQLQEDVDMEDAP
grep does not work over line ends...
So we need to flatten out the fasta files
(at some point it helps to have these guys in a database table...)
>gi|10047086|ref|NP_061821.1| MSIAGVAAQEIRVPLKTGNR…
>gi|10047090|ref|NP_055147.1| MNMSKQPVSNVRAIQANINI…
Then run grep and awk (to take only the first ‘field’ - the query string).
Advantages of scripts
Allow you to re-run with different parameters/search patters
Help keep track of what you are doing
Act as a documentary record of what you did (for publication)
Create a ‘resource’ that other people may find useful
Things to look for in MySQL
Define a column which automatically fills with sequential numbers
(AUTO_INCREMENT in MySQL, index/identity in others)
Temporary tables which ‘evaporate’ at the end of a session, so you don’t
have to clean them, or their data, up.
Indexing to speed up queries...
Logic in scripts (if, while, etc.)
A challenge....
Flatten out a fasta file !
Either >defline+’ ‘+sequence-on-one-line........
OR
>defline
sequence-on-one-line........
>gi|10047086|ref|NP_061821.1| mitogen-inducible gene 6 protein [Homo sapiens]
MSIAGVAAQEIRVPLKTGFLHNGRAMGNMRKTYWSSRSEFKNNFLNIDPITMAYSLNSSA
QERLIPLGHASKSAPMNGHCFAENGPSQKSSLPPLLIPPSENLGPHEEDQVVCGFKKLTV
NGVCASTPPLTPIKNSPSLFPCAPLCERGSRPLPPLPISEALSLDDTDCE
>gi|10047090|ref|NP_055147.1| small muscle protein, X-linked [Homo sapiens]
MNMSKQPVSNVRAIQANINIPMGAFRPGAGQPPRRKECTPEVEEGVPPTSDEEKKPIPGA
KKLPGPAVNLSEIQNIKSELKYVPKAEQ
>gi|10047100|ref|NP_057387.1| WW domain binding protein 5 [Homo sapiens]
MKSCQKMEGKPENESEPKHEEEPKPEEKPEEEEKLEEEAKAKGTFRERLIQSLQEFKEDI
HNRHLSNEDMFREVDEIDEIRRVRNKLIVMRWKVNRNHPYPYLM
>gi|10047102|ref|NP_057388.1| ribosomal protein L24-like [Homo sapiens]
MRIEKCYFCSGPIYPGHGMMFVRNDCKVFRFCKSKCHKNFKKKRNPRKVRWTKAFRKAAG
KELTVDNSFEFEKRRNEPIKYQRELWNKTIDAMKRVEEIKQKRQAKFIMNRLKKNKELQK
VQDIKEVKQNIHLIRAPLAGKGKQLEEKMVQQLQEDVDMEDAP
N.b. We note that MacOS unix version of sed does not
allow substitution of control characters (TAB,
LINEFEED, etc) – the function tr (translate) appears to
be able to overcome this limitation...