Bioinformatics Seminar 13/11/07

Download Report

Transcript Bioinformatics Seminar 13/11/07

Bioinformatics Seminar 13/11/07
• Keith Satterley, Bioinformatics Division, WEHI
1
Summary:
GABOS = Get A Bit Of Sequence.
GAFEP = Get A Few Exon Primers.
 Functions and Facilities:
 WEB interface.
 Command Line Interface.
Data Management:
Genome data.
Result data.
Tools Used:
Perl
HTML
PHP
Javascript
Availability.
Future Work.
2
•
•
GABOS version 1 is at http://unix28.alpha.wehi.edu.au/bioinformatics/gabos
WEB Page version 1 limitations:
–
–
–
–
•
Exons, DNA, Transcripts available.
Genomes are a hard coded list of latest version data only.
Annotation File is a hard coded list covering all genomes.
Chromosome selection was a list of the common chromosome filenames.
Data Files Availability
–
All data has been downloaded from UCSC’s download site. It is described at:
•
•
–
http://hgdownload.cse.ucsc.edu/downloads.html and can be ftp downloaded from:
ftp://hgdownload.cse.ucsc.edu/goldenPath/
Genome data is stored on the WEHI Disk Server accessible from:
•
•
•
WEHI Unix computers
– /home/users/lab0605/Bioinformatics/databases/genomes/UCSC
WEHI Windows computers – map a network drive to:
– \\unix33\bioinformatics
WEHI Macintoshes – Connect to Server at:
– smb://unix33/Bioinformatics
3
•
Genomes at WEHI:
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
•
Jul
Jul
Jul
Jul
Jul
Jul
Nov
Nov
Jul
Jul
Jul
Jul
Jul
Jul
Aug
Jul
Aug
Aug
Jul
Jul
Jul
Jul
Jul
Jul
Jul
Jul
Jul
Jul
24
23
23
22
23
23
6
5
22
20
23
22
23
23
24
23
23
23
22
23
22
23
23
22
25
22
23
23
01:05
15:16
15:16
01:17
10:33
15:20
01:10
16:37
01:17
17:27
10:11
01:17
10:29
10:29
01:10
10:30
14:50
18:12
01:17
10:32
01:17
10:32
10:33
01:17
02:32
01:17
10:33
10:33
canFam -> canFam2
canFam1
canFam2
danRer -> danRer4
danRer3
danRer4
dm -> dm3
dm3
galGal -> galGal3
galGal2
galGal3
hg -> hg18
hg17
hg18
mm -> mm9
mm7
mm8
mm9
monDom -> monDom4
monDom4
panTro -> panTro2
panTro1
panTro2
rheMac -> rheMac2
rheMac2
rn -> rn4
rn3
rn4
More can be downloaded as requested.
4
•
Chromosome data Files:
–
–
–
–
–
–
–
Aug 23 14:09 chr9_random.fa
Aug 23 14:09 chrM.fa
Aug 23 14:09 chrUn_random.fa
Aug 23 14:14 chrX.fa
Aug 23 14:14 chrX_random.fa
Aug 23 14:14 chrY.fa
Aug 23 14:16 chrY_random.fa
–
–
–
–
–
–
–
–
Jul 23 16:11 chr9.fa
Jul 23 16:11 chrM.fa
Jul 23 16:13 chrNA_random.fa
Jul 23 16:14 chrUn_random.fa
Jul 23 16:14 md5sum.txt
Jul 23 16:14 README.txt
Jul 23 16:16 scaffoldNA_random.fa
Jul 23 16:16 scaffoldUn_random.fa
–
–
–
–
–
–
Jun 22 04:05 chr2L.fa
Jun 22 04:05 chr2LHet.fa
Jun 22 04:05 chr2R.fa
Jun 22 04:05 chr2RHet.fa
Jun 22 04:05 chr3L.fa
Jun 22 04:05 chr3LHet.fa
•
Annotation Data Files:
5
• Data Management:
– Amount of data:
• How many genomes local? – currently 10 = 96GB.
– 19 Vertebrates available + 9 sequence only.
– 15 Insects, 5 Nematodes + 4 others available.
• How many versions of each? mm7, mm8, mm9?
– 2 or 3 of each?
• Chromosome data: 10-50 per genome.
• Annotation data: 5-10 per genome version
– RefSeq, genscan, mgc, xenoRef, uniGene, refFlat,
– EST’s. mRNA’s …
– Up to date data!
• Tool currently being written to nightly check UCSC
• Download, unpack and sort annotation files.
6
• GABOS Sequence Retrieval Features
– Specify Search Criteria as either:
• Gene Name List
– as in Annotation Files
» NM_001037759,NM_145692, NM_027033, NM_013715 as in RefSeq.txt
» Sgk3, 4930418G15Rik, Cops5, Sulf1 as in RefFlat.txt
• Chromosome Sequence Range specification.
– Chr10:13,500,000 - 14,550,000
– This will select all genes in this region that are defined in
the annotation file(s) specified.
– Exons (incl. EST exons), Transcripts of Genes or
straight DNA sequence can be retrieved.
• Specify either strand or both strands.
7
• Extra Sequence Parameters
– Range of bases in data object (for e.g. bps in an Exon)
• 1-e = all, base 1 to the end base (the default)
• 1-10 = bases 1 to 10
• 10-e = base 10 to end base in object.
– Range of objects requested. (for e.g. a range of Exons)
•
•
•
•
1-e
1-3
1
e
= all exons (the default)
= exons 1 to 3.
= first exon only
= last exon only
– Possible Extensions
• (e-3)-e = last three objects (or bases)
8
• GABOS Extras:
–
–
–
–
–
–
–
–
–
–
Specify the line length of the FASTA output file.
Output Sequence Lines ONLY.
Output Fasta Description Lines ONLY.
Concatenate ALL Sequences.
Concatenate ONLY Sequence from a DNA object (Each gene’s
exons concatenated for example).
String of characters to be inserted BEFORE each DNA object.
String of characters to be inserted AFTER each DNA object.
Specify flanking bases.
Show co-ordinates relative to: Chromosome, Exon, Transcript
Uses either RefSeq or Browser gene names in refFlat.txt
• GAFEP (Get a Few Exon Primers)
– Use output of GABOS to find primers around each exon.
9
GABOS Command Line Version (CLI).
• Same code. Program detects environment and adjusts
accordingly.
• CLI use of GABOS caters for programmatic use of the
tool as part of other tasks.
– For eg. Collecting 5000 bases before a transcript and 5000 into
the transcript to be used for promoter/regulation searching for
thousands of genes.
CLI Eg.
gabos -afile refFlat.txt -genome mm9 -seqrange 4,482,560-4,483,185
-chr 1 -pre 420 -post 420 –fastaonly >my_results.fa
Options can be in any order. Output can be redirected to a file as shown.
A file of gene names could be used as input instead of a chromosome sequence range.
gabos –help
lists all options.
10
• CLI additional abilities:.
– Gene lists read from a file or piped in.
– Debugging options available.
– Specification of alternate locations for:
(enables use of program at other sites without modification.)
• Annotation files.
• Genome data files.
• Checks if data files are latest version and updates
if not (To be replaced with upgraded procedure).
11
GABOS Command Line options:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
-addend:s,
-addstart:s,
-dna:s,
-basedir:s,
-genome:s
-afile:s,
-adir:s,
-gdir:s,
-check!
-name:s,
-namep:s,
-namef:s,
-chr:s,
-seqrange:s,
-strand:s,
-dataobject:s,
-objectrange:s,
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
-baserange
-seqonly,
-fastaonly,
-linelength:i,
-relative:s,
-pre:i
-post:i
-v!
-debug1:i,
-debug2:i,
-debug3:i,
-debug4:i,
-debug5:i,
-debug6:i,
-debugall:i,
-h|help|?,
-version
All GAFEP programs can also be
run at the command line.
In particular:
Combine_overlapping_exons,
Create_primers1,
Create_primers2 ,
Makep3i,
P3out2tab.
12
• Demo of GABOS version 2.
http://unix28.alpha.wehi.edu.au/bioinformatics/gabos/testing_index.php
– Improvements:
• Automatically reads genomes available:
• Automatically shows chromosome data for
genome selected.
• Automatically shows Annotation data files for
genome selected.
• Includes ability to read EST data files.
• Uses alternate gene name in refFlat.txt.
• Faster processing of large data files using/making
presorted versions.
13
•
GAFEP = Get A Few Exon Primers.
This is a suite of programs.
1. Combines overlapping exons into one “CExon”.
2. Displays Primer3 options and collects choices.
3. Creates input files for Primer3 in the required
format.
4. Runs Primer3, displays output on the web page
and reformats the output suitable for pasting into
Excel.
5. The same code runs from the web interface or
from a Command Line Interface.
14
Combining Exons to reduce number of primers needed.
1
2
3
4
5
6
7
CExon
CExon
Exon
15
Short Exons
120bp
90
Pad out short exons
to 300 bp.
120
90
300
Add a 70 bp. cushion
70
90
120
90
70
440
Add 200bp
flanks
Primers
in flanks
200
70
90
120
90
70
200
840
16
Long Exons
900bp
Split
485bp
485bp
70bp overlap
70
70
485
Add a 70 bp. cushion
625
Add 200bp
flanks
200
70
485
70
200
1025
Primers
in flanks
17
• Demonstration of GAFEP
18
GAFEP Output
19
20
An example application:
Ben Kile’s lab are using GABOS/GAFEP to
create primers to search for variations in
sequence caused by the ENU mutations in
mice.
21
Random chemical mutagenesis in the mouse
N
H3C-CH2-N-C-NH2
=
Alkylating agent
=
N-ethyl-N-nitrosourea (ENU)
O
O
Point mutagen
Efficiently mutates mouse spermatogonial stem cells
ENU
Male mice treated with ENU produce offspring heterozygous for
ENU-induced mutations at the rate of 1 mutation per 1.5 megabases
22
Phenotyping screen: measuring platelet number
Blood test
Mutant offspring
Platelet count x103/uL
Platelet counts
Plt16 and Plt20 cause
dominant thrombocytopenia
23
Mapping strategy for dominant mutations
Affected
1st Outcross
m
m
2nd Outcross
Wild-type
C57BL/6
X
Balb/c
X
F1 Generation
Unaffected
F2
Generation
Affected
m
m
m
m
24
Mapping strategy for dominant mutations
1. Genome-wide scan with 80-100 microsatellites
20 affected and 20 unaffected animals
Result: mutation assigned to a chromosome
2. Fine mapping
200-1,000 informative meioses, genotyped with SSLPs at increasing density
Result: candidate interval refined to 1-3 Mb
Issues
Recombination cold spots
Polymorphism deserts
SNP density map of mouse chromosome 1
(C57BL/6 v 129Sv)
25
Candidate intervals
Heaven
Hell
Chromosome 2: 20-21 Mb
Chromosome 11: 70-71 Mb
26
Candidate gene sequencing
Prioritize candidates for sequencing on the basis of:
Known function
Homology to other genes of known function
Tissues expression pattern
Domain structure
Exhaustive literature searches…..
27
Candidate gene sequencing
1. Automated PCR primer design
Robotic liquid handling
2. Genomic PCR
In-well template clean-up
3. Direct amplicon
sequencing
4. Capillary
electropheresis
28
5. Sequence analysis
• Tools used to develop GABOS/GAFEP
• Perl programming language for all programs.
• Web interface
– HTML coding
– PHP – inserted into HTML and processed by the
webserver before the HTML is processed by the
webserver.
– Javascript – processed by the clients web
browser (Mozilla Firefox or Safari for example)
29
Unix Server = unix28
WEHI Computing Layout
php processed here
Webserver = apache
html produced here
Client = Mac, Windows.
wan/lan
Browser = Firefox,IE …
html processed here
Unix28 disk
Javascript acts here
In response to user
nfs
GABOS/GAFEP
unix33
Display of
GABOS/GAFEP
here
ftp
Genome
DATA
UCSC
30
• Web Interface Debugging tools
– Firefox Error Console
– Firebug Addin to Firefox
31
• Future Work:
– Short term:
• Finalize GABOS version 2
– Transcript, DNA working
• Complete data download maintenance program
• Automate sorting of annotation files and modify GABOS to be
aware of sorted/non-sorted data and act accordingly.
• Include ability to retrieve RNA data
• Will run on any unix server – not just unix28.
• Web Interface available on WEHI’s public server.
• Source code will be made freely available.
– Longer Term:
• Retrieve data for utrs, others?
• Provide web interface access to annotation files.
• Remove need for BioPerl to be installed.
32
Aknowledgements:
• Bioinformatics Division
– Terry Speed & Gordon Smyth for the opportunity to pursue this
project in an excellent environment.
– All others in Bioinformatics for many and varied help.
• WEHI ITS
– Nick Tan, Jakub Szarlat for Unix help.
– Dung Tran, Scott Wood for network help.
– Tri Le and John Nguyen for MS windows support.
– Tony Kyne & others in ITS for many questions answered.
• Molecular Medicine
– Doug Hilton, Ben Kile for explaining their needs.
• Users for their feedback.
– Kylie Greig, Adrienne Hilton, Greg Hather, Carolyn de Graaf …
33