- Lorentz Center

Download Report

Transcript - Lorentz Center

The Zebrafish Genome
Sequencing Project
Bioinformatics resources
Kerstin Howe, Mario Caccamo, Ian Sealy
Bioinformatics resources
outline
• clone mapping, sequencing and manual annotation in
• genome assemblies and automated annotation in
• integrated ZF-Models data and tools
Clone mapping and sequencing
mapping
• 2 BAC Tuebingen libraries
• 1 BAC and 1 cosmid library from single Tuebingen double-haploid fish
• end sequencing, RH mapping, fingerprinting
• pieced together according to fingerprints, marker mapping, sequence
alignment
• currently ~ 2500 ctgs
Clone mapping and sequencing
sequencing pipeline
• select clones based on position in fpc contig
• subcloning
• sequencing
• automatical assembly/pre-finishing
(back to sequencing if necessary)
• finishing
• QC
• automated analysis pipeline
• manual annotation
• submission to EMBL
+
+
=
Manual annotation
• RepeatMasker
unfinished sequence
• CpG island prediction
• Genscan
finished sequence
• FGenesh
• halfwise (Pfam)
automated analysis pipeline
manual annotation
• EPCR
• Blast (ESTs, cDNAs, proteins)
• gene structures
• remarks (gene names, function, similarities)
otter
• other features
• mysql database in 'ensembl style'
• acedb or apollo front end
• open to users from the 'outside'
EMBL
Manual annotation
annotation policy
• follows guidelines for human annotation (havana team, Sanger Institute)
• no "guesses", annotations solely based on supporting evidence
• annotation of:
CDSs and UTRs / transcripts
splice variants
pseudogenes
poly A features
transposons
repeats
• approved nomenclature (SI:clone.number)
• collaboration with ZFIN
existing ZFIN records are reported
ZFIN provides new records for newly found genes
Manual annotation
repeats
DNA
CpG island
FGenesH
Genscan
proteins
mRNAs
ESTs
vega.sanger.ac.uk
Vega
contigview
Vega
geneview
www.sanger.ac.uk/Projects/D_rerio
www.sanger.ac.uk/Projects/D_rerio
when to use what
go to vega.sanger.ac.uk if you need
• highly reliable sequence
• highly reliable annotation (with your input)
• ‘your gene’ stable over time (TILLING)
go to www.ensembl.org if you need
• the whole genome
• comparative data
• ZF-Models microarray or insertional mutagenesis data
• complicated searches (BioMart)
Zebrafish Genome Project
whole genome shotgun sequencing
clone mapping and sequencing
clone libraries
WGS reads
markers (T51)
tile path
BACs
WGS assembly
fpc ctg
map
contig
supercontig
sequencing
integration
(un)finished clones
assembly release (Zv5)
contigs
finish clone
~ 8,000 finished clones (~1 Gb)
1.63 Gb
clones+ctgs
automatic annotation
manual annotation
WGS assembly
Phusion assembler - High Performance Assembly Group (Zemin Ning et al.)
reads
group reads
A
B
C
B
A
phrap
C
NNNNNNNN
gap
contig
contig
contig
contig
C
read-pair tracker
A
contig
B
supercontig
supercontig
supercontig
supercontig
Read grouping
• k-mer word hashing
gap hash k=12 (4x3) - dealing with variation
ATGGCGTGCAGTCCATGTTCGGATCA
ATGGCGTGCAGTCCATGT
TGGCGTGCAGTCCATGTT
GGCGTGCAGTCCATGTTC
GCGTGCAGTCCATGTTCG
continuous base hash - k=12
ATGGCGTGCAGTCCATGTTCGGATCA
ATGGCGTGCAGT
TGGCGTGCAGTC
GGCGTGCAGTCC
GCGTGCAGTCCA
frequency
seq. errors
• word distribution
repeats
~7
k-mer occurrence
Zebrafish Genome Project
whole genome shotgun sequencing
clone mapping and sequencing
clone libraries
WGS reads
WGS assembly
markers (T51)
map
sequencing
integration
(un)finished clones
assembly release (Zv5)
~ 7,000 finished clones (~1 Gb)
automatic annotation
manual annotation
Integration
BACs
BX005153
BX005057.8
BX005049.6 BX005123.6
fpc contig
cDNA
WGS supercontig
bacends
marker
Zv5 scaffoldn.1
BX005153
Zv5 scaffoldn.3
BX005057.8
Zv5 scaffoldn.5
Zv5 scaffoldn
BX005049.6 BX005123.6
Zv5 scaffoldn.7
Assemblies
release date assembly
Zv5
Zv4
Zv3
Zv2
27.05.05
12.07.04
27.11.03
03.04.03
total length [bp]
1,630,306,866
1,592,025,686
1,459,115,486
1,452,210,772
scaffolds
16,214
21,333
58,339
83,470
finished clones
4,519 (699 Mb)
2.828 (443 Mb)
1,502 (263Mb)
-
scaffolds in chr 1-25
1,749
1,892
1,490
-
scaffolds in fpc contigs
265 (chrU)
694 (chrU)
1,842
5,677
NA scaffolds
14,676
18,747
54,798
77,793
sum(length) chr 1-25
[bp]
1,200,129,620 (73%)
1,097,507,810 (69%)
718,270,423 (49%)
-
sum(length) ctgs
183,993,739 (11%)
176,222,396 (11%)
365,271,659 (25%)
1,143,459,008
sum(length) NAs
246,183,507 (16%)
318,295,480 (20%)
335,615,307 (23%)
308,751,764
Automatic Annotation
Zebrafish Proteins
Other Proteins
Zebrafish cDNAs
Zebrafish ESTs
Genewise
Exonerate
Exonerate
Genewise
genes
Aligned
cDNAs
Aligned
ESTs
ClusterMerge
Genewise genes
with UTRs
Supported ab initio
(optional)
Genebuilder
Final set
Ensembl
EST genes
Ensembl
Contigview
Geneview
Searching Ensembl
Biomart
start
filter
output
Do’s and Dont’s
go elsewhere (Ensembl) if you
want to know about the whole genome
need comparative data
need ZF-Models microarray or insertional mut data
need to do complicated searches
go to Vega if you
need highly reliable sequence
need highly reliable annotation
need ‘your gene’ stable over time (TILLING)
DAS
genome browser
local storage
reference sequence
DAS client
XML
DAS server
DAS server
DAS server
remote storage
remote storage
remote storage
SNPs and Indels
Ensembl releases
Zv5
Zv4
Zv3
Zv2
Human
Fugu
Tetraodon
genes
22,877
23,526
22,409
20,062
24,194
22.339
28,005
transcripts
32,143
32,071
30,783
26,587
35,845
22,102
28,005