Annotation of vector genomes, the Aedes aegypti model

Download Report

Transcript Annotation of vector genomes, the Aedes aegypti model

Collective annotation of the Ixodes
scapularis genome: VectorBase,
MSCs and the tick community.
Daniel Lawson, VectorBase
BRC6 28th October 2008
Arthropod vectors of human pathogens
Anopheles
Aedes
Culex
Ixodes
Pediculus
BRC6 28th October 2008
Rhodnius
Lutzomyia
Glossina Phlebotomus
Deer tick Ixodes scapularis
• Vector of Lyme disease (spirochete Borrelia burgdorferi)
• Estimated genome size of 2.1 Gb
• Sequenced strain: Wikel
•
12th generation from ticks sourced from New York, Oklahoma & Connecticut
• First Chelicerate genome to be sequenced
BRC6 28th October 2008
Genome annotation cycle
ESTs, cDNAs
Repeat library (TEs etc)
Other genomes,
gene sets
Assembly
Automatic
gene build
Manual
annotations
Community
annotations
Protein domains
BRC6 28th October 2008
Generating sequence
• Sequencing undertaken by established sequencing
centres (e.g. Broad, JCVI,)
• Initial assembly annotated in collaboration with the
sequencing centre(s)
• 19,300,000 trace reads generated
• Approx. 6x WGS
• 570K BAC end sequencing
• Assembly produced at JCVI
• 194K EST sequences
BRC6 28th October 2008
Assembly statistics
• This WGS project has the project accession
ABJB000000000. The current version of the project
(01) has the accession number ABJB010000000, and
consists of 1,141,594 scaffolds (ABJB010000001ABJB011141594).
• Released assembly IscaW1
• 570,637 contigs
• 369,495 supercontigs
• Assembled coverage of 3.8x
BRC6 28th October 2008
Preparing for gene build
• Repeatmasking
• Analyses to identify repeat elements
• RepeatScout
• RECON
• Standard tandem-repeat & low-complexity filtering
• Collate data sets
• Transcripts (cDNA & EST data)
• Peptides (taxonomic groupings, inc. Daphnia pulex)
• Train gene predictors, mainly Augustus (JCVI)
BRC6 28th October 2008
Annotation plan
• First-pass gene prediction
• Focused on protein-coding genes CDS’s
• Semi-automated approach
• This is not manual curation
• Involvement of community where possible
• Timely delivery of gene set
BRC6 28th October 2008
Gene Prediction
• Each group/centre has it’s own gene prediction pipeline/protocol.
• Each group produces a 1st pass ‘best guess’ set of predictions
• 0.5 sets, public release
• These sets are merged into a single set
• 1.0 set, not released
• Quality control activities
• 1.1.set, public release
• Which is annotated with protein features
• .. And released to the wider world
BRC6 28th October 2008
Merging gene predictions
Gene set #1
Gene set #2
Reduce to single predictions per locus
Compare exon/intron structures
Identical
structures
Compatible
structures
Different
structures
Merge/Split
structures
Complex
Add isoform predictions based on EST/Peptide data
Canonical gene set
BRC6 28th October 2008
No Map
Merge annotation comparisons
BRC6 28th October 2008
Examples
Isoform-compat
Isoform-diff
BRC6 28th October 2008
Examples
Merge/Splits
Difficult
BRC6 28th October 2008
GBrowse viewer
BRC6 28th October 2008
VectorBase browser
BRC6 28th October 2008
Final gene set (IscaW1.1)
• 20,486 protein-coding genes
• 48% have Pfam domain
• 40% have supporting EST evidence
• 8,138 tRNAs
• Over-prediction of Ser (4425) and Thr (1527)
predictions
• 301 ncRNA
• Submitted to GenBank last week, release to be
coordinated in the next couple of weeks
BRC6 28th October 2008
Genome annotation cycle
ESTs, cDNAs
Repeat library (TEs etc)
Other genomes,
gene sets
Assembly
Automatic
gene build
Manual
annotations
Community
annotations
Protein domains
BRC6 28th October 2008
Community annotation
Gene Build
GFF3
Web submission
CHADO
Researcher
Approval
Appraisal
Total: 13,339 entries
An. gambiae
9,423
Cx. quinquefasciatus
2,598
Ae. aegypti
1,281
Ix. scapularis
vb
!
37
Community representative
BRC6 28th October 2008
Community annotation track in browser
BRC6 28th October 2008
Lessons
• Annotation plan for sequencing and annotation of new genomes is
well established (MSC & BRC)
• Clearly defining the data release strategies (0.5,1.0 & 1.1)
• Monthly conference calls
• Face to face meeting when merging 0.5 gene predictions
• Coordinated release between MSC, VectorBase and GenBank
BRC6 28th October 2008
But we can always improve
• Agreement on project/public identifiers at the start of the project
• Primarily contigs and supercontigs
• Overall nomenclature applied as final step in annotation
• More QC before the major milestones
• Better communication
BRC6 28th October 2008
Acknowledgements
EMBL-EBI
Harvard
IMBB
• Ewan Birney
• Bill Gelbart
• Kitsos Louis
• Martin Hammond • Kathy Campbell • Pantelis Topalis
• Daniel Lawson
• Emmanuel Dialynas
• Karyn Megy
Aedes
• Dave Severson
• Neil Lobo
Anopheles
• Frank Collins
• Neil Lobo
Culex
• Peter Atkinson
• Peter Arensburger
Imperial
• Fotis Kafatos
• George Christophides
• Bob MacCallum
• Seth Redmond
Ixodes
• Catherine Hill
• Jason Meyer
Colleagues
Sequencers { JCVI & Broad Institute }
BRCs { Pathema, ApiDB }
Ensembl { Genebuilders, Web, Compara, Core, Outreach }
BRC6 28th October 2008
Notre Dame
• Frank Collins
• Greg Madey
• Scott Emrich
• Ryan Butler
• Katie Cybulski
• Nate Konopinski
• Rob Bruggner (alumni)
• E.O. Stinson (alumni)