Harrow - Mouse Genome Informatics

Download Report

Transcript Harrow - Mouse Genome Informatics

Havana manual annotation
Jen Harrow
Wellcome Trust Sanger Institute
overview
• General intro work/aims of havana
• Annotation tools
• Vega view
Havana Overview
• 24 team members (2 work remotely Manchester
/Glasgow)
• 3 genomes annotated different levels of evidence
(human /mouse/ zebrafish)
• Mouse EUCOMM/KOMP/NORCOMM project external
collaborators
• Feedback from experts (ENCODE/biosapiens/ vega
website)
• Recently funded to complete human genome as part
of Encode scale up
Havana Annotation
• Coding and non-coding loci
(known and novel)
• Pseudogenes
(processed/unprocessed/ transcribed)
• polyA sites/signal
• Check and submit to nomenclature databases
(MGI/HGNC/zfin)
• Specialise annotation/categorisation variants
• Freely accesible via Vega/Ensembl browser
Rules for selecting CDS_type
CDS?
Yes
Identical* to
Swissprot or
Refseq NP_?
Yes
Known_CDS
No
Novel
first or last
coding exon?
No
Yes
Has
cross-species Yes
Swissprot/Trembl or
gene family
support?
No
Putative_CDS
Shares
>60% length
of known_CDS,
Swissprot or
Refseq?
No
Has
No
cross-species
Swissprot/Trembl or
gene family
support?
Yes
Yes
Novel_CDS
Novel_CDS
Pfam
No
domain
structure identical
to Known
_CDS?
Yes
Novel_CDS
Tagging transcripts
Known
CDS
(uniprot
or refseq)
Novel
CDS
NMD
putative
*now protein
CDS
coding
transcripts
• Biosapiens looked
at novel /putative
variants
• Found little they
evidence that had
same function as
known variants
Transcript
variant1?
No
CDS flowchart
Transcript assignment
Yes
Literature
confirmed as
non-coding?
No
Literature
confirmed as
antisense?
Yes
Yes
Non_coding
Antisense
No
>1 possible
ORF2?
Fully
3
No retains or runs No Full-length No
transcript subject
>200 bases into
to NMD?
intron?
Yes
Ambiguous_ORF
Yes
Retained_intron
Yes
NMD4
Transcript7
EST
supported
2-3 exon
variant5?
Yes
Putative
mRNA
supported,
No
non-canonical
splice site?6
Yes
Artefact
No
Consortium links for Encode Scale up
for whole human genome
CCDS:
HGNC/
Uniprot
NCBI/Refseq
Ensembl
UCSC
HSF
Manual annotation:
WTSI:Havana
Computational:
UCSC:Data centre/QC
WashU:Brent/QC
Yale:Gerstein/Pseudogene
MIT:Kellis/Comparative analysis
CRG:Guigo tracking/primer design
Sanger:Ensembl QC/tracking evidence
Experimental:
Lausanne
CRG:
Gingeras/Txfrag
(ENCODE)
Biosapiens
Goals for mouse annotation
• Funded EUCOMM project 3 annotators goal
~8000 KO (12000 genes a mix old and newfunded until 2009)
• Collaborate CCDS mouse annotate difficult
cases (updates highlighted in RT- not funded)
• Collaborate washU KOMP project/NORCOMM
(no direct funding for annotation)
• Internal collaborations WTSI mouse faculty
• Updates usually only on request/
vega/uniprot(HSF) feedback
structure of manual annotation pipeline
Compute
farm
Clone
Annotation
Submitted
EMBL
Database
Ensembl pipeline
database (clone based
raw analysis)
incremental updates
datasets possible
Otter/loutre
annotation
database
lace/spandit client
mysql
ana_notes for
clone selection
and tracking
Convert
xace
Lace and
Zmap:
editing
interfaces
use
temporary
local
database
mysql
regular
QC
dumps
GFF files
etc.
vega
database
http
das
Vega
/Ensembl
browsers
New Annotation tool
(move from Fmap to Zmap)
Annotation Interface:
• data sets
Annotation Interface:
•Viewers and editors: Lace
Annotation Interface: Viewers and editors: Zmap overview
Clone tiling path
Zmap:viewing homologies
EST
Manual Annotation
Predictions
Repeats
RNA
Protein
Refseqs
mRNA alignments
Expand hits
“Traffic light” links
EST alignments
Vertical and horizontal
splitting
Enables viewing
Homology hits
of long genes
Annotation Interface:
•Viewers and editors: Zmap
Annotation Interface:
•Viewers and editors: Blixem
Annotation Interface:
•Viewers and editors: Dotter
Why is consistency important?
• Mouse projects EUCOMM/KOMP
(EU Conditional Mouse Mutant/KnockOut Mouse
Project (USA))
• 5 externally funded annotators
• 2 annotators based at WashU
• At least 3000 targets annotated /year
• If annotation isn’t reliable/consistent then
experiments can fail
Assessing scale of problem
• Test all annotators on unannotated mouse
region >30 loci chr10
• Time limit 2 days, no discussion between
team members 2 days
• Software modified to allow multiple sessions
opened on same clones and not seen by
other members
• Laurens reference annotation to compare
against
Annotation consistency mouse
chr10
reference
CDS
ensembl
transcript
CDS transcript
Mum1 locus
CDS transcript CDS transcript CDS transcript
Inconsistencies between
annotators
Cirpb1
RP23-6P9.7
Efna1
Result of the test
• Guidelines needed to be updated
• Clarify variant assignment (CDS and
transcript) although assignment of
transcript more troublesome
• Mostly, rate of annotation is attributed
to experience.
Current Vega mouse statistics
Nod region
Mouse chrX finished Nov 06
contigview:polyA sites and
signals visible
Locus report
CCDS
Date of annotation
xref
Transcript classes included
known
Retained intron
Evidence used to build transcript
Merge genes in Ensembl
Knockout transcripts:EUCOMM and KOMP
Compare
KO:transcript
against
original
transcript
KO
MIG database gene lists
Summary:mouse annotation
• Change from sequencing to KO targets or
CCDS
• Reannotation not automatic
(feedback/request)
• Encode project will help improve overall
havana annotation (computational QC)
• Vega main portal to access data (updated
every 2 months)
• Move to multi-genome annotation with
improvements with Zmap
Acknowledgements
Havana:
Jeff Almeida
Clara Amid
If Barnes
Denise Cavalhoe Silva
Sarah Donaldson
Adam Frankish
Elizabeth Hart
Rhoda Kinsella
Gavin Laird
Jane Loveland
Jonathan Mudge
Jeena Rajan
Harminder Sehra
Catherine Snow
Charles Steward
Marie-M. Suner
Mark Thomas
Laurens Wilming
Informatics:
James Gilbert
Chao-Kung Chen
Leo Gordon
Mustapha Larbaoui
Ensembl:
Steve Searle
Val Curwen
Acedb/Zmap:
Ed Griffiths
Roy Storey
Vega:
Stephen Trevanion
Project Coordinators:
Tim Hubbard
Gencode collaborators:
Roderic Guigo
France Denoued
Alex Reymond
Taf1, TAF1 RNA polymerase II, TATA box binding protein (TBP)-associated factor
Phastcons track showing conservation
Annotation, vector design
NB
Exon/exons
deleted
Not whole
gene
96 BACs
3 rounds of Recombineering in 96well boxes
1
FR
T
bgal
FR
T
2
lox
P
3
lox
P
96 targeting constructs
Annotation system:
ANALYSIS
PIPELINE
mRNA, EST, protein BLAST;
Genscan, Fgenesh gene
predictions;
RepeatMasker;
tandem repeats;
CpG islands;
RefSeq;
Ensembl;
.....
LACE
MYSQL DB
analysis data
transcript editing interface
FMAP/ZMAP
viewer
MYSQL DB
annotation
VEGA
ENSEMBL
Manual annotation:
•
manual annotation of finished genomic sequence
every exon of every transcript supported by homology
(mRNA / EST / protein)
splice variants
pseudogenes
nomenclature
gene clusters
•
interpretation of problematic evidence
•
examination of literature
•
•
•
•
•
NMD highly sophisticated
pathway
Neu-Yilik et al, Genome Biol 2004 4:1218
Consistent annotation of all
coding variant essential
Critical exon
Exon missing in variant