Bult - Mouse Genome Informatics

Download Report

Transcript Bult - Mouse Genome Informatics

Building a Unified Gene Catalog
for the Mouse Reference
Genome
Carol Bult
The Jackson Laboratory
Mouse Genome Annotation Summit
Bethesda, Maryland
March, 2008
How similar are the results of different
gene prediction pipelines for Build 37 of
the reference mouse genome?
Gene Unification
• Compare genome annotations from:
– NCBI (31,711 annotations)
– Ensembl (28,167 annotations)
– VEGA (14,919 annotations)
• Determine:
–
–
–
–
–
Equivalent gene models
Gene models unique to Ensembl
Gene models unique to NCBI
Gene models unique to VEGA
Etc.
Method:
Genome feature overlap analysis
• Assess genome coordinate overlaps for
annotated exons
– NCBI, Ensembl and Vega provided their annotations in
a standardized file format w/B37 genome coordinates
– Richardson, J. “fjoin: Simple and Efficient Computation
of Feature Overlaps” J. Comp Biol 13:1457-64 (2006).
• Overlap of a single nucleotide between two
exons is sufficient to call two gene models
“equivalent”
– Overlap parameter is adjustable
– Features to use to detect overlaps is configurable
Caveats
• Equivalent does not mean identical gene
structure
– Analysis does not evaluate which gene model
is “best” --only that the annotations from
different sources likely represent the same
gene or transcriptional unit
• Unique does not mean novel
– Some known genes are present in one
annotation file but not the other
Example: Ensembl and NCBI
31711
28167
Unification
(Exon Overlap Detection)
Unique to NCBI
Equivalent
Unique to Ensembl
23650
8678
5248
1:1
1:n
n:1
n:m
21528
629
788
705
Build 37 Summary
0:1
1:0
1:1
1:n
n:1
n:m
E vs V
4764
17923
9322
333
505
433
N vs E
5248
8678
21528
629
788
705
V vs N
20208
3409
10606
405
410
535
E = Ensembl (28167)
V = Vega (14919)
N = NCBI (31711)
E unique = 4707
N unique = 6953
V unique = 2986
Equivalent
(1:1:1)
11:84331455..84340462
Screenshots from MGI Mouse GBrowse
Equivalent
(1:n)
1:58765343..58820514
Equivalent
(n:1)
Clec2g
Clec2f
6:128876095..128986094
Some annotations masked out to improve clarity of example
Equivalent
(n:m)
2:155895575..155939706
Unique to Ensembl and Vega
Some annotations in this region are masked to enhance clarity of the example.
Csmd2
Chr4:136463772..137119871
Common Issues
• Gene duplications/gene family
• Read through transcripts
• Shared first exons
Gene Duplication/Gene Family
4:145845084..145895083
Rex2
Reduced expression 2
Zinc finger protein
4:146339646..146439645
Rex2??
4:145845084..145895083
4:146339646..146439645
Rex2
Rex2
Read through Transcripts
9:20862521..20912520
Raver1 and Fdx1l
Shared Exons
1:18240353..18255926
Defb41 and novel defensin gene
10:21849916..22136785
Raet1a,b,c,d,e
Some annotations masked out to improve clarity of example
Importance of Annotation Coordination
• Genome feature identity
• Functional annotation associations
• Experimental genetics
– KOMP
Gene Identity
16:96582252..96792251
Pcp4 and Igsf5
Pcp4 – Purkinje cell protein 4 (MGI:97509)
Igsf5 – Immunoglobulin superfamily, member 5 (MGI:1919308)
There is no Igsf5 in Ensembl, but Igsf5 appears to be used as a synonym for Pcp4
Clec2g
Clec2f
Clec2f
Functional Annotations
Clec2f (MGI:3522133)
Clec2g (MGI:1918059)
KOMP
10:51199649..51217200
Gp49a and Lilrb4
www.knockoutmouse.org
In Ensembl, this gene model is associated
only with Lilrb4. In MGI we associated it with Gp49a.
11:62630999..62696530
Trim16 and Fbxw10
www.knockoutmouse.org
Acknowledgements
• Joel Richardson
•
•
•
•
•
•
Yunzia “Sophia” Zhu
Ken Frazer
TBK Reddy
Bob Sinclair
Deb Reed
Richard Baldarelli
NIH HG00330-P1
• Deanna Church
• Donna Maglott
• Paul Flicek
• Steve Searle
• Laurens Wilming
15:91663946..91769797
Smgc and Muc19