Affymetrix Rhesus array annotation discussion

Download Report

Transcript Affymetrix Rhesus array annotation discussion

Rhesus genome annotations
Rob Norgren
Department of Genetics, Cell Biology and Anatomy
University of Nebraska Medical Center
Conventional Approach to GeneChip Production
• Sequence millions of ESTs
• Obtain finished genomic sequences
• Cluster redundant ESTs
• Align EST clusters with genomic sequences
• Extract the last 571 bp of sequence from each transcript probe selection region (PSR)
• Choose 11 to 16 probes that tile across the PSR
Problems with the conventional approaches for a
rhesus macaque GeneChip
• Insufficient ESTs to cover most genes
• Little finished genomic sequence (in 2005)
Strategy for targeted amplification of rhesus
genes
• Identify the terminal exon and flanking sequence for every human gene
• Design primers and amplify from monkey genomic DNA
• Obtain the rhesus PSR sequences
Poly A
Terminal exon
F
PSR: Probe selection region
F: forward primer
R: reverse primer
PSR R
Other sources for rhesus GeneChip PSRs
• Preliminary Baylor Genomic Sequences
In silico approach - Aligned human PSRs with preliminary rhesus genomic
sequence.
• ESTs
Rhesus GeneChip
• Available in March 2005
• Novel design
• Whole genome expression array - 52,024 probes for 47,000 transcripts
• Probesets include 17,093 well-annotated genes (16 probes/probeset)
• Probesets were designed for 1,099 well-annotated genes not present on the
U133+2.0 human GeneChip.
Rhesus Genome
• Draft published in Science on
April 17, 2007
• “The rhesus macaque genome
assembly is a draft DNA
sequence, and it contains many
gaps.”
What does a “draft” rhesus genome mean?
• 26,907 protein coding genes for the human
• 24,038 protein coding genes for rhesus macaques
• Sounds good, but is misleading.
• 19,450 well-annotated protein coding genes for humans
• 8,744 well-annotated protein coding genes for rhesus macaques
• What does “well annotated” mean”?
• No “hypothetical” genes
• Only genes with “good” gene symbols. No “Locs”.
Problems with GeneChip annotations
• Affymetrix relies on NCBI annotations, hence, many probesets are not
annotated with “real” gene symbols
• Stop gap solution:
http://www.unmc.edu/rheusgenechip
• Permanent solution requires full and complete annotation of the rhesus
genome at NCBI.
What can go wrong at the genome sequencing
center?
• Large gaps
• Small gaps
• Misassemblies
• Sequencing errors
What can go wrong with ab initio annotations?
• Incorrect assignment of pseudogene status
• Failure to identify genes
• Incorrect gene models (some exons right, some wrong)
• Incomplete gene models
Consequences of non-annotated genes
• Large number of databases depend on NCBI annotations for their
annotations. Example: Affymetrix GeneChips
• Errors and omissions are propagated to dependent databases
• Users are frustrated when they see “Locs” instead of a proper gene symbol
• Users can Blast each probeset consensus sequence or ask their
bioinformatics personnel to establish gene identity, but this is wasteful in time
and energy.
How to correct annotations
• Annotations must be acceptable to NCBI, if they are not, corrections will not
propagate to dependent databases.
• Some gene annotations can be corrected by manual inspection.
• Some gene annotations can be corrected by human ortholog-based gene
models rather than ab initio approaches.
• Some gene annotations can only be corrected by additional sequencing.
• And some gene annotations require a trip to Hell...
Defensins - the gene family from Hell
• Large family of genes
• Orthologs poorly conserved - positive selection?
• Will require focused sequencing and annotation
• May require publication before NCBI annotates most of the rhesus defensins
Acknowledgements
• Jeff Kittrell
• Joel Goodsell
• Audrey Gomel
• NCRR/NIH