Transcript view
28-Way vertebrate
alignment and conservation
track in the UCSC Genome
Browser
Journal club Dec. 7, 2007
Vertebrate genome
sequencing
the Broad Institute of MIT (Massachusetts Institute of
Technology) and Harvard
the Human Genome Sequencing Center at the Baylor College of
Medicine
the Genome Sequencing Center at Washington University.
the Sanger Center
the Department of Energy’s (DOE’s) Joint Genome Institute
the National Institute of Genetics in Japan.
Alignment:
Similarities & differences between
genome sequences:
1. functional noncoding regions
2. protein-coding genes
3. non-coding RNA genes
Aims
1. to more reliably identify functional
elements via sequence alignment
2. To enhance the effectiveness of the
disease-model species for experiment
3. To determine the course of evolution &
reconstruct the ancestral genome
sequence
April 2007:
17 28
11 old species data
6 updated old species
11 new species
>79%
Heterogeneous
mix
Coverage:
2 – >99%
16 – 5.1% ~ 8.5%
10 – ~2x
(2x – 87.5%, 5x – 99.4%)
Cloning bias…
Applications
Application 1:
Application 2:
indels in protein-coding regions
conservation of start and stop codons
Application 3:
phylogenetic extent of alignment of
functional regions
Application 1
Indels accumulated at a uniform rate during the
evolution?
The phenotypic consequence of human- specific
protein indels?
Positions of potentially disease-associated indels
resisted substitution over evolutionary time –
interspecies conservation
6-bp indel near the start of PRNP
Primate &
glires
P
G
D
Total Indel: 209
2/MY
4/MY
# of Indel / # per MY
Parametric bootstrap test ---- significantly differ from hypothesis
Human specific protein indels
SULF1: human specific 3-bp insertion in exon 11
1. Fixed in humans
Replication slippage
2. Very conserved region (retain 4Es over 2 billion years)
3. Without 3D data
GFM2: human specific 6-bp insertion
1. Not conserved region
2. This insertion only occurs in some human individuals
3. Similar protein 3D data implied no phynotypic consequence
Human replacement disease-associated
amino acid mutations are overabundant
occur predominantly in positions essential
to the structure and function of the
proteins
Subramanian and Kumar, BMC Genomics 2006, 7:306
Disease-associated deletion
More species considering
Data from PhenCode Locus Variants
PAH
Simplified distance -- # of distinct aa.
6
>79%
<
Drift away
Hard to identify precise gene boundaries based on comparative genomics data
Hypothesis 1:
the CpG islands that are common near gene starts are more difficult to sequence
1.65%
Hypothesis 2:
Selection at the start codon might be more relaxed in genes with multiple
promoters (alternate promoters)
4%
Hypothesis 3:
the program may not have enough surrounding
conserved sequence to reliably align the small
initial coding exon around the start codon
similar
Hypothesis 3:
the program may not have enough surrounding conserved sequence to
reliably align the small initial coding exon around the start codon
Conclusion
A bias against CpG islands in the draft
sequence combined with difficulty in aligning
small initial coding exons does explain a great
deal of the observed unalignability of start
codons compared with stop codons
Gene model based on multiple genomic
alignments must be aware of the start codon
Background – finding
functional elements
conservation in noncoding regions is much
more subject to evolutionary turnover than in
protein-coding regions.
Evolutionary(conservation) turnover
-- Most studies tacitly equate homology of
functional elements with sequence homology.
This assumption is violated by the
phenomenon of turnover, in which
functionally equivalent elements reside at
locations that are nonorthologous at the
sequence level.
Frith et al. Genome research 2006
More species genomics data --- higher resolution
251000 coding exons of RefSeq genes
481 ultraconserved elements
94000 predicted regulatory
regions(PRPs)
3900 putative transcriptional regulatory
regions (pTRRs)
Alignability: the fraction that aligns with
a designated comparison species
Human