VERTEBRATE GENOME EVOLUTION AND FUNCTION …

Download Report

Transcript VERTEBRATE GENOME EVOLUTION AND FUNCTION …

Using Vertebrate Genome Comparisons
to Find Gene Regulatory Elements
Penn State University, Center for Comparative Genomics and
Bioinformatics: Webb Miller, Francesca Chiaromonte, Anton
Nekrutenko, Ross Hardison
University of California at Santa Cruz: David Haussler, Jim Kent
National Human Genome Research Institute: Laura Elnitski
Children’s Hospital of Philadelphia: Mitch Weiss
Lawrence Livermore National Laboratory: Ivan Ovcharenko
Comparative genomics to find functional sequences
Genome
size
2,900
Find
common
sequences
blastZ,
multiZ
2,400
Human
Identify
functional
sequences: ~ 145
Mbp
All mammals
1000 Mbp
2,500
Mouse
Rat
1,200
million base pairs
(Mbp)
Also birds: 72Mb
Papers in Nature from mouse and rat and chicken genome consortia, 2002, 2004
Genome sequence assemblies and sources
Species
Assembly
Genome
size
Assembly
depth
Source
Human
hg17
2.851Gb
“finished”
International human genome sequencing
consortium
Chimp
panTro1
ca.
2.8Gb
4x
Chimpanzee sequencing consortium
Mouse
mm5
2.6Gb
1.9Gb
“finished”
Mouse genome sequencing consortium
Rat
rn3
2.57Gb
Dog
canFam1
2.5Gb
7.6x
Broad Institute and Agencourt Bioscience
Cow
bosTau1
ca. 3Gb
3x
Baylor and collaborators
Opossum
monDom1
3.5Gb
6.5x
Broad Institute
Platypus
ornAna0
Chicken
galGal2
1.2Gb
6.63x
Washington University Genome Seq Center
Frog
xenTro1
ca.
1.3Gb
7.4x
DoE Joint Genome Institute
Zebrafish
Zv4
1.56Gb
5.7x
Zebrafish Sequencing Group at the Sanger
Institute
Tetraodon
tetNig1
0.385Gb
7.9x
Genoscope and Broad Institute
Fugu
fr1
0.319Gb
5.7x
DoE Joint Genome Institute and Singapore
IMCB
Baylor and collaborators
Washington University Genome Seq Center
Coverage of human by alignments with other
vertebrates ranges from 1% to 91%
5.4
Millions of
years
Human
91
92
173
220
310
360
450
5%
Distinctive divergence rates for different types
of functional DNA sequences
Percentofofregions
humannot
genome
not in
Percent
in alignments
alignments
100
100
9090
8080
7070
Genome
Coding exons
Ultraconserved (HM)
Log. (Genome)
6060
5050
4040
3030
2020
1010
00
00
100
200
300
400
500
100
200
300
400
500
Time of divergence from common ancestor to
Time of divergence
from common
human,
Myr ago ancestor to human,
Myr ago
Large divergence in cis-regulatory modules
from opossum to platypus
cis-Regulatory modules conserved from human
to fish
Millions of
years
91
173
310
450
• About 20% of CRMs
• Tend to regulate genes whose
products control transcription
and development
• Recent reports:
– Sandelin, A. et al. (2004). BMC
Genomics 5: 99.
– Woolfe, A. et al. (2005). PLoS
Biol 3: e7
– Plessy, C., Dickmeis, T.,
Chalme,l F., Strahle, U. (2005)
Trends Genet. 21: 207-10.
cis-Regulatory modules conserved from human
to chicken
•
•
Millions of
years
91
– Conservation jungles
– Hillier et al. (2004) Nature
173
310
About 40% of CRMs
Noncoding sequences conserved
from human to chicken tend to
clusters in gene-poor regions
•
Stable gene deserts are conserved
from human to chicken
– Ovcharenko et al., (2005) Genome
Res. 15: 137-145.
450
•
Conserved noncoding sequences
in stable gene deserts tend to be
long-range enhancers
– Nobrega, M.A., Ovcharenko, I.,
Afzal, V., Rubin, E.M. (2003)
Science 302: 413.
Posters 120 (Bob Harris), 121(Laura
Elnitski), 192 (Ivan Ovcharenko)
cis-Regulatory modules conserved in eutherian
mammals (and marsupials?)
Millions of
years
91
173
310
450
• About 80-90% of CRMs
• Within aligned noncoding DNA
of eutherians, need to
distinguish constrained DNA
(purifying selection) from
neutral DNA.
Score multi-species alignments for features
associated with function
• Multiple alignment scores
– Binomial, parsimony (Margulies et al., 2003)
• PhastCons
– Siepel and Haussler, 2003; Siepel et al. 2005
– Phylogenetic Hidden Markov Model
– Posterior probability that a site is among the 10% most highly
conserved sites
– Allows for variation in rates and autocorrelation in rates
• Factor binding sites conserved in human, mouse and rat
– Tffind (from M. Weirauch, Schwartz et al., 2003)
• Score alignments by frequency of matches to patterns
distinctive for CRMs
– Regulatory potential (Elnitski et al., 2003; Kolbe et al., 2004)
Computing Regulatory Potential (RP)
Alignment
seq1
seq2
seq3
Collapsed alphabet
G
G
A
1
T
T
T
2
A
G
G
1
C
T
T
3
C
C
C
4
T
G
A
5
A
7
C
7
T
A
A
6
A
G
A
8
C
C
T
3
G
C
G
6
C
C
T
3
A
A
A
9
•A
3-way alignment has 124 types of columns. Collapse these to a smaller alphabet
with characters s (for example, 1-9).
•Train two order t Markov models for the probability that t alignment columns are followed
by a particular column in training sets:
–positive (alignments in known regulatory regions)
–negative (alignments in ancestral repeats, a model for neutral DNA)
–E.g. Frequency that 3 4 is followed by 5:
0.001 in regulatory regions
0.0001 in ancestral repeats
•RP of any 3-way alignment is the sum of the log likelihood ratios of finding the strings of
alignment characters in known regulatory regions vs. ancestral repeats.
RP 

a in segment
 pREG ( sa | sa 1  sa t ) 

log
 p AR ( sa | sa 1  sa t ) 
More species and better models improve
discriminatory power of RP scores
Poster 257: James Taylor
ROC curves for different RP scores, tested on a set of known regulatory regions
from the HBB gene complex
Galaxy metaserver for integrative
analysis of genomic data
• Use servers at primary data repositories (e.g.
UCSC Table Browser) to gather initial data
• Results stored and analyzed at Galaxy
• Operations
– Union, intersection, subtraction
– Clustering, proximity
• Bioinformatic tools:
– Retrieve alignments
– Ka/Ks
• Giardine, Riemer … Nekrutenko, Poster 90
How well do these alignment-based
scores work in finding cis-regulatory
modules?
RP and phastCons can discriminate most known
functional elements from neutral DNA
Genes co-expressed in late erythroid maturation
•
G1E-ER cells: proerythroblast line from mice lacking the transcription factor
GATA-1.
– Can restore the activity of GATA-1 by expressing an estrogen-responsive form of
GATA-1
– Allows cells to mature further to erythroblasts
Use microarray analysis of each to find genes that increase or decrease
expression upon induction.
– Walsh et al., (2004) BLOOD; Image from k-means cluster, GEO:
repressed
induced
genes
•
time after restoration of GATA-1
Predicting cis-regulatory modules (preCRMs)
Identify a genomic region with a regulated gene.
Find all intervals whose RP score exceeds an empirical threshold.
Subtract exons
Find all matches to GATA-1 binding sites that are conserved (cGATA-1_BS)
Intervals with RP scores above the threshold and with a cGATA-1_BS within 50bp
are preCRMs.
Test predicted cis-regulatory modules (preCRMs)
•
Enhancement in transient transfections of erythroid cells
test
HBG
prom
FF luciferase
Dual luciferase
tk
Ren luciferase
prom
K562 cells
•
Activation and induction of reporter genes after site-directed, stable
integration in erythroid cells
•
Chromatin immunoprecipitation (ChIP) for GATA-1
assay
7 of 24 Zfpm1 preCRMs enhance transient expression
9 of 24 Zfpm1 preCRMs enhance after stable
integration at RL5
About half of the preCRMs are validated as
functional
Assay
GATA-1 ChIPs
Transient
transfections
Site-directed
integrants
All assays
Number
tested
5
64
Number
positive
5
18
%
validated
100
28
54
24
44
64
34
53
Conclusions
• Particular types of functional DNA sequences are
conserved over distinctive evolutionary distances.
• Multispecies alignments can be used to predict whether a
sequence is functional (signature of purifying selection).
• Alignments can be used to predict certain functional
regions, including some cis-regulatory elements.
• The predictions of cis-regulatory elements for erythroid
genes are validated at a good rate.
• Databases such as the UCSC Table Browser, GALA and
Galaxy provide access to these data.
• Expect improvements at all steps.
Many thanks …
Wet Lab: Yuepin Zhou, Hao Wang, Ying
Zhang, Yong Cheng, David King
Alignments, chains, nets, browsers, ideas, …
Webb Miller, Jim Kent, David Haussler
PSU Database crew: Belinda Giardine,
Cathy Riemer, Yi Zhang, Anton Nekrutenko
RP scores and other bioinformatic input:
Francesca Chiaromonte, James Taylor, Shan Yang,
Diana Kolbe, Laura Elnitski
Funding from NIDDK, NHGRI, Huck Institutes of Life Sciences at PSU
Marsupial genome adds substantially to the
conserved fraction of regulatory regions
Additive contribution of each 2nd species to conservation
100
80
Primate
Eutherian
Marsupial
Monotreme
Avian
Amphibian
Fish
Percent
60
40
20
m
e
ge
no
e
cT
F
er
s
BS
s
W
ho
l
io
na
l
pr
ry
Fu
nc
t
gu
la
to
re
n
om
ot
re
gi
on
s
nd
s
is
la
C
pG
Kn
ow
m
iR
N
s
in
g
ex
on
ed
C
od
co
ns
er
v
a
U
ltr
As
0
All preCRMs in Gata2 are functional in at least one
assay
ChIP data are from publications from E. Bresnick’s lab.
The distal Major regulatory element of the human HBA gene
complex is conserved in opossum but not beyond
Neutral DNA “cleared out” over 200Myr
100
Percent of human not aligned
90
Platypus
80
Chick Frog
Fish
Opossum
70
Mouse, Rat
60
Cow
50
Dog
40
30
20
10
Chimp
0
0
100
200
300
400
500
Divergence from common ancestor to human, Myr ago
Most human DNA is not alignable to species separated by more than 200 yr.
Divergence dates from Kumar and Hedges (Nature 1998) and Hedges (Nature Rev Genet 2002)