Two-Stage Association Mapping in Dogs Identifies Coat
Download
Report
Transcript Two-Stage Association Mapping in Dogs Identifies Coat
Comparative genomics of 24 mammals
Manolis Kellis
MIT
Broad Institute of MIT and Harvard
MIT Computer Science & Artificial Intelligence Laboratory
Sequencing the mammalian phylogeny
Kerstin Lindblad-Toh, Sante Gnerre, Federica DiPalma
Broad, Baylor, WashU, Arachne, UCSC
#
H1
H2
H3
H4
H5
H6
H7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Species
Human
Chimp
Rhesus
Mouse
Rat
Dog
Cow
Elephant
Armadillo
Tenrec
Rabbit
Guinea Pig
Hedgehog
Shrew
Microbat
Tree Shrew
Squirrel
Bushbaby
Pika
Mouse Lemur
Horse
Cat
Dolphin
Hyrax
Kangaroo Rat
Megabat
Alpaca
Tarsier
Sloth
Pangolin
Flying lemur
Center
Done
Done
Done
Done
Done
Done
Done
Broad
Broad
Broad
Broad
Broad
Broad
Broad
Broad
Broad
Broad
Broad
Broad
Broad
Broad
Agencourt
Baylor
Baylor
Baylor
Baylor
WashU
WashU
WashU
x
x
Covg
Full
Full
Full
Full
Full
Full
Full
1.94x
1.98x
1.90x
1.95x
1.92x
1.86x
1.92x
1.84x
1.89x
1.90x
1.87x
1.92x
1.93x
5.36x
1.87x
2.59x
2.19x
1.85x
~2x
2.34x
1.88x
2.10x
x
x
Comparative genomics of mammalian species
• Goal 1: Discover regions of increased selection
– Detect functional elements by their increased conservation
– More genomes: detect smaller elements, subtle selection
• Goal 2: Discover different classes of functional elements
– Patterns of change distinguish different types of functional elements
– Specific function Selective pressures Patterns of mutation/inse/del
• Develop evolutionary signatures characteristic of each function
Protein-coding genes
Mike Lin
Evolutionary signatures for protein-coding genes
Non-synonymous substitutions
Gaps are multiples of 3
Frame-shifting gaps
Synonymous codon substitutions
• Same conservation levels, distinct patterns of divergence
–
–
–
–
Gaps are multiples of three (preserve amino acid translation)
Mutations are largely 3-periodic (silent codon substitutions)
Specific triplets exchanged more frequently (conservative substs.)
Conservation boundaries are sharp (pinpoint individual splicing signals)
Protein-coding evolution vs nucleotide conservation
Highly conserved non-coding elements
Protein-coding exons
• Evolutionary signatures specific to each function
– Distinguish protein-coding from non-coding conservation
– Genome-wide run (CSF only): 81% sens., 91% precision
– Incorporating additional signatures: RFC, single-species…
Many new genes confirmed by chromatin domains
Missed
exon
Alt.spliced
exon
Example: MM14qC3
• Several hundred new exons, many in clusters
Mikkelsen et al
• Supported by chromatin signatures (Guttman et al)
Genome-wide curation / experimental follow-up
G
PI: Tim Hubbard, Sanger Center.
HAVANA curators, experimental validation.
• Novel candidate genes and exons
– Experimental cDNA sequencing and validation
– Curation of gene structures integrating evidence
• Revising existing annotations
– Identify dubious genes with non-protein-like evolution
– Refine boundaries and exon sets of existing genes
– Curation: evaluate evidence supporting that annotation
• Unusual gene structures
– Evolutionary evidence in absence of primary signals
– Reveal new and unusual biological mechanisms
Unusual protein-coding events
Mike Lin
When primary sequence signals are ignored
• Typical gene (MEF2A). Evolutionary signal stops at the stop codon.
• Unusual gene (GPX2). Protein-coding signal continues past the stop.
• GPX2 is a known selenoprotein! Additional candidates found.
Translational read-through in neuronal proteins
Novel candidate: OPRL1 neurotransmitter
Protein-coding
conservation
Stop codon
read through
Continued protein-coding
conservation
2nd stop
codon
No more
conservation
• New mechanism of post-transcriptional control.
– Conserved in both mammals (~5 candidates) and flies (~150 candidates)
– Strongly enriched for neurotransmitters and brain-expressed proteins
– Read-through stop codon (&surrounding) shows increased conservation
• Many questions remain
– Role of editing? Cryptic splice sites? RNA secondary structure?
Lin et al, Genome Research 2007
Measuring excess constraint within protein-coding exons
Typical protein-coding exon (Numerous mutations, at each column)
Excess-conservation exon: conserved above and beyond the call of duty
Likely to have additional functions, overlapping selective pressures
Searching for excess-constraint coding sequence
(1) Build a model for expected substitution counts
Syn.subs. correlate w/ degeneracy & CpG
Distribution for each ancestral codon
(2) Score windows for depletion in syn. subst.
• Z-score: P(obs. subst | expected for each codon)
(3) Top candidate exons with excess constraint
•
•
•
•
PCPB2: derived from ancestral transposon
Hox B5 gene start: 52 AA before 1 syn.subst
C6orf111: predicted ORF on chr. 6
EIF4G2: overlaps spliced EvoFold prediction
Examples: Top candidate exons showing increased selection
• HoxB5: 52 amino acids before the first synonymous substitution
• Overlaps highly conserved RNA secondary structure
• C6orf11: Predicted ORF, protein-coding, extremely conserved
• EIF4G2: Several consecutive exons, conserved RNA struct.
microRNA genes
Alex Stark
Pouya Kheradpour
Evolutionary signatures for microRNA genes
(1) Conservation
profile
Combine with 10 other features 4,500-fold enrichment
348 reads
16 reads
Ruby, Bartel, Lai
• In fly genome: 101 hairpins above 0.95 cutoff
60 of 74 (81%) known Rfam miRNAs rediscovered
+ 24 novel expression-validated by 454&Solexa (Bartel/Hannon)
+ 17 additional candidates show diverse evidence of function
• In mammals: combine experimental & evolutionary info
Rely on reads for discovery, use evolutionary signal to study function
Stark et al, Genome Research (GR) 2007. Ruby et al GR 2007
Novel miRNAs validated by sequencing reads
Surprise 1: microRNA & microRNA* function
Drosophila Hox
• Both hairpin arms of a microRNA can be functional
– High scores, abundant processing, conserved targets
– Hox miRNAs miR-10 and miR-iab-4 as master Hox regulators
Stark et al, Genome Research 2007
Surprise 2: microRNA-anti-sense function
Highly conserved Hox targets
•
•
•
antisense
A single miRNA locus transcribed from both strands
The two transcripts show distinct expression domains (mutually exclusive)
Both processed to mature miRNAs: mir-iab-4, miR-iab-4AS (anti-sense)
Stark et al, Genes&Development 2007
sense
Sensory bristles
wing
haltere
wing
w/bristles
haltere
WT
wing
sense
Antisense
Note: C,D,E same magnification
miR-iab-4AS leads to homeotic transformations
• Mis-expression of mir-iab-4S & AS:
altereswings homeotic transform.
• Stronger phenotype for AS miRNA
• Sense/anti-sense pairs as general
building blocks for miRNA regulation
• 10 sense/anti-sense miRNAs in mouse
Stark et al, Genes&Development 2007
Function of miRNA* arms and anti-sense miRNAs
• Denser Hox miRNA targeting network
Measuring selection
Michele Clamp
Manuel Garber
Xiaohui Xie
Detecting Purifying Selection (ω)
Neutral sequence
Constrained sequence
Estimating intensity of constraint ():
• Probabilistic evolutionary model
• Maximum Likelihood (ML) estimation of
- sitewise (evaluate every k-long window)
- windows-based (increased power)
• Reports ω, and its log odds score (LODS).
• Theoretical p-value (LODS distributes 2 with df = 1)
Manuel Garber, Michele Clamp, Xiaohui Xie
Detecting other constraint signatures (π)
0
0
0.8 0.5 0.6 3.2
0
0
• Repeated CG
transversion
• Has happened at
least 4 times.
• Very unlikely given
neutral model.
• Goal: Identify sites with unlikely substitution pattern.
• Approach: Probabilistic method to detect a
stationary distribution that is different from background.
• Solution: Implement ML estimator () of this vector:
• Provides a Position Weight Matrix for any given k-mer in the genome.
• Scores every base in the genome (LODS).
Manuel Garber, Michele Clamp, Xiaohui Xie
Estimation of genome-wide constraint
Pilot Encode Regions (1%):
9.4% conserved
5.7% above FDR cutoff
10.5% conserved
6% above FDR cutoff
Genome-wide:
Across entire genome: 5% under selection.
Same as for Human-Mouse. What’s different?
Manuel Garber, Michele Clamp, Xiaohui Xie
More mammals: We can actually tell which 5% it is!
Constraint calculated over a 50mer
21 mammals
4 mammals
>40% FDR
5% FDR
Constraint calculated over a 12mer
4 mammals
21 mammals
>40% FDR
5% FDR
Michele Clamp
Individual conserved elements match known TF sites
Example: TNNC1 (Troponin C)
5’
Constraint score
Promoter alignment
Known TF binding sites
5’
TATA
SP-1
CEF-2
CEF1
Binding site resolution, even without known motif model
Michele Clamp
Binding sites for known regulators
Pouya Kheradpour
Alex Stark
Computing Branch Length Score (BLS)
mutations
movement
missing short
branches
BLS = 2.23sps (78%)
Allows for:
1. Mutations permitted by motif degeneracy
CTCF
2. Misalignment/movement of motifs within window
(up to hundreds of nucleotides)
3. Missing motif in dense species tree
Branch Length Score Confidence
1. Use motif-specific shuffled control motifs determine the expected number of
instances at each BLS by chance alone (or due to non-motif conservation)
2. Compute Confidence Score as fraction of instances over noise at a given BLS
(=1 – false discovery rate)
3. Many species are needed to confidently predict instances
Median number of instances (at fixed confidence)
Performance on vertebrate Transfac motifs
2.5x increase
3.5x
6.5x
1. Most motifs have confident instances into 90% confidence with 18 mammals
2. Substantial increase in the number of instances compared to only human, mouse
rat and dog.
Intersection with CTCF ChIP-Seq regions
ChIP data from Barski, et al., Cell (2007)
ChIP-Seq and ChIP-Chip technologies
allow for identifying binding sites of a
motif experimentally
1. Conserved CTCF motif instances
highly enriched in ChIP-Seq sites
2. High enrichment does not require low
sensitivity
3. Many motif instances are verified
50% motifs verified
50% confidence
≥ 50% of regions with a motif
We can accurately
identify targets for
many factors
Zeller, et al., PNAS (2006)
Lim, et al., Molecular Cell (2007)
Wei, et al., Cell (2006)
Lin, et al., PLoS Genetics (2007)
Robertson, et al., Nature Methods (2006)
Odom, et al., Nature Genetics (2007)
Barski, et al., Cell (2007)
Enrichment also found for other factors
Enrichment increases in conserved bound regions
1. ChIP bound regions may not be conserved
2. For CTCF we also have binding data in mouse
3. Enrichment in intersection is dramatically higher
Human: Barski, et al., Cell (2007)
Mouse: Bernstein, unpublished
Odom, et al., Nature Genetics (2007)
Human: Barski, et al., Cell (2007)
Mouse: Bernstein, unpublished
Enrichment increases in conserved bound regions
1. ChIP bound regions may not be conserved
2. For CTCF we also have binding data in mouse
3. Enrichment in intersection is dramatically higher
4. Trend persists for other factors where we have multispecies ChIP data
Motif discovery
Pouya Kheradpour
Alex Stark
Using confidence for motif discovery
1. Use motif-specific shuffled control motifs determine the expected number of
instances at each BLS by chance alone (or due to non-motif conservation)
2. Compute Confidence Score as fraction of instances over noise at a given BLS
(=1 – false discovery rate)
Motif discovery pipeline
1. Enumerate motif seeds
T G C
gap
T A G
• Six non-degenerate characters with variable size gap
in the middle
2. Score seed motifs
• Use a conservation ratio corrected for composition
and small counts to rank seed motifs
3. Expand seed motifs
S R T G C Y
gap
WT A G R
• Use expanded nucleotide IUPAC alphabet to fill
unspecified bases around seed using hill climbing
4. Cluster to remove redundancy
• Using sequence similarity
Heinzman et al, Bing Ren’s lab
Motif discovery in enhancer regions
• Collaboration with Ren, White, Posakony labs
– Predict novel enhancer / promoter / insulator elements
– Identify motifs associated with these regions
– Validate predicted regions for in vivo function
• Initial results in human genome
– Motif combinations predictive of enhancer regions (5X)
Motif discovery in 3’UTRs
1. Perform motif discovery by ranking 7-mers in 3’UTRs by the highest confidence
they reach with 100 instances.
Summary
• Measuring increased selection
– Scaling of branch lengths: ω
– Non-random stationary distribution: π
– Increased resolution: individual binding sites
• Protein-coding genes
– Distinct evolutionary signatures
– Novel genes, revised genes
– Unusual structures: read-through, increased selection
• microRNAs
– Function of miRNA/miRNA* and sense/anti-sense pairs
– Dense miRNA targeting network for Hox cluster
• Regulatory motifs
– Measure increased selection, derive confidence score
– High sensitivity / high specificity for known motifs
– Use enumeration/confidence metric for motif discovery
Acknowledgements
MIT Computer Science and AI Lab
Mike
Lin
Pouya
Kheradpour
Alex
Stark
Matt
Rasmussen
Broad Institute of MIT and Harvard
Kerstin Michele Manuel Xiaohui
Xie
Lindblad-Toh Clamp Garber
Sante Gnerre, David Jaffe
Issao Fujiwara
Federica Di Palma
Arachne Assembly Team
Broad Sequencing Platform
Eric Lander
Sequencing Baylor, WashU, Agencourt. Funding: NHGRI
miRNAs
Julius Brennecke, Graham Ruby, Greg Hannon, David Bartel
iab-4AS
Natascha Bushati, Steve Cohen, Julius, Greg Hannon