GenomicVariation_11-22

Download Report

Transcript GenomicVariation_11-22

MEME homework:
probability of finding GAGTCA at a given position in the yeast genome, based
on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2
(0.2)(0.3)(0.2)(0.3)(0.2)(0.3) = 2.16 x 10-4
This is the probability that ANY 6 mer will be this sequence by chance
How many instances within 1,000 bp upstream of 6,000 genes?
Number of 6mers per 1,000 bp: 1000 bp – 5 bp (account for 6mer start position)
= 995 6mers per gene upstream region
6000 genes * 995 = 5.97 x 106 possible 6mers total
P that any one is your sequence: 2.16 x 10-4 x 5.97 x 106 = 1290 sites
BUT … can also have the reverse complement (i.e. site on other strand)
1
= 2X possible sites (because of our bg model) = 2580 possible matches
An alternative approach: Phylogenetic footprinting
Rather than look at multiple, different regulatory regions from one species,
look at one region but across multiple, orthologous regions from many species.
Hypothesis: functional regions of the genome will be conserved more than
‘nonfunctional’ regions, due to selection.
Therefore, simply look for regions of sequence that are conserved above background.
2
Simplest case: stretches of very highly conserved sequence
Kellis et al. 2003 “Sequencing and comparison of yeast species to identify genes and regulatory elements”
Sequenced 4 closely related Saccharomyces genomes & identified conserved sequences in multiple
alignments of orthologous sequences from the four species.
3
Incorporating evolutionary models can improve motif finding
Remember that evolution acts on functionally important base pairs …
Also remember from our motif finding exercise that not all contiguous base pairs
are equally important (information content).
bits
Information Profile:
Position
4
Incorporating evolutionary models can improve motif finding
Remember that evolution acts on functionally important base pairs …
Also remember from our motif finding exercise that not all contiguous base pairs
are equally important (information content).
Moses et al. 2003 “Position specific variation in the rate of evolution in transcription factor binding sites”
Rate of evolution (ie. degree of conservation) within a motif is inversely proportional to the information
5
content … important base pairs evolve slower
Multiple motif finding methods now work on multiple alignments of
regulatory regions of coregulated genes.
Given: 1) group of regulatory regions of coregulated genes
2) orthologs of each region, in the form of multiple alignments
Sinha et al. 2004 “PhyME: A probabalistic algorithm for finding motifs in
sets of orthologous sequences”
Moses et al. 2004 “Monkey: identification of transcription factor binding sites
in multiple alignments using a binding site-specific evolutionary model
Siddharthan et al. 2005 “PhyloGibbs: A Gibbs sampling motif finder that
incorporates phylogeny.”
Wang & Stormo. 2003 (PhyloCon) “Combining phylogenetic data with co-regulated
genes to identify regulatory motifs”
Prakash et al. 2004. (OrthoMEME) “Motif discovery in heterogeneous sequence data
6
Keep in mind that the relevant evolutionary models are specific
for what one is looking for (TF binding sites, ncRNA, etc)
Moses et al. 2003 “Position specific variation in the rate of evolution in transcription factor binding sites”
Rate of evolution (ie. degree of conservation) within a motif is inversely proportional to the information
7
content … important base pairs evolve slower
VISTA suite for visualizing conservation in global alignments
Pre-computed multiple global alignments of mammalian genomes, visualized
by conservation level.
-- Uses BLAT local alignment tool to find seeds of high sequence similarity,
then these seeds are used for global single- or multiple-genome alignment
Frazer et al. 2004 “VISTA: computational tools for comparative genomics”
8
9
Which species to compare?
Balance between:
-- species closely related enough that:
1) There’s enough similar sequence to get confident
pairwise alignments
2) The sequences of interest and their corresponding functions
have been conserved
-- species distantly enough related that:
1) nonfunctional sequence has had time to diverge
10
The above approaches have focused on using similarity/conservation
to identify important regions of the genome …
A large focus in genomics is understanding the differences in genome
sequences and what accounts for the vast diversity in phenotypes
within a population.
Analysis of single nucleotide polymorphisms (SNP) within populations,
Analysis of variations in gene expression within and between populations,
Analysis of quantitative trait loci (QTLs) accounting for differences in gene expression.
11
Connecting phenotype to genotype
-- Large variations in size, shape, health, etc in human populations
-- Much of that variation has to do with disease susceptibility
-- A major goal of genetics (and now genomics) is understanding the consequences
of genetic variation.
~2800 disease-associated genes known, mostly from positional cloning &
mapping studies
Done by linkage analysis: pattern of marker inheritance in families
with heritable diseases
A major force in genomics is to identify and annotate SNPs
in human populations, and identify those related to disease
12
2001
13
Array-based methods of SNP detection & Haplotype mapping
Each base-pair position on human chromosome 21 is interrogated 8 times
(4 in forward & 4 in reverse orientations)
GGAGATGAGTTCGATTACTCTTAGG
GGAGATGAGTTCAATTACTCTTAGG
GGAGATGAGTTCTATTACTCTTAGG
GGAGATGAGTTCCATTACTCTTAGG
1.7 x 108 oligos total on eight Affy wafers were used to identify SNPs on
human Chromosome 21 from 21 different individuals.
14
Each row = single SNP
Each column = Ch 21
Blue = major allele
Yellow = minor allele
Much of the chromosomal
variation is explained with
relatively limited
haplotype diversity.
80% of haplotype structure
can be captured with only
10% of the SNPs in that block
(need only 2SNPs to type)
Haplotype length can vary
from a few kb to mega bases.
15
16
Phenotypic variation (including disease susceptibility) are often linked to copy # changes
This is especially true of numerous types of cancers, where local amplifications
and translocations increase the copy number of cell proliferation regulators, etc.
17
Amplifications in breast
cancer lines increase the
copy # of specific regulators ..
18
19