L18 - Computer Science and Engineering

Download Report

Transcript L18 - Computer Science and Engineering

CSE182-L18
Population Genetics
Perfect Phylogeny
• Assume an evolutionary model in which no recombination
takes place, only mutation.
• The evolutionary history is explained by a tree in which
every mutation is on an edge of the tree. All the species in
one sub-tree contain a 0, and all species in the other contain
a 1. Such a tree is called a perfect phylogeny.
• How can one reconstruct such a tree?
The 4-gamete condition
• A column i partitions the
set of species into two sets
i0, and i1
• A column is homogeneous
w.r.t a set of species, if it
has the same value for all
species. Otherwise, it is
heterogenous.
• EX: i is heterogenous w.r.t
{A,D,E}
A
i0 B
C
D
i1 E
F
i
0
0
0
1
1
1
4 Gamete Condition
• 4 Gamete Condition
– There exists a perfect phylogeny if and only if for all pair of
columns (i,j), either j is not heterogenous w.r.t i0, or i1.
– Equivalent to
– There exists a perfect phylogeny if and only if for all pairs of
columns (i,j), the following 4 rows do not exist
(0,0), (0,1), (1,0), (1,1)
4-gamete condition: proof
• Depending on which
edge the mutation j
occurs, either i0, or i1
should be homogenous.
• (only if) Every perfect
phylogeny satisfies the
4-gamete condition
• (if) If the 4-gamete
condition is satisfied,
does a prefect
phylogeny exist?
i
i0
i1
An algorithm for constructing a
perfect phylogeny
• We will consider the case where 0 is the ancestral
state, and 1 is the mutated state. This will be
fixed later.
• In any tree, each node (except the root) has a
single parent.
– It is sufficient to construct a parent for every node.
• In each step, we add a column and refine some of
the nodes containing multiple children.
• Stop if all columns have been considered.
Inclusion Property
• For any pair of columns i,j
– i < j if and only if i1  j1
• Note that if i<j then the
edge containing i is an
ancestor of the edge
containing i
i
j
Example
1 2 3 4 5
A 1 1 0 0 0
B 0 0 1 0 0
C 1 1 0 1 0
D 0 0 1 0 1
E 1 0 0 0 0
r
A
B
Initially, there is a single clade r, and
each node has r as its parent
C D
E
Sort columns
•
•
Sort columns according to the
inclusion property (note that the
columns are already sorted here).
This can be achieved by
considering the columns as binary
representations of numbers (most
significant bit in row 1) and sorting
in decreasing order
A
B
C
D
E
1
1
0
1
0
1
2
1
0
1
0
0
3
0
1
0
1
0
4
0
0
1
0
0
5
0
0
0
1
0
Add first column
A
B
C
D
E
• In adding column i
– Check each edge and
decide which side you
belong.
– Finally add a node if
you can resolve a clade
1 2 3 4 5
1 1 0 0 0
0 0 1 0 0
1 1 0 1 0
0 0 1 0 1
1 0 0 0 0
r
u
A
C
E
B
D
Adding other columns
A
B
C
D
E
• Add other
columns on
edges using the
ordering
property
1
1
0
1
0
1
2
1
0
1
0
0
3
0
1
0
1
0
4
0
0
1
0
0
r
1
E
3
2
B
5
4
D
C
A
5
0
0
0
1
0
Unrooted case
• Switch the values in each column, so that 0
is the majority element.
• Apply the algorithm for the rooted case
Summary :No recombination leads to
correlation between sites
A
B
00000000
3
00100000
5
8
00100001
•
•
•
00101000
The different sites are linked. A 1 in position 8 implies 0
in position 5, and vice versa.
The history of a population can be expressed as a tree.
The tree can be constructed efficiently
Recombination
• A tree is not
sufficient as a
sequence may have 2
parents
• Recombination leads
to violation of 4
gamete property.
• Recombination leads
to loss of correlation
between columns
00000000
11111111
00011111
Studying recombination
• A tree is not sufficient as a sequence may have 2
parents
• Recombination leads to loss of correlation between
columns
• How can we measure recombination?
Linkage (Dis)-equilibrium (LD)
A
0
0
0
0
1
1
1
1
B
1
1
0
0
0
0
0
0
No recombination
–Pr[A,B=0,1] = 0.25
•Linkage disequilibrium
A
0
0
0
0
1
1
1
1
B
0
1
0
0
1
0
0
0
Extensive Recombination
– Pr[A,B=(0,1)=0.125
• Linkage equilibrium
Measuring LD
• Consider two bi-allelic sites A and B, with values 0 and 1.
•
•
•
•
•
•
•
Let p1 = probability[individual has allele 1 in site A]
q1 = probability[individual has allele 1 in site B]
P11 = Prob [individual has allele 1 in site A, and B]
Linkage Disequilibrium, D = |P11-p1q1| = |P01-p0q1| =….
If D=0, sites are uncorrelated, (are in linkage equilibrium)
If |D| >>0, sites are highly correlated (have high LD)
Other measures exist, but they all measure similar
quantities.
LD can be used to map disease genes
LD
D
N
N
D
D
N
0
1
1
0
0
1
• LD decays with distance from the disease allele.
• By plotting LD, one can short list the region
containing the disease gene.
Population sub-structure can cause
problems in disease gene mapping
Population sub-structure can
increase LD
• Consider two populations
that were isolated and
evolving independently.
• They might have
different allele
frequencies in some
regions.
• Pick two regions that are
far apart (LD is very low,
close to 0)
0
0
0
1
0
0
0
0
0
..
..
..
..
..
..
..
..
..
1
1
0
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
..
..
..
..
..
..
..
..
..
0
0
0
1
0
0
0
0
0
Pop. A
p1=0.1
q1=0.9
P11=0.1
D=0.01
Pop. B
p1=0.9
q1=0.1
P11=0.1
D=0.01
Recent ad-mixing of population
• If the populations came
together recently (Ex:
African and European
population), artificial LD
might be created.
• D = 0.15 (instead of 0.01),
increases 10-fold
• This spurious LD might lead
one false associations
• Other genetic events can
cause LD to arise, and one
needs to be careful
0
0
0
1
0
0
0
0
0
..
..
..
..
..
..
..
..
..
1
1
0
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
..
..
..
..
..
..
..
..
..
0
0
0
1
0
0
0
0
0
Pop. A+B
p1=0.5
q1=0.5
P11=0.1
D=0.1-0.25=0.15
Determining population substructure
• Given a mix of people, can you sub-divide them into
ethnic populations.
• Turn the ‘problem’ of spurious LD into a clue.
– Find markers that are too far apart to show LD
– If they do show LD (correlation), that shows the
existence of multiple populations.
– Sub-divide them into populations so that LD disappears.
Determining Population sub-structure
• Same example as before:
• The two markers are too
similar to show any LD, yet
they do show LD.
• However, if you split them
so that all 0..1 are in one
population and all 1..0 are in
another, LD disappears
0
0
0
1
0
0
0
0
0
..
..
..
..
..
..
..
..
..
1
1
0
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
..
..
..
..
..
..
..
..
..
0
0
0
1
0
0
0
0
0
Iterative Algorithm for Population
Substructure
• Assume that there are 2 sub-populations
• Randomly partition the individuals into two.
• Select an individual, and compute the
probabilities Pr(x|A), and Pr (x|B)
• Assign the individual to A with probability
– Pr(x|A)/ (Pr(x|A)+Pr(x|B))
• Continue.
Iterative algorithm for population
sub-structure
• Define
• N = number of individuals (each has a single
chromosome)
• k = number of sub-populations.
• Z  {1..k}N is a vector giving the sub-population.
– Zi=k’ => individual i is assigned to population k’
• Xi,j = allelic value for individual i in position j
• Pk,j,l = frequency of allele l at position j in
population k
Example
• Ex: consider the
following assignment
• P1,1,0 = 0.9
• P2,1,0 = 0.1
1
1
1
1
1
1
1
1
1
1
0
0
0
1
0
0
0
0
0
0
..
..
..
..
..
..
..
..
..
..
1
1
0
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
1
1
0
1
1
1
1
1
1
1
..
..
..
..
..
..
..
..
..
..
0
0
0
1
0
0
0
0
0
0
Goal
•
•
•
•
X is known.
P, Z are unknown.
The goal is to estimate Pr(P,Z|X)
Various learning techniques can be
employed.
• Here a Bayesian (MCMC) scheme is
employed. We will only consider a simplified
version
Algorithm:Structure
• Iteratively estimate
– (Z(0),P(0)), (Z(1),P(1)),.., (Z(m),P(m))
• After ‘convergence’, Z(m) is the answer.
• Iteration
– Guess Z(0)
– For m = 1,2,..
• Sample P(m) from Pr(P | X, Z(m-1))
• Sample Z(m) from Pr(P | X, P(m-1))
• How is this sampling done?
Example
• Choose Z at random, so each
individual is assigned to be in one of
2 populations. See example.
• Now, we need to sample P(1) from
Pr(P | X, Z(0))
• Simply count
• Nk,j,l = number of people in pouplation
k which have allele l in position j
• pk,j,l = Nk,j,l / N
1
2
2
1
1
2
1
2
1
2
0
0
0
1
0
0
0
0
0
0
..
..
..
..
..
..
..
..
..
..
1
1
0
1
1
1
1
1
1
1
1
2
2
1
1
2
1
2
2
1
1
1
0
1
1
1
1
1
1
1
..
..
..
..
..
..
..
..
..
..
0
0
0
1
0
0
0
0
0
0
Example
• Nk,j,l = number of people in population
k which have allele l in position j
• pk,j,l = Nk,j,l / N1,1,*
• N1,1,0 = 4
• N1,1,1 = 6
• p1,1,0 = 4/10
• p1,2,0 = 4/10
• Thus, we can sample P(m)
1
2
2
1
1
2
1
2
1
2
0
0
0
1
0
0
0
0
0
0
..
..
..
..
..
..
..
..
..
..
1
1
0
1
1
1
1
1
1
1
1
2
2
1
1
2
1
2
2
1
1
1
0
1
1
1
1
1
1
1
..
..
..
..
..
..
..
..
..
..
0
0
0
1
0
0
0
0
0
0
Sampling Z
• Pr[Z1 = 1] = Pr[”01” belongs to population 1]?
• We know that each position should be in linkage
equilibrium and independent.
• Pr[”01” |Population 1] = p1,1,0 * p1,2,1
=(4/10)*(6/10)=(0.24)
• Pr[”01” |Population 2] = p2,1,0 * p2,2,1 =
(6/10)*(4/10)=0.24
• Pr [Z1 = 1] = 0.24/(0.24+0.24) = 0.5
Sampling
• Suppose, during the iteration, there
is a bias.
• Then, in the next step of sampling Z,
we will do the right thing
• Pr[“01”| pop. 1] = p1,1,0 * p1,2,1 =
0.7*0.7 = 0.49
• Pr[“01”| pop. 2] = p2,1,0 * p2,2,1
=0.3*0.3 = 0.09
• Pr[Z1 = 1] = 0.49/(0.49+0.09) = 0.85
• Pr[Z6 = 1] = 0.49/(0.49+0.09) = 0.85
• Eventually all “01” will become 1
population, and all “10” will become a
second population
1
1
1
2
1
2
1
2
1
1
0
0
0
1
0
0
0
0
0
0
..
..
..
..
..
..
..
..
..
..
1
1
0
1
1
1
1
1
1
1
2
2
2
1
2
2
1
2
2
1
1
1
0
1
1
1
1
1
1
1
..
..
..
..
..
..
..
..
..
..
0
0
0
1
0
0
0
0
0
0
Population Structure
Africa
Eurasia
East Asia
Oceania
• 377 locations (loci) were sampled in 1000 people from 52
populations.
• 6 genetic clusters were obtained, which corresponded to 5
geographic regions (Rosenberg et al. Science 2003)
America
Other topics
Assembly
Genomic
Analysis/
Pop.
Genetics
Protein
Sequence
Analysis
Sequence
Analysis
Gene Finding
ncRNA
ncRNA gene finding
• Gene is transcribed but not translated.
• What are the clues to non-coding genes?
– Look for signals selecting start of transcription and
translation. Non coding genes are transcribed by Pol III
– Non-coding genes have structure. Look for genomic
sequences that fold into an RNA structure
• Structure: Given a sequence, what is the structure
into which it can fold with minimum energy?
tRNA structure
RNA structure: Basics
• Key: RNA is single-stranded. Think of a string over 4
letters, AC,G, and U.
• The complementary bases form pairs.
• Base-pairing defines a secondary structure. The basepairing is usually non-crossing.
RNA structure: pseudoknots
Sometimes, unpaired bases in loops form
‘crossing pairs’. These are pseudoknots
A Static picture of the cell is insufficient
•
•
Each Cell is continuously
active,
– Genes are being
transcribed into RNA
– RNA is translated into
proteins
– Proteins are PT
modified and
transported
– Proteins perform
various cellular
functions
Can we probe the Cell
dynamically
Pathways
Gene
Regulation Transcript
profiling
Proteomic
profiling
Gene expression
• The expression of
transcripts and protein in
the cell is not static. It
changes in response to
signals.
• The expression can be
measured using microarrays.
• What causes the change
in expression?
Transcriptional machinery
• DNA polymerase (II) scans the genome, initiating
transcription, and terminating it.
• The same machinery is used for every gene, so while Pol II
is required, it is not sufficient to confer specificity
TF binding
•
•
•
Other transcription
factors interact with the
core machinery and
upstream DNA to
provide specificity.
TFs bind to TF binding
sites which are clustered
in upstream enhancer and
promoter elements.
The enhancer elements
may be located many kb
upstream of the corepromoter
Transcription factors
Upstream elements
TF binding sites
• TF binding sites are
weak signal (about 10
bp with 5bp
conserved)
• If two genes are coregulated, they are
likely to share binding
sites
• Discovery of binding
site motifs is an
important research
problem.
g1
g2
g3
TCAGGAG
TGAGGAG
g4
g5
TGAGGTG
TCAGGTG
TCAGGTG
http://www.gene-regulation.com/pub/databases.html#transfac
Discovering TF binding sites
• Identification of these TF binding
sites/switches is critical.
• Requires identification of co-regulated
genes (genes containing the same set of
switches).
• How do we find co-regulated genes?
Idea1: Use orthologous genes from different
species
ACGGCAGCTCGCCGCCGCGC
||||| || ||||||| ||
ACGGC-GGGCGCCGCCCCGC
1.
ACGGCAGCTCGCCGCCGC-C
| || | ||||||| |
AGTGC-GGGCGCCGCCTCAT
2.
3.
ACGGC-GC-TCGCCGCCGCGC
|
|
| || |
|
AT-ACGAAGTAGCGG-ATGGT
The species are too close (EX:
humans and chimps). Binding
& non-binding sites are both
conserved.
The species are distant. Binding
sites are conserved but not
other sequence.
The species are very distant.
Even binding sites are not
conerved. The genes have
alternative regulators.
Idea2: Microarray
•
Expression level of all genes
Pathways
•
•
•
Proteins interact to
transduce signal, catalyze
reactions, etc.
The interactions can be
captured in a database.
Queries on this database
are about looking for
interesting sub-graphs in a
large graph.
Summary
•
•
•
•
Biological databases cannot be
understood without understanding
the data, and the tools for
querying and accessing these data.
While database technology (XML,
Relational OO databases, text
formats) is used to store this data,
its use is (often) transparent for
Bioinformatics people.
In this course, we looked at various
data-streams, and pointed to
databases that store these datastreams
Nucleic Acids Research brings out
a database issue every January
2005 issue