Transcript ppt

5 Open Problems in Bioinformatics
•Pedigrees from Genomes
•Comparative Genomics of Alternative Splicing
•Viral Annotation
•Evolving Turing Patterns
•Protein Structure Evolution
From genomes to pedigrees
Coalescent
Rebombination
process
Pedigree process
Seqeunce/Individual
Boundary
Three Processes
1. Recombination
3. The Mutational Process
From Yun Song
2. Choosing Parents
Probability of Data given a pedigree.
Elston-Stewart (1971) -Temporal Peeling Algorithm:
Mother
Father
Condition on parental states
Recombination and mutation are Markovian
Lander-Green (1987) - Genotype Scanning Algorithm:
Mother
Father
Condition on paternal/maternal inheritance
Recombination and mutation are Markovian
Comment: Obvious parallel to Wiuf-Hein99 reformulation of Hudson’s 1983 algorithm
Benevolent Mutation and Recombination Process
Genomes with r and m/r --> infinity
r - recombination rate, m - mutation rate
•Counting within a small interval would reveal the length of the path connecting the two segments.
•Siblings are readily revealed, since they will have segments with 2m density of mutations
•The distribution of path lengths are readily observable between two sequences
•All embedded phylogenies are observable
From Phylogenies to Pedigrees
Mike’s counter example, linkage and individuals
Different Pedigrees
Same Phylogenies
Pedigree 1
2
1
2
Sibling Sequences come from
different parents.
Pedigree 2
Individual 1
1
Gluing Phylogenies together
1
2
1
2
?
grandparents
A recombinants’ parent are
sister sequences.
1
2
1
2
1
2
1
2
Comparative Genomics of Alternative Splicing
From Transcripts to the AS-Graph
S
S
E
E
1. How well known is the AS-graph as a function number of transcripts?
2. A family and distribution of transcripts, can they be explained an AS-graph
with probabilities at donor sites or do we need probabilities for
(donor,acceptor) pairs? Or possibly even more complicated situations. And is
sampling transcripts good enough to distinguish these situations.
Mini-project: reliability of AS-detection.
Choose Idealized AS-Graph:
1. Genome
2. Choose donor and acceptor sites in random pairs.
3. For each possible splice pair assign probability for choosing it.
This should define a probability for all transcripts.
•
Generate a set of transcripts.
•
Reconstruct AS-Graph.
Key questions:
1. How many transcripts must be sampled to detect AS.
2. How well will the AS-Graph be recovered?
Optimal DAG (directed acyclic graph) under restrictions
Finding a set of annotations:
1. Find set of paths, maximizing sum of scores.
Sub-optimal Paths:
2. The score of minimal path must be above
threshold.
Two paths must differ significantly: An enclosed
area, the maximal height must be d higher than
the boundary defining it. Height(i,j) = di,j + di,j
1. Does known AS genes have more CTO structure than non-AS genes?
2. Do the AS correspond to the CTO structure
3. Is the CTO structure evolutionary conserved?
Phylogenetically related ASGs
E
E
SS
E
E
SS
E
E
SS
1. Is ASG conserved?
2. What is conserved?
3. How is selection along position dependent on splicing status?
Virus Annotation
Classes of Gene Structures
http://www.tulane.edu/~dmsander/WWW/335/Diarrhoea.html
Diarrhoea Causing Arrangements
Illustrating the 3 main
classes of gene structures:
Unidirectional, Convergent
and Divergent.
http://www.tulane.edu/~dmsander/WWW/335/Retroviruses.html
http://www.tulane.edu/~dmsander/WWW/335/Papovaviruses.html
Retroviridae
Arrangements
Papoviridae Arrangement
The Problems of Viral Annotation
•HMM gene structure generator (McCauley)
•Gene Structure Evolution (de Groot)
•Alignment (Caldeira, Lunter, Rocco)
•Recombination (Lyngsø, Song)
•Multiple constraints: RNA secondary structure,
gene conservation, binding/transcriptional
instructional sites.
Our 8 State HMM which allows for Unidirectional
overlapping gene structures
HMM States
• Non-coding
• Coding RF1
• Coding RF2
• Coding RF3
• Coding RF1,2
• Coding RF1,3
• Coding RF2,3
• Coding RF1,2,3
Combining Levels of Selection.
Assume multiplicativity: fA,B = fA*fB
Protein-Protein
Hein & Støvlbæk, 1995
Codon Nucleotide Independence Heuristic
Jensen & Pedersen, 2001
Contagious Dependence
Protein-RNA
Singlet
Doublets
Contagious Dependence
Table illustrating the performance benefit in
Sensitivity we obtain utilizing a Phylogenetic HMM.
We extend the HMM model to include evolutionary
information from 13 aligned HIV2 sequences.
HIV2
Genomes
SS HMM
Sensitivity
PHMM
Sensitivity
Del
Sensitivity
SS HMM
Specificity
PHMM
Specificity
Del
Specificity
M30502
0.8913
0.9765
0.0852
0.9878
0.9753
-0.0125
J04542
0.8458
0.9173
0.0714
1.0000
0.9956
-.0044
D00835
0.8796
0.9432
0.0636
0.9920
0.9733
-0.0187
M15390
0.9310
0.9971
0.0661
1.0000
0.9869
-0.0131
J03654
0.8261
0.9971
0.1709
1.0000
0.9865
-0.0135
AY509259
0.8697
0.9256
0.0559
1.0000
0.9886
-0.0114
AY509260
0.8257
0.9101
0.0844
0.9928
0.9792
-0.0136
J04498
0.8961
0.9737
0.0776
1.0000
0.9911
-0.0089
AF082339
0.9074
0.9650
0.0577
0.9842
0.9773
-0.0069
U22047
0.9028
0.9874
0.0847
0.9865
0.9744
-0.0121
U27200
0.8769
0.9453
0.0684
0.9928
0.9748
-0.0180
LO7625
0.8340
0.9680
0.1340
1.0000
0.9607
-0.0393
L36874
0.8653
0.9957
0.1303
0.9980
0.9766
-0.0214
MEANS
0.8732
0.9617
0.0885
0.9949
0.9800
-0.0149
GenBank: Centralized resource for
publicly available viral sequence data.
Entrez Genomes currently contains 2120 Reference
Sequences for 1510 viral genomes and 36 Reference
Sequences for viroids.
http://www.ncbi.nlm.nih.gov/Genbank/
http://www.ncbi.nlm.nih.gov/genomes/VIRUSES/viruses.html
Properties of overlapping genes are conserved across microbial genomes.
Genome Res. 2004 Nov;14(11):2268-72.
Within microbial genomes, one third of annotated genes contain some degree
of overlap, and one third of these are either Convergent or Divergent.
Krakauer, D.C. Stability and evolution of overlapping genes.
Evolution 54: 731-739 (2000) Genome Res. 2004 Nov;14(11):2268-72.
General preponderance of overlapping gene structures is roughly a 90:9:1
ratio split across Unidirectional, Convergent and Divergent arrangements.
Turing Patterns
Mathematical models to understand biological patterns
From Maini’s Home Page: http://www.maths.ox.ac.uk/~maini
Turing Model
 u ( x, t )
2

D

u  f (u, v,{ pu })
u
 t
 v( x, t )

 Dv  2 v  g (u, v,{ pv })
 t
Different parameters lead to different patterns
Stripes: p small
Spots: p large
[From: Leppanen et al. Dimensionality effects in Turing pattern formation,
Int. J. Mod. Phys. B 17, 5541-5553 (2003)]
ut  Du  2u   (u  av - uv 2 - puv)
v  D  2 v   (bv  hu  uv 2  puv)
v
 t
3 suggestions
1. Networks and Turing Patterns
2. Stochastic Partial Differential Equations
3. Phylogenetically related Turing Patterns
Evolutionary Models of Protein Structure Evolution
?
?
?
?
Known
a-globin
Unknown
300 amino acid changes
800 nucleotide changes
1 structural change
1.4 Gyr
Known
Myoglobin
1. Given Structure what are the possible events that could happen?
2. What are their probabilities? Old fashioned substitution + indel process with bias.
Bias: Folding(Sequence Structure) & Fitness of Structure
3. Summation over all paths.
2 suggestions
A. Structure (Homology Modelling, Topology)
Folding(Sequence Structure)
As a first approximation similar structures should be compared and the
problem could be solved by comparative modelling.
Fast Homology Modelling
Using Protein Topology as Hidden Variable
Fitness of Structure – such functions are common place in guiding prediction
programs.
B. MCMC
Questions to be asked
Negative Note:
Protein Structure Analysis is much harder than Sequence Analysis.
Much of the first hand impression will remain: “Structures are either
trivially similar or highly dissimilar” – the middle ground is empty.
At Gyr scale other rearrangements occur.
Positive Note: If it works
Test of smooth/catastrophic structure evolution
Separation of analogous/homologous similarities
Protein Evolution in General
How closely linked are homologous and structurally equivalent sites?
http://www.biochem.ucl.ac.uk/bsm/cath/
http://scop.mrc-lmb.cam.ac.uk/scop/
Summary
Pedigrees from Genomes
Does infinite genomes determine pedigrees?
How many pedigrees are there?
Comparative Genomics of Alternative Splicing
How well do you know the ASG?
How do you measure selection on the ASG?
Viral Annotation
How well can you annotate viruses from observed evolution?
Evolving Turing Patterns
Turing Patterns and Networks
Stochastic Turing Patterns
Phylogenetically Related Turing Patterns
Protein Structure Evolution
Full Model of Structure Evolution
Model of Protein Topology Evolution