Phylogenetic Tree Construction using Pathway Analysis

Download Report

Transcript Phylogenetic Tree Construction using Pathway Analysis

Phylogenetic Tree Construction using
Pathway Analysis
Bioengineering 190C Project
By: Harry Choi
Nick Lin
Gabe Kwong
Li Yan
Christina Yau
Background

Traditional Approach





Comparison of single orthologs between
organisms
Distance matrix generation from similarity scores
Hierarchical Clustering
Tree Construction
Disadvantage


Sensitive to choice of gene for comparison
Possible inconsistency of trees generated
Our Approach

N annotated organisms to be clustered







Reference organism is chosen
Pathway in the reference organism is chosen
Pool of orthologs in the N organisms is generated
by BLAST
Analysis of pool of ortholog generated vector
representing each organisms
Distance calculation from vectors
Hierarchical Clustering
Tree Construction
Rationale for Approach

Pathway takes into account multiple genes


Individual differences between genes not directly
taken into account
All genes considered are related to each other
in cellular function

Better conservation of actual function than
sequence identities
 Better consistency in trees generated
Program Design

Modular design divided into following portions:






BLAST
Analysis of BLAST results
Hierarchical Clustering
Tree Construction
Design allows for reuse of components in
different applications with minor changes
Design allows individual subrountines to be used
recursively to generate desired results with
minimal changes
Step 1: BLAST
Conserved proteins from Wnt pathway
will be used as example
Wnt pathway
Wnt proteins form a family of highly conserved
secreted signaling molecules that regulate cell-to-cell
interactions during embryogenesis. Wnt genes and
Wnt signaling are also implicated in cancer. Wnt
pathway is found in many organisms such as:
Drosophila, Caenorhabditis elegans, Xenopus,
Chiecken, Mouse, Zebrafish, and Human.
Wnt pathway (cont)
Choose 6 most conserved
proteins from this pathway
as seed proteins:






Wnt
Frizzled
Dsh
Apc
Axin
Tcf
(Roel Nusse, 2002)
5 organisms





Drosophila:
Mouse:
C. elegans:
Zebrafish:
Xenopus:
54455 sequences
77143 sequences
62256 sequences
3069 sequences
5174 sequences
Strategy
Seed protein (pr1) from Organism 1 (O1)  blast against 4 other organisms:
O1
O2
O3
O4
O5
pr1
pr2
pr3
pr4
pr5
pr6
pr7
Pr8
pr9
Secondary seed proteins (pr 2, …, 9) blast against respective 4 other organism:
O1
pr1
pr10
pr11
O2
pr2
O3
O4
O5
O1
O2
pr4
pr6
pr12
pr8
pr13
pr14
pr15
pr16
pr17
pr1
pr24
pr3
pr25
pr26
O3
pr4
O4
O5
pr7
pr14
pr15
pr9
pr16
pr17
pr23
O1
O2
O3
O4
O5
O1
O2
O3
O4
O5
pr11
pr18
pr19
pr3
pr5
pr6
pr20
pr7
pr13
pr21
pr22
pr9
pr23
:
:
:
:
:
:
pr5
:
:
:
:
:
:
Example input seed sequence for BLAST
>gi|85190|pir||A29650 wingless (wg) protein precursor - fruit fly
(Drosophila melanogaster)
MDISYIFVICLMALCSGGSSLSQVEGKQKSGRGRGSMWWGIAKVGEPNNITPIMYMDPAIHS
TLRRKQRRLVRDNPGVLGALVKGANLAISECQHQFRNRRWNCSTRNFSRGKNLFGKIVDRGC
RETSFIYAITSAAVTHSIARACSEGTIESCTCDYSHQSRSPQANHQAGSVAGVRDWEWGGCS
DNIGFGFKFSREFVDTGERGRNLREKMNLHNNEAGRAHVQAEMRQECKCHGMSGSCTVKTC
WMRLANFRVIGDNLKARFDGATRVQVTNSLRATNALAPVSPNAAGSNSVGSNGLIIPQSGLV
YGEEEERMLNDHMPDILLENSHPISKIHHPNMPSPNSLPQAGQRGGRNGRRQGRKHNRYHF
QLNPHNPEHKPPGSKDLVYLEPSPSFCEKNLRQGILGTHGRQCNETSLGVDGCGLMCCGRGY
RRDEVVVVERCACTFHWCCEVKCKLCRTKKVIYTCL
(fasta format: start with “>”)
Example output file from BLAST:
15 secondary seed proteins
wg_85190
wg_85190
wg_85190
wg_85190
wg_85190
wg_85190
wg_85190
wg_85190
wg_85190
wg_85190
wg_85190
wg_85190
wg_85190
wg_85190
wg_85190
wg_celegans_7508752
wg_celegans_3880389
wg_celegans_17539494
wg_zebrafish_103816
wg_zebrafish_833600
wg_zebrafish_18859559
wg_zebrafish_139740
wg_xenopus_65236
wg_xenopus_69039
wg_xenopus_139748
wg_mouse_293671
wg_mouse_387388
wg_mouse_69037
wg_mouse_13529431
wg_mouse_139744
1.70e-41
1.70e-41
1.70e-41
1.20e-80
1.20e-80
1.20e-80
1.20e-80
1.40e-76
1.40e-76
1.40e-76
2.50e-78
2.50e-78
2.50e-78
2.50e-78
2.50e-78
Example output file from BLAST (cont)
wg_celegans_7508752
wg_celegans_7508752
wg_celegans_7508752
wg_celegans_7508752
wg_celegans_7508752
wg_celegans_7508752
wg_celegans_7508752
wg_celegans_7508752
wg_celegans_7508752
wg_celegans_7508752
wg_celegans_7508752
wg_celegans_7508752
wg_celegans_7508752
wg_celegans_7508752
wg_celegans_7508752_drosophila_6537292
wg_celegans_7508752_drosophila_12018324
wg_celegans_7508752_xenopus_422628
wg_celegans_7508752_xenopus_313268
wg_celegans_7508752_xenopus_465484
wg_celegans_7508752_mouse_202406
wg_celegans_7508752_mouse_227507
wg_celegans_7508752_mouse_111253
wg_celegans_7508752_mouse_14789729
wg_celegans_7508752_mouse_6678599
wg_celegans_7508752_mouse_14424475
wg_celegans_7508752_zebrafish_1256778
wg_celegans_7508752_zebrafish_18859567
wg_celegans_7508752_zebrafish_2501662
1.30e-90
1.30e-90
1.10e-96
1.10e-96
1.10e-96
2.40e-96
2.40e-96
2.40e-96
2.40e-96
2.40e-96
2.40e-96
2.30e-94
2.30e-94
2.30e-94
Step 2: Analysis of BLAST
Results
ie. Metric Determination
Metric Determination


Common Algorithms used to calculate a
distance metric from similarity scores
include (1-%Identity) and S = e(-d/m2)
(Shepard 1987).
A different algorithm is used for this
project.
Rules Metric must Satisfy



The distance between a gene and itself
must be zero Dii = 0.
Communitive property: Dij = Dji.
Triangular inequality: Dij + Dik  Djk.
Dij
j
Djk
k
i
Dik
Our Algorithm

Determine unique gene pool from all
the organisms that meet the threshold
for a particular gene in pathway.
Celegans_17531491
Wg-Drosophila
g1
g2
g3
Is g1 Unique?
Yes
g1
Is g2 unique?
No
g2
Gene pool
g4
g3
g2
g4
Gene Vectors
Drosophila Mouse Zebrafish...
Genepool
of entire
Wnt
pathway
g1
g2
g3
.
.
.
gn
1
0
0
.
.
.
1
0
1
1
.
.
.
0
0
0
0
.
.
.
1
No Homolog of gn found
in Mouse.
Homologous
gn found
in Zebrafish
Euclidean Distance



Vectors are in N dimensional space
Determine Euclidean Distance by taking
the root of the differences squared.
Dij =
=
(Di1-Dj1)2 + …+ (Din-Djn)2
(1-0)2 + (1-1)2 + (0-1)2 + …
Distance Matrix
O1 O2 O3 . . . . . . . . . On
O1
O2
O3
.
.
.
On
0
D21 0
D31 D32 0
Since Euclidean distances commute
Matrix is Triangular.
.
.
.
Dn1
0
Step 3: Hierarchical Clustering
Hierarchical Clustering
There are two types of clustering:
 Successive Fusions
(Agglomerative Clustering)
 Separation
(Divisive Clustering)
Hierarchical Clustering


In this project, agglomerative clustering
algorithm has been employed
Idea: The most similar objects are first
grouped. These are then merged
according to their similarities, until all are
fused into one single cluster
Hierarchical Clustering

Input of the clustering program:
Any N x N triangular
matrix containing the
pairwise distances
between the
organisms
D = {djk}
Hierarchical Clustering

Feed the NXN matrix as the input and the
clustering method will output a
(N-1)X(N-1) matrix
In this case, it will be a 4X4 matrix:
Hierarchical Clustering
New Distances are determined
between the new group and each
of the remaining organisms
Hierarchical Clustering

d
d
Continue with the
clustering until all
the organisms fused
into one cluster
= min {d (35)2, d 12} = min{7, 9} = 7
(135)4 = min {d (35)4, d 14} = min{8,6} = 6
(135)2
D(135)(24) = min {d (135)(2) , d(135)(4) }
= min {7, 6} = 6
Hierarchical Clustering
The outputs of each run:


The names of the organisms that are grouped
together
The distance between the two organisms
After N-1 number of iterations, the outputs are
saved to a file and they will be used to draw the
phylogenetics tree.
Step 4: Phylogenetic Tree
Construction
Tree Construction

Sample input

Flat file of clusters and distances
e.g. sample1.txt
A B 4.5
B C 5.2
E D 5.8
C E 12.4
Or
A
A
E
A
e.g. sample2.txt
B 4.5
BC
5.2
D 5.8
BC
E
D
12.4
Tree Construction

Sample Input (continued)

Requirements for input file:






Each line must represent one cluster
First entries are leaves in the cluster
Last entry is the distance
No more than two new leaves can be added in a cluster
Each entry must be delimited by a tab
Flexibility


File can have all leaves in the cluster or a new leaf and
any leaf from previous clusters
Subroutine can be reuse to generate tree from any file
by modifiying one line of code
Tree Construction

Method

Subclass intree





Read in file line by line
Add new leaves to Vector leaves
Array elts tracks the number of leaves added to the
vector in each cluster
Array d tracks the distances between elements in the
cluster
Subclass treed



Convert distances to pixels
Draw tree and leaves in Jframe
Draw scale of distance in Jframe
Tree Construction

Sample Output
A
B
C
D
E
1
2
3
4
5
6
7
8