GraphTheoryMatrix

Download Report

Transcript GraphTheoryMatrix

Biological Networks –
Graph Theory and Matrix Theory
Ka-Lok Ng
Department of Bioinformatics
Asia University
Content
Topological Statistics of the protein interaction networks
How to characterize a network ?
– Graph theory, topological parameters (node degrees, average
path length, clustering coefficient, and node degree
correlation.)
– Random graph, Scale-free network, Hierarchical network
– Evolution of Biological Networks
2
Biological Networks - metabolic networks
Metabolism is the most basic network of
biochemical reactions, which generate
energy for driving various cell processes,
and degrade and synthesize many
different bio-molecules.
3
Biological Networks
- Protein-protein interaction network (PIN)
Proteins perform distinct and well-defined functions, but little is known about how
interactions among them are structured at the cellular level. Protein-protein interaction account
for binding interactions and formation of protein complex.
- Experiment – Yeast two-hybrid method, or co-immunoprecipitation
Limitation:
No subcellular
location, and
temporal
information.
Cliques
– protein complexes ?
www.utoronto.ca/boonelab/proteomics.htm
4
Biological Networks - PIN
Yeast Protein-protein interaction network
- protein-protein interactions are not random
- highly connected proteins are unlikely to interact with each
other.
Not a random network
- Data from the highthroughput two-hybrid
experiment (T. Ito, et al.
PNAS (2001) )
- The full set containing
4549 interactions among
3278 yeast proteins
87% nodes in the largest
component
- kmax ~ 285 !
- Figure shows nuclear
proteins only
5
Biological Networks
– Gene regulation networks
In a gene regulatory network, the protein encoded by a gene can regulate
the expression of other genes, for instance, by activating or inhibiting DNA
transcription. These genes in turn produce new regulatory proteins that
control other genes.
Example of a genetic regulatory network of two genes (a and b),
each coding for a regulatory protein (A and B).
6
Biological Networks
– Gene regulation networks
Transcription regulatory
network in Yeast
Transcription regulatory
network in H. sapiens
- From the YPD database:
1276 regulations among 682 proteins
by 125 transcription factors (~10
regulated genes per TF)
- Part of a bigger genetic regulatory
network of 1772 regulations among
908 proteins
Data courtesy of Ariadne Genomics
obtained from the literature search:
Transcription regulatory
network in E. coli
Data (courtesy of Uri Alon)
was curated from the Regulon
1449 regulations among 689 proteins database:
606 interactions between
424 operons (by 116 TFs)
7
Biological Networks
– Signal transduction networks
Hormones (first message)
Receptor
cAMP, Ca++ (second message)
phosphorylation
Nuclear transcription factor NF-kB
-control of apoptosis (cell suicide),
-development of B and T cells,
-anti-viral and bacterial responses
Oxidant-induced activation of NF-kB signal
transduction
8
Biological Networks
– Cell cycle regulation networks
9
Biological Networks
Biological networks are not randomly connected
Underlying architecture  clustering
How to characterize ?
An universal features across different species ?
10
Graph theory
- The Bridge Obsession Problem
Find a tour crossing every bridge
just once Leonhard Euler
(Switzerland), 1735. It turns out it
is impossible.
L. Euler
1707-1783
Bridges of Königsberg
11
Eulerian Cycle Problem


Find a cycle
that visits
every edge
exactly once
Linear time
More complicated Königsberg
12
Hamiltonian Cycle Problem



Find a cycle that
visits every
vertex exactly
once
Around the 20
famous cities in
the world
NP – complete
Sir W. Hamilton
(English mathematician)
1805 - 1865
Game invented by Sir
William Hamilton in 1857
13
Mapping Problems to Graphs


Arthur Cayley
(English
mathematician)
studied chemical
structures of
hydrocarbons in the
mid-1800s
He used trees
(acyclic connected
graphs) to enumerate
structural isomers
Arthur Cayley
14
Real networks
Many networks show scale-free behavior

World-Wide Web

Internet

Ecology network (food web)

Science collaboration network

Movie actor collaboration network





Cellular network
Network in linguistic
Power and neural network
Sexual contacts within a population (important
for disease prevention!)
etc.
Power law behavior
15
Graph Theory
Binary relation
Pathway
Cluster
Hierarchical Tree
A
B
Network representation. A network (graph) consists of a set of elements (vertices) and a set of
binary relations (edges).
16
Graph Theory – Basic concepts
Graphs
G=(N,E)
N={n1 n2,... nN}
E={e1 e2,... eM}
ek={ni nj}
Nodes: proteins
Edges: protein interactions

nN
d ( n)  2 | E |
Mutligraph
ek={ni nj}+ duplicate edges
i.e. em={ni nj}
Nodes: proteins
Edges: interactions of different
sort: binding and similarity
Hypergraphs
Hyperedge: ex={ni, nj, nk ...}
Nodes: proteins
Edges: protein complexes
Directed hypergraph
Hyperedge: ex={ni, nj .. | nk nl ...}
Nodes: substances
Edges: chemical reactions A + B  C +D
eX={A, B .. | C, D ...}
Directed graph
ek={ni nj}
Nodes: genes and their products
Edges from A to B: gene regulation 
gene A regulates expression of gene B
Different systems  Different graphs
17
Graph Theory – Basic concepts
Node degree
Components
Clustering coefficient Ci
if A-B, B-C, then it is
highly probable that A-C
Ci 
2 Ei
ki (ki  1)
CA 
Complete graph
(Clique)
Shortest path length
2 1
 0.1
5(5  1)
Two ways to compute Ci
-Ei actual connections out of Ck2 possible
connections
-number of triangles that included i/ki(ki-1)
Average clustering coefficient
1
C
N
N
C
i 1
i
18
Graph Theory – Vertex adjacency matrix
Undirected graph
0

1
A




1  

0 1 1
1 0 

1  0 
ki
1
3
1
1
1
symmetric
- ∞ means not directly connected
- node i connectivity, ki = countj(mij = 1)
3
2
4
Bipartite graph
 0
A   T
B
B 

0 
19
Graph Theory – Edge adjacency matrix
a
0

1
E (G )  
0

1

b
c
G
d
1 0 1

0 1 1
1 0 1

1 1 0 
1
a
a
b
symmetric
c
b
3
2
c
d
d
4
The edge adjacency matrix (E) of a graph G is identical to vertex adjacency matrix (A)
of the line graph of G, L(G). That is the edge in G are replaced by vertices in L(G).
Two vertices in L(G) are connected whenever the corresponding edges in G are
a
b
adjacent.
L(G)
A(L(G)) = E(G)
d
c
The labeling of the same graph G are related by a similarity transformation, P-1A(G1)P=A(G2).
20
Graph Theory – average network distance
Interaction path length or average network distance, d
the average of the distances between all pairs of nodes
- frequency of the shortest interaction path length, f(j)
- determined by using the Floyd’s algorithm
The average network diameter d is given by
-
d
 jf ( j )
j
 f ( j)
j
where j is the shortest path length between two nodes.
Network diameter (global)  Average network distance (local)
21
Graph Theory – the shortest path
The shortest path
- Floyd algorithm, an O(N3) algorithm.
For iteration n,
- given three nodes i, j and k, it is
shorter to reach j from i by passing
through k
Mnij=min{Mn-1ij, Mn-1ik+Mn-1kj}
- search for all possible paths,
e.g. 1-2, 1-2-3, 1-2-4, 2-3, 2-4
i
j
k
1
3
2
4
22
Graph Theory – number of the shortest
path in a graph
A nonvanishing element of A(G), Aij = 1, represents a walk of a length
between the vertices i and j. Therefore, in general
1
Aij  
0
if there is a walk of length one between vertices i and j
otherwise
There are walks of various lengths which can be found in a given graph. Thus
1
Aik Akj  
0
if there is a walk of length two between vertices i and j passing
through the vertex k
otherwise
N
2
Therefore, the expression ( A ) ij   Aik Akj represents the total number of walks of
k 1
the length 2 in G between the vertices i and j.
For a walk of a length L, we
1
Air Ars ... Azj  
0
if there is a walk of length L between vertices i and j passing
through the vertices r, s, …..z
otherwise
23
Graph Theory – Trace of a matrix
N
Trace of the NxN matrix A Tr ( A)   Aii
i 1
In the case of the adjacency matrix for graph without loops,
Tr A = 0
The trace of powers of A is a graph invariant
N
N
Tr ( A )   ( A ) ii   D(i )  2 M
2
2
i 1
i 1
N
Tr ( A )   ( A3 ) ii  6C3
3
i 1
where M is the number of edges, C3 is the number of three-membered cycles.
In case of graph with n loops Tr ( A) 
N
h
i 1
i
24
Random Graph Theory
= Graph Theory +Probability
25
Random Graph Theory
= Graph Theory +Probability
26
Random Graph Theory
= Graph Theory + Probability
Random graph (Erdos and Renyi, 1960)
N = 4  C6n
N nodes labeled and connected by n edges
 CN2 = N(N-1)/2 possible edges
n
Number of possible graphs, C6n
1
6
2
15
3
20
4
15
5
6
6
1

C
N
 
2
n
possible graphs with N nodes and n edges
N=4
n
3
3
4
4
5
6
P(ki  k )  CkN 1 p k (1  p) N 1 k
27
Random Graph Theory
– Random network, Scale free network
Connectivity distribution P(k)
In a random network, the links are randomly
connected and most of the nodes have degrees close to
<k>=2E/N. The degree distribution P(k) vs. k is a
Poission distribution,
i.e. P(k) ~ <k>ke-<k>/k!
In many real life networks, the degree distribution has
no well-defined peak but has a power-law distribution,
P(k) ~ k-g ,where g is a constant. Such networks are
known as scale-free network.
Random network Log[P(k)] vs Log[k] plot has a peak
 homogenous nodes
 d ~ log N
Scale-free network  Log[P(k)] vs Log[k] plot is a line
with negative slope
 inhomogenous nodes
 d ~ log(log N)
Albert R. and Barabasi A.L.(2002) Rev. Mod. Phys. 74,
47
Random network
Scale-free network
http://physicsweb.org/box/world/
28
Example – metabolic pathways
WIT database (43 organisms), node = substrated, edge = reaction
 scale-free network  P(k)<kg, with gin = 2.2, gout = 2.2
 similar scaling behavior of connectivity distribution
 Fig. 2d, connectivity distribution averaged over 43 organisms
 Suggested that metabolic networks belong to the class of scalefree networks

http://ergo.integratedgenomics.com/IGwit/
It is interesting to notice that most of the real networks have 1 < g < 3.
29
Random Graph, Scale-free network,
Hierarchical network
Hierarchical network coexistence of
(1) modularity,
(2) local clustering, and
(3) scale-free behavior
Node degree
distribution
Clustering
coefficient
scaling Cave(k) ~ kb
b 1 for Deterministic
hierarchical network
model
30
Graph Theory – Network motifs
Compared the abundance of small loops in E. coli transcription regulatory
network to its randomized counterpart
- Treat the transcription network as directed graph
 node = operon (a group of contiguous genes)
 edge = from an operon that encode an TF to an operon regulated by that TF
- Frequency of occurrences three types of motifs (feed-forward loops, single
input module, and dense overlapping regulons) are much higher than the random
network version
-There are 13 types of 3-node connected, directed subgraphs
-Feed-Forward Loops (FFL) were significantly over-represented
(40 in real vs 7+/- 5 in random)
Reference : S.S. Shen-Orr, R. Milo, S. Mangan, and U. Alon, Nature Genetics, 31(1):64-68 (2002)
31
Graph Theory – Network motifs
Feed-forward loop
- A TF X regulate a second TF Y, and both jointly
regulated
one or more operons Z1,….. Zn.
Single input module (SIM)
- A single TF X regulates a set of operons Z1,….. Zn. X is
usually autoregulate
Dense overlapping regulons (DOR)
- A set of operons Z1,….. Zm are each regulated by a
combination of a set of TFs, X1,….. Xn.
32
Graph Theory – Node degree correlation
random graph models  node degrees are uncorrelated
- count the frequency P(K1,K2) that two proteins with
connectivity K1 and K2 connected to each other by a link
- compared it to the same quantity PR(K1,K2) measured in
a randomized version of the same network.
- The average node connectivity for a fixed K1 is given
by,.
P( K1 , K 2 )
 K2    K2
 PR ( K1 , K 2 )
-
where <> denotes the multiple sampling average, and the
summation sums for all K2 with a fixed K1.
- In the randomized version, the node degrees of each
protein are kept the same as in the original network,
whereas their linking partner is totally random.
K1 = 2
K2 = 5
1
3
2
0

1
M 




4
1  

0 1 1
1 0 

1  0 
33
Input - Database of Interacting
Proteins (DIP)
DIP http://dip.doe-mbi.ucla.edu
DIP is a database that documents experimentally determined protein-protein interactions.
We analyze the protein-protein interaction for seven different species, C.elegan, D.
melanogaster, E. coli, H. pylori, H. sapiens, M. musculus and S. cerevisiae.
- Look for general and different features of PIN for different species
34
Input - Database of Interacting
Proteins (DIP)
35
Results – scale-free network
study
C. elegans
FLY
E . coli
H. pylori
H. sapiens
M. musculus
YEAST
• Large standard deviation of k
• Coefficient of determination, r2 = SSR/SST >0.90
• To account for the flat plateau and long tail behaviors,
assume a short-length scale correction k0 and an exponential
cut-off tail at kc
yeast
P ( k )  ( k  k ) g e  k / k C
0
fly
g ~ 2.1
g ~ 1.9
36
Results
Highly connected proteins (k≧25)
– yeast (39 sequences) and fly (317
sequences)
Most of the sequences do not have
high sequence similarity (E-value ≦
0.01)  different functions
37
Results
These highly connected proteins are
pair-wise compared in an all-against-all
manner using gapped BLAST (16), and
none of the sequences shown significant
sequences similarity (E-value < 0.001)
except the tryptophan protein and
SEC27 protein, nuclear pore protein,
26S proteasome regulatory particle
chain and DNA-directed RNA
polymerase.
38
Results
0
-1
Log f(L)
-2
-3
-4
-5
-6
-7
0
2
4
6
8
10
12
14
16
L
E.coli
H.pylori
S.cerevisiae
H.sapiens
M.musculus
D.melanogaster
C.elegans
Fig. 4. The logarithm of the normalized frequency distribution
of connected paths vs the logarithm of their length for
S. cerevisiae(CORE), H. pylori, E. coil, H. sapiens, M. musculus and D.
melanogaster.
39
Results – node degrees correlation
2.0
Highly connected proteins are
unlikely to interact.
2.0
2.0
2.0
1.0
2.0
2.0
2.0
2.0
2.0
1.0
2.0
1.6
1.0
40
Results – Hierarchical structures
Yeast
E. coli
Cave(k) ~ k-b
The plots of Log Cave(k) vs Log k for the seven species.
All the species exhibit a rather flat plateau for
small values of k, and they fall rapidly for large k.
41
Results – identification of cliques
identify protein complexes
compute the clustering coefficients, find the cliques or pseudo-cliques
42
Identification of cliques
1
Aij  
0
if there is a walk of length one between vertices i and j
otherwise
Theorem
Let A3ij be the (i,j)-th element of A3. Then a vertex Pi belongs to
some clique if and only if A3ii ≠0.
Example
0
1

A  0

1
1
1
0
0
1
0
0
0
0
0
0
1
1
0
0
0
1
0
0

0
0
and
2
4

3
A  0

4
3
4
2
0
3
1
0
0
0
0
0
4
3
0
2
1
3
1
0

1
0
The non-zero diagonal entries of A3 are a311, a322 and a344. Consequently, node 1, 2 and 4
belong to cliques. Since a clique must contain at least three vertices, the graph has only
one clique.
43
Results
- protein complexes
Identification of the highest clique
degree with protein complexes
We had identified all possible
cliques within the seven PINs. To
identify the relation between cliques
and protein complexes, we only
considered cliques with the largest
number of connected proteins in our
preliminary study, and had
succeeded in predicting some of the
cliques did correspond to protein
complexes (comparing data from the
BIND database).
44
Evolution of Biological Networks
Databases – DIP and MIPS
Motif identification
- detecting all n-node subgraphs,
i.e. all 2-, 3-, 4- and some 5node (a set of 28 five-node
motifs) motifs in yeast PIN
- the network consists of 3183
yeast proteins encodes 1000 to
1,000,000 copies of the specific
motif types
45
Evolution of Biological Networks
-studied the conservation of 678 (47% of 1443)
yeast proteins with an ortholog in each of
five higher eukaryotes (A. thaliana, C.elegans,
D. melanlgaster, M. musculus and H.sapiens)
deposited in the InParanoid database
- 47% of the 1443 fully connected
pentagons (#11), in yeast have each of their
five proteins components conserved in each
of the five higher eukaryotes
- this results  blocks of cohesive motifs
tend to be evolutionary conserved
46
Evolution of Biological Networks
Growth Model of a scale-free network PIN
- New proteins nodes are added (genes duplication)
- Preferential attachment
Redundant links are lost
(in an asymmetric fashion)
47
Evolution of Biological Networks
Growth
1. start with m0 nodes
2. add a node with m edges
3. connect these edges to existing nodes
at time step t : t+m0 nodes, tm edges
Preferential attachment
Probability q of connection to node i
depends on the degree ki of this node.
q (ki ) 
ki

k
j j
m0=3, m=2
This model leads to the power law distribution
P(k) = 2m2k-3 ~ k-3
48
Summary
0.0
-0.5
Log Cave(k)
Protein-protein interaction Network
•
PINs are not random networks, they have
rather heterogeneous structures  highly
connected protein  blastp shows that
they do not share sequence similarity
•
The plots of Log[Pcum(k)] vs Log[k] study
indicates that PINs are well approximate
by scale-free networks
•
a ~ 2  A general biological evolution
mechanism across species  growth +
preferential attachment model
•
The plots of Log[Pcum(k)] vs Log[k] for fly
and yeast seems to have deviation at the
small k and large k value  modification
of the growth + preferential attachment
model
•
Highly connected proteins are unlikely to
interact
•
Hierarchical network model is a better
description for certain species’ PINs
-1.0
-1.5
-2.0
S.cerevisiae (CORE)
S. cerevisiae (regression)
-2.5
0.2
0.4
0.6
0.8
1.0
1.2
Log k
49
1.4
1.6
1.8
2.0
2.2
Matrix
Permutations
A one-to-one mapping of the set {1,2,3…,n} onto itself is called a
permutation. We denote the permutation s by s = j1j2…jn.
The number of possible permutation is n!, and that the set of them
is denoted by Sn. For example, S2 = {12, 21}, S3 = {123, 132,
213, 231, 312, 321}.
Consider an arbitrary permutation s in Sn: s = j1j2…jn. We say s is
even or odd according whether there is an even or odd number
of pairs (i,k) for which
i > k but i precedes k in s
We then define the sign of s by, written sgn s, by
1
sgn( s )  
 1
if s is even
if s is odd
50
Matrix
Example
Consider the permutation s =35142 in S5.
 3 and 5 precede and are greater than 1; hence
(3,1), and (5,1)
 3,5 and 4 precede and are greater than 2;
hence (3,2), (5,2), (4,2)
 5 precedes and is greater than 4; hence (5,4)
 Since exactly 6 pairs satisfy the sgn(s), s is
even and sgn(s) = 1.
51
Matrix
Determinant
The determinant of matrix A is defined by
det( A)   (sgn( s )) a1 j1 a2 j2 ...anjn
s
where the factors come from successive rows and so the first subscripts are
in the natural order 1,2,…n. The sequence of second subscripts form a
permutation s in Sn. Also the sum is summed over all permutations s in Sn.
Example
In S2, the permutation 12 is even and the permutation 21 is odd
det(A) = a11a21 – a12a21
In S3, the permutation are 123, 231 and 321 are even, and the permutation 132,
213 and 312, are odd.
det(A) = a11a22a33 + a12a23a31 + a13a21a32 – a13a22a31 – a12a21a33 – a11a23a32
Permanent of A per ( A)   a1 j1 a2 j2 ...anjn
where s is always equal to 1
s
52
Matrix
The incidence matrix
The incidence matrix T(G) of a graph G with N vertices and M edges is
the NxM matrix; the rows and columns of the matrix corresponding
to vertices and edges, respectively, of G. It is defined as,
1
Tij  
0
Example
G
1
if the j-th edge is incident with the i-th vertex
otherwise
a
a
b
3
2
c
d
4
1

1
T (G )  
0

0

b
c
d
0 0 0

1 0 1
1 1 0

0 1 1 
1
T
ij
i, j
2
 2M   Di
i
3
T
4
M is the number of edges
ij
 8  24
i, j
53
Matrix
The circuit matrix
The circuit matrix C(G) of a graph G, whose cycles (circuits) c and edges e
are labeled, is a cxe matrix; the rows and columns of the matrix
corresponding to circuits and edges, respectively, of G. It is defined as,
Example
1
Cij  
0
1
G
d
4
a
e
if the i-th cycle contains the j-th
otherwise
2
b
c
3
Cycles
C1 = {a,d,e}
C2 = {b,c,e}
C3 = {a,b,c,d}
a
b
c
d
e
1 0 0 1 1
C (G )  0 1 1 0 1
1 1 1 1 0
C1
C2
C3
54
Principle Components Analysis (PCA)
General Outline

Suppose you have a microarray dataset composed of 1000 genes, each
of which have an expression value over 10 experiments. The
dimensionality of that dataset is therefore 10 or 1000.

The data, though clumped around several central points in that
hyperspace, will generally tend towards one direction. If one were
to draw a solid line that best describes that direction, then that
line is the first principle component (PC). The result was a space in
which the axes were the eigenvectors of the covariance matrix of the
experiments (in this space each point is a gene) or the genes (in this
space each point is an experiment).

Any variation that is not captured by that first PC is captured by
subsequent orthogonal PCs (captures the maximum amount of
variation left in the data).

The first 3 PCs could themselves act as Cartesian axes. The data they
capture can therefore be plotted in terms of these axes. Hence
there is a reduction of dimensionality.

When the data is plotted in this manner they are said to be plotted in
PC-space.
References
1. http://www.ucl.ac.uk/oncology/MicroCore/HTML_resource/PCA_1.htm
2. Draghici S., 2003. Data Analysis Tools for DNA Microarray. Chapman &
Hall/CRC
55
Principle Components Analysis (PCA)
-
-
-
PCA is commonly used in microarray research as a cluster analysis tool.
to capture the variance in a dataset in terms of principle components.
trying to reduce the dimensionality of the data to summarize the most important (i.e.
defining) parts whilst simultaneously filtering out noise.
Normalization, however, can sometimes remove this noise and make the data less
variant, which could affect the ability of PCA to capture data structure.
PCA can be imposed on datasets to capture the cluster structure (just using the first
few PC's) prior to cluster analysis (e.g. before performing k-Means clustering to
determine a good value for K).
Coordinate transformation
(translation + rotation)
Most of the variance
along the first eigenvector.
Variance along the second
eigenvector is probably due to
noise.
56
Principle Components Analysis (PCA)
-
-
-
-
PCA pay attention to those dimensions that account for a large variance in the
data and to ignore the dimensions in which the data are not vary very much.
The direction of PC is determined by calculating the eigenvectors of the
covariance matrix of the pattern.
Eigenvector of a matrix A is defined as a vector z such as:
Az = lz
where l is the eignevalue
For example,
1 1

A
0

 2
has eignevalues l1 = -1 and l2 = -2, and eigenvectors z1=(1, 0) and z2 = (1, -1)
 1 1  1
1
Az1  
 (1)    l1 z1



 0  2 0
0
Similarly expression for l2 = -2
- In other words, covariance matrix captures the shape of the set of data
- Eigenvalue with largest absolute value implies that the data have the largest
variance along its eignevector
57
Principle Components Analysis (PCA)
How to calculate the eigenvalues
of a matrix ? Solve
1  l
0
1
0
2l
 (1  l )( 2  l )  0
How to find eigenvectors ?
l1  1
l 2  2
  1 1  x 
 x

   (1) 
 0  2  y 
 y
 x  y  x
 2y  y
y  0, x  anything
choose, x  1, y  0
  1 1  x 
 x

   (2) 
 0  2  y 
 y
 x  y  2 x
 2 y  2 y
 x  y
choose, x  1, y  1
Up to a multiple
constant

Normalization of eignevectors  | x| = 1  x 2  y 2  1
58
Eigenvalues and eigenvectors
If A is a square matrix, then l is an eigenvalue of A if, for some nonzero vector
Av = lv
where v is an eigenvector of A belonging to l.
Linear dependence
The vectors v1, …, vm are said to be linearly dependent if there exist scalars
a1, …, am not all of them 0, such that
a1v1 + …. amvm = 0
Otherwise, the vectors are said to be linearly independent.
Example
The vectors u =(1, -1, 0), v=(1,3,-1) and w=(5,3,-2) are dependent since 3u+2vw=0.
Theorem
Nonzero eignevectors belonging to distinct eignevalues are linearly independent.
59
Eigenvalues and eigenvectors
Theorem
An n-square matrix A is similar to a diagonal matrix B if and only if A has
n linearly independent eigenvectors. In this case the diagonal elements
of B are the corresponding eigenvalues.
In the above theorem, if we let P be the matrix whose columns are the n
independent eigenvectors of A, then B = P-1AP.
1 2
.
Example: Consider the matrix A  
3
2


 2
1
A has two independent eigenvectors   and   .
 1
 3
Set P  
1
 1 / 5 .1 / 5 
 and so P 1  

3

1
3
/
5

2
/
5




2
Then A is similar to the diagonal matrix
 1 / 5 1 / 5  1 2  2 1   4 0 


  

B  P 1 AP  
3
/
5

2
/
5
3
2
3

1
0

1



 

As expected, the diagonal elements 4 and -1 of the diagonal matrix B
are eigenvalues corresponding to the given eigenvectors.
60
Characteristic Polynomial
– Cayley-Hamilton Theorem
The matrix tI-A, where I is the n-square matrix and t is a constant, is called
the characteristic matrix of A. Its determinant DA(t) = det(tI – A) which
is a polynomial in t is called the characteristic polynomial of A.
Cayley-Hamilton Theorem
Every matrix is a zero of its characteristic polynomial.
Example:
1 2
 is
The characteristic polynomial of the matrix A  
3
2


D(t) = (t-1)(t-2) – (-2)(-3) = t2 – 3t – 4. As expected, A is a zero of D(A):
2
1 2
1 2  1 0  0 0
  3
  4
  

D( A)  
 3 2
 3 2  0 1  0 0
61
Singular Value Decomposition








Performing PCA is the equivalent of performing Singular Value
Decomposition (SVD) on the covariance matrix of the data.
Applications: Image processing and compression, Information
Retrieval, Immunology, Molecular dynamics, least square error
What is SVD? (See MIT paper at the bottom of this page for more
depth)
For the sake of example, let A be a nxp matrix that represents gene
expression experiments such that the rows are genes and columns
are experiments (i.e.arrays). The SVD of A is said to be the
factorization:
A = USVT (Eq. 1)(Note that VT means 'the Transpose of matrix V').
How to determine U, V and S ?
In equation 1, matrices U and V are such that they are orthogonal
(their columns are orthonormal, UTU = I, VTV = I). The columns of
U are called left singular values (gene coefficients) and the rows
of VT are called right singular values (expression level vectors).
To calculate the matrices U and V, one must calculate the
eigenvectors and eigenvalues of AAT and ATA. These multiplications
of A by its transpose results in a square matrix (the number of
columns is equal to the number of rows).
62
Singular Value Decomposition





The eigenvectors of AAT (a nxn matrix)  columns of U (a
nxn matrix)
The eigenvectors of ATA (a pxp matrix)  columns of V (a
pxp matrix)
The eigenvalues of AAT or ATA, when square-rooted (s1≧
s2 ≧ …)  the columns of S (nxp matrix)
The diagonal of S is said to be the singular values (min(n,p))
of the original matrix, A.
Each eigenvector described above represents a principle
component. PC1 (Principle Component 1), which is defined as
the eigenvector with the highest corresponding eigenvalue. The
individual eigenvalues are numerically related to the variance
they capture via PC's - the higher the value, the more variance
they have captured.
Please also see Yeung & Ruzzo (2001), where it is shown that
lower PC's can capture data structure.
http://web.mit.edu/be.400/www/SVD/Singular_Value_Decompo
sition.htm
http://www.netlib.org/lapack/lug/node53.html or
http://kwon3d.com/theory/jkinem/svd.html
63
Singular Value Decomposition
Example
eigenvectors of AAT
0.34
eigenvectors of ATA
eigenvalues of
AAT or ATA
64
Yeast Two-hybrid System
pubs.acs.org/cen/coverstory/ 7831/7831scit1.html
Have two plasmids:
- fusion of target to activation domain (AD).
- fusion of bait to DNA binding domain (BD).
If the target protein bind to the bait protein it brings AD close to BD.
BD bind to the GAL4 promoter and brings AD close so it can
activate the DNA.
LacZ is expressed from a GAL4 promoter when proteins are
interacting. LacZ will make the yeast blue.
http://cmbi.bjmu.edu.cn/
65
Immunoprecipitation
(1) Antibody added to a mixture
radiolabel(*) or unlabeled proteins binds
specifically to its antigen (A) (left tube).
(2) Antibody-antigen complex is absorbed
from solution through the addition of an
immobilized antibody binding protein
such as Protein A-Sepharose beads
(middle panel).
(3) Upon centrifugation, the antibodyantigen complex is brought down in the
pellet (right panel).
Subsequent liberation of the antigen can
be achieved by boiling the sample in the
presence of SDS.
66
Co-Immunoprecipitation
Co-IP vs. IP
Co-immunoprecipitation (Co-IP)
is a popular technique for protein
interaction discovery. Co-IP is
conducted in essentially the same
manner as an IP. However, in a
co-IP the target antigen
precipitated by the antibody “coprecipitates” a binding
partner/protein complex from a
lysate (細胞溶解液), i.e., the
interacting protein is bound to the
target antigen, which becomes
bound by the antibody that
becomes captured on the Protein
A or G gel support.
67