Transcript Slide 1

COMS 6998-06 Network Theory
Week 10
Dragomir R. Radev
Wednesdays, 6:10-8 PM
325 Pupin Terrace
Fall 2010
(16) Biological, technological and
information networks
Biological networks
• Protein interaction networks
• Metabolic networks
• Gene regulatory networks
Gene expression
• The process of converting genetic
information (DNA) into proteins
• The central dogma of cell biology:
information passes from nucleic acid to
another nucleic acid (e.g., DNA to RNA) or
from a nucleic acid (e.g., RNA) to proteins
• DNA is composed of bases: adenine (A),
thymine (T), cytosine (C), and guanine (G)
• A binds with T, C binds with G.
Gene expression (cont’d)
• DNA is copied onto RNA using an enzyme
(RNA polymerase)
• Proteins are included in cells (up to 50% of
the dry weight of most cells)
• Proteins are sequences of 20 distinct
types of amino acids. Their length can be
as high as more than 25,000 amino acids)
Sample proteins
glycine
tryptophan
alanine
[images from
http://www.biology.arizona.edu/biochemistry]
Polypeptides
• Peptide bonds are formed between two
amino acids when the amino group of one
of them reacts with the carboxyl group of
the other. Water is released in the
process.
Protein folding
• Based on the chemical properties of the amino
acids (e.g., hydrobhobic, hydrophilic, electrically
charged), the polypeptide sequence folds in 3-D.
Protein interactions
• Signal transduction – communication
between external signals and the inside of
the cell
• Proteins may carry each other
• Phosphorylation
• Protein complexes
Protein interaction networks
• Network structure
• Deletion of hubs is
lethal for the
organism (shown
for Saccharomyces
cereviciae)
a, Map of protein–protein interactions. The largest cluster, which contains 78% of all proteins, is shown. The colour of a node signifies the
phenotypic effect of removing the corresponding protein (red, lethal; green, non-lethal; orange, slow growth; yellow, unknown). b,
Connectivity distribution p(k) of interacting yeast proteins, giving the probability that a given protein interacts with k other proteins. The
exponential cut-off indicates that the number of proteins with more than 20 interactions is slightly less than expected for pure scale-free
networks. In the absence of data on the link directions, all interactions have been considered as bidirectional. The parameter controlling the
short-length scale correction has value k0 1. c, The fraction of essential proteins with exactly k links versus their connectivity, k, in the
yeast proteome. The list of 1,572 mutants with known phenotypic profile was obtained from the Proteome database. Detailed statistical
analysis, including r = 0.75 for Pearson's linear correlation coefficient, demonstrates a positive correlation between lethality and connectivity.
Nature 411, 41-42, 2001. Lethality and centrality in protein networks
H. Jeong, S. P. Mason, A.-L. Barabási and Z. N. Oltvai
Yeast
proteins
Sergei Maslov and Kim Sneppen
Specificity and stability in
topology of protein networks
Science 296, 910-913 (2002).
Joint work with Arzucan Ozgur, Thuy Vu, David States
• Challenge in extracting relevant information from vast
amount of publications
– Biomedical literature growing rapidly ( > 14 million articles in
PubMed)
– Delay in including new discoveries to curated databases
– Most information uncovered in unstructured text of biomedical
publications
• Determining gene-disease associations requires laborious
experiments over hundreds of genes
– e.g. Genetic linkage analysis
Approach: Text mining and network analysis for predicting gene-disease
associations
High Level Description
Collect known disease genes
(seed genes) from OMIM
Extract interactions of seed genes
and their neighbors from the literature
Gene
name identification (Genia
Tagger)
Gene name synonyms (HUGO)
Dependency parsing & SVM
Build disease-specific
gene interaction network
Snapshot of PMCOA used
48,245 articles
Find the central genes
degree, eigenvector, closeness,
betweenness
Hypothesis: Genes central in the disease-specific gene interaction network are
likely to be related to the disease
Seed Genes
•15 prostate cancer genes from OMIM Morbid Map
Gene
Description
AR
BRCA2
MSR1
EPHB2
KLF6
MAD1L1
HIP1
CD82
ELAC2
MXI1
PTEN
RNASEL
HPC1
CHEK2
PCAP
androgen receptor
breast cancer 2, early onset
macrophage scavenger receptor 1
EPH receptor B2
Kruppel-like factor 6
MAD1 mitotic arrest deficient-like 1 (yeast)
huntingtin interacting protein 1
CD82 molecule
elaC homolog 2 (E. coli)
MAX interactor 1
phosphatase and tensin homolog (mutated in multiple advanced cancers 1)
ribonuclease L (2',5'-oligoisoadenylate synthetase-dependent)
hereditary prostate cancer 1
CHK2 checkpoint homolog (S. pombe)
predisposing for prostate cancer
Feature Extraction from
Dependency Parse Trees
“The results demonstrated that KaiC interacts rhythmically with KaiA, KaiB, and
SasA.”
Path1: KaiC – nsubj – interacts – obj – SasA
Path2: KaiC – nsubj – interacts – obj – SasA – conj_and – KaiA
Path3: KaiC – nsubj – interacts – obj - SasA – conj_and – KaiB
Path4: SasA – conj_and – KaiA
Path5: SasA – conj_and – KaiB
Path6: KaiA - prep_with - SasA – conj_and – KaiB
•
•
•
Path Edit Kernel
Word-based edit distance
– Minimum number of edit operations (insertion, deletion, or substitution of
a single word) to transform the first string to the second
Ex:
– 1. KaiC - subj - interacts - obj - SasA - conj - KaiA
– 2. KaiC - subj - interacts - obj - SasA - conj – KaiA
Edit distance = 2 (2 insertions)
Normalize edit distance: divide to the length (number of words) of the
longer path: 2/7 = 0.286
Converting the distance measure to a similarity function:
EditSim pi , p j
 γ EditDist  pi , p j 

= e
– Parameter γ makes the kernel matrix well-defined (positive definite) (γ = 4.5)
Performance of Gene Interaction Extraction
Data Sets:
Data Set
Sentences
+ Sentences
- Sentences
AIMED
4026
951
3075
CB
4056
2202
1854
Results: 10-fold cross validation
Precision
Recall
F-measure
AIMED
77.52
43.51
55.61
CB
85.15
84.79
84.96
Previous Results for AIMED:
Precision
Recall
F-measure
(Yakushiji et al., 2005)
33.70
33.10
33.40
(Mitsumori et al., 2006)
54.20
42.60
47.70
Constructing the Interaction Network
•
Sample extracted interaction sentences:
– PTEN is transcriptionally regulated by transcription factors such as p53
and Egr-1.
– In response to DNA damage, the cell-cycle checkpoint kinase CHEK2
can be activated by ATM kinase to phosphorylate p53 and BRCA1,
which are involved in cell-cycle control and apoptosis.
– The interactions of RAD51 with TP53, RPA and the BRC repeats of
BRCA2 are relatively well understood (see Discussion).
– The interaction of BRCA2 with HsRad51 is significantly more different to
both RadA and RecA (Figure 2c).
• The constructed graph:
BRCA1
CHEK2
p53
TP53
RAD51
BRCA2
PTEN
HsRad51
Egr-1
Gene Name Normalization
• Constructing dictionary of
protein synonyms from HUGO
Gene Nomenclature Database
(HGNC).
• Replacing gene names with
‘official HGNC symbol’.
• Combining the interactions
– AR ~ androgen receptor
AIS
NR3C4
SMAX1
HUMARA
DHTR
SBMA
– p53 ~ TP53
– RAD51 ~ HsRad51
A
B
A’
B’
A’’
B’’
Dictionary of
synonyms
Normalization
A
B
Constructing the Interaction Network
• The graph before gene name normalization:
BRCA1
CHEK2
p53
PTEN
TP53
RAD51
BRCA2
Egr-1
HsRad51
• The graph after gene name normalization:
BRCA1
CHEK2
RAD51
TP53
BRCA2
PTEN
EGR1
Graph Centrality Measures
• Measure of importance of a node in the graph
• Given a node x:
• Degree centrality: number of nodes that x is connected to.
• Eigenvector centrality: weighted sum of the centralities of
other nodes that connect to x.
• Closeness centrality: sum of the shortest distances from x to
other nodes in the network.
• Betweenness centrality: number of shortest paths between other
nodes that run through x.
Properties of the disease-specific
network
• 226 nodes
•
•
•
•
1,187 edges
Diameter = 6
Average shortest path length = 2.57
Watts-Strogatz Clustering Coefficient = 0.4497
– C(rand) = 0.0487
• Power-law exponent = 2.24
• Small-world network with power-law degree distribution
Evaluation Method for Gene-Disease
Associations
• Prostate Gene DataBase (PGDB)
– manually expert curated database for genes related to prostate
cancer
– relatively high quality, reliable, objective
• KEGG Pathway for prostate cancer
– manually drawn pathway map of currently known interaction and
reaction network for prostate cancer
– relatively high quality, reliable, objective
• Literature (published articles)
– relatively less reliable and subjective
% of top n genes associated with prostate cancer
based on Prostate Gene DataBase (PGDB)
Top n
10
20
30
40
50
75
100
125
150
175
200
226
Degree Eigenvector Betweenness
80.00
80.00
90.00
75.00
80.00
70.00
60.00
63.33
63.33
55.00
57.50
52.50
46.00
50.00
48.00
33.33
36.00
34.67
26.00
28.00
26.00
23.20
25.60
23.20
20.67
22.00
20.00
18.29
20.57
18.29
17.50
19.00
18.50
17.70
17.70
17.70
Closeness
70.00
55.00
56.67
47.50
42.00
33.33
27.00
23.30
20.00
18.29
17.00
17.70
Baseline
50.00
45.00
43.33
32.50
28.00
34.67
27.00
22.40
18.67
17.14
15.00
13.27
• Baseline
• Co-occurrence network constructed
• Metric: number of connections with seed genes
• Top 20: Degree, eigenvector, and betweenness significantly better than baseline
(Fisher's Exact Test; p-value < 0.05)
Top Ranked 20 Genes
12 genes: Prostate Gene DataBase (PGDB)
2 genes: KEGG pathway for prostate cancer and literature (MDM2 and INS)
2 genes: literature (NR3C1 and MAPK1)
7 genes: No positive or negative evidence
Evidence from the literature
• MDM2
– “MDM2 has a role in prostate cancer growth via p53 dependent
and p53-independent mechanisms” (Wang et al., 2003; Zhang et
al., 2003)
• INS
– “Polymorphism of the insulin gene is associated with increased
prostate cancer risk” (Ho et al., 2003)
References:
•
Ho, G., Melman, A., Liu, S., Li, M., Yu, H., Negassa, A., Burk, R., Hsing, A., Ghavamian, R., and
Chua, S. J. (2003). Br J Cancer, 88(2), 263–269.
•
•
Wang, H., Yu, D., Agrawal, S., and Zhang, R. (2003). Prostate, 54(3), 194–205.
Zhang, Z., Li, M., Wang, H., Agrawal, S., and Zhang, R. (2003). Proc Natl Acad Sci,
100(20), 11636–11641.
Evidence from the literature
• NR3C1
– Wei et al., 2007 show that it is differentially expressed in androgen
independent prostate cancer.
• MAPK1
– “apoptosis induced by cannabinoid receptor CB1 and CB2 agonists leads
to activation of ERK1/2 leading to G1 cell cycle arrest in prostate cancer
cells” (Sarfarez et al., 2006)
– “lysophosphatidic acid (LPA), the receptor LPA(1), ERK2 and p38alpha
are important regulators for prostate cancer cell invasion and thus could
play a significant role in the development of metastasis.” (Hao et al,.
2007)
References:
•
•
•
Hao, F., Tan, M., Xu, X., Han, J., Miller, D., Tigyi, G., and Cui, M. (2007). Biochim Biophys Acta.,
1771(7), 883–892.
Sarfaraz, S., Afaq, F., Adhami, V., Malik, A., and Mukhtar, H. (2006). J Biol Chem.,281(51), 39480–
39491.
Wei, Q., Li, M., Fu, X., Tang, R., Na, Y., Jiang, M., and Li, Y. (2007). Prostate Cancer Prostatic Dis.,
10(2), 167–174.
Summary of the results for the
top 20 genes
# of seed genes
# of inferred genes
% of inferred genes
# of confirmed inferred genes
% of confirmed inferred genes
% of confirmed genes
Degree Eigenvector Betweenness Closeness
5
6
7
2
15
14
13
18
75
70
65
90
14
13
8
13
93.33
92.86
61.54
72.22
95
95
75
75
Baseline
3
17
85
9
52.94
60
• Fisher's Exact Test:
• No significant difference between performances of degree. eigenvector,
betweenness, and closeness
• eigenvector and degree significantly better than baseline (p < 0.02)
Robustness Analysis
• Edges removed randomly
• % of top 20 genes in Prostate Gene DataBase
% of edges removed
0
5
15
25
40
Robust against random errors.
Degree Eigenvector Betweenness Closeness
75
80
70
55
80
80
70
55
80
75
70
60
75
80
65
60
65
75
65
50
Metabolic pathways
• Metabolism – chemical reactions catalyzed by enzymes
and which are related to energy and the synthesis
(anabolism) or breakup (catabolism) of molecules.
Metabolic network showing the links
between enzymes and metabolites that
interact with the Arabidopsis TCA cycle
KEGG classification M00009.
Enzymes and metabolites are the nodes
(red), interactions are the lines. In total,
43 enzymes and 40 metabolites are shown.
Metabolic pathways
• Most reactions are not reversible so the
graph is directed.
• http://www.gwu.edu/~mpb/ - all major
metabolic pathways
• Example: tryptophan synthesis:
http://www.gwu.edu/~mpb/shikimate3.htm
The large-scale organization of
metabolic networks
Authors: H. Jeong, B. Tombor,
R. Albert, Z.N. Oltvai, A.-L. Barabasi
Nature, v407 651-654 (2000)
Food webs
Marine
food web
in Alaska
[http://www.absc.usgs.gov/research/seabird_foragefish/marinehabitat/images/Food_Web3.gif]
Little Rock Lake Wisconsin
[http://userwww.sfsu.edu/%7Ewebhead/lrl.html]
See also http://www.foodwebs.org
Connectance
• X=m/n(n-1) – number of actual predations
relative to the number of possible ones.
• Sample values: 0.31 (Skipwith Pound),
0.12 (St. Martin Island), 0.03 (Silwood
Park).
IP networks
• Routers connect different networks
(autonomous systems)
• Example (using traceroute):
% traceroute www.cs.columbia.edu
traceroute to www.cs.columbia.edu (128.59.18.180), 30 hops max, 40 byte packets
1 v-si-crew.d-ccb-1.umnet.umich.edu (141.211.184.2) 0.703 ms
0.686 ms
1.877 ms
2 d-ccb1-cool.r-cool.umnet.umich.edu (141.213.156.48) 0.462 ms
0.367 ms
0.347 ms
3 l3-arbl-cool.r-arbl.umnet.umich.edu (141.211.0.129) 0.561 ms
0.454 ms
0.708 ms
4 v-bin-arbl.r-bin-arb.umnet.umich.edu (192.122.183.93) 1.041 ms
0.669 ms
0.607 ms
5 l3-barb-bseb-2.r-bin-seb.umnet.umich.edu (192.12.80.11) 0.553 ms
0.521 ms
0.477 ms
6 v-bin-seb-i2-aa.merit-aa2.umnet.umich.edu (192.12.80.33) 6.564 ms
6.509 ms
6.534 ms
7 192.122.183.30 7.125 ms
7.142 ms
7.039 ms
8 buf-7600-internet2.nysernet.net (199.109.11.1) 20.687 ms
20.535 ms
20.666 ms
9 alb-7600-buf-7600.nysernet.net (199.109.7.10) 26.564 ms
26.624 ms
26.674 ms
10 nyc-gsr-alb-7600.nysernet.net (199.109.7.98) 29.423 ms
29.542 ms
29.517 ms
11 columbia.nyc-gsr.nysernet.net (199.109.4.14) 29.760 ms
29.832 ms
29.580 ms
12 cc-core-1-x-nyser32-gw-1.net.columbia.edu (128.59.255.5) 29.956 ms
30.076 ms
29.890 ms
13 mudd-edge-1-x-cc-core-1.net.columbia.edu (128.59.255.86) 29.957 ms
29.919 ms
30.016 ms
14 radiata.cs.columbia.edu (128.59.18.180) 29.985 ms
29.890 ms *
Properties
• Data from
http://moat.nlanr.net/
http://www.caida.org/
home/
http://archive.routevie
ws.org/
• Gamma = 2.2
• CC = 0.24-0.46
(depending on the
study).
Large-scale topological and dynamical properties of Internet
Alexei Vazquez, Romualdo Pastor-Satorras and Alessandro Vespignani
cond-mat/0112400 (December 2001)
Email networks
How to search a social network. Lada A Adamic, Eytan Adar
arXiv:cond-mat/0310120v2