CSCE590/822 Data Mining Principles and Applications

Transcript CSCE590/822 Data Mining Principles and Applications

CSCE555 Bioinformatics

Lecture 18 Network Biology
Meeting: MW 4:00PM-5:15PM SWGN2A21
Instructor: Dr. Jianjun Hu
Course page: http://www.scigen.org/csce555
University of South Carolina
Department of Computer Science and Engineering
2008
www.cse.sc.edu.
Outline
Biological Networks & Databases
 Background of graphs and networks
 Three types of bio-network analysis

◦ Network statistics
◦ Network based functional annotation
◦ Bio-network reconstruction/inference

Summary
7/21/2015
2
Why network analysis: Building
models from parts lists
Systems Biology view
BIOLOGICAL NETWORKS
Networks are found in biological systems of
varying scales:
1. Evolutionary tree of life
2. Ecological networks
3. Expression networks
4. Regulatory networks
- genetic control networks of organisms
5. The protein interaction network in cells
6. The metabolic network in cells
… more biological networks
Examples of Biological Networks
Metabolic Networks
 Signaling Networks
 Transcription Regulatory Networks
 Protein-Protein Interaction Networks

5
Signaling & Metabolic Pathway
Network

A Pathway can be defined as a modular unit of
interacting molecules to fulfill a cellular function.

Signaling Pathway Networks
◦ In biology a signal or biopotential is an electric quantity
(voltage or current or field strength), caused by chemical
reactions of charged ions.
◦ refer to any process by which a cell converts one kind of signal
or stimulus into another.
◦ Another use of the term lies in describing the transfer of
information between and within cells, as in signal transduction.

Metabolic Pathway Networks
◦ a series of chemical reactions occurring within a cell, catalyzed
by enzymes, resulting in either the formation of a metabolic
product to be used or stored by the cell, or the initiation of
another metabolic pathway
A Signaling Pathway Example
A Metabolic Pathway Example
Regulatory Network
Expression Network


A network representation of genomic data.
Inferred from genomic data, i.e. microarray.
Gene co-expression network. Each node is a gene. Edge: coexpression relationship
Example of a PPI Network



Yeast PPI network
Nodes – proteins
Edges – interactions
The color of a node
indicates the phenotypic
effect of removing the
corresponding protein
(red = lethal, green =
non-lethal, orange = slow
growth, yellow =
unknown).
11
How do we know that proteins
interact? (PPI Identification methods)


Data
◦
◦
◦
◦
Yeast 2 hybrid assay
Mass spectrometry
Correlated m-RNA expression
Genetic interactions
Analysis
◦
◦
◦
◦
Phylogenetic analysis
Gene neighbors
Co-evolution
Gene clusters
Also see: Comparative assessment of large-scale data sets of protein-protein
interactions – von Mering
12
Protein Interaction Databases
 Species-specific
◦
◦
◦
◦
◦
FlyNets - Gene networks in the fruit fly
MIPS - Yeast Genome Database
RegulonDB - A DataBase On Transcriptional Regulation in E. Coli
SoyBase
PIMdb - Drosophila Protein Interaction Map database
 Function-specific
◦ Biocatalysis/Biodegradation Database
◦ BRITE - Biomolecular Relations in Information Transmission and
Expression
◦ COPE - Cytokines Online Pathfinder Encyclopaedia
◦ Dynamic Signaling Maps
◦ EMP - The Enzymology Database
◦ FIMM - A Database of Functional Molecular Immunology
◦ CSNDB - Cell Signaling Networks Database
Protein Interaction Databases
 Interaction type-specific
◦ DIP - Database of Interacting Proteins
◦ DPInteract - DNA-protein interactions
◦ Inter-Chain Beta-Sheets (ICBS) - A database of protein-protein
interactions mediated by interchain beta-sheet formation
◦ Interact - A Protein-Protein Interaction database
◦ GeneNet (Gene networks)
 General
◦ BIND - Biomolecular Interaction Network Database
◦ BindingDB - The Binding Database
◦ MINT - a database of Molecular INTeractions
◦ PATIKA - Pathway Analysis Tool for Integration and Knowledge
Acquisition
◦ PFBP - Protein Function and Biochemical Pathways Project
◦ PIM (Protein Interaction Map)
Pathway Databases
 KEGG (Kyoto Encyclopedia of Genes and Genomes)
 http://www.genome.ad.jp/kegg/
 Institute for Chemical Research, Kyoto University
 PathDB
 http://www.ncgr.org/pathdb/index.html
 National Center for Genomic Resources
 SPAD: Signaling PAthway Database
 Graduate School of Genetic Resources Technology. Kyushu University.
 Cytokine Signaling Pathway DB.
 Dept. of Biochemistry. Kumamoto Univ.
 EcoCyc and MetaCyc
 Stanford Research Institute
 BIND (Biomolecular Interaction Network Database)
 UBC, Univ. of Toronto
KEGG

Pathway Database: Computerize current knowledge of
molecular and cellular biology in terms of the pathway of
interacting molecules or genes.

Genes Database: Maintain gene catalogs of all sequenced
organisms and link each gene product to a pathway component

Ligand Database: Organize a database of all chemical compounds
in living cells and link each compound to a pathway component

Pathway Tools: Develop new bioinformatics technologies for
functional genomics, such as pathway comparison, pathway
reconstruction, and pathway design
Network Properties
Properties of networks









Small world effect
Transitivity/ Clustering
Scale Free Effect
Maximum degree
Network Resilience and robustness
Mixing patterns and assortativity
Community structure
Evolutionary origin
Betweenness centrality of vertices
18
Biological Networks Properties




Power law degree distribution: Rich get richer
Small World: A small average path length
◦ Mean shortest node-to-node path
Robustness: Resilient and have strong resistance to
failure on random attacks and vulnerable to
targeted attacks
Hierarchical Modularity: A large clustering
coefficient
◦ How many of a node’s neighbors are connected
to each other
Graph Terminology
Node
Edge
Directed/Undirected
Degree
Shortest Path/Geodesic distance
Neighborhood
Subgraph
Complete Graph
Clique
Degree Distribution
Hubs
20
Graphs

Graph G=(V,E) is a set of vertices V and edges E

A subgraph G’ of G is induced by some V’  V and E’  E

Graph properties:
◦ Connectivity (node degree, paths)
◦ Cyclic vs. acyclic
◦ Directed vs. undirected
Network Measures

Degree ki

Degree distribution P(k)

Mean path length

Network Diameter

Clustering Coefficient
Network Analysis
Paths:
metabolic,
signaling pathways
Cliques:
protein complexes
Hubs:
regulatory modules
Subgraphs:
maximally weighted
Sparse vs Dense Graphs

G(V, E) where |V|=n, |E|=m the number of
vertices and edges

Graph is sparse if m~n

Graph is dense if m~n2

Complete graph when m=n2
Connected Components
G(V,E)
 |V| = 69
 |E| = 71

Connected Components
G(V,E)
 |V| = 69
 |E| = 71
 6 connected
components

Paths
A path is a sequence {x1, x2,…, xn} such that (x1,x2), (x2,x3), …, (xn-1,xn)
are edges of the graph.
A closed path xn=x1 on a graph is called a graph cycle or circuit.
Shortest-Path between nodes
Shortest-Path between nodes
Longest Shortest-Path
Network Measures: Degree
Degree Distribution
P(k) is probability of each
degree k, i.e fraction of
nodes having that degree.
For random networks, P(k)
is normally distributed.
For real networks the
distribution is often a powerlaw:
P(k) ~ k-g
Such networks are said to
be scale-free
Clustering Coefficient
The density of the network
surrounding node I, characterized as
the number of triangles through I.
Related to network modularity
nI
2n I
CI 

 k  k  k - 1
 
 2
k: neighbors of I
The center node has 8 (grey) neighbors
There are 4 edges between the neighbors
C = 2*4 /(8*(8-1)) = 8/56 = 1/7
nI: edges between
node I’s neighbors
Interesting Properties of Network
Types
Small-world Network

Every node can be reached from every other by
a small number of hops or steps

High clustering coefficient and low meanshortest path length
◦ Random graphs don’t necessarily have high clustering coefficients

Social networks, the Internet, and biological
networks all exhibit small-world network
characteristics
Small world effect
 most pairs of vertices in the network seem to be connected by
a short path
l is mean geodesic distance
dij is the geodesic distance between vertex i and vertex j
l ~ log(N)
36
Scale-Free Networks are Robust

Complex systems (cell, internet, social
networks), are resilient to component failure

Network topology plays an important role in
this robustness
◦ Even if ~80% of nodes fail, the remaining ~20% still maintain
network connectivity

Attack vulnerability if hubs are selectively
targeted

In yeast, only ~20% of proteins are lethal when
deleted, and are 5 times more likely to have
degree k>15 than k<5.
Hierarchical Networks
Detecting Hierarchical Organization
Other Interesting Features

Cellular networks are assortative, hubs tend not to
interact directly with other hubs.

Hubs tend to be “older” proteins (so far claimed for
protein-protein interaction networks only)

Hubs also seem to have more evolutionary pressure—
their protein sequences are more conserved than
average between species (shown in yeast vs. worm)

Experimentally determined protein complexes tend to
contain solely essential or non-essential proteins—
further evidence for modularity.
Network Models
 Random
Network
 Scale free Network
 Hierarchical Network
41
Random Network I
 The Erdös–Rényi (ER) model of a random network starts
with N nodes and connects each pair of nodes with
probability p, which creates a graph with approximately pN(N–
1)/2 randomly placed links
 The node degrees follow a Poisson distribution
42
Random Network II
 Mean shortest path l ~ log N, which indicates that it is
characterized by the small-world property.
 Random graphs have served as idealized models of
certain gene networks, ecosystems and the spread of
infectious diseases and computer viruses.
43
44
Scale Free Networks I
Power-law degree distribution: P(k) ~ k –γ, where γ is
the degree exponent. Usually 2-3
The network’s properties are determined by hubs
The network is often generated by a growth process
called Barabási–Albert model
45
Scale Free Networks II
 Scale-free networks with degree exponents 2<γ<3, a
range that is observed in most biological and non-biological
networks like the Internet backbone, the World Wide Web,
metabolic reaction network and telephone call graphs.
 The mean shortest path length is proportional to
log(n)/log(log(n))
46
How Scale-free networks are formed?
 PREFERENTIAL ATTACHMENT on Growth: the probability
that a new vertex will be connected to vertex i depends on the
connectivity of that vertex:
ki
(ki ) 
kj
j
In biological network, many
such networks are due to
gene duplication!
Hierarchical Networks I
 To account for the coexistence of modularity, local
clustering and scale-free topology in many real systems it has to
be assumed that clusters combine in an iterative manner,
generating a hierarchical network
The hierarchical network model seamlessly integrates a scale-free
topology with an inherent modular structure by generating a
network that has a power-law degree distribution with degree
exponent γ = 1 + ln4/ln3 = 2.26
48
Hierarchical Networks II
 It has a large system-size independent average clustering
coefficient <C> ~ 0.6. The most important signature of
hierarchical modularity is the scaling of the clustering
coefficient, which follows C(k) ~ k –1 a straight line of slope –1
on a log–log plot
 A hierarchical architecture implies that sparsely connected
nodes are part of highly clustered areas, with communication
between the different highly clustered neighborhoods being
maintained by a few hubs
Some examples of hierarchical scale free networks.
49
Problems of Network Biology
 Network Inference
 Micro Array, Protein Chips, other high throughput assay methods
 Function prediction
 The function of 40-50% of the new proteins is unknown
 Understanding biological function is important for:
 Study of fundamental biological processes
 Drug design
 Genetic engineering
 Functional module detection
 Cluster analysis
 Topological Analysis
 Descriptive and Structural
 Locality Analysis
 Essential Component Analysis
 Dynamics Analysis
 Signal Flow Analysis
 Metabolic Flux Analysis
 Steady State, Response, Fluctuation Analysis
 Evolution Analysis
 Biological Networks are very rich networks with very limited, noisy, and
incomplete information.
 Discovering underlying principles is very challenging.
50
Summary
The problem: Identify Differentially
expressed genes from Microarray data
 How to identify: t-test and Rank product
 How to evaluate significance of identified
genes

Reference & Acknowledgements

Albert Barabasi et al
◦ Network Biology: understanding the cell’s functional
organization

Jing-Dong et al
◦ Evidence for dynamically organized modularity in the
yeast protein–protein interaction network

Woochang Hwang

CSCE590/822 Data Mining Principles and Applications

Transcript CSCE590/822 Data Mining Principles and Applications

Directory