Microarray Data to Gene Networks

Download Report

Transcript Microarray Data to Gene Networks

Microarrays to Functional Genomics:
Generation of Transcriptional Networks
from Microarray experiments
Joshua Stender
December 3, 2002
Department of Biochemistry
What is a genetic network?
Gene networks are usually
represented as directed
graphs where the nodes are
defined as the genes and the
edges represent regulation.
Networks summarized a
limited relationship between
a subset of genes in both
positive and negative
feedback loops.
Jenssen et al. 2001
Why interested in Genetic
Networks?
• Drug therapies for complex diseases
• Gain insights for stimulus-response
interactions
• Identify novel pathways
• Understand cell physiology
• Understand multifactor gene-gene or geneprotein relationships in normal and disease
states
Modeling Network Framework
• Need to define a map from sequence space to
functional space
• Stage of Regulation (RNA, Protein)
• Temporal Regulation
• Spatial Regulation(Nucleus,Cytoplasm, etc)
Prazhnik et al. Gene networks:how to put the function in genomics. Trends in Biochem 20: 467-72.
Methods for Developing Gene
Networks
• Two types of experiments used for network
design: Time series and Steady-State gene knockout
• Co-expression clustering
• Cis acting elements in promoters(Amy
Creekmore)
• Reverse Engineering: use of algorithms to
generate new networks
Time-Series Approach
• Expression level of a certain gene at a time point
can be modeled as some function of previous time
points.
• Problem exists with dimensionality where more
genes then time points. Better results require
more time points
• Solution in the literature: Basic Linear Model,
Singular Value Decomposition, and Bayesian
Networks
Steady-State Approach
• Takes advantage of gene deletions or over
expression
• If gene A goes up after gene B deleted,
perhaps gene B is negative modulator of A
and so on
• Microarrays offer opportunities to identify
gene deletion consequences on entire
genomes
Genetic Network Generation Schematic
Jong Modeling and simulation of genetic regulatory systems: a literature review.
J. Comput Biol 2002;9(1):67-103
Algorithmic Approach to
Network Design
• Boolean Binary State along with co
expression clustering
• Continuous Steady-State(NonLinear):Assumes genes can have
intermediate states
• Singular Value Decomposition
Methods for Generating Gene
Networks
• D’Haeseleer et al. Genetic network inference:
from co-expression clustering to reverse
engineering. Bioinformatics 16(8): 707-26.
• Fuente et al. Linking the genes: inferring
quantitative gene networks from microarray data.
Trends in Genetics 18(8): 395-98.
• Toh et al. Inference of a genetic network by a
combined approach of cluster analysis and
graphical Gaussian modeling. Bioinformatics
18(2): 287-297.
Types of Clustering
• Non-hierarchical- clusters N objects into K
Groups until a preset threshold is
established. Examples include: K-means,
SOM, and Expectation-maximization
• Hierarchical- returns a hierarchy of nested
clusters (agglomerative vs. divisive)
Why use clustering?
• Wealth of data from microarray is overwhelming
• Cluster to limit gene list to one that has genes that
change significantly
• Inference of functional annotation
• Extraction of regulatory motifs
• Molecular signature for distinguishing cell or
tissue types
• Use of learning machines to characterize unknown
genes
Determining Distances Between
Genes
•
Majority of clustering algorithms use matrix of
pair wise distances between genes
• Distances can be calculated based on:
1. Similarity according to positive correlations
2. Similarity based on positive and negative
correlations
3. Similarity based on mutual information
Guilt-by Association(GBA)
• Gene selected at random and determine its
nearest neighbor
• Genes are clustered based on arbitrary cutoff distances in expression space
• Assumes that genes regulated in the same
pattern participate in similar processes
K-means Clustering
• Partitions N genes into K groups
• Centroids are weighted center of a cluster
• Each gene is assigned to a cluster and the
centroid is calculated
• Centroid continuously recalculated and
genes reassigned
Self-Organizing Maps (SOM)
• Very similar to K-means, however cluster
centers are placed on a grid
• At each iteration, gene pattern chosen at
random and nearest cluster neighbor and
cluster center updated
• Requires user to define number cluster and
grid size
Expectation-Maximization
• Clustering similar to K-means, however
genes assigned to multiple categories
• Membership to a cluster is based on
Gaussian distribution of probabilities
• Continuously update membership and the 3
following parameters are assigned for each
cluster: centroid, covariance, and mixture
weight
Determining which clustering
analysis to use
• Each combination of distance measure and
clustering algorithm will emphasize different types
of regularities associated with data
• Best to complement data with more than one
clustering analysis due to variety of algorithms
and the multiple functions of each gene
Construction of a Simple Network
Clustering
Brazhnik et al.
Boolean Networks
• Simplification: each gene represented in the
binary ON/Off state
• Each gene is regulated by other genes using
Boolean functions
• Most genes are in an intermediate state and
therefore are continuous
Example of a Boolean Network
Jong Modeling and simulation of genetic regulatory systems: a literature review.
J. Comput Biol 2002;9(1):67-103
Limitations of Boolean Networks
• Fail to reveal causality
• Non-Quantitative
• Does not take into account multiple gene
states
• In the future Protein-Protein interaction
maps need to be included
Graphical Gaussian Model
• Toh et al. Inference of a Genetic Network
by a Combined Approach of Cluster
Analysis and Graphical Gaussian Modeling.
Bioinformatics 18(2): 287-297.
• Goal: To establish a method to combine
Clustering and GGM for genetic network
predictions.
Graphical Gaussian Modeling
• GGM is a multivariate analysis to infer or
test a statistical model for the relationship
among a plural of variables where a partial
correlation is used
• Data: 2467 Saccharomyces cerevisiae genes
under 79 different conditions
Graphical Gaussian Method
• Genes were clustered into 34 distinct
clusters
• To reduce dimensionality, each cluster was
averaged for each condition
Stepwise iterative algorithm developed by Wermuth and Scheidht(1977)
Step 0: Complete Graph generated with M
nodes and every node connected to each other.
Step 1: Calculate partial correlation Matrix P(t)
from correlated Coefficient Matrix C(t) where
t indicates iteration.
Step 2: Find element with smallest absolute
value in P(t) and replace it with 0.
Step 3: Reconstruct C(t+1) from P(t)
Step 4: Termination is dependent on deviance
Dev1= Nlog ( | C(t+1) |/|C(0)|)
Dev2= Nlog ( | C(t+1) |/|C(t)|)
Calculate dev1 and dev2. If either dev <.05
iteration stopped. Else go to step1
Graphical Gaussian Method
Sub graph of the independence graph corresponding to partial correlation coefficient matrix
Graphical Gaussian Method
Results and conclusions
• Algorithm stopped after 189 iterations
• SUC2(sucrose hydrolyzing enzyme) was used as
model to evaluate accuracy of method: Among 40
known correlations for other genes, method
identified 3 to be of same cluster,8 to have
correlation of 0 and 29 to interact.
• Conclude that about 75% accurate.
• Could be a highly effective method for gene
network generation if combined with previous
knowledge
Linear Additive Method
• Fuente et al. “Linking the genes: inferring
quantitative gene networks from microarray data.”
Trends in Genetics 18(8): 395-8.
• Goal: To establish a method for inferring gene
networks and the corresponding gene interaction
strengths
• Represent gene networks that consider expression
levels as continuous variables
Linear Additive Method
Co-control coefficient
FR=Fluorescence Intensities
Linear Additive Method
Conclusions
• In Silico approach is useful in testing
inferred networks
• Can be used with experiments with one
gene disruption at a time
• Generated method for developing gene
networks that include quantitative
interaction strengths
New and Improved Network
Designs
• Continuous-value network inference: uses
differential equations and allows genes to be
continuous variables
• Gene Duplication: Network nodes are randomly
duplicated to help network connections evolve
• Many computer simulations are being developed
to help mimic real data to aid in the design of new
algorithms
Conclusion and Outlook
• Integration of large amount of biological data and
computational power increasing our knowledge of
complex systems
• Increasing need to standardize microarray
experiments and create databases
• Gradual improvement of cluster and gene
inference algorithms
• Addition of differential proteomics and also
incorporation of multiple regulation steps