Transcript Slide 1

Microarray Gene Expression Data Analysis
Functional Genomics
Chapter: 07
A.Venkatesh
CBBL
Contents
7.1 Introduction
7.2 Normalization of Microarray of
gene expression data.
7.3 Data Analysis.
7.4 Identification of Differential
Expressed Genes.
7.5 Identification of Co-expressed
Genes.
7.6 Application for pathway
inference.
7.7 Summary
Introduction
Microarray technology provides a powerful tool that enables researchers to observe
simultaneously mRNA expression levels of thousands of genes.
 The expression data can observe directly which genes are differentially expressed
under a particular experimental condition.
 Thus, indicating possible functional connections between the inputs and certain
components of the gene network.
The Gene network may include protein-protein interaction using two hybrid system.
Normalization of Microarray
of gene expression
Linear models have often been used to model multiplicative errors.
One simple technique for achieving this is through visualizing the two-dimensional (2D)
scatter plot
In which signal intensities under two different conditions are plotted on a two-dimensional
plane with the two intensity values representing the x-axis and y-axis coordinates.
Deviation from this symmetry generally indicates systematic errors.
 The same can be said about the overall intensity values of the two channels from the
same microarray slide.
To “correct” systematic errors, one generally needs to model the relationship between
the correct data and the erroneous data.
These models could be either linear or nonlinear models, depending on the complexity
of such a relationship.
Linear models have often been used to model multiplicative errors and addictive errors
The goal here is to compare a series of expression data for the same gene under different
conditions.
 The different time points, and to decide how the gene’s expression levels change as a
function of time or biological conditions.
“Artificially” adjusting the expression levels clearly will not make this analysis any
easier.
One way to minimize the effects of error correction is to adjust the intensity values on
both x and y data sets simultaneously. can multiply the x intensities all by √a and divide
all y values by √a.
An error corrections has achieved and “minimized” the effects of incorrect adjustments.
Data Analysis
Data Transformation
There are many different approaches to data transformation, among which the most
commonly used in the microarray field is to take the logarithm of the expression data,
mainly for the following reasons.
First, the variation of log-transformed intensities and log transformed ratios of intensities
is less dependent on absolute magnitude.
 Log transformation could equalize variability in the wildly varying microarray raw data.
Second, log transformation could even out highly skewed distributions and thus bring the
data closer to a normal distribution.
Principle component
analysis
Principal component analysis (PCA) is a multivariate technique for examining
relationships among a group of data points in Euclidean space.
It has been widely used in the analysis of gene expression data for various purposes,
including the identification of outliers in a data set, reduction of dimensionality.
The basic idea of k principal component analysis is to find an orthogonal transformation
of multidimensional data points in a k-dimensional space that would maximize the
scattering of projections of the data points in the new space.
Identification of Differentially
Expressed Gene
When analyzing microarray data, one needs to understand what contributes to the
observed data.
Schematic of expression profiles of three genes (represented by square, closed circle,
and diamond symbols) under six different conditions or over six time points.
•The basic idea of data clustering is to partition a data set into non overlapping subsets (or
clusters) such that data points of the same cluster are “highly” correlated, whereas data
points from different clusters are not.
•Four classes of genes with distinct gene expression profiles for each class of genes.
Basics of Gene Expression
Data Clustering
Schematic of two types of data clustering problems. The x- and y-axes represent expression
levels at two different time points (or under two different conditions).
(a) Data set with two apparent clusters. (b) Three data clusters in a noisy background.
Clustering of Gene Expression
Data
There are a number of popular clustering techniques for gene expression data.
They include K-means clustering, hierarchical clustering and self-organizing maps.
The following is a list of a few available computer software programs for gene
expression and other data clustering, based on (or including) the K-means algorithm.
1. Gene Spring
2. Spotfire
3. Expression Profiler
The hierarchical data clustering.
(a) Set of data points in two-dimensional
space.
(b) Representation of a clustering tree of
the data set.
The objective of this clustering paradigm is to provide a hierarchical view of a clustering
problem at different levels of resolution.
At the highest resolution, every data point forms a cluster by itself.
At the lowest resolution, the whole data set forms one cluster. In between, each cluster at
a particular level is formed by merging the two closest clusters at a higher (resolution)
level.
SOM
Self-organizing Maps Self-organizing maps represent a class of neural networks often
used for data clustering.
 The SOM approach has a similar objective to that of a K-means approach.
 It tries to identify a good representative for a group of nearby (similar) data points and to
group data points around these representatives.
Summary
A great deal of information could be revealed about genes through the rationale design of
microarray gene expression experiments and sensible data interpretation of gene expression
patterns.
The role assignments of genes to a particular gene network are possible when additional
information from other sources, like genomic or proteomic data, is available.
A number of major efforts in building public databases for gene expression data are
underway, to facilitate the functional inference of genes and gene networks in a systematic
manner.
E.g. GEO(Gene Expression Omnibus, NCBI, Saccharomyces cerevasiae.