Microarray Analysis -- Image Processing and Filter Design

Download Report

Transcript Microarray Analysis -- Image Processing and Filter Design

Microarray Analysis: Image
processing and Filter design
Instructors: Dr.Ravi Sankar Dr.Wei
Qian
Student: Kun Li
Nov 2006
Introduction of Microarray Analysis

Microarray is a new technology of molecular biology research.
It is an excellent tool to monitor gene transcription for
thousands of genes at a time. The first step of this technique
involves spotting known sequences on a substrate, which in
most cases are glass slides or nylon membranes. This is
followed by reverse transcription of mRNA isolated from the
biological subjects under study into cDNA. During the process
of reverse transcription, the control and the experimental
materials are differentially labeled, pooled and hybridized to
the arrays. cDNA strands in this pool hybridize to
complementary sequences on the array by competing for them.
The relative abundance of the corresponding mRNA from the
two sources will be assessed by the mesured signal.
Continue…

The objectives of microarray experiments are
to reveal unknown genes and new gene
functions as a result of experimental
treatments, to find new gene expression
patterns and use them as a basis for
classification of physiological or pathological
processes.
Continue…
Microarray Image Processing

We know there are many differences in
between Micro-array images of patients and
normal person. Analysis of micro-array images
will help us in cancer detection and diagnosis,
and more importantly it can help us to identify
cancer related genes. Actually, many
researches about recognition and comparison
of gene expression pattern have been done.
Literature Review

Microarray Analysis attracts lots of interests from researchers, there so
many literatures. I got 137 papers published only in 2006. Here I list 20
of them as below:

SE Ahnert, K Willbrand, FCS Brown, TMA Fink (2006), "Unbiased pattern detection in
microarray data series", Bioinformatics, 22(12):1471-1476.

David B Allison, Xiangqin Cui1, Grier P Page1, Mahyar Sabripou (2006), "Microarray data
analysis: from disarray to consolidation and consensus", Nature Reviews Genetics, 7:55-65.

Claes R Andersson, Anders Isaksson, Mats G Gustafsson (2006), "Bayesian detection of
periodic mRNA time profiles without use of training examples", BMC Bioinformatics, 7:63.

Richard P Auburn, Roslin R Russell, Bettina Fischer, Lisa A Meadows, Santiago Sevillano
Matilla, Steven Russell (2006), "SimArray: a user-friendly and user-configurable microarray
design tool", BMC Bioinformatics, 7:102.

Simon Barkow, Stefan Bleuler, Amela Prelic, Philip Zimmermann, and Eckart Zitzler (2006),
"BicAT: a biclustering analysis toolbox", Bioinformatics, 22(10):1282-1283.
Continue…

Anders Bengtsson, Henrik Bengtsson (2006), "Microarray image analysis: background
estimation using quantile and morphological filters", BMC Bioinformatics, 7:96.

Henrik Bengtsson, Ola Hossjer (2006), "Methodological study of affine transformations of
gene expression data with proposed robust non-parametric multi-dimensional normalization
method", BMC Bioinformatics, 7:100.

Daniel Berrar, Ian Bradbury, Werner Dubitzky (2006), "Avoiding model selection bias in
small-sample genomic data sets", Bioinformatics, 22(10):1245-1250.

Daniel Berrar, Ian Bradbury, Werner Dubitzky (2006), "Instance-based concept learning from
multiclass DNA microarray data", BMC Bioinformatics, 7:73.

Ghislain Bidaut, Karsten Suhre, Jean-Michel Claverie, Michael F Ochs (2006),
"Determination of strongly overlapping signaling activity from microarray data", BMC
Bioinformatics, 7:99.

Jonathon Blake, Christian Schwager, Misha Kapushesky, and Alvis Brazma (2006),
"ChroCoLoc: an application for calculating the probability of co-localization of microarray
gene expression", Bioinformatics, 22:765-767.
Continue…

Marta Blangiardo, Simona Toti, Betti Giusti, Rosanna Abbate, Alberto Magi, Filippo Poggi,
Luciana Rossi, Francesca Torricelli, and Annibale Biggeri (2006) "Using a calibration
experiment to assess gene-specific information: full Bayesian and empirical Bayesian models
for two-channel microarray data", Bioinformatics, 22:50-57.

Philippe Broët, Vladimir A. Kuznetsov, Jonas Bergh, Edison T. Liu, Lance D. Miller (2006),
"Identifying gene expression changes in breast cancer that distinguish early and late relapse
among uncured patients", Bioinformatics, 22(12):1477-1485.

Ljubomir J. Buturovic (2006), "PCP: a program for supervised classification of gene
expression profiles", Bioinformatics, 22:245-247.

Roger D Canales, Yuling Luo, James C Willey, Bradley Austermiller, Catalin C Barbacioru,
Cecilie Boysen, Kathryn Hunkapiller, Roderick V Jensen, Charles R Knight, Kathleen Y Lee,
Yunqing Ma, Botoul Maqsodi, Adam Papallo, Elizabeth Herness Peters, Karen Poulter,
Patricia L Ruppel, Raymond R Samaha, Leming Shi, Wen Yang, Lu Zhang, Federico M
Goodsaid (2006), "Evaluation of DNA microarray results with quantitative gene expression
platforms", Nature Biotechnology, 24:1115-1122.
Continue…

Pedro Carmona-Saez, Monica Chagoyen, Andres Rodriguez, Oswaldo Trelles, Jose
M Carazo, Alberto Pascual-Montano (2006), "Integrated analysis of gene
expression by association rules discovery", BMC Bioinformatics, 7:54.

Pedro Carmona-Saez, Roberto D Pascual-Marqui, Francisco Tirado, Jose M
Carazo, Alberto Pascual-Montano (2006), "Biclustering of gene expression data by
non-smooth non-negative matrix factorization", BMC Bioinformatics, 7:78.

Yian A Chen, Cheng-Chung Chou, Xinghua Lu, Elizabeth H Slate, Konan Peck,
Wenying Xu, Eberhard O Voit, Jonas S Almeida (2006), "A multivariate prediction
model for microarray cross-hybridization", BMC Bioinformatics, 7:101.

H Chipman, R Tibshirani (2006), "Hybrid hierarchical clustering with applications
to microarray data", Biostatistics, 7(2):286-301.
A Choudhary, M Brun, J Hua, J Lowey, E Suh, ER Dougherty (2006), "Genetic test
bed for feature selection," Bioinformatics, 22(7):837-842.

Continue…


Most of these researches focus on pattern recognition
using Neural Network and Support Vector Machine,
Gene Identification and improving the image
processing methods, such as optimizing
Normalization and Noise reduction method.
Our research is different in the sense that it combines
image processing and signal processing and focuses
on maping the relations between genes associated
with breast cancer.
Thinking About Our New Method


The most important thing of cancer detection,
diagnosis and treatment is to detect cancer and
identify its type in the early stage when no
obvious symptoms that can be detected by
traditional methods developed.
From a new microarray image, how can we
detect its cancer development “potential” ?
Continue



We believe the image pattern will give us some “hints”
for cancer detection.
In fact, cancer development process involves lots of
genes, that means before a cancer gene expressed, the
expression level of many other genes have changed.
So, if we can find out the “implicit” relations between
cancer related genes, the problem solved.
We are planning to design some filters that can be
applied on microarray image to generate some
specific “signatures” for cancer and normal.


It’s important to emphasize early stage here.
Why? Because detecting cancer after people get it is not as
meaningful as predicting the cancer developing probability
“before” people get it. The figures below are small parts of
normal and cancer microarray images.
Normal
Cancer
Normal
In developing process
Cancer
?
This mid-stage (developing/early) is critical, if we know the gene
expression patterns of mid-stage, we can accurately predict cancer
development. However, it’s not possible for us to get these patterns
because we have to use other methods to detect cancers, then decide a
pattern belong to cancer or normal. If cancers have been detected by
traditional methods, it not the mid-stage we want.
How to resolve mid-stage problem

We can assume there is a cycle as below:
?
Normal
After
treatment
?
In developing
process
Cancer

In the cycle in previous page, we can assume
that the two question marks have some
similarities. Therefore, we can use the gene
expression patterns of “after treatment” as a
type of control of the gene expression pattern
in “developing”. (although they have
similarities, they won’t be exactly the same, so
we can only use pattern of “after treatment” as
reference.)
Assumption and Hypothesis


Assumption: we assume gene expression pattern of
“after treatment” and “developing” have some
similarities and the pattern of “after treatment” can be
use as reference.
Hypothesis I: There are differences in between gene
expression pattern of “normal”, “developing”, “cancer”
and “after treatment” stages. These differences can be
distinguished via using computational methods. The
gene expression pattern of “developing” stage can be
derived from other three stages with relatively high
reliability,
Continue…


Hypothesis II: We can design filters and apply
them on microarray images of the four stages
to generate “signatures” of them.
Hypothesis III: The “signatures” from different
stages can be used to predict cancer
developing probabilities.
Material

All images we processed in this project are
from aCGH tumor, provided by Pollack,
Jonathan in Stanford University. Thanks to
him!!!
Method (we use an example to
explain our method)
This is a small portion
of a microarray image
containing 4800 spots.
Red = Cancer
Green = Control
Yellow = Mixed
The first thing to do is
to separate the red and
green layers.
Sample and Control Layer
Sample Layer
Control Layer
Convert RGB Image To Grayscale Image
For spots finding, we
need to convert the
RGB image to
grayscale image
Compute The Mean Intensity of The Image

To set up regular grid, we compute the mean
intensity of the column of the image, this will
help us identify the center of spots and gap
between them.
Mean Intensity Profile
Use Autocorrelation to Enhance the Result

Ideally the spots would be periodically spaced, but in
practice, they have different shape, size and intensity, so the
mean profile looks irregular. We can use autocorrelation to
enhance the result.
Peaks Segmentation

Remove Background noise, set some threshold to
segment the peaks.
Enhanced Mean Intensity Profile
Peak Segmentation
Grid Point Locating

The grid point location should be the middle point
of two adjacent peaks.
Red Lines show the grid location
Transpose and Repeating

We have done vertical grid. To do horizontal
grid, simply transpose the image and repeat the
process mentioned before.
Set Up Bounding Boxes

Now we can form bounding
box regions to address each
spot individually by using
pairs of neighboring grid
points.
Segment Spots From Background
Apply logarithmic
transformation and do
global threshold.
Global Threshold
Continue
Since we already get
the bounding boxes,
we can try local
threshold.
Local Threshold
Continue


Advantages and disadvantages of the two
method mentioned before:
Log threshold is good, but some weak points
missed. Local threshold shows those weak
spots, but the spots with strong intensity are
bad.
Combine Logarithmic and Local Threshold
It is reasonable to
combine these two
methods. The result is
better.
Combined Threshold
What We Get Now…
Sample
Control
Spots segmentation and intensity
computation
Final results:
Cancer
Control
The number in each bounding box shows the intensity of
each spot.
Breast Cancer Analysis

Now we are ready to inspect the “implicit”
relations of genes “hiding” in a microarray
image. Our idea is to design some type of
filters which can be applied on microarray
image and generate breast cancer “signatures”.
Here’s an Example…



We us a 6*6 matrix from breast cancer microarray image as an example.
We use each row of the intensity matrix of normal control to filter the
control and cancer microarray images. Here are some results:
red = cancer, green = control
Row Filter 1
Continue…
Row Filter 2
Row Filter 3
Continue…
Row Filter 4
Row Filter 5
Continue…

Randomly choose spots to design filter:
Random Filter
Continue…
Choose specific (cancer related) spots to
design filter:
Specific Filter
Continue…


The good news is: in the processed result, the
cancer and normal control have significant
differences between them. It’s easy for us to
detect breast cancer.
The bad news is: we already know the image is
from breast cancer patient. How to detect early
stage breast cancer reasonably?
Continue…

Another bad news is: based on our small data set, the
processed results don’t look converge to some standard.
Red = Breast Caner Sample 1; Blue = Breast Cancer Sample 2;
Green = Control
Optimization


Take the first/second derivate of the processed results might (might not, I
think it depends on the results itself) be able to optimize the results.
An Example
First/Second Derivatives
Another Example
First/Second Derivative
Discuss

Take first/second derivatives make us focus on
the essential differences between normal and
cancer. How ever, some expression level
information lost.
Why the results don’t converge?

There are many reasons make the results not
converge to a standard, such as the “gene map”
strongly depends on each individual person;
different researchers have different habit;
researchers use different equipment and
reagents; etc.
Let’s try another method to design
the filter…


We’ll identify genes that strongly related to
cancer and design filter according to them.
From the cancer (red) and normal (green)
images shown before, we can construct two
intensity matrix, let’s call them Ic and In,
subtract In from Ic, we get another matrix
which shows the differences between Ic and In,
let’s call it Id.




For each image, we can get a specific Id, but
for the filter design, we need a standard Id. So
we average those Ids to get the standard Id.
Id = (Id1+Id2+Id3+…+Idn)/n
Now we can use the big values in Id to design
the filter.
Note: Id may contain negative values, because
some gene’s expression level maybe higher in
normal than in cancer cells.


Here’s an example: the biggest positive value in Id is
the 61st value: 150, we keep this value, and set all
other values equal to 0, we call this new matrix F1.
convolve F1 with Id1, the result shows the relation
between the 61st gene and all other genes.
The biggest negative value in Id is the 32nd value: 50, we keep this value and set all other values equal
to 0, we call this new matrix F2. convolve F2 with
Id1, the result shows the relation between the 32nd
gene and all other genes.
Take square root…
Take log…
Another example: linearly add the
results from different filter…




This time, design the filters according to the
35th and 42nd value in the Id matrix. Set them
as F1 and F2 respectively.
R1= conv(Id1, F1)
R2= conv(Id1, F2)
R3= a*R1+ b*R2 (a, b are constants)
Not good enough, try another idea…


The filter design depends on the location of the
selected value in the standard Id matrix, it’s
tedious and not convenient.
Each spot in the microarray image indicates a
specific gene, how can we identify this
speciality. Our idea is to bind a specific
frequency with each specific gene. For
example: bind Gene1 with Sinwt, bind Gene2
with Sin2wt, and so on.



The elements of Id look like this:
[Value1*Sinwt Value2*Sin2wt
Value3*Sin3wt……]
Now we convert the intensity matrix to
frequency domain.
Why we do this?


Advantage 1: Sin(iwt) is a orthogonal series
While i != j
sin(
iwt
)
sin(
jwt
)
dt

0


So we can design a feature extraction array and put all genes
associated with cancer in it. For example, the array may look
like this: E = [a*sin3wt b*sin17wt c*sin45wt…..], since we
bind the frequency information with the intensity, it no
longer depends on the location of the values.



We can use this feature extraction matrix E to “scan”
the Id, then select those critical genes out.
Advantage 2: we can do inverse Fourier transform to
transfer the intensity matrix to “time” domain. I think
the physical meaning is: at some specific time, the
expression level of all genes.
Advantage3: Maybe we can design band pass or band
stop filter based on this.


Disadvantage:
Since sin(iwt) is a orthogonal series, the
process mentioned before will select specific
frequency only and wipe all other frequency
out, so we can’t see the relations between a
specific gene and other genes.
Future work and Challenge

Although the processed results don’t converge
to a standard, we can construct database to
store the breast cancer “signatures” as many as
possible, therefore, when we get a new
microarray image signature, we can firstly try
to match it in our database or compute the
“similarities” between the sample and cancer or
between the sample and control to predict
cancer developing probability.


Finish the frequency and “time” domain
analysis.
Optimize the filter design.
Continue…

The big problem is that since the gene map strongly
depends on each individual person, it might not be a
good idea to use a normal person to “measure”
another people. We need microarray images from the
same person, before/after he/she got cancer and after
he/she received treatment. It’s very hard for us to get
this type of images. We can use images of normal and
abnormal tissue from the same person instead, but we
are lacking for these images either.
Item needed

We need 50 or more microarray images of
“normal”, “cancer” and “after treatment” stages
from the same person. (or from normal and
cancer tissues)
Appendix: Softwares we can use






Although there are many softwares we can use,
the listed below are free:
F-scan
P-scan
ScanAlyze 2
TIGR Spotfinder
UCSF Spot

Thank you!!!