Microarrays - Arizona State University

Download Report

Transcript Microarrays - Arizona State University

Basic Introduction to
Microarrays
Chitta Baral
Arizona State University
Feb 3, 2003
Basics of Microarrays
•
•
•
•
•
•
From: The magic of microarrays, S. Friend and R. Stoughton
(Scientific American, Feb 2002)
A microarray has thousands of spots (or array elements)
Each spot has thousands to millions of copies of a particular single stranded DNA
representing a particular gene.
Thousands of genes are each assigned a unique spot.
From 2 cell samples (say one treated with a drug and another untreated) collect
mRNA, make more stable cDNA from them and add fluorescent labels (green to
untreated and red to treated).
Apply the labeled cDNAs to the chip.
–
–
•
Binding occurs when cDNA from a sample finds its complementary sequence of bases on the
chip.
Such binding means that the gene represented by the chip DNA was active or expressed in
the sample.
Put the chip in a scanner.
–
Calculate the ratio of red to green at each spot and generate a color coded readout.
•
•
•
•
Red : Gene that strongly increased activity in treated cells
Green: Gene that strongly decreased activity in treated cells
Yellow: Gene that was equally active in treated and untreated cells.
Black: Gene that was inactive in both groups.
Determining the impact of new drug
• Suppose the last experiment was about determining
quickly whether a potential new drug is likely to harm the
liver.
• Are the red or green genes correspond to ways (gene
functions) that reflect liver damage. I.e., do those genes
make proteins whose concentration (high for the green
ones, and low for the red ones) reflects liver damage?
• Or compare the overall expression pattern with the
patterns produced when those genes (? cells) react to
known liver toxins.
– Close similarity would indicate that the new drug is probably
toxic as well.
Application of comparative cDNA
hybridization
•
Tissue-specific genes
– Comparative hybridization experiments (CHE) can reveal genes which are
preferentially expressed in specific tissues. Some of the genes implement the
behaviors that distinguish the cell’s tissue type; while other controlling genes
make sure that the cell only performs the function for its type.
•
Regulatory gene defects in cancer
– CHE can pinpoint the transcription differences responsible for the change from
normal to cancerous cells
– CHE can distinguish different patterns of abnormal transcription in heterogenous
cancers.
•
Cellular responses to the environment
– CHE can point to genes whose transcription changes in response to an
environmental stimulus.
– Temporal studies can also identify the order of changes providing evidence about
which genes control the response directly and which are only indirectly affected
by it.
•
Cell cycle variations
– CHE can be used to distinguish genes that are expressed at different times in the
cell cycle. Thus pathways responsible for basic life processes can be uncovered.
Microarray applications and uses
• Characterization of the temporal order of gene
expression within a cell
• Determination of the cellular location of gene
products
• Prediction of the function of resulting proteins
• Prediction of the effect of perturbations of the
cellular environment on the program of gene
expression by changing environmental
conditions or administering drugs.
Computational challenges –
interpreting the scanned image
• End point of a CHE is a scanned array image.
– Intensities can be quantified by measuring the average or
integrated intensities of the spots
• Subject to noise from irregular spots, dust on the slide, and
nonspecific hybridization.
• Deciding the threshold (between spots and background) can be
difficult, especially when the spots fade gradually around the edges.
• Detection efficiency may not be uniform across the slide, leading to
excessive red intensity on one side and excessive green on the
other.
– Ratio of fluoroscent intensities for a spot is interpreted as the
ratio of concentrations for its corresponding mRNA in the two cell
populations.
• Low levels of cDNA due to reverse transcription bias, sample loss,
or an inherently rare mRNA can cause large uncertainties in these
ratios.
A typical Microarray data set
• Includes expression levels for thousands of genes
across hundreds of conditions such as
–
–
–
–
From cells of different cell lines
From cells under different conditions
Pathological tissue specimens from different patients
Serial time points following a stimulus to a cell or organism.
• Imagine a 2D array of measurements
–
–
–
–
Rows: measurements associated with individual genes
Columns: measurements associate with conditions
Profile: list of measurements along each row or column
Features: individual expression measurements within each
profile. (some features more valuable than others; and
sometimes focusing on a subset improves results)
Analyzing microarray data
• Any dataset can be analyzed in 2 ways.
– Eg. 47 expression profiles of 4026 genes collected from lymphoma
specimens (Nature 403, 503-511, 2000)
• 47 cancer profiles with 4026 available features
• 4026 gene profiles with 47 available features
• Supervised and unsupervised methods
– Supervised: the genes or conditions are associated with labels – coming
from outside the experiment --that provide information about a
preexisting classification. The information may include knowledge of
gene function or regulation, disease sub type or tissue origin of a cell
type.
• Classification information is used to drive the analysis
• Used for predicting accurate labels for new genes.
– Unsupervised: No additional information
• Geared towards the discovery of patterns in the data, unbiased by outside
knowledge.
• Used for exploratory tasks.
• Clustering
Unsupervised grouping: clustering
•
•
Goal: Simplify large gene expression data sets
Approach: Group similar profiles together based on a distance metric (a formula for
calculating the similarity of two profiles)
–
Distance metrics: distance between 2 list of numbers
•
•
–
Euclidean distance (sqr root of the sum of squared differences)
Statistical correlation coefficient (-1 to +1)
Clustering strategy
•
Hierarchical clustering: calculate the distance between individual data points and then group together
that are close.
–
–
•
K-means clustering: requires a parameter k (the # expected clusters)
–
–
•
Initially cluster centers are selected randomly
In each iteration of the algorithm, all of the profiles are assigned to clusters whose center they are nearest to, and
then the cluster center is recalculated based on the profiles within the cluster.
Self-organizing maps:
–
–
–
•
•
Distance between groups are computed and used to create groups of groups
Easy to implement but suffer because the decision about where to create branches and in what order to present
them is often arbitrary.
Instead of partitioning, they organize the clusters into a map where similar clusters are close to each other.
No and topological configuration of the clusters are pre-specified
Cluster centers are recalculated in each iteration using both the profiles within the cluster as well as the profiles in
adjacent clusters.
Clustering is sensitive to the features used to compute the distance metric.
Applications: identify co-regulated genes, genes with related functions, signatures of
individual signalling pathways within the data set, etc.
Supervised grouping: classification
•
Approach: Take known groupings and create rules for reliably assigning
genes or conditions to these groups.
– Eg. Problem of classifying unknown genes as ribosomal or non-ribosomal
– Success depends on whether high quality labeled sets are provided or not.
•
Examples of methods (Machine learning)
– Logistic regression:
• uses the feature value for different groups to estimate the parameters of a predictor
function (a linear log-likelihood model)
– Neural networks
• Use a set of known examples to create a multi-layered computational network
– Linear discriminant analysis
• Use the labeled example from each set of classified cases to estimate a probability
distribution for the values of the features in that set.
• Given a new example, it determines the closest distribution and assigns the example to
this set.
– Inductive logic programming
– Decision trees
Dimension reduction
•
Involves removing features from the data set
–
–
Removed because do not provide significant incremental information and can confuse and
make analysis unnecessarily complex
Feature selection often attempts to identify a minimum set of non-redundant features that are
useful for classification.
•
•
Unsupervised dimensional reduction: pruning uninformative features; several
methods such as:
–
Principal component analysis (PCA)
•
•
•
–
•
Eg. Don’t select co-regulated genes.
Automatically detects redundancies in the data and determines a new set of guarentedly nonredundant hybrid (multiple features condensed) features.
Advantage: Makes apparent the outliers and clusters in a data set and reduces the noise in a data set
Disadvantage: Throwing away weak signals that could be but important
Independent component analysis
Supervised dimension reduction: feature selection
–
–
Goals: selecting relevant underlying features and reducing the number of features necessary
to classify correctly
A straight forward method
•
•
•
•
Iteratively apply a supervised classification algorithm that reports weights on all features.
After running the classification algorithm the first time, the feature with the lowest weight is removed
from the data set.
The algorithm is run again to determine the second least important feature.
This process is repeated while monitoring the classification performance on known examples.
Further Details …
•
•
•
Types of microarrays: spotted cDNA microarrays, high-density Oligonucleotide microarrays
Fluorescent dyes (for glass arrays) vs radioactive isotopes (for membrane arrays)
Interpretations
–
–
–
Identifying individual genes (regulated expression of which can explain particular biological phenomena) or
assign potential function to new genes.
Co-regulated genes (often identified using cluster analysis) allow functional classification (may participate in
similar cellular processes or pathways),
potential identification of common regulatory elements (DNA motifs) in promoter sequences.
•
–
–
–
•
Comparing large number of samples for a global view
–
•
Assumption: Genes with closely related expression patterns may be controlled by the same regulatory mechanism
When one sees differential expression they may have knowledge about the probable function (from NCBI
databases) of that gene and can make a hypothesis about the role that gene is playing in their system.
One cluster of genes related to one pathway, another to another pathway; hint about interconnection
between such pathways
Gene regulatory networks
Common control can be mixture of all samples used
Combining many data sets and analyzing the whole set is very useful
–
–
Comparing the expression profiles of tumour samples using many genes, it is possible to identify those
genes whose expression characterizes a particular tumour type
Compare the expression signature of a particular tumour type to data generated by measuring the
responses of closely related cell lines in culture to many different stimuli, such as hormones, growth factors,
etc. Using this strategy one can draw conclusions about which signalling pathways are activated in a
particular tumour type, leading to the identification of pathways that might provide therauptic targets.
References: main sources used in
this presentation
•
•
•
•
•
Basic microarray analysis: grouping and feature reduction. Raychaudhuri et
al.Trends in Biotechnology vol 19, No 5, May 2001 189 – 193.
Gene expression microarrays and the integration of biological knowledge.
Noordewier and Warren. Trends in Biotechnology vol 19, No 10, Oct 2001
412-415.
http://www.cs.wustl.edu/~jbuhler/research/array
The magic of microarrays. Friend and Stoughton. Scientific American. Feb
2002.
Other sources that I read.
– Navigating gene expression using microarrays – a tech review. Schulze &
Downward. Nature cell biology. Vol3, Aug 01. E190
– http://www.fargo.ars.usda.gov/ps/micr_the.htm
– Analysis of gene expression by microarrays: cell biologist’s gold mine or
minefield? Schulze & Downward. J. of Cell Science 113, 2000, 4151-4156.
– Microarrays: handling the deluge of data and extracting reliable information.
Hess et al. Trends in Biotechnology. Vol 19, No 11, Nov 2001, 463-468