Various Career Options Available
Download
Report
Transcript Various Career Options Available
Analysis and Management of Microarray Data
Dr G. P. S. Raghava
Major Applications
Identification of differentially
expressed genes in diseased tissues
(in presence of drug)
Classification of differentially
expressed (genes) or clustering/
grouping of genes having similar
behaviour in different conditions
Use expression profile of known
disease to diagnosis and classify of
unknown genes
Management of Microarray Data
Magnitude of Data
– Experiments
50 000 genes in human
320 cell types
2000 compunds
3 times points
2 concentrations
2 replicates
– Data Volume
4*1011 data-points
1015 = 1 petaB of Data
Gene expression database – a
conceptual view
Genes
Samples
Gene
annotations
Sample
annotations
Gene expression
matrix
Gene expression
levels
Management of Microarray
Data
Major Issues
Large volume of microarray data in last few years
– Storage and efficient access
– Comparison and integration of data
Problem of data access and exchange
– Data scattered around Internet
– Supplementary material of publications
– Difficult for user to access relivent data
Problems with existing databases
– Diverse purpose
– Developed for specific purpose
Management of Microarray
Data
Specific Database
– Platform (eg.Stanford MA Database; SMD)
– Organism (Yeast MA global viewer)
– Project (Life cycle database of Drosophila)
Problem with Supplement and MA databases
–
–
–
–
Lack of direct access
Quality not checked
No standard format
Incomplete data
Comprehensive database server to manage
massive amount of Microarray Data
– Biomaterial Information
– Raw Data & Images
– Web Tools (normalization; data viewing; analysis)
Run on local servers allows full management
and permission to add and view data
Minimum Information about Microarray
Experiment (MIAME)
BASE http://bioinformatics1.uams.edu:8081:/
Public Databases
Gene Expression data is an essential
aspect of annotating the genome
Publication and data exchange for
microarray experiments
Data mining/Meta-studies
Common data format - XML
MIAME (Minimal Information About a
Microarray Experiment)
GEO at the NCB I
Microarray Data Mining
Challenges
too few records (samples), usually <
100
too many columns (genes), usually >
1,000
Too many columns likely to lead to False
positives
for exploration, a large set of all
relevant genes is desired
for diagnostics or identification of
Analysis of Microarray Data
Analysis of images
Preprocessing of gene expression data
Normalization of data
–
–
–
–
Subtraction of Background Noise
Global/local Normalization
House keeping genes (or same gene)
Expression in ratio (test/references) in log
Differential Gene expression
– Repeats and calculate significance (t-test)
– Significance of fold used statistical method
Clustering
– Supervised/Unsupervised (Hierarchical, K-means,
SOM)
Prediction or Supervised Machine Learnning (SVM)
Low Level Analysis
or
Preprocessing of gene expression data
Scale Transformation
Normalization and Scaling
Replicate Handling
Missing value Handling
Flat pattern filtering
Pattern standardization
Normalization Techniques
Global normalization
– Divide channel value by means
Control spots
– Common spots in both channels
– House keeping genes
– Ratio of intensity of same gene in two channel is used for
correction
Iterative linear regression
Parametric nonlinear nomalization
– log(CY3/CY5) vs log(CY5))
– Fitted log ratio – observed log ratio
General Non Linear Normalization
– LOESS
– curve between log(R/G) vs log(sqrt(R.G))
Classification
Task: assign objects to classes (groups) on
the basis of measurements made on the
objects
Unsupervised: classes unknown, want to
discover them from the data (cluster
analysis)
Supervised: classes are predefined, want to
use a (training or learning) set of labeled
objects to form a classifier for classification
of future observations
Cluster analysis
Used to find groups of objects when not
already known
“Unsupervised learning”
Associated with each object is a set of
measurements (the feature vector)
Aim is to identify groups of similar
objects on the basis of the observed
measurements
Unsupervised Learnning
Hierarchical clustering: merging two branches at the time
until all vari-ables(genes) are in one tree. [it does not answer
the question of “howmany gene clusters there are”?]
K-mean clustering: assuming there are K clusters. [what if
this assumption is incorrect?]
Self Organizing Maps (SOM)
– Split all genes into similar sub-groups
– Finds its own groups (machine learning)
Principle Component
– every gene is a dimension (vector), find a single dimension that
best represents the differences in the data
Model-based clustering: the number of clusters is
determined dynamically [could be one of the most promising
methods]
Average linkage hierarchical clustering,
melanoma only
unclustered
‘cluster’
Supervised Analysis
Fisher’s linear discriminant
analysis
Quadratic discriminant analysis
Logistic regression (a linear
discriminant analysis)
Neural networks
Support vector machine
Example: Tumor Classification
Reliable and precise classification essential for
successful cancer treatment
Current methods for classifying human malignancies
rely on a variety of morphological, clinical and
molecular variables
Uncertainties in diagnosis remain; likely that existing
classes are heterogeneous
Characterize molecular variations among tumors by
monitoring gene expression (microarray)
Hope: that microarrays will lead to more reliable
tumor classification (and therefore more appropriate
treatments and better outcomes)
Higher Level
Microarray data analysis
Clustering and pattern detection
Data mining and visualization
Controls and normalization of results
Statistical validatation
Linkage between gene expression data and gene
sequence/function/metabolic pathways databases
Discovery of common sequences in co-regulated
genes
Meta-studies using data from multiple experiments
Thanks