Various Career Options Available

Download Report

Transcript Various Career Options Available

Analysis and Management of Microarray Data
Dr G. P. S. Raghava
Major Applications



Identification of differentially
expressed genes in diseased tissues
(in presence of drug)
Classification of differentially
expressed (genes) or clustering/
grouping of genes having similar
behaviour in different conditions
Use expression profile of known
disease to diagnosis and classify of
unknown genes
Management of Microarray Data

Magnitude of Data
– Experiments






50 000 genes in human
320 cell types
2000 compunds
3 times points
2 concentrations
2 replicates
– Data Volume


4*1011 data-points
1015 = 1 petaB of Data
Gene expression database – a
conceptual view
Genes
Samples
Gene
annotations
Sample
annotations
Gene expression
matrix
Gene expression
levels
Management of Microarray
Data
Major Issues

Large volume of microarray data in last few years
– Storage and efficient access
– Comparison and integration of data

Problem of data access and exchange
– Data scattered around Internet
– Supplementary material of publications
– Difficult for user to access relivent data

Problems with existing databases
– Diverse purpose
– Developed for specific purpose
Management of Microarray
Data

Specific Database
– Platform (eg.Stanford MA Database; SMD)
– Organism (Yeast MA global viewer)
– Project (Life cycle database of Drosophila)

Problem with Supplement and MA databases
–
–
–
–
Lack of direct access
Quality not checked
No standard format
Incomplete data

Comprehensive database server to manage
massive amount of Microarray Data
– Biomaterial Information
– Raw Data & Images
– Web Tools (normalization; data viewing; analysis)



Run on local servers allows full management
and permission to add and view data
Minimum Information about Microarray
Experiment (MIAME)
BASE http://bioinformatics1.uams.edu:8081:/
Public Databases
Gene Expression data is an essential
aspect of annotating the genome
 Publication and data exchange for
microarray experiments
 Data mining/Meta-studies
 Common data format - XML
 MIAME (Minimal Information About a
Microarray Experiment)

GEO at the NCB I
Microarray Data Mining
Challenges
too few records (samples), usually <
100
 too many columns (genes), usually >
1,000
 Too many columns likely to lead to False
positives
 for exploration, a large set of all
relevant genes is desired
 for diagnostics or identification of

Analysis of Microarray Data



Analysis of images
Preprocessing of gene expression data
Normalization of data
–
–
–
–

Subtraction of Background Noise
Global/local Normalization
House keeping genes (or same gene)
Expression in ratio (test/references) in log
Differential Gene expression
– Repeats and calculate significance (t-test)
– Significance of fold used statistical method


Clustering
– Supervised/Unsupervised (Hierarchical, K-means,
SOM)
Prediction or Supervised Machine Learnning (SVM)
Low Level Analysis
or
Preprocessing of gene expression data
Scale Transformation
 Normalization and Scaling
 Replicate Handling
 Missing value Handling
 Flat pattern filtering
 Pattern standardization

Normalization Techniques


Global normalization
– Divide channel value by means
Control spots
– Common spots in both channels
– House keeping genes
– Ratio of intensity of same gene in two channel is used for
correction


Iterative linear regression
Parametric nonlinear nomalization
– log(CY3/CY5) vs log(CY5))
– Fitted log ratio – observed log ratio

General Non Linear Normalization
– LOESS
– curve between log(R/G) vs log(sqrt(R.G))
Classification
Task: assign objects to classes (groups) on
the basis of measurements made on the
objects
 Unsupervised: classes unknown, want to
discover them from the data (cluster
analysis)
 Supervised: classes are predefined, want to
use a (training or learning) set of labeled
objects to form a classifier for classification
of future observations

Cluster analysis
Used to find groups of objects when not
already known
 “Unsupervised learning”
 Associated with each object is a set of
measurements (the feature vector)
 Aim is to identify groups of similar
objects on the basis of the observed
measurements

Unsupervised Learnning



Hierarchical clustering: merging two branches at the time
until all vari-ables(genes) are in one tree. [it does not answer
the question of “howmany gene clusters there are”?]
K-mean clustering: assuming there are K clusters. [what if
this assumption is incorrect?]
Self Organizing Maps (SOM)
– Split all genes into similar sub-groups
– Finds its own groups (machine learning)

Principle Component
– every gene is a dimension (vector), find a single dimension that
best represents the differences in the data

Model-based clustering: the number of clusters is
determined dynamically [could be one of the most promising
methods]
Average linkage hierarchical clustering,
melanoma only
unclustered
‘cluster’
Supervised Analysis
Fisher’s linear discriminant
analysis
 Quadratic discriminant analysis
 Logistic regression (a linear
discriminant analysis)
 Neural networks
 Support vector machine

Example: Tumor Classification

Reliable and precise classification essential for
successful cancer treatment

Current methods for classifying human malignancies
rely on a variety of morphological, clinical and
molecular variables

Uncertainties in diagnosis remain; likely that existing
classes are heterogeneous

Characterize molecular variations among tumors by
monitoring gene expression (microarray)

Hope: that microarrays will lead to more reliable
tumor classification (and therefore more appropriate
treatments and better outcomes)
Higher Level
Microarray data analysis







Clustering and pattern detection
Data mining and visualization
Controls and normalization of results
Statistical validatation
Linkage between gene expression data and gene
sequence/function/metabolic pathways databases
Discovery of common sequences in co-regulated
genes
Meta-studies using data from multiple experiments
Thanks