No Slide Title

Download Report

Transcript No Slide Title

Analysis of Molecular and Clinical Data at PolyomX
www.polyomx.org
Adrian Driga1, Kathryn Graham1, 2, Sambasivarao Damaraju1, 2, Jennifer Listgarten3, Russ Greiner4, John Mackey1, 2 , Carol Cass1, 2
1PolyomX,
PolyomX has conducted several population-based association
studies on gene expression microarray and SNP type data to
determine the contribution of genes to disease susceptibility and
clinical characteristics. For each study, we have generated datasets
for investigation from the DORA database and have analyzed the
data using software tools that have been developed by PolyomX or
are publicly available to researchers. The data analysis process can
be automated so that researchers can easily generate result reports
for various combinations of parameters and datasets.
We present the analysis process for gene expression microarray
data, focusing on prediction of clinical characteristics and survival
analysis, and describe each step of the process from data
generation to results. We use a real dataset in which the identity of
the breast cancer patients has been anonymized, and we list the
genes that are differentially expressed between the classes defined
by Nuclear Grade and ER status. We include the results generated
using a software package, BRB ArrayTools, available from the
National Cancer Institute (US). We also give details on how one
would interpret the results of the data analysis.
Microarray Data Analysis
Figure 1: Horizontally aligned microarray data for BRB ArrayTools input
Figure 2: Experiment descriptors for BRB ArrayTools input
1. Retrieve gene expression microarray data for specified set of
patients. A web form is used to retrieve microarray data form the
DORA database. In the resulting Excel file (Figure 1), each
column holds gene expression data for a patient and each row
holds gene expression data for a gene. Each column is headed
by a patient ID. Each row is headed by a gene title and a gene ID
(Unigene number). Microarray data stored in the DORA database
is filtered, normalized, and aggregated using algorithms
developed by PolyomX (1). Each gene expression value is the log
base 2 value of the ratio between the intensities of the red and
green channels from the hybridized microarray.
2. Retrieve clinical data for specified set of patients and clinical
characteristics that will be analyzed. A pre-defined query is
used to retrieve clinical data from the DORA database. In the
resulting Excel file (Figure 2), each column holds values for a
clinical characteristic and each row holds the values of all clinical
characteristics for a patient. The first column and row are
headers. The patient IDs must be listed in the same order as in
the gene expression data file. Each clinical characteristic
partitions the patients into two or more classes. The clinical data
file is called the experiment descriptors file. For survival analysis,
the clinical data must include the patient status and the number of
follow-up days since the event of reference.
3. Collate horizontally aligned microarray data and experiment
descriptors in a BRB ArrayTools project. The ‘Data import
wizard’ (Figure 3) of BRB ArrayTools links microarray expression
data and clinical data into a BRB ArrayTools project. Genes are
further filtered based on their expression values and presence in
the patient profile.
Cross Cancer Institute, Alberta Cancer Board, Depts. of 2Oncology and 4Computing Science, University of Alberta, Edmonton,
3Dept. Of Computer Science, University of Toronto, Toronto
5. Interpret the results and rerun analysis with different parameter
values if necessary. For class prediction analysis, the accuracy of a
classifier is considered statistically significant only if the p-value of that
classifier is very small, e.g., p < 0.001. The prediction accuracy of each
classifier is estimated through leave-one-out cross-validation (LOOCV).
The sensitivity and specificity are shown for each classifier and each class.
A classifier is significant if, for one of the classes, both sensitivity and
specificity are high (e.g., greater than 0.7), and sensitivity is very high (e.g.,
greater than 0.9). For survival analysis, the user specifies limits for the
number and the proportion of false discoveries. The output shows where to
cut the gene list to comply with each of the two constraints specified.
Figure 3: BRB ArrayTools: Collate microarray data and experiment descriptors
4. Run class prediction or survival analysis algorithms from BRB
ArrayTools. Run the class prediction analysis to discover a set of
genes that could accurately predict the class to which a patient
belongs given the gene expression profile of the patient. The user
chooses a clinical characteristic to analyze and a threshold
(significance level) that determines whether a gene is differentially
expressed between the classes defined by the clinical characteristic.
The software builds six classifiers based on the expression values of
the differentially expressed genes and tests these classifiers through
cross validation to determine their sensitivity and specificity. The
statistical significance of each classifier’s accuracy is determined
through permutation testing (multivariate analysis). The global p-value
generated for each classifier represents the false discovery rate of
that classifier. Survival analysis identifies genes that are significantly
associated with the survival of the patients. The software computes a
statistical significance level for each gene based on univariate
proportional hazards models (Cox proportional hazards model).
Multivariate permutation test provides 90% confidence that the rate of
false discoveries is less than a specified quantity.
Figure 6: The 3-Nearest Neighbors classifier is the most accurate classifier for ER class
prediction when using the significance level 10-6 and random univariate tests. The
classifier uses 20 differentially expressed genes. A reading of 3+ for ER is considered
Positive, otherwise it is deemed Negative. Acc. = 89%
Analysis Goal: Identify a list of genes that differentiate between patient
classes or are associated with patient survival. The sensitivity and specificity of a
classifier measure how well the classifier built using these genes can predict the
class of a new patient using the expression data of only the genes from the list.
Analysis Directions:
• Ensure that the global p-value of a classifier is small.
The p-value gives the number of times when a classifier built on
randomized data is more accurate than the classifier built on the
original data. A small p-value shows that the sensitivity and
specificity of the classifier are statistically significant.
• Test as many classifiers as possible. An underlying
association between gene expression data and a clinical
characteristic may be predicted by one classifier more accurately
than another. For different significance levels, different classifiers
may provide the highest accuracy. BRB ArrayTools includes 6
classifiers that have been shown to model microarray data
effectively.
• For biologists, a short list of differentially expressed genes is
more useful than a longer list if the most accurate classifier built
from the short lists has sensitivity and specificity values
comparable to those of the classifier built with the longer list. The
short list includes the most significant genes from the longer list.
Figure 7: Survival analysis is based on the number of days from diagnostic to last followup. The first gene is the only gene associated with survival. It is possible that the other 9
genes in the list are false discoveries.
Figure 8: With the same conditions as above, the maximum allowed number of
false-positive genes is set to 1 and the maximum allowed proportion of falsepositives genes is set to 0.01. The glutathione peroxidase 1 is again the only
gene associated with survival. The small number of genes found to be
associated with survival can be explained by the short follow-up time of the
patients in this study: avg. = 746 days, med. = 811 days. Only 8 patients out of
the 117 breast cancer patients studied are deceased.
Figure 4: Complete results for Nuclear Grade class prediction at the significance level 10-4. The
Support Vector Machine classifier has the highest sensitivity and specificity. Its overall accuracy is
74% and its global p-value is less than 5x10-4. The same 12 genes are used to build each
classifier. A reading of 3 for Nuclear Grade is considered High, otherwise it is deemed Low.
Results: We find BRB ArrayTools to be effective in discovering
statistically relevant associations between the microarray data and
the clinical characteristics of the 117 breast cancer patients
included in our study. We found genes that have significant
associations with Nuclear Grade and ER status, and one gene that
is likely to be associated with survival. Figures 4-8 give details on
the results of our analysis.
Figure 5: The Nearest Centroid classifier is the most accurate classifier for Nuclear Grade class
prediction when using the significance level 10-3 and random univariate tests. The classifier uses
29 differentially expressed genes (not shown). Acc. = 76%
Acknowledgements:
• PolyomX is supported by the Alberta Cancer Foundation and the Alberta
Cancer Board
• DORA is the molecular and clinical database developed by PolyomX
• Analyses were performed using BRB ArrayTools 3.2 developed by Dr. Richard
Simon and Amy Peng
• (1) Clinically validated benchmarking of normalization techniques for two color
oligonucleotide spotted microarray slides (2004). J. Listgarten, K. Graham, S.
Damaraju, C. Cass, J. Mackey and B. Zanke. Applied Bioinformatics: Vol 2,
219-228