Document 307353

Download Report

Transcript Document 307353

In this demo the following features in LatentiX
will be demonstrated
•
•
•
•
•
•
•
•
•
•
•
addition of variables from external files via the clipboard
renaming variables
deleting variables
handling category variables
colouring plots by variables and sets
creating calibration- and validation-sets (using “Set composer”)
creating object- and variable sets (using “Create sets”)
variable selection (Principal Variables)
making predictions
plotting the prediction results
transferring results (tables and plots) to reports
and more ...
The demodata is avaible from the internet:
The original paper
See also:
http://home.ccr.cancer.gov/ncifdaproteomics/ppatterns.asp
Comments to the paper
There might be some problems with
the experimental design ...
Respons to comments
With the MATLAB® Bioinformatics Toolbox® the data are
pre-processed:
From: http://www.mathworks.com/products/bioinfo/demos.html?file=/products/demos/shipping/bioinfo/mspreprodemo.html
Let’s look at the data in LatentiX:
Load the dataset from the “Demo
datasets” menu.
Note: the number of available datasets can vary
The dataset consist of 216
objects and
4000 measured variables.
The category variable is found
in a separate Excel-file, and is
added to the instrumental
data.
Open the Excel-file, mark the range and select Copy
Open the Excel-file: Ovarian_cancer.xls
Return to LatentiX and select
“Add variables from clipboard ...”
Because some of the imported data are non-numeric, you can
automatically create a category variable
Change the numeric value
corresponding to “Normal” to
“0” (zero)
Give the variable a better name:
Delete the variable “obj. no.”:
NOTE: Due to a bug in LatentiX, you have to import at least two
variables, and then delete the unnecessary variable afterwards.
We now have 216 objects and
4001 variables, the last one
being the category variable
“Cancer”.
It’s a good time to save the data on
the disk – here the data is saved in
the folder “C:\temp\My Latentix
files”
De-select the variable “Cancer” by
holding the Ctlr-button down while
clicking on its name in the variable
list box.
Next autoscale the instrumental
variables.
The plot is immediately updated
Select PLS as model type
Click on “Y”
Select “Cancer” as the Y-variable
Click “Calculate”
Choose CV: Random (repeated)
as validation method
Give the model a good name: click on “Name”
Let’s have a look on the scores-plot
It’s a good idea to take a note!
Select
“Color according to”
“Continous” and select
“Cancer”
There seems to be some discriminating power in the 4000 variables
We now create two object sets “Healthy” and “Sick”
Go to the workbench and select Objects, Create sets
(shortcut: ALT+O, C):
In “Criteria 1” select “Where Cancer == 0”
click “Create sets” and
give the set a name
“Cancer == 0” is suggested, but change it e.g. to
“Healthy” and click OK
In “Criteria 1” now select “Where Cancer == 1” and
follow the same procedure ...
We look at the scores again now colouring by
the two sets:
Select
“Color according to”
“Sets” and select the two
new sets “Healthy” and
“Sick”
We get - of cause - the exact same plot, but the legends
are more meaningful
We have used all 216 objects and all 4000 variables
until now.
To avoid over-fitting when selecting “good” subsets of
variables, we will split the objects into a calibration- and
a validation set.
For that purpose, we use the “Set composer ...”
Go to the workbench and select Objects, Set composer
(shortcut: ALT+O, O):
Sort by “Data value”
Right-click in “Search result”
Select “Cancer”
NOTE: You might have to selected “Sort method: Alphabetic” once
and then again “Sort method: Data value” to get the shown picture
Click “Exit”
Calculate the Principal Variables.
It is very important, that this is based
on the calibration set only – beware of
over fit!
Select matrix X to enable the menu “Principal Variables” (only available in the full version)
Select the 16 variables, which
are most descriptive for “Cancer”.
These 16 variables describe 90%
of the total variation.
Click “Select in workbench”
and close the window
When the “Principal variables” window is closed,
the 16 important variables are highlighted in the
variables box.
It is convenient to define a set of these variables.
Select:
Variables, Define set ...
in the workbench and write a name, e.g. “PV-16”
Select “Color by Cancer”
Calculate a new PLS-model using only the
calibration set “CAL” (162 objects) and only the
16 principal variables “PV-16”.
Use the same settings (validation method etc.)
as before and give the model a good name.
Choose Plot, Scores to see the scores plot
Select “Options, Lines on selected sets” to emphasise the grouping
The
subset
16 the
principal
gives a better
We will
nowoftest
modelvariables
on the independent
objects.
discrimination
between
sick
and
healthy
people
than did
I.e. objects that have not participated in the selection
of
the
4000 original
variables.
variables
nor in the
PLS-model.
A
lottoofthe
noisy
and irrelevant
variables
have been
Go
workbench
and make
a prediction
of the set
removed.
“VAL” using the PLS-model based on the set “CAL”:
2
5
3
1
4
A clear distinction between sick and healthy is also seen
among the independent validation objects.
Thus, the selected variables are interesting and could
maybe be worth a closer study.
You might want to make plots of loadings or regression
coefficients and copy it to the Windows clipboard or
directly to PowerPoint – see the next slides:
Make a Loadings plot and select “Tables, Current plot” (ALT+T, C). Paste into Excel.
PLS Loadings [Model 2: PLS on CAL using 16 principal varia ...]
0.4
Var #2734
0.3
Cancer
Loadings PC#2 (15.881%)
0.2
0.1
Var #2332
Var #2588
Var #2162
0
Var #3734
-0.1
Var #1891
Var #2762
-0.2
Var #2814
-0.3
Var #3136
Var #2351
Var # 905
Var #2036
Var #3474
Var
#1045
Var
#3502
-0.4
Var # 919
-0.5
-0.8
-0.6
-0.4
-0.2
0
0.2
Loadings PC#1 (14.016%)
0.4
0.6
0.8
Select “Plots, Copy plot to PowerPoint” (ALT+P, O)
or “Plots, Copy plot to Clipboard” (ALT+P, D)
You could also look at the regression coefficients (the plot is pasted into Excel from the clipboard):
THE END
Note, that the group Healthy is measured at day #1, whereas the
group Sick is measured at day #2 and #3.
Thus, we can not be tell, whether the revealed effects are due to
human disease or to changes in the instrument - this is a problem.
The principles shown in this demo are, however, valid.