2 - People Server at UNCW - University of North Carolina Wilmington

Download Report

Transcript 2 - People Server at UNCW - University of North Carolina Wilmington

Using Random Forests to explore
a complex Metabolomic data set
Susan Simmons
Department of Mathematics and Statistics
University of North Carolina Wilmington
Collaborators
•
•
•
•
•
•
Dr. David Banks (Duke)
Dr. Jacqueline Hughes-Oliver (NC State)
Dr. Stan Young (NISS)
Dr. Young Truoung (UNC)
Dr. Chris Beecher (Metabolon)
Dr. Xiaodong Lin (SAMSI)
Large data sets
• Examples
– Walmart
• 20 million transactions daily
– AT&T
• 100 million customers and carries 200 million calls a day on its
long-distance network
– Mobil Oil
• over 100 terabytes of data with oil exploration
– Human genome
• Gigabytes of data
– IRA
Dimensionality
Dimensionality
•
•
•
•
3,000 metabolites
40,000 genes
100,000 chemicals
Try to find the signal in these data sets (and
not the noise)…..Data mining
• Examples of data mining techniques:
pattern recognition, expert systems, genetic
algorithms, neural networks, random forests
Today’s talk
• Focus on classification (supervised
learning…use a response to guide the
learning process)
• Response is categorical (Each observation
belongs to a “class”)
• Interested in relationship between variables
and the response
• Short, fat data (instead of long, skinny data)
Long, skinny data
X
Y
Z
2
8
9
3
4
4
7
5
46
8
7
3
4
56
35
6
58
63
12
9
3
14
2
35
24
1
45
2
7
4
13
78
25
14
56
34
18
6
89
35
8
56
Short, fat data
X
Y
Z
S
T
V
M
N
R
Q
L
H
G
K
B
C
W
4
36
5
8
30
4
35
7
3
78
9
3
1
40
2
5
34
6
7
34
6
7
67
8
89
8
4
2
6
5
9
8
67
3
7
46
2
4
5
6
7
58
9
7
9
50
4
45
7
8
45
8
4
5
65
57
57
42
2
7
23
4
6
76
8
0
56
90
n<p problem
Random Forests
• Developed by Leo Breiman (Berkeley) and
Adele Cutler (Utah State)
• Can handle the n<p problem
• Random forests are comparable in accuracy
to support vector machines
• Random forests are a combination of tree
predictors
Constructing a tree
Observation
1
2
3
4
5
6
7
8
Gender
F
F
M
F
F
M
F
M
Height (inches)
60
66
68
70
66
72
64
67
Tree for previous data set
All observations
N=8
Height < 66
Height > 66
N=4
N=4
Male
Female
Male
Female
N=0
N=4
N=3
N=1
Random Forest
• First, the number of trees to be grown must
be specified.
• Also, the number of variables randomly
selected at each node must be specified (m).
• Each tree is constructed in the following
manner:
1. At each node, randomly select m variables to
split on.
Random Forest
2. The node is split using the best split among
the selected variables.
3. This process is continued until each node has
only one observation, or all the observations
belong to the same class.
• Do this for each tree in the “forest”
Example: Cereal Data
N=70
(40 G, 30K)
Calories <100
Calories <100
(2 G, 15 K)
(38 G, 15 K)
Fat <1
Fat >1
Carbo<12
Carbo>12
15 K
2G
15 K
38G
Random Forest
• Another important feature is that each tree is
created using a bootstrap sample of the learning
set.
• Each bootstrap sample contains approximately 2/3
of the data (thus approximately 1/3 is left)
• Now, we can use the trees built not containing
observations to get an idea of the error rate (each
tree will “vote” on which class the observation
belongs to).
• Example
N=70
(40 G, 30K)
Calories <100
Calories <100
(2 G, 15 K)
(38 G, 15 K)
Fat <1
Fat >1
Carbo<12
Carbo>12
15 K
2G
15 K
38G
Observation withheld from creating this tree
Calories Fat Carbo Mfr
98
2
10
K
Random Forest
• This gives us an “out of bag” error rate
• Random forests also give us an idea of
which variables are important for
classifying individuals.
• Also gives information about outliers
The era of the
“omics” sciences
Just a few of the “omics”
sciences
•
•
•
•
•
•
•
•
Genomics
Transcriptomics
Proteomics
Metabolomics
Phenomics
Toxicogenomics
Phylomics
Foldomics
•
•
•
•
•
Kinomics
Interactomics
Behavioromics
Variomics
Pharmacogenomics
Functional Genomics
Genomics
Transciptomics
Proteomics
Metabolomics
Metabolomics
• Metabolites are all the small molecules in a
cell (i.e. ATP, sugar, pyruvate, urea)
• 3,000 metabolites in the human body
(compared to 35,000 genes and
approximately 100,000 proteins)
• Most direct measure of cell physiology
• Uses GC/MS and LC/MS to obtain
measurements
Data
• Currently only have GC/MS information
• Missing values are very informative (below
detection limits)
• Imputed data using uniform random
variables from 0 to minimum value
• 105 metabolites
• 58 individuals (42 “disease 1”, 6 “disease
2”, and 10 “controls”)
Confusion matrix
1
2
3
1
40
1
8
2
0
5
1
3
2
0
1
Oob error = 20.69%
Outlier
Variable Importance
Visual Data
• Dostat
Conclusions
• Random forests, support vector machines,
and neural networks are some of the newest
algorithms for understanding large datasets.
• There is still much more to be done.
Thank you