You can position your opening statement here, either in

Download Report

Transcript You can position your opening statement here, either in

Standards for SNPs Analysis with
Decision Trees Tools.
Linda Fiaschi
Supervisors:
Jon Garibaldi
Natalio Krasnogor
IMA Seminar 24/02/2009
1
Outline
• Genetic background and clinical objectives
• Disease : Pre-eclampsia
• Method of analysis
• My Methodology: ADTree, C4.5, ID3
• Results
• Conclusions
• Future Work
2
1
Genetics : SNPs
• The DNA of most people is 99.9 percent the
same.
• Single Nucleotide Polymorphisms (SNPs) are
DNA sequence variations that occur when a single
nucleotide (A,T,C,or G) is changed, which occur
approximately once every 100 to 300 bases
• The resulting different forms of the same gene
are called Alleles. People can have two identical
or two different alleles for a particular gene.
3
2
Clinical objectives on SNPs
• The majority have no effect, others cause subtle differences in
countless characteristics, like appearance.
• Genetic factors may also confer susceptibility or resistance to a
disease and determine the severity or progression of disease
• Genetic factors also affect a person's response to drug therapy
4
3
Disease: Pre-eclampsia
• It occurs during pregnancy and the postpartum
period and affects both the mother and the unborn baby.
• Affecting at least 5-8% of all pregnancies, it is a rapidly progressive
condition characterized by high blood pressure and the presence of
protein in the urine.
• Pre-eclampsia and other hypertensive disorders of pregnancy are
a responsible for 76,000 deaths globally each year.
5
4
Case-Control Analysis
Case-control studies use patients who already have a disease or
other condition and look back to see if there are characteristics of
these patients that differ from those who don’t have the disease.
Comparison
Cases: Sick
Controls: Healthy
Classification
Rules
6
5
Decision Tree Analysis
•
One of the most widely used and practical forms of machine
learning and data mining
•
It assigns a class to an input pattern through tests
• Test: has mutually exclusive and exhaustive outcomes
• Test: is either multivariate or univariate
• Attributes: is categorical or numeric
• Tree: 2 classes (Boolean) or more.
7
6
ADTree Algorithm
•
They are a natural generalization of
decision trees
•
They are competitive with other
boosted decision tree algorithms
•
The rules are usually smaller in size
and easier to interpret
•
In addition to classification they give
a measure of confidence
• For each instance there is a multi-path:
the sum of all the prediction nodes gives
the classification
8
8
ID3 Algorithm
Gain measures how well a given attribute separates training
examples into targeted classes.
Gain(S, A) = Entropy(S) – Σ((|Sv| / |S|) * Entropy(Sv) )
S is each value v of all possible values of attribute A
Sv = subset of S for which attribute A has value v
|Sv| = number of elements in Sv
|S| = number of elements in S
Entropy(S) = Σ((-p(I) log2 p(I))
- S is a collection of c outcomes
- Σ is over c.
- p(I) is the proportion of S belonging to class I.
9
9
ID3 Algorithm Example
Delivery week
< 35.5
Liver measures
<94
1(15\4)
>=94
>= 35.5
Systolic Pressure
<152.5
>=152.5
0(25\0)
1(9\1)
Age
<36.3
1(26\2)
>=36.3
0(31\0)
10
10
From ID3 to C4.5 Algorithm
•
Handling both continuous and discrete attributes
•
Handling training data with missing attribute values
•
Pruning trees after creation
11
11
Methodology
A progressive analysis: detection of significant results deepened and
confirmed in the subsequent analysis.
Pre-processing of the Data
Data Analysis
12
12
Pre-processing
13
13
A
Data Analysis
Statistical Significance
Kappa Value:
proportion of
agreement
corrected for
chance between
two judges
assigning cases to
a set of categories
Kappa[8] Agreement
<0
No agreement
0.0-0.2
Slight
0.2-0.4
Fair
0.4-0.6
Moderate
0.6-0.8
Substantial
0.8-1.0
Almost perfect
A
14
14
Experimental Dataset
4529 Patients
Genotype: 52 SNP attributes
•
•
•
•
•
•
•
AGT gene: SNPs 1-8, alleles 1 and 2
AGTR1 gene: SNPs 9-12, alleles 1 and 2
TNF gene: SNPs 13-16, alleles 1 and 2
F5 gene: SNP 17, alleles 1 and 2
NOS3 gene: SNPs 18-22 and 24, alleles 1 and 2
MTHFR gene: SNPs 25, 26, alleles 1 and 2
AGTR2 gene: SNP 27
Phenotype: 53 clinical attributes
• 5 individual's identity data
• 34 maternal data: physical and physiological parameters,
pregnancy details and current treatments
• 6 fetal data: weight and gestational age at birth
• 8 medical history data of parents, partners or siblings
15
15
Results: Pre-processing I
Babies dataset (372X58)
1. Attributes: Gestation at birth (day and
week), weight, disease status, live at birth
2. Class: CBC - birth-weight centile corrected for gestation at birth, baby
sex, ethnicity, mother's height and weight and number of pregnancies.
50 is normal weight, below 50 is underweight.
3. Missing Value: we retain missing values using the appropriate
codification for the chosen algorithm.
4. Data Balancing: case-control ratio depends on the chosen CBC
threshold to transform it from numeric to Boolean.
16
16
Data
Analysis I
Kappa Analysis:
17
17
Results: Data Analysis II
Balancing of the data:
CBC = 6: 147 cases (39.5%) and 225 controls
CBC = 10: 177 cases (47.6%) and 195 controls
CBC = 28: 243 cases (65.3%) and 129 controls
> 33%
ADTree results Analysis
18
18
Results:
Data Analysis III
C4.5 Results Analysis:
19
19
Results: Data Analysis IV
Cross Analysis: common attributes between ADTree and C4.5
20
20
Results: Data Analysis V
Analysis with common attributes for CBC= 28
(ADTree Kappa = 0.41, C4.5 Kappa = 0.38) :
Male babies, born after the 35th week of gestation and with:
AGT SNP3 allele2 = 1
(CBC > 28)
AGT SNP3 allele2 = 2 &
AGTR1 SNP11 allele2 = 1
(CBC < 28)
Analysis with only Gestational week and CBC = 10
(Kappa value = 0.42 for both the ADTree and C4.5) :
Babies delivered before 35 or 35.5 week of gestation are likely to be
underweight (CBC < 10).
21
21
Conclusions
• Guideline for data mining in the specific application of case-control
analysis for SNPs.
• Methodological point of view: attributes are rejected, instances
are decreased (screening stage).
• Clinical perspective: Significance of threshold CBC = 10 and
dependency of CBC on the “week of delivery”.
22
22
Future Work
•
Genotype of the mothers rather that the babies.
•
Recoding of the SNPs
•
Redundant interaction between attributes
•
Non linear interaction between attributes
•
Heritable trend can be detected across the two generations
23
23
References
[1] J. Han and M. Kamber, Data Mining: Concept and Techniques.Morgan Kaufmann, 2006.
[2] N. M. Laird and C. Lange, “Family-based designs in the age of largescale gene-association
studies,” Nature Reviews Genetics, pp. 385–394, 2006.
[3] J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, pp. 81–106, 1986.
[4] J. R. Quinlan, “C4.5: Programs for machine learning,” Machine Learning, vol. 16, no. 3, pp. 235–
240, 1994.
[5] Y. Freund and L. Mason, “The alternating decision tree learning algorithm,” Proceedings of the
Sixteenth International Conference on Machine Learning, pp. 124–133, 1999.
[6] J. Cohen, “A coefficient of agreement for nominal scales,” Educational and Psychological
Measurement, vol. 20, no. 1, pp. 37–46, 1960.
[7] D. G. Altman, Practical Statistics for Medical Research., Chapman and Hall, Eds. CRC Press,
1991.
[8] Landis, J. R. and Koch, The measurement of observer agreement for categorical data.
Biometrics. (1977) pp. 159--174
24
24