Su, Xiaogang - Optimal Tree/MLE

Download Report

Transcript Su, Xiaogang - Optimal Tree/MLE

Tree-Augmented Regression (TAR) Analysis
and Its Applications in Health Science
Xiaogang Su
Data Mining Group
Department of Statistics and Actuarial Science
University of Central Florida
Orlando, FL 32816
[email protected]
Summary
1.
Introduction and Main Idea of TAR
1.1 Data Mining
1.2 Recursive Partitioning (Tree-Based Methods)
1.3 Main idea of TAR - Model Checking in Analysis of Large Data
2.
Tree Method for Checking Adequacy of Functional Form
2.1 The Tree Procedure
2.2 The WCGS Example
3.
Tree method for Handling Heteroscedasticity
3.1 The tree procedure
4.
Discussion
1.1, Introduction - Data Mining

Allocation of Current Statistical Research Efforts:
–
–

The area of large data has somehow been overlooked by
statisticians.
–

Small Data (Statistical Genetics)
Large Data (Data Mining, KDD)
“KDD is among the top 10 technologies that are believed to
change the world”
Data mining is an emerging field that integrates statistics,
computer science, machine learning, artificial intelligence, and etc.
1.1, Introduction - Data Mining

Tools in Data Mining
–
–

(Generalized) Linear models, cluster analysis
recursive partitioning, neural networks, etc.
Characteristics of Data Mining Tools
–
–
–
Data Driven (vs. Theory-Driven)
Computationally Intensive
Problem Solving
1.2, Introduction – Recursive Partitioning

Recursive partitioning or tree-based methods
–
–

Fit piecewise (constant) models by recursively bisecting the
predictor space
The hierarchical (binary) tree structure automatically and optimally
divide data into disjoint groups.
Most Important Advances
–
–
–
–
AID by Morgan and Sonquist (1963)
CART by Breiman, Friedman, Olshen, and Stone (1984)
MARS by Friedman (1990)
Bagging (Breiman, 1996), Boosting (Freund and Schapire, 1996;
Friedman, 2001), and Random Forest (Breiman, 2001).
1.3, Introduction – TAR


Tree-Augmented Analysis (TAR) belongs to the broader
scope of hybrid models.
Main Idea of TAR
–
–

Fit “best” (generalized) linear models first, and then use
trees as a supplemental tools to provide augmentation.
Starts with and centers around the linear model
Derived Tree Methods
1.
Trees for Checking and Amending Deficiencies of Functional
Form Specification (Su, Tsai, and Wang, 2003)
–
2.
3.
Covariate-Adjusted Classification (Su, 2005)
Handling Hoteroscedasticity or Over-Dispersion (Su,Tsai
and Yan, 2004)
Interaction Trees (Su, Yan, and Ji, 2005)
1.3, Introduction – TAR


Tree-Augmented Analysis (TAR) belongs to the broader
scope of hybrid models.
Main Idea of TAR
–
–

Fit “best” (generalized) linear models first, and then use
trees as a supplemental tools to provide augmentation.
Starts with and centers around the linear model
Derived Tree Methods
1.
Trees for Checking and Amending Deficiencies of
Functional Form Specification
–
2.
3.
Covariate-Adjusted Classification
Handling Hoteroscedasticity or Over-Dispersion
Interaction Trees
1.3, Assumptions

Four major assumptions involved in linear
model fitting
–
–
–
–
Linearity or functional form specification
Homoscedasticity
Normality
Independence of data
1.3, Model Diagnostics
Conventional Diagnostic Tools and Their
Limitations
1.
Graphical Plots
–
–
2.
Hard to make decisive conclusions
May result in solid black display when working large data
Numerical Methods e.g. grouping residuals & chisquared test
–
–
–
Sensitive to number of groups and how grouping is formed
Statistical significance testing is not appealing for large data
Do not offer valuable clues about how to make amends
2
Tree-Structured Checking for Adequacy
in Functional Form Specification
2.1 Comparison of Two Approaches

Approach I: Residual-Based
–
–
–
–
Intuitive and easy to
implement
Computationally faster
Fails in situations where
same variables are involved
in both the linear regression
part and tree part.
Reason: Bias problem in
estimating betas caused by
underfitting or misspecification!

Approach II
–
–
–
–
Efficient way of evaluating
the splitting statistics is
available
Better performance
Computationally
manageable
Recommended!
2.2 Example – Cancer Mortality in WCGS Study




The Western Collaborative Group Study (WCGS) is a
prospective cardiovascular epidemiological study.
In 1960 through 1961, 3,154 healthy middle-aged white males,
who were free of coronary heart disease (CHD) and cancer,
were drawn from 10 large California corporations and followed
up for various health outcomes such as incidence of CHD,
cancer, and death. (Ragland and Brand, 1988)
After a 33-year follow-up till 1993, 405 deaths were known due
to cancer. 927 participants died of unrecorded causes prior to
1993.
Outcome of Interest: cancer death time (censored survival
times), together with 2047 observations and 9 covariates
2.2 Example – Variable Description
2.2 WCGS – hostlty and behpatt

Two covariates, hostlty and behpatt, are particularly worth noting.

behpatt - indicator of Type A pattern behavior. The Type A pattern
behavior is characterized by traits such as impatience,
aggressiveness, a sense of time urgency, and the desire to achieve
recognition and advancement. People exhibiting Type A behavior
seem to find themselves in various high-pressured scenarios.

One major subcomponent of Type A pattern behavior is hostility,
denoted by hostlty. The `hostility' part is characterized by a
tendency to react to unpleasant situations with responses that reflect
anger, frustration, irritation, and disgust. See e.g., Schneiderman, et
al. (1989) for more discussions.
2.2 WCGS – The “best” Cox PH Model

The Cox Proportional Hazards Model built by Carmelli,
Zhang, and Swan (1997)
2.2 WCGS – The Augmentation Tree Structure
2.2 WCGS – Tree-Augmented Cox PH Model

The Tree-Augmented Cox Proportional Hazards Model (Su &
Tsai, 2005)
2.2 WCGS – An Alternative Final Model
3.1 The Tree Procedure



Again, the final tree final can be found following
the standard practice of CART. In particular,
Leblanc and Crowley (1993)’s split-complexity
pruning algorithm can be adopted to truncate
trees.
An iterative weighted least squares method
can be used to fit the TV model (4).
A computationally more efficient tree procedure
is available.
4, Discussion & Future Research




The augmentation tree structures provide covariateadjusted classification rules, which can be useful in
medical prognosis/diagnosis, and screening.
The methods of MARS, bagging, and boosting, are
useful in augmentation trees.
Extensions:
Heteroscedasticity is embodied as over-dispersion in
generalized linear models and frailty in survival analysis.
Interaction Trees – subgroup or cross-sectional analysis
in clinical trials and drug discovery.
谢谢!
Thank you!