#### Transcript Su, Xiaogang - Optimal Tree/MLE

Tree-Augmented Regression (TAR) Analysis and Its Applications in Health Science Xiaogang Su Data Mining Group Department of Statistics and Actuarial Science University of Central Florida Orlando, FL 32816 [email protected] Summary 1. Introduction and Main Idea of TAR 1.1 Data Mining 1.2 Recursive Partitioning (Tree-Based Methods) 1.3 Main idea of TAR - Model Checking in Analysis of Large Data 2. Tree Method for Checking Adequacy of Functional Form 2.1 The Tree Procedure 2.2 The WCGS Example 3. Tree method for Handling Heteroscedasticity 3.1 The tree procedure 4. Discussion 1.1, Introduction - Data Mining Allocation of Current Statistical Research Efforts: – – The area of large data has somehow been overlooked by statisticians. – Small Data (Statistical Genetics) Large Data (Data Mining, KDD) “KDD is among the top 10 technologies that are believed to change the world” Data mining is an emerging field that integrates statistics, computer science, machine learning, artificial intelligence, and etc. 1.1, Introduction - Data Mining Tools in Data Mining – – (Generalized) Linear models, cluster analysis recursive partitioning, neural networks, etc. Characteristics of Data Mining Tools – – – Data Driven (vs. Theory-Driven) Computationally Intensive Problem Solving 1.2, Introduction – Recursive Partitioning Recursive partitioning or tree-based methods – – Fit piecewise (constant) models by recursively bisecting the predictor space The hierarchical (binary) tree structure automatically and optimally divide data into disjoint groups. Most Important Advances – – – – AID by Morgan and Sonquist (1963) CART by Breiman, Friedman, Olshen, and Stone (1984) MARS by Friedman (1990) Bagging (Breiman, 1996), Boosting (Freund and Schapire, 1996; Friedman, 2001), and Random Forest (Breiman, 2001). 1.3, Introduction – TAR Tree-Augmented Analysis (TAR) belongs to the broader scope of hybrid models. Main Idea of TAR – – Fit “best” (generalized) linear models first, and then use trees as a supplemental tools to provide augmentation. Starts with and centers around the linear model Derived Tree Methods 1. Trees for Checking and Amending Deficiencies of Functional Form Specification (Su, Tsai, and Wang, 2003) – 2. 3. Covariate-Adjusted Classification (Su, 2005) Handling Hoteroscedasticity or Over-Dispersion (Su,Tsai and Yan, 2004) Interaction Trees (Su, Yan, and Ji, 2005) 1.3, Introduction – TAR Tree-Augmented Analysis (TAR) belongs to the broader scope of hybrid models. Main Idea of TAR – – Fit “best” (generalized) linear models first, and then use trees as a supplemental tools to provide augmentation. Starts with and centers around the linear model Derived Tree Methods 1. Trees for Checking and Amending Deficiencies of Functional Form Specification – 2. 3. Covariate-Adjusted Classification Handling Hoteroscedasticity or Over-Dispersion Interaction Trees 1.3, Assumptions Four major assumptions involved in linear model fitting – – – – Linearity or functional form specification Homoscedasticity Normality Independence of data 1.3, Model Diagnostics Conventional Diagnostic Tools and Their Limitations 1. Graphical Plots – – 2. Hard to make decisive conclusions May result in solid black display when working large data Numerical Methods e.g. grouping residuals & chisquared test – – – Sensitive to number of groups and how grouping is formed Statistical significance testing is not appealing for large data Do not offer valuable clues about how to make amends 2 Tree-Structured Checking for Adequacy in Functional Form Specification 2.1 Comparison of Two Approaches Approach I: Residual-Based – – – – Intuitive and easy to implement Computationally faster Fails in situations where same variables are involved in both the linear regression part and tree part. Reason: Bias problem in estimating betas caused by underfitting or misspecification! Approach II – – – – Efficient way of evaluating the splitting statistics is available Better performance Computationally manageable Recommended! 2.2 Example – Cancer Mortality in WCGS Study The Western Collaborative Group Study (WCGS) is a prospective cardiovascular epidemiological study. In 1960 through 1961, 3,154 healthy middle-aged white males, who were free of coronary heart disease (CHD) and cancer, were drawn from 10 large California corporations and followed up for various health outcomes such as incidence of CHD, cancer, and death. (Ragland and Brand, 1988) After a 33-year follow-up till 1993, 405 deaths were known due to cancer. 927 participants died of unrecorded causes prior to 1993. Outcome of Interest: cancer death time (censored survival times), together with 2047 observations and 9 covariates 2.2 Example – Variable Description 2.2 WCGS – hostlty and behpatt Two covariates, hostlty and behpatt, are particularly worth noting. behpatt - indicator of Type A pattern behavior. The Type A pattern behavior is characterized by traits such as impatience, aggressiveness, a sense of time urgency, and the desire to achieve recognition and advancement. People exhibiting Type A behavior seem to find themselves in various high-pressured scenarios. One major subcomponent of Type A pattern behavior is hostility, denoted by hostlty. The `hostility' part is characterized by a tendency to react to unpleasant situations with responses that reflect anger, frustration, irritation, and disgust. See e.g., Schneiderman, et al. (1989) for more discussions. 2.2 WCGS – The “best” Cox PH Model The Cox Proportional Hazards Model built by Carmelli, Zhang, and Swan (1997) 2.2 WCGS – The Augmentation Tree Structure 2.2 WCGS – Tree-Augmented Cox PH Model The Tree-Augmented Cox Proportional Hazards Model (Su & Tsai, 2005) 2.2 WCGS – An Alternative Final Model 3.1 The Tree Procedure Again, the final tree final can be found following the standard practice of CART. In particular, Leblanc and Crowley (1993)’s split-complexity pruning algorithm can be adopted to truncate trees. An iterative weighted least squares method can be used to fit the TV model (4). A computationally more efficient tree procedure is available. 4, Discussion & Future Research The augmentation tree structures provide covariateadjusted classification rules, which can be useful in medical prognosis/diagnosis, and screening. The methods of MARS, bagging, and boosting, are useful in augmentation trees. Extensions: Heteroscedasticity is embodied as over-dispersion in generalized linear models and frailty in survival analysis. Interaction Trees – subgroup or cross-sectional analysis in clinical trials and drug discovery. 谢谢! Thank you!