4, Section 4.6 - La Salle University

Download Report

Transcript 4, Section 4.6 - La Salle University

Data Mining – Algorithms:
Linear Models
Chapter 4, Section 4.6
Numeric Attributes
• Numeric prediction and/ or numeric attributes as
predictors
• Linear regression is well established statistical
technique
– Designed to predict numeric value based on numeric
attributes
– Determines optimal set of coefficients for linear equation:
• pred = w0 + w1a1 + w2a2 + … + wnan
– Optimal means prediction errors squared is minimized
– For data mining, this would be done on training data so that it
can be tested on test data
– I hope that a CSC major could read a statistics book and then
write the code to do this
– However, there is no need to do this, since this method is so
available, unless you are seeking to create an improved
Example
• <Show Basketball Spreadsheet – Baskball sheet
• NOTE – input values, weights, prediction vs
actual
• <Show testReg sheet – test on separate instances
• NOTE – how it did – prediction vs actual –
difference, correlation
Using Regression for Classification
• Perform regression for each class
• Set output to be predicted = 1 for training
instances that belong to a class
• Set output to be predicted = 0 for training
instances that do NOT belong to the class
• Do this for each class, and you will have an
“membership function” equation for each class
• On test, plug new instance into each equation,
and highest value produced will be the
prediction to make
Example
•
•
•
•
•
•
<Show discretized sheet
NOTE – prep of data – into low, medium, high
NOTE – Weights for 3 regressions, high, med, low
<Show Test sheet
NOTE – Calcs Hi, Med, Low
(doesn’t do that well, suspect that the data may not be
from same source (NBA), and that the discretization
was a bit of a problem (very few low)
More sophisticated
• Do as many pairwise competitions as necessary
• Training – two classes against each other:
– temporarily toss training instances that are not one of the two
– Set output = 1 for class to be predicted and –1 for other
• Test – do all pairwise competitions, winner of each gets a
vote
–
–
–
–
–
E.g. say –
Medium beats High
Medium beats Low
High beats Low
Medium wins 2-1-0
• Conservative approach would be to predict nothing if no
prediction dominates
In Context
• Has been used for decades for various applications (e.g.
social science research)
• Bias – only searches for linear equations – no squares,
cubes etc
• To work well, data must fit a linear model – e.g must
be “linearly separable” – be able to divide with a
line (in 2D, a plane in 3D, a hyperplane in multi-D)
• To work well, attributes should not be highly correlated
with each other
• Depends on numeric attributes
Let’s Look at WEKA
• Linear Regression with Basketball data
• No Correctness measures
– Correlations
– Error
• Discretize Points per minute
– Try logistic regression – a categorical prediction
approach
End Section 4.6