Transcript Document

NY housing data revisited Weighted kNN kmeans,
clustering, some plotting,
Bayes
Peter Fox
Data Analytics – ITWS-4963/ITWS-6965
Week 6b, March 6, 2015
1
Plot tools/ tips
pairs, gpairs, scatterplot.matrix, clustergram, ggmap, geocoord, etc.
data()
# precip, presidents, swiss, sunspot.month, environmental, ethanol,
ionosphere
More script fragments in Lab6b_*_2015.R on the web site
(escience.rpi.edu/data/DA )
Resetting plot space:
par(mfrow=c(1,1))
par()$mar # to view margins
par(mai=c(0.1,0.1,0.1,0.1))
2
Do over…
• Get Bronx.R from the escience script location
• Look through the pdf – linked off the website
for this week (6) –
Data_Analytics2015_week5.pdf ~ is the week
5b lab, but working with new libraries (thanks
to Jiaju and James) – skip page 1…
• A lesson in R…
3
Plotting clusters (DIY)
library(cluster)
clusplot(mapmeans, mapobj$cluster, color=TRUE,
shade=TRUE, labels=2, lines=0)
# Centroid Plot against 1st 2 discriminant functions
#library(fpc)
plotcluster(mapmeans, mapobj$cluster)
• dendogram?
library(fpc)
• cluster.stats
4
Comparing cluster fits
Use help.
> help(plotcluster)
> help(cluster.stats)
5
Using a contingency table
> data(Titanic)
> mdl <- naiveBayes(Survived ~ ., data =
Bayes Classifier for Discrete Predictors
Titanic) Naive
Call: naiveBayes.formula(formula = Survived ~ ., data = Titanic)
probabilities:
> mdl A-priori
Survived
No
Yes
0.676965 0.323035
Conditional probabilities:
Class
Survived
1st
2nd
3rd
Crew
No 0.08187919 0.11208054 0.35436242 0.45167785
Yes 0.28551336 0.16596343 0.25035162 0.29817159
Sex
Survived
Male Female
No 0.91543624 0.08456376
Yes 0.51617440 0.48382560
Age
6
Survived
Child
Adult
No 0.03489933 0.96510067
Try Lab6b_9_2014.R
Yes 0.08016878 0.91983122
Classification Bayes
• Retrieve the abalone.csv dataset
• Predicting the age of abalone from physical
measurements.
• Perform naivebayes classification to get
predictors for Age (Rings). Interpret.
• Compare to what you got from kknn (weighted
nearest neighbors) in class 4b
7
http://www.ugrad.stat.ubc.ca/R/library/mlb
ench/html/HouseVotes84.html
require(mlbench)
data(HouseVotes84)
model <- naiveBayes(Class ~ ., data =
HouseVotes84)
predict(model, HouseVotes84[1:10,-1])
predict(model, HouseVotes84[1:10,-1], type =
"raw")
pred <- predict(model, HouseVotes84[,-1])
table(pred, HouseVotes84$Class)
8
Hair, eye color
> data(HairEyeColor)
> mosaicplot(HairEyeColor)
> margin.table(HairEyeColor,3)
Sex
Male Female
279 313
> margin.table(HairEyeColor,c(1,3))
Sex
Hair Male Female
Black 56 52
Brown 143 143
Red 34 37
Blond 46 81
Construct a naïve Bayes classifier and test it!
9
Cluster plotting
source("http://www.r-statistics.com/wpcontent/uploads/2012/01/source_https.r.txt") # source code
from github
require(RCurl)
require(colorspace)
source_https("https://raw.github.com/talgalili/R-codesnippets/master/clustergram.r")
data(iris)
set.seed(250)
par(cex.lab = 1.5, cex.main = 1.2)
Data <- scale(iris[,-5]) # scaling
clustergram(Data, k.range = 2:8, line.width = 0.004) # line.width 10
- adjust according to Y-scale
Any good?
set.seed(500)
Data2 <- scale(iris[,-5])
par(cex.lab = 1.2, cex.main = .7)
par(mfrow = c(3,2))
for(i in 1:6) clustergram(Data2, k.range = 2:8 ,
line.width = .004, add.center.points = T)
Scale? We’ll look at multi-dimensional scaling
soon (mds)
11
How can you tell it is good?
set.seed(250)
Data <- rbind(
cbind(rnorm(100,0, sd =
0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)),
cbind(rnorm(100,1, sd =
0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3)),
cbind(rnorm(100,2, sd =
0.3),rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3)))
clustergram(Data, k.range = 2:5 , line.width = .004,
add.center.points = T)
12
More complex…
set.seed(250)
Data <- rbind(
cbind(rnorm(100,1, sd = 0.3),rnorm(100,0,
sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)),
cbind(rnorm(100,0, sd = 0.3),rnorm(100,1,
sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)),
cbind(rnorm(100,0, sd = 0.3),rnorm(100,1,
sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3)),
cbind(rnorm(100,0, sd = 0.3),rnorm(100,0,
sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3)))
clustergram(Data, k.range = 2:8 , line.width = .004,
add.center.points = T)
13
Hierarchical clustering
> dswiss <- dist(as.matrix(swiss))
> hs <- hclust(dswiss)
> plot(hs)
14
Another example
> A = c(1, 2.5); B = c(5, 10); C = c(23, 34)
> D = c(45, 47); E = c(4, 17); F = c(18, 4)
> df <- data.frame(rbind(A,B,C,D,E,F))
> colnames(df) <- c("x","y")
> hc <- hclust(dist(df))
> plot(hc)
> df$cluster <- cutree(hc,k=2)
> plot(y~x,df,col=cluster)
# 2 clusters
15
See also
• Lab5a_ctree_1_2015.R
– Try clustergram instead
– Try hclust
• Lab3b_kmeans1_2015.R
– Try clustergram instead
– Try hclust
16
Assignment 6 preview
• Your term projects should fall within the scope of a data analytics
problem of the type you have worked with in class/ labs, or know of
yourself – the bigger the data the better. This means that the work must
go beyond just making lots of figures. You should develop the project to
indicate you are thinking of and exploring the relationships and
distributions within your data. Start with a hypothesis, think of a way to
model and use the hypothesis, find or collect the necessary data, and do
both preliminary analysis, detailed modeling and summary
(interpretation).
– Note: You do not have to come up with a positive result, i.e. disproving the hypothesis
is just as good. Please use the section numbering below for your written submission for
this assignment.
•
•
•
•
•
•
Introduction (2%)
Data Description (3%)
Analysis (8%)
Model Development (8%)
Conclusions and Discussion (4%)
Oral presentation (5%) (10 mins)
17
Assignments to come
• Term project (A6). Due – early May. 30% (25% written, 5%
oral; individual). Available before spring break.
• Assignment 7: Predictive and Prescriptive Analytics. Due ~
week ~ 10. 15% (15% written; individual);
18