An Introduction to R

Download Report

Transcript An Introduction to R

An Introduction to R
Free software for repeatable
statistics, visualisation and
modeling
Dr Andy Pryke,
The Data Mine Ltd
[email protected]
Dr Andy Pryke - The Data Mine Ltd
Outline
1. Overview
What is R?
When to use R?
Wot no GUI?
Help and Support
2. Examples
Simple Commands
Statistics
Graphics
Modeling and Mining
SQL Database Interface
3. Going Forward
Relevant Libraries
Online Courses etc.
Dr Andy Pryke - The Data Mine Ltd
What is R?
• Open source, well supported, command line
driven, statistics package
• 100s of extra “packages” available free
• Large number of users - particularly in
bio-informatics and social science
• Good Design - John Chambers received the
ACM 1998 Software System Award for “S”
Dr. Chambers' work "will forever alter the
way people analyze, visualize, and
manipulate data…”
Dr Andy Pryke - The Data Mine Ltd
When Should I Use R?
• To do a full cycle of:
–
–
–
–
–
–
data import
data pre-processing
exploratory statistics and graphics,
modeling and data mining
report production
integration into other systems.
• Or any one of these steps - i.e. just to
standardise pre-processing of data
Dr Andy Pryke - The Data Mine Ltd
Wot no GUI?
or “The Advantages of Scripting”
•
•
•
•
•
Repeatable
Debug-able
Documentable
Build on previous work
Automation
– Report generation
– Website or system integration
– Links from Perl, Python, Java, C, TCP/IP….
Dr Andy Pryke - The Data Mine Ltd
Help and Support
Built in help/example system (e.g. type “?plot”)
Many tutorials available free
R-Help mailing list
- Archived online
- Key R developers respond
- Contributors understand statistical concepts
Large User Community
Dr Andy Pryke - The Data Mine Ltd
Simple Commands
1+1
2
10*3
30
c(1,2,3)
1 2 3
c(1,2,3)*10
10 20 30
x <- 5
x*x
25
exp(1)
2.718282
q()
Save workspace
image? [y/n/c]: n
Dr Andy Pryke - The Data Mine Ltd
Simple Statistics
colnames(iris)
"Sepal.Length" "Sepal.Width" "Petal.Length"
"Petal.Width” "Species"
plot(iris$Sepal.Length, iris$Petal.Length)
# Pearson Correlation
cor(iris$Sepal.Length, iris$Petal.Length)
0.8717538
# Spearman Correlation
cor(rank(iris$Sepal.Length), rank(iris$Petal.Length))
0.8818981
Dr Andy Pryke - The Data Mine Ltd
Graphics
Edgar Anderson's Iris Data
0.5 1.0 1.5 2.0 2.5
7.5
Eye
Blue
Hazel Green
Male
2.0 2.5 3.0 3.5 4.0
4.5
5.5
Sepal.Length
Black
6.5
Brown
FemaleMale
2.0 2.5 3.0 3.5 4.0
4.00
Female
Sex
Hair
5
6
7
Brown
Sepal.Width
Pearson
residuals:
7.61
0.00
4.5
5.5
6.5
7.5
1
2
3
4
5
6
Red
Female Male Female
Male
Petal.Width
Blond
0.5 1.0 1.5 2.0 2.5
1
2
3
4
Petal.Length
2.00
-2.00
-4.33
p-value =
< 2.22e-16
7
January Pie Sales
Cherry
Blueberry
Apple
Vanilla Cream
Other
Boston Cream
Dr Andy Pryke - The Data Mine Ltd
Linear Models
## Scatterplot of Sepal and Petal Length
plot(iris$Sepal.Length, iris$Petal.Length)
4
3
2
1
iris$Petal.Length
5
## plot the model as a line
abline(irisModel)
6
7
## Make a Model of Petals in terms of Sepals
irisModel <- lm(iris$Petal.Length ~ iris$Sepal.Length)
4.5
5.0
5.5
6.0
6.5
7.0
7.5
8.0
Classification Trees
# Model Species
irisct <- ctree(Species ~ . , data = iris)
1
Petal.Length
p < 0.001
1.9
1.9
3
Petal.Width
p < 0.001
# Show the model tree
plot(irisct)
# Compare predictions
table(predict(irisct), iris$Species)
1.7
1.7
4
Petal.Length
p < 0.001
4.8
Node 2 (n = 50)
4.8
Node 5 (n = 46)
Node 6 (n = 8)
Node 7 (n = 46)
1
1
1
1
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0
0
setosa
0
setosa
0
setosa
setosa
Dr Andy Pryke - The Data Mine Ltd
SQL Interface
Connect to databases with ODBC
library("RODBC")
channel <- odbcConnect("PostgreSQL30w",
case="postgresql")
sqlSave(channel,iris, tablename="iris")
myIris <- sqlQuery(channel,
"select * from iris")
Dr Andy Pryke - The Data Mine Ltd
Data Mining Libraries (i)
RandomForest
– Random forests - Robust prediction
Party
– Conditional inference trees - Statistically principled
– Model-based partitioning - Advanced regression
– cForests - Random Forests with ctrees
e1071
– Naïve Bayes, Support Vector Machines, Fuzzy
Clustering and more...
Dr Andy Pryke - The Data Mine Ltd
Data Mining Libraries (ii)
nnets
– Feed-forward Neural Networks
– Multinomial Log-Linear Models
BayesTree
– Bayesian Additive Regression Trees
gafit & rgenoud
– Genetic Algorithm based optimisation
varSelRF
– Variable selection using random forests
Dr Andy Pryke - The Data Mine Ltd
Data Mining Libraries (iii)
arules
– Association Rules (links to ‘C’ code)
Rweka library
– Access to the many data mining algorithms found
in open source package “Weka”
dprep
– Data pre-processing
– You can easily write your own functions too.
Bioconductor
– Multiple packages for analysis of genomic (and
Dr Andy Pryke - The Data Mine Ltd
biological) data
Sources of Further Information
Download these slides + the examples
& find links to online courses in R here:
http://www.andypryke.com/pub/R
Dr Andy Pryke - The Data Mine Ltd
Dr Andy Pryke - The Data Mine Ltd
Editors which Link to R
•
•
•
•
•
•
•
Rgui (not really a GUI)
Emacs (with “ESS” mode)
RCmdr
Tinn-R
jgr - Ja
SciViews
and more...
Dr Andy Pryke - The Data Mine Ltd