Transcript Powerpoint

R PROGRAMMING FOR SQL DEVELOPERS
Kiran Math
Developer
Work @ : Proterra in Greenville SC
[email protected]
MOTIVATION
Tidy Data
Raw Sensor Data
GOAL
ZILLOW
Viz
Get &
Tidy
Transform
Transform
Model
@hadleywickham
VASCO DA GAMA BRIDGE - LISBON
IN PORTUGAL
Question : What is the probability of
having seventeen or more vehicles
crossing the bridge in a particular
minute?
Raw Data
Processing Script
-----------------------------------
------------------------------------------
Tidy Data
R Code
-----------------------------------
Data on Web CSV
Format
Read CSV from the Web into R
Data Visualization
Packages used :
TidyR
--------------------------------R Code
ggplot2
baseplot
Code Repository
Data Communication
----------------------------------------------------
-----------------------GitHub
Blog
Data Manipulation and Analysis
the probability of having seventeen or more
Vehicles crossing the bridge in a particular
minute is 10.1%
------------------------------------------R Code
Average Vehicles per min
12
Data Model - Poisson distribution
----------------------------------------------------------------------------------------
ppois(16, lambda=12, lower=FALSE)
Answer : 0.10129
# upper tail
Comprehensive R Archive Network (CRAN)
R Studio
https://www.cran.r-project.org/
https://www.rstudio.com/
INSTALLATION

University of Auckland
ROBERT GENTLEMAN - ROSS IHAKA
sqldf
ggPlot2
tidyR
Base
rodbc
Packages
reshape2
stringR
ggPlot2
dplyr
lubridate
R <- CORE && R <-PACKAGES

Runs on almost any standard computing platform/OS (even on
the PlayStation 3)

Frequent releases (annual + bug fix releases); active
development.

Quite lean, as far as software goes; functionality is divided into
modular packages

Graphics capabilities very sophisticated and better than most
stat packages.

Very active and vibrant user community; R-help and R-devel
mailing lists and Stack Overflow
FEATURES OF R

Essentially based on 40 year old technology.

Objects must generally be stored in physical memory;
DRAWBACKS OF R
# Define a Variable
# Create a vector - Numeric
a <- 25
x <- c(0.5, 0.6,0.7)
# Call a Variable
## call it
a
## [1] 25
x
## 0.5 0.6 0.7
# Do something to it
a + 10
# Do something to the vector
mean(x)
## [1] 35
BASICS 1 - VECTOR
## [1] 0.6
A matrix is a collection of data
elements arranged in a twodimensional rectangular layout.
> A = matrix(
c(1, 2, 3, 4, 5, 6),
nrow=2,
ncol=3,
byrow = TRUE)
# the data elements
# number of rows
# number of columns
# fill matrix by rows
>A
# print the matrix
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
BASICS 2 - MATRIX
#If Statements
x <- 10
y <- if (x>75) 'Pass' else 'Fail'
##Get the value of variable
y
## For loops
for (index in 1:3)
{
print(index)
}
## [1] "Fail"
BASICS 3 – CONTROL STRUCTURES
## Compute the mean of the vector of
numbers
Functions are blocks of code that allow R to
be a modular and facilitate code reuse
meanX <- function(a_vector)
{
s <- sum(a_vector)
Funct_name <- function (arg1,arg2, ..){
l <- length(a_vector)
### do something
}
m <- s/l
return(m)
}
### create a vector
v <- c(1,2,3,4,5)
### Find the mean
meanX(v)
BASICS 4 - FUNCTIONS
## [1] 3
A data frame is used for storing data
tables.
To retrieve data in a cell, we would
enter its row and column coordinates in
the single square bracket "[]" operator.
mtcars[1, 2]
[1] 6
mtcars["Mazda RX4", "cyl"]
[1] 6
Preview data frame
DATA FRAME

head(mtcars)

tail(mtcars)

View(mtcars)
# Make a very simple plot
# Define Vectors
x <- c(1,3,6,9,12)
y <- c(1.5,2,7,8,15)
plot(x,y,
xlab="x axis", ylab="y axis",
main="my plot",
ylim=c(0,20), xlim=c(0,20),
pch=15, col="blue")
# add some more points to the graph
x2 <- c(0.5, 3, 5, 8, 12)
y2 <- c(0.8, 1, 2, 4, 6)
points(x2, y2, pch=16, col="green")
BASICS 6 - PLOTS
HOME SALE
I have home sales data in the neighborhood,
in sql server database.
Question :
I have a 3000 sql ft house and how much it will
sale for?
REGRESSION MODEL
REGRESSION MODEL
Demo : Predict sale price of the house that is 3000 sq ft
The dplyr package provides
simple functions that can be
chained together to easily
and quickly manipulate data
install.packages("dplyr")
library(dplyr)
Verbs
1.
filter – select a subset of the rows of a data
frame
2.
arrange – works similarly to filter, except that
instead of filtering or selecting rows, it
reorders them
3.
select – select columns of a data frame
4.
mutate – add new columns to a data frame
that are functions of existing columns
5.
summarize – summarize values
6.
group_by – describe how to break a data
frame into groups of rows
MANAGING DATA FRAMES WITH DPLYR
DEMO : DPLYR
Grammer of Graphics
The ggplot2 package provides
two workhouse function for
plotting
Building Blocks
1.
Data Frame
2.
Aesthetics – how data is mapped to color
and size ~ aes()
3.
Geoms – Geometric objects to be drawn,
such as points, lines, bars, polygons and text.
install.packages(“ggplot2")
4.
Facets – Panels used in conditional Plot
library(ggplot2)
5.
Stats – statistical transformation ~ binning,
quantiles, smoothing
6.
Scales – coding that aesthetic map uses like
male = blue and female = red
7.
Co-ordinate System
1.
qplot()
2.
ggplot()
VISUALIZING DATA FRAMES WITH GGPLOT2
DEMO : GGPLOT2
THANK YOU