Transcript Lecture 1

Spatial Statistics and Spatial Knowledge
Discovery
First law of geography [Tobler]: Everything is related to everything, but
nearby things are more related than distant things.
Drowning in Data yet Starving for Knowledge [Naisbitt -Rogers]
Lecture 1 : Introduction to R
Pat Browne
Introduction to programming in R
• R is a computer language and environment that
allows users to program algorithms and use prewritten packages. R is a free software
environment for statistical computing and
graphics (including mapping).
• There are special R-packages for handling and
analyzing spatial data. For example, The sp
package provides classes and methods for
points, lines, polygons, and grids.
• R can extract spatial data from PostgreSQL.
Also, R can be combined with SQL using PL/R.
Installing R
• R for Windows can be downloaded from
•
http://ftp.heanet.ie/mirrors/cran.r-project.org/bin/windows/base/R-2.14.1-win.exe
• See Lab1.doc for installation details.
Starting R
• We will look at the main features of R, see
lab1.doc for more details. This lecture also
presents an introduction to programming.
• The basic components of current languages are:
–
–
–
–
Data types e.g. Integers, String, Polygon.
Variables to refer to data types e.g. a <- 2
Operations on those data types e.g. area(polygon)
Control structures e.g. sequence, iteration, and
conditions.
– Logic is an important part of programming, but it is
often implicit and external to the language. Some
languages like SQL are quite close to logic.
Starting R: Programs consists of
Data, Operations etc.
• The basic components of current languages are:
– Concrete data types e.g. Integer, String, Polygon.
– Variables to refer to data types e.g. a <- 2
– Operations on those data types e.g. area(polygon)
– Control structures e.g. sequence, iteration, and
conditions.
– Logic is an important part of programming, but it is
often implicit and external to the language. Some
languages like SQL are quite close to logic.
Starting R: Variables
• Variables provide a means of accessing the data
stored in computer memory. R provides a
number of specialized data structures or objects
(also called data types). These objects are
referenced in your programs using variables.
Store: a <- 2 Access: a
Store: b <-”Pat” Access: b
• Assigns the variable a the number 2 and the
variable b the string “Pat”.
Starting R: Data types
• A data type represents a constraint placed upon the
interpretation of data in a type system, describing
representation, interpretation, legal operations and
structure of values.
• Data types are a way to limit the kind of data that can be
used by a particular program or stored in a database
table. Types restrict the data to a certain set of values
(e.g. 1,2,3,..for Integers).
• Data types also are restricted to certain operations on
the type (e.g. addition for Integers). R comes with a
range of standard data types that can be used to
represent strings, integers, real numbers, and dates, but
R also has types that are especially suited to statistics
such as vectors and tables.
Starting R: Data types
The c() function combines its argument into a vector.
In R the term modes is used to describe data types. There are 4 basic types or
modes: numeric, character, complex , and logical. These can be combined to
form collections or what are called objects in R.
Starting R: Data types (Objects)
Starting R: Data types (Objects)
Starting R: Data types (Objects)
Starting R: Finding data types
Starting R: Data types
•
•
•
•
•
•
•
•
•
Numbers: 1, 1.4.
Strings: “ABC” or “abc”
Vector:
Arrays: are vectors plus dimension vector (dim)
Factors: for nominal & ordered categorical data
Data Frames: matrix-like for data of different types
Tables
One Way Tables
Two Way Tables
Starting R: Data types- Numbers
a
b
•
>
•
>
>
•
<- 3
<- sqrt(a*a+3)
List of the defined variables/objects:
ls()
We can add 1 to every element of a list
a <- c(1,2,3,4,5)
a+1
We can get the mean, variance, and standard deviation
from a list of numbers
> mean(a)
> var(a)
> sd(a)
Starting R: Data types- Strings
>
>
>
>
>
a
a
b
b
b
<- "hello"
[1] "hello"
<- c("hello","there")
[1]
[2]
Starting R: Data types-Vector
• R operates on named data structures. The simplest such
structure is the numeric vector, which is a single entity
consisting of an ordered collection of numbers. To set up
a vector named x use the R command
> x <- c(10.4, 5.6, 3.1, 6.4, 21.7)
> x[2]
• Variable assignment can be written as <- in R. The
above assignment uses the function c() which can take
an arbitrary number of vector arguments and whose
value is a vector got by concatenating its arguments end
to end.
• A number occurring by itself in an expression is taken as
a vector of length one.
Starting R: Data types-Arrays
• Arrays are vectors plus the dim attribute
(dimension vector), matrices are arrays
with a dim attribute of length 2. Arrays are
ordered column major order
Starting R: Data types-Matrices
• Arrays are vectors plus the dim attribute
(dimension vector), matrices are arrays
with a dim attribute of length 2. Arrays are
ordered column major order
Starting R: Data types-Tables
x=c("Yes","No","No","Yes","Yes")
> table(x)
x
No Yes
2
3
Types of Categorical
data
• Nominal: Mutually exclusive categories:
male/female, dead/alive, smoker/non-smoker,
bus/car/train. Tends to be unordered or have no
logical hierarchy
• Ordinal: Can be ranked in a meaningful order.
Distance between values is not relevant as there
is no distance information: race positions (1st,
2nd, 3rd), grouped amounts (1-5, 6-10, 11-15
per day). Unlike nominal data, ordinal data can
be compared against each other.
Starting R: Data types- Factor
• When looking at the impact of carbon
dioxide (CO2) on the growth rate of a tree
you might try to observe how different
trees grow when exposed to different
preset concentrations of CO2. The different
levels are often called categories or
factors. CO2 is measured in parts per
million by volume (ppmv). Levels could be
L1: 0-3, L2:3-6, L3:6-9, L4:9-12 ppmv
(ignoring double inclusion of boundaries).
Starting R: Data types- Factor
• Categorical data is often used to classify data
into various levels or factors. For example,
smoking data could be a factor in a broader
survey on health issues. R has a special class
for working with factors, R will adapt itself when
it knows it has a factor.
> x=c("Yes","No","No","Yes","Yes")
> factor(x)
[1] Yes No No Yes Yes
Levels: No Yes
Starting R: Data types- Factor
• We will assume that your data files are stored in
C:\My-R-Dir\
• Load in the file tree91.csv.
• tree <- read.csv(file="C:\\My-R-Dir\\trees91.csv",header=TRUE,sep=",");
• The summary operation prints out the possible
values and the frequency that they occur. Find
summary of the chamber identification label
(CHBR)
• summary(tree$CHBR)
Starting R: Data types- Factor
• summary(tree$CHBR)
• Note the output of the summary operation
produces quartiles. A quartile is one of
three points (including the median), that
divide a data set into four equal groups,
each representing a fourth of the
distributed sampled population.
Starting R: Data types- Factor
• A nominal value is represented as a factor
in R. The factor stores the nominal values
as a vector of integers in the range [
1... k ]
– where k is the number of unique values in the
nominal variable e.g. male=1,female=2,
• and an internal vector of character strings
(the original values) mapped to these
integers.
Starting R: Data types- Factor
• Consider variable gender with 20 male entries
and 30 female
• gender <- c(rep("male",20), rep("female", 30))
• gender <- factor(gender)
• Stores gender as 20 1s and 30 2s, where
1=female, 2=male internally (alphabetically)
• R now treats gender as a nominal variable
• summary(gender)
• What does rep() do? How would you find out?
• Type ? rep() into R and see.
Starting R: Data types- Factor
• An ordered factor is used to represent an ordinal
variable. Consider a variable rating coded as
large, medium, small
rating <- c(rep("large",10), rep("medium", 10),rep("small", 10) )
rating <- ordered(rating)
• R codes rating to 1,2,3 and associates: 1=large,
2=medium, 3=small internally
• R uses factor for nominal variables and
ordered for ordinal variables in statistical
procedures and graphical analyses.
• Try the command plot(rating)
Starting R: Data types- Factor
• A factor is a vector object used to specify a discrete
classification (grouping) of the components of other
vectors of the same length. R provides both ordered and
unordered factors. The application of factors is with
model formulae. A sample of 30 tax accountants from all
the states of Australia by a character vectors as
•
state <- c("tas", "sa", "qld", "nsw", "nsw", "nt", "wa", "wa",
"qld", "vic", "nsw", "vic", "qld", "qld", "sa", "tas", "sa",
"nt", "wa", "vic", "qld", "nsw", "nsw", "wa", "sa", "act",
"nsw", "vic", "vic", "act")
• A factor is created using the factor() function:
• statef <- factor(state)
• summary(statef)
• To find out the levels of a factor the function levels() can be used.
levels(statef) [1] "act" "nsw" "nt" "qld" "sa" "tas" "vic" "wa"
Starting R: Data types- Matrix
• A matrix is a collection of data elements
arranged in a two-dimensional rectangular
layout. The following is an example of a
matrix with 2 rows and 3 columns.
Starting R: Data types- Matrix
> A = matrix(
+ c(2, 4, 3, 1, 5, 7), # the data elements
+ nrow=2,
# number of rows
+ ncol=3,
# number of columns
+ byrow = TRUE)
# fill matrix by rows
> A # print the matrix
[,1] [,2] [,3]
[1,] 2 4 3
[2,] 1 5 7
An element at the mth row, nth column of A can be accessed by the expression A[m, n].
> A[2, 3]
# element at 2nd row, 3rd column
[1] 7
The entire mth row A can be extracted as A[m, ].
> A[2, ]
# the 2nd row
[1] 1 5 7
Similarly, the entire nth column A can be extracted as A[ ,n].
> A[ ,3]
# the 3rd column
[1] 3 7
Starting R: Data types- Dataframe
• A dataframe is more general than a matrix, in
that different columns can have different modes
(numeric, character, factor, etc.). It is a bit like an
SQL table.
d <- c(1,2,3,4)
e <- c("red", "white", "red", NA)
f <- c(TRUE,TRUE,TRUE,FALSE)
mydata <- data.frame(d,e,f)
names(mydata) <- c("ID","Color","Passed")
• There are a variety of ways to identify the
elements of a dataframe .
mydata[2:3] # columns 2,3 of dataframe
mydata[c("ID",“Color")] # columns ID,Color
myframe$ID # name in dataframe
Starting R: Data types- data.frame
• Here we create a data.frame called d.
L3 <- LETTERS[1:3]
(d <- data.frame(cbind(x=1, y=1:10),
fac=sample(L3, 10, replace=TRUE)))
• To view four rows: df[1:4,]
• To view a column: d$y, d$y, d$fac
• Alternative way to view a column: d[,3]
Starting R: Data types- Table
• One way tables are created with table
command, its arguments are a vector of
factors, and it calculates the frequency
that each factor occurs.
Starting R: Data types- one-way
Table
> a <- factor(c("A","A","B","A","B","B","C","A","C"))
> results <- table(a)
> results
>a
A B C
4 3 2
> attributes(results)
>attributes(results) $dimnames$a
>attributes(results) $dim
>attributes(results) $ class
> summary(results)
Starting R: Data types- two-way
Table
• Say we want to put the results of two questions into a
table:
• First question responses are Never, Sometimes,
Always,
• Second question responses are Yes, No, and Maybe.
Two vectors a and b contain the response for each
measurement.
• In the vectors, responses are represented by position.
The third item in a is how the third person responded to
the first question, and the third item in b is how the third
person responded to the second question.
• In the following we can see that two people who said
"Maybe" to the first question also said "Sometimes" to
the second question.
Starting R: Data types- two-way
ROW
COLUMN
Table
a <c("Sometimes","Sometimes","Never","Always","Always","Sometimes","Sometimes","Never")
b <- c("Maybe","Maybe","Yes","Maybe","Maybe","No","Yes","No")
results <- table(a,b)
> results
b
a
Maybe No Yes
Always
2 0 0
Never
0 1 1
Sometimes 2 1 1
The third item in a is how the third person
responded to the first question, and the
third item in b is how the third person
responded to the second question.
The table shows that two people who said Maybe to the first question
also said Sometimes to the second question.
The elements are accessed like a matrix (result(,1). )
How many people responded?
Useful functions
length(object) # number of elements or components
str(object)
# structure of an object
class(object) # class or type of an object
names(object) # names
c(object,object,...)#combine objects into a vector
cbind(object, object, ...) # combine objects as columns
rbind(object, object, ...) # combine objects as rows
Useful functions
object()
# prints the object
ls(),objects()
# list current objects
rm(object)
# delete an object
newobject<-edit(object) #edit,copy,save
fix(object)
# edit in place
data.entry(result)
# GUI edit in place
mode(object) # type of the object.
Starting R : Input-Output IO
• There are many ways to data into R. We
focus on just three:
– Assignment
– Reading a CSV File (writing later)
– Loading data from PostgreSQL (later)
Starting R : IO-Assignment
• Assignment (RHS <- LHS) allows an expression on the
RHS to be stored in a name object on the LHS. In R
> a <- c(3,5,7,9) >
• The above assignment uses the combine command. (c
means combine). This makes a vector called a. No
output is produced yet. Now we can retrieve the contents
of a just by typing it in.
• > a
• > a[3]
• The command gives all of a the second command gives
the third element of a . [3] is called the index. The zero
entry hold the data type of the a vector. Try:
• b <- c("one","two","three")
Starting R : IO-Assignment
cells <- c(1,26,24,68)
rnames <- c("R1", "R2")
cnames <- c("C1", "C2")
mymatrix <matrix(cells, nrow=2, ncol=2,
byrow=TRUE,
dimnames=
list(rnames, cnames))
Type >attributes(mymatrix)
Type >help(array) to find more details on Arrays
Starting R : Input: File
• Place the file simple.csv in a directory (folder).
• Load the file into R using:
h <- read.csv(file=“C:\\My-R-Dir\\simple.csv”,head=TRUE,sep=”,”)
• View the contents of h:
• Now the contents of the file are stored in R as
the object named h.
• Type >names(h)
Starting R: Data types-Matrices
• All columns in a matrix must have the same data type
(numeric, character, etc.) and the same length. The
general format is:
mymatrix <matrix(vector, nrow=r, ncol=c,
byrow=FALSE,
dimnames=list(char_vector_rownames,
char_vector_colnames))
• byrow=TRUE indicates that the matrix should be filled
by rows. byrow=FALSE indicates that the matrix should
be filled by columns (the default). dimnames provides
optional labels for the columns and rows.
Review - vectors, lists, matrices,
data frames
• To make vectors x, y, year, names
x <- c(2,3,7,9)
y <- c(9,7,3,2)
year <- 1990:1993
names <- c("payal", "shraddha", "kritika", "itida")
Accessing last element
y[length(y)]
• To make a list person
person <- list(name="payal", x=2, y=9, year=1990)
Accessing person$name,
names(person)
person$x , person[1]
Review - vectors, lists, matrices,
data frames
• To make a matrix, pasting together the columns year , x,
y using column bind.
m <- cbind(year, x, y)
• To make a data frame, which is a list of vectors of the
same length
D <- data.frame(names, year, x, y)
nrow(D)
• Accessing one of these vectors
D$names
Accessing the last element of this vector
D$names[nrow(D)]
D$names[length(D$names)]
Finding the type and class
> g <- c(1,3,2)
> class(g)
[1] "numeric"
> typeof(g)
[1] "double“
> is(g)
[1] "numeric" "vector"
Sorting
• The variable i is a vector of integers, then the
data frame D[i,] picks up rows from D based on
the values found in `i'. The order() function
makes an integer vector which is a correct
ordering for the purpose of sorting.
• D <- data.frame(x=c(1,2,3,1), y=c(7,19,2,2))
•
•
•
•
Sort on x
indexes <- order(D$x)
D[indexes,]
Print out sorted dataset, sorted in reverse by y
D[rev(order(D$y)),]
Logical constants & variables
• TRUE and FALSE are logical constants
• T and F are logical variables
• T and F are quite not synonyms for TRUE
and FALSE but variables that have the
expected values by default
• TRUE == TRUE
• T == T
• Normally give the expected result.
Missing Values : NA
• Not Available or Missing Values are represented as NA,
which is a logical constant (either T or F) which contains
a missing value indicator.
• Examples
is.na(c(1, NA)) #FALSE TRUE
is.na(c(NA, NA)) #TRUE TRUE
is.na(paste(c(1, NA))) # FALSE FALSE
xx <- c(0:4)
is.na(xx) <- c(2, 4)
xx
# 0 NA 2 NA 4
Writing your own functions.
• R comes with a built-in median function.
• Usage: median(x, na.rm = FALSE)
• Where x an object for which a method
has been defined, or a numeric vector
containing the values whose median is to
be computed.
• na.rm a logical value indicating whether
NA values should be removed before the
computation proceeds.
Control - If
> if (T) print("Hello") else print("Good Bye")
[1] "Hello"
> if (F) print("Hello") else print("Good Bye")
[1] "Good Bye"
Control - Sequence
a <- c(1,2,3,4,5)
b <- c(2,3,4,5)
odd.even <- length(a) %% 2
if (odd.even == 0)
(sort(a)[length(a)/2] +
sort(a)[1 + length(a)/2])/2 else
sort(a)[ceiling(length(a)/2)]
If we want to find the median of b we have to type the whole thing again.
> if (odd.even == 0) (sort(b)[length(b)/2] + sort(b)[1 + length(b)/2])/2
else sort(b)[ceiling(length(b)/2)]
It would be better to write a function.
User Written - Functions
a <- c(1,2,3,4,5)
b <- c(2,3,4,5)
mymedian <- function(x){
odd.even <- length(x) %% 2
if (odd.even == 0)
(sort(x)[length(x)/2] +
sort(x)[1 + length(x)/2])/2 else
sort(x)[ceiling(length(x)/2)]
}
Now we can call, run, execute or invoke my median on any vector.
> mymedian(a)
> mymedian(b)
References
Lloyd: Spatial Data Analysis
Applied Spatial Data Analysis with R
Bivand, Pebesma, Gómez-Rubio
http://www.manning.com/obe/
http://www.spatial.cs.umn.edu/Book/