1 intro to R and quant analysis

Download Report

Transcript 1 intro to R and quant analysis

Quantitative
Analysis
Quantitative / Formal Methods
• objective measurement systems
• graphical methods
• statistical procedures
why bother?
• description
– esp. of populations
– ex: average height of people in room
• inference
– describe populations on the basis of samples
– test hypothesis about populations
– estimate levels of uncertainty associated with
inferential description
• exploratory analysis
– pattern searching/recognition
– “data mining”
• evaluate strength of patterning…
“Patterning”
• patterning = departures from randomness
• strength of patterning = ?
 degree of departure from
randomness…
• “how likely is it that observed patterning
could have occurred by chance??”
• this is a statistical question…
• “is the patterning strong enough to either
require or support an explanatory
argument??”
• this is usually an anthropological
question…
basic vocabulary
•
•
•
•
case
variable
data matrix
attribute
•
•
•
•
aggregation
stratification
accuracy
precision
• case
– equivalent to ‘record’
– something about which we want to make/record
observations…
• variable
– kinds of observations we want to make/record
– measurements of variability among cases…
cases and variables
variable 1 variable 2 variable 3 variable 4 variable 5 …
case 1
case 2
case 3
case 4
case 5
…
(data matrix)
• attribute
– the intersection between cases and variables
– i.e., an observation about a specific case with
reference to a specific variable
– ex.
• “elk”
• “strongly agree”
• “plain-ware”
– also called ‘value’, or ‘variable state’
• aggregation
– grouping cases, usually on the basis of a shared
attribute
– spatial proximity, temporal proximity
– gender of interment associated with grave lots
• stratification
– dividing cases into sub-groups
– usually to carry out parallel analyses that relate
to different control conditions
• accuracy
– an expression of the closeness between a
measured (or computed) value and the true
value
– frequently confused with precision
• precision
– has to do with replicability
– the closeness of repeated measures to the same
value (not necessarily the true value)
scales of measurement
• presence / absence data
– simply whether or not the case exhibits a
specific state
• nominal data
– contrasting groups, usually mutually exclusive
– sometimes referred to as ‘discrete’ or
‘categorical’ data
scales of measurement
• ordinal data
– a logical order or ranking exists among the
various categories
– no assumptions implied about the
‘measurement space’ occupied by categories
• ratio data
– also metric, continuous
– has a non-arbitrary zero
– can meaningfully compare measurements as
ratios
scales of measurement
• interval data
– distances between categories of measurement
are fixed and even (unlike ordinal data)
– scale lacks a non-arbitrary ‘zero’ (unlike ratio
data)
• count data
– derived from nominal data
– really a kind of ratio data created by
aggregation
Drennan
• distinctions are inconsistent and not too
important…
• measurements vs. categories
– measurements: quantities measured along a
scale
– categories: +/- equivalent to nominal data
– counts: discrete enumeration
• but, confusion does occur…
– ex. can’t use ‘goodness of fit’ tests on nominal
data!
data coding
• presence / absence data
– can use 0 / 1 (but analyze with care!)
• nominal data
– OK to use integers (1, 2, 3, etc.)
– but don’t subject them to arithmetic operations
– don’t assume rules of numerical distance
data coding
• ordinal data
– use integers…
• ratio / metric data
– use integer or decimal notation
– don’t record spurious levels of accuracy or
precision
– note: x = 10.2 means 10.15 < x < 10.25
coding “missing data”
•
•
MD more problematic than most realize…
may want more than one code:
1. variable state is uncertain, vs.
2. variable doesn’t apply, vs.
3. variable state is not present (not really MD)
•
R gives you one coding option (“NA”)
recoding data
•
can readily recode “down” the scale (ex.
ratioordinal)
–
•
•
implies a loss of information and a probably
wasted recording effort
reporting apparently dubious counts as
presence/absence data is not a good idea
moving ‘up’ the scale means redoing lab
work…
data management
•
three main options for electronic storage
of data:
–
–
–
spreadsheet
statistics package
database
‘spreadsheet’
• organized by cells
• no restrictions on
cell content
• most useful for
short-term
manipulation of
small datasets
• poor for long-term
storage of
complex
datastructures
• data forms offer
less versatility
than spreadsheets
• organized by case
& variable
• powerful
analytical tools
• poor management
tools
‘stat-pac’
‘database’
• best option for
managing complex
data structures
pottery design elements:
‘reptile eye’
‘obsidian knife’
‘cloud motif’
etc….
“multiple entry”
artifact #
ax-122
az-01
aa-01
…
design elements
reptile eye, obsidian knife, cloud
maguey thorn, reptile eye
jaguar paw
…
“flat-file” format
artifact #
D1
ax-122 rep
az-01
mgt
aa-01
jgp
…
artifact #
ax-122
az-01
aa-01
…
rep
1
1
0
D2
obk
rep
obk
1
0
0
D3
cld
cld
1
0
0
mgt
0
1
0
jgp
0
0
1
relational database
artifacts
ID
1
2
3
catNum
ax-122
az-01
aa-01
design element link
artID
deID
1
1
1
2
1
4
2
1
2
2
3
5
design elements
ID
element
1 reptile eye
2 obsidian knife
3 maguey thorn
4 cloud
5 jaguar paw
artifacts
design element link
ID
------------1
design elements
 artID
catNum
deID
------------1 ID
element
abbrev
abbrev
rep
obk
mgt
cld
jgp
“structured query language” (SQL)
SELECT artifacts.catNum, [design elements].abbrev
FROM [design elements]
INNER JOIN (artifacts INNER JOIN [design element
link]
ON artifacts.ID = [design element link].artID)
ON [design elements].ID = [design element link].deID;
catNum
abbrev
ax-22
rep
ax-22
obk
ax-22
cld
az-01
rep
az-01
obk
aa-01
jgp