Datasets and Variables - University of California, Riverside

Download Report

Transcript Datasets and Variables - University of California, Riverside

Datasets and Variables
We want to answer questions
We want to use data for this purpose
Observations of characteristics of cases
Case: person, city, organization, etc.
Characteristic or Variable: age, size,
sector of economy, etc.
Dataset: data arranged in case by variable
format
Datasets
Cases
Variables
Variables
Measures or observations of a case’s
traits
 characteristics
 qualities
 attributes
 amounts
 quantities
 etc.

Errors in variables
Missing values
Measurement errors

mistakes in
Reporting
 Remembering
 Recording


Lies, etc.
Quality of answer ~ quality of data
Types of variables
Categorical: nominal (name)
Ordinal: (name and order)
Measurement: interval and ratio
Interval (name, order, and unit of measure)
 Ratio (name, order, unit, and true zero
point)

Summarizing Data
Frequency Distributions
For measurement variables
 For categorical variables

Frequency Distribution
For Categorical variables (table)
Variable
Value Frequency Proportion (Percent)
Dem
17
.425
42.5%
Rep
7
.175
17.5
Ind
16
.400
40.0
n=40
1.000
100.0%
f
f/n
f/n * 100
Frequency Distribution
For Measurement variables (table)
Variable
Value Frequency Proportion (Percent)
0
19
.44
44
1
10
.23
23
…
7
2
.05
5
n=43
1.00
100
Frequency Distribution
For Categorical variable (bar chart)
20
15
10
Pol. Pref.
5
0
Dem
Rep
Ind
Frequency Distribution
For Measurement variable (histogram)
20
15
10
Quiz Score
5
0
0
1
2
3
4
5
6
7
Frequency Distribution
Frequency, f - count the number of
cases that have the same value of a
variable
Total cases, n - count all the cases
Proportion, p = f/n
Percentage, % = 100 * p
Dataset
Individual cases

e.g. stats.dta dataset: characteristics of
individuals: age, msat, gender
Aggregate data (groups of individual
cases)

e.g. college1.dta dataset: characteristics of
individuals: age, msat, gender averaged
for groupings of students by college
Populations and Samples
Population: all the relevant cases. The
entire set.
Sample: some portion of the population
haphazard (e.g., whomever you meet)
 systematic (e.g., every 10th person)
 representative (all possibilities included)
 random (every case in the population has a
fixed probability of being included in the
sample, often equal probability)

Populations
Best information
Expensive or impossible to observe
Samples
Easier to get
Less expensive
Less accurate - but, can be very
accurate depending upon type of
sample
Hints for good grade
Do all assigned exercises and turn them in
on time
Do all other exercises for yourself to be
sure you understand
Read and study the text, before and after
lectures -- several times
Statistics is a language -- learn
(memorize) the vocabulary and concepts