4208 2016 week 2x

Download Report

Transcript 4208 2016 week 2x

Describing Data
September 14, 2016
Updates
• This week – Lab sections begin
• Wed: 2-4pm (Today!)
• Wed: 4-6pm (Today!)
• Mon: 4-6pm
• Next week
• Eric Glass, guest speaker from DSSC (part of class)
• The following week, another speaker talking about Zotero.
Updates to assignments
• Updated LiPS assignment
Still have to seven write-ups
One must be either Fulong Wu (Monday evening Nov 14th)
or Malo Hutson (Tuesday evening Sept. 20th)
• Assignment 2 posted to CourseWorks
Due at the start of your lab in 2 weeks. Hand in a paper copy to your TA and
post also to CourseWorks.
Today: Statistics
• Descriptive
• Describe and summarize our data to give insights
• Inferential
• Use statistics to make generalizations about a broader population
Types of Variables
• Categorical
• Nominal (not ranked)
• College major, type of property, color of car
• Ordinal (ordered or ranked)
• Useful for preferences, though no value assigned
• Dichotomous (two categories, not ranked)
• Yes/no
• Numerical
• Discrete (values are counts)
• Continuous (values are measures)
Variables
• Nominal
• Exclusive but not ordered or ranked
• Ordinal
• Ranked
• Interval
• Equally spaced variables
Nominal Examples
• Think of nominal scales as “labels”
• No quantitative value
Nominal Examples
• Think of nominal scales as “labels”
• No quantitative value
Nominal Examples
• Think of nominal scales as “labels”
• No quantitative value
Color
Blue
Black
Red
blue
Purple
Green
Purple
White
BLUE
Brown
Burgundy
Gray
Pink
Red
Yellow
nav
orange
purple
red
seafoam green
turquoise
white
Count
10
8
6
5
3
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Nominal Examples
• Think of nominal scales as “labels”
• No quantitative value
• Other Examples:
• Gender
• Hair color
• Neighborhood
• When there are only two categories, we call this “dichotomous.”
• Examples – Heads/Tails, On/Off, Rural/Urban, In poverty / Not in poverty
• Q: What about gender? Is that a dichotomous variable?
Ordinal
• Ranked in order of values, but the difference between values is not
always known
• Example:
• Educational attainment
Ordinal example:
educational attainment
Interval
• Numerical scales where order of and differences between variables is
known
Hours of Sleep
0
12
• Examples:
• Money or income
• Height
• Weight
10
COUNT
8
6
4
2
0
1
2
3
4
5
6
7
8
9
10
11
12
13
Likert items
• Allow people to respond according to some scale
Likert items
• Allow people to respond according to some scale
• Examples:
Question: How frequently do you think you need to come to class to get a high
pass?
o Always
o Often
o Occasionally
o Rarely
o never
Likert items
• Allow people to respond according to some scale
• Examples:
Question: I already know everything there is to know about “Planning
Techniques”
o Agree Strongly
o Agree Slightly
o Neutral
o Disagree Slightly
o Disagree Strongly
Likert items
• Allow people to respond according to some scale
• Examples – four point scale
Question: I read emails from Nick Klein
o
o
o
o
Most of the time
Some of the time
Seldom
Never
Likert items
• Allow people to respond according to some scale
• Examples – four point scale
Question: I read emails from Nick Klein
o
o
o
o
Most of the time – ALL OF THE TIME
Some of the time
Seldom
Never
Likert Scales
• What types of variables are these?
• How can we interpret them?
Descriptive stats
We need some data to describe
Lucky us!
What year were you born?
50 responses:
1993, 1991, 1960, 1993, 1994, 1992, 1989, 1992, 1993, 1993, 1994,
1991, 1990, 1992, 1987, 1989, 1994, 1992, 1989, 1992, 1994, 1985,
1994, 1991, 1991, 1992, 1993, 1993, 1993, 1992, 1991, 1985, 1992,
1992, 1992, 1985, 1994, 1993, 1995, 1991, 1985, 1993, 1990, 1992,
1994, 1994, 1994, 1994, 1992, 1990
Hard to make sense of this…
50 responses:
1993, 1991, 1960, 1993, 1994, 1992, 1989, 1992, 1993, 1993, 1994,
1991, 1990, 1992, 1987, 1989, 1994, 1992, 1989, 1992, 1994, 1985,
1994, 1991, 1991, 1992, 1993, 1993, 1993, 1992, 1991, 1985, 1992,
1992, 1992, 1985, 1994, 1993, 1995, 1991, 1985, 1993, 1990, 1992,
1994, 1994, 1994, 1994, 1992, 1990
We can use a “frequency table”
Year born
1960
Frequency
1
Percent
2.00
1985
1987
1989
4
1
3
8.00
2.00
6.00
1990
1991
1992
1993
3
6
12
9
6.00
12.00
24.00
18.00
1994
10
20.00
1995
1
2.00
Let’s represent it another way, graphically
We can use a “dot plot” where each dot
represents a response
1960
1970
1980
What year were you born?
1990
2000
0
5
Frequency
10
15
This is similar to a histogram
1960
1970
1980
1990
What year were you born?
2000
But a histogram is more flexible
10
5
0
Frequency
15
20
We can change the number of “bins”
1960
1970
1980
What year were you born?
1990
2000
20
10
0
Percent
30
40
And change the y-axis to a measure of
“relative frequency” rather than a count.
1960
1970
1980
What year were you born?
1990
2000
Another approach is a “stem and leaf”
195. |
196. |
197. |
198. |
199. |
200. |
The stem consists of the numbers with the last digit
omitted. So for our years, this would mean ignore the
year but keep the decade.
So “1975” would become “197”
Another approach is a “stem and leaf”
195. |
196. | 0
Then add the final digits (the leaf or leaves)
back in to the corresponding stem
197. |
198. | 55557999
199. | 00011111122222222222233333333344444444445
200. |
Summary Statistics
Central Tendency and Spread
• Two of the most simple and most important measures
Central Tendency
• There are a number of measures of central tendency
• The most common are:
• Mean
• Median
• Mode
• Let’s focus on the first two
Mean
• The mean is the average.
• To calculate it, we add up all the values and divide by the number of
observations.
• If we write it out as an equation, where we have n observations, we
could write it out as such:
𝜒=
𝑛
𝑖=1 𝑛𝑖
𝑛
Mean
• Let’s use the first five years from our data as an example:
1. 1985
2. 1992
3. 1992
4. 1992
5. 1985
𝜒=
𝑛
𝑖=1 𝑛𝑖
𝑛
1985 + 1992 + 1992 + 1992 + 1985
𝜒=
5
𝜒 = 1989.2
Median
• The median is the middle most value
• We can identify it by placing our data in order. Let’s use the same five
values:
1985
1985
1992
1992
1992
• The mean (1989.2) and median (1992) are often different. The
median has a nice attribute in that it is generally not sensitive to
outliers.
Median
• If there are two middle-most variables, we would take the average of
the two middle values
• Let’s add our outlier (1960) to our data set and figure out the
median:
1960
1985
1985
1992
1992
The median is now (1985 + 1992) / 2 = 1988.5
1992
Mean and Median
Mean
●
Easy to understand. It’s the average
●
Affected by extreme high or low values (outliers)
●
May not best characterize skewed distributions
Median
●
Not affected by outliers
●
May better characterize skewed distributions
What about mode?
Mode
●
The most frequent value
●
Less often used in social science
Mode
●
The most frequent value
●
Less often used in social science
Percentiles
• Imagine a chart will all the observable values in a population; it
contains 100 percent of the possible values.
• The pth percentile is the value of a given distribution such that p% of
the distribution is less than or equal to that value.
• Quartiles: The 25th, 50th, and 75th percentiles
• Quintiles: The 20th, 40th, 60th, and 80th are quintiles
• Deciles: 10th, 20th, 30th, 40th, 50th, 60th, 70th, 80th, and 90th.
• The 50th percentile is the MEDIAN
10 percent under curve (shaded red)
10th percentile=-1.2816
Basic descriptive statistics
25 percent under curve (shaded red)
25th percentile=-0.67
Basic descriptive statistics
50 percent under curve (shaded red)
50th percentile=0.00
75 percent under curve (shaded red)
75th percentile=0.6745
Basic descriptive statistics
90 percent under curve (shaded red)
90th percentile=1.2816
Percentiles from our data
1960
1970
1980
What year were you born?
1990
2000
Percentiles from our data
25th Percentile is 1991
50th Percentile / the median
value is 1992
75th Percentile is 1993
1960
1970
1980
What year were you born?
1990
2000
Measures of Spread
How do we describe the different
distributions?
Measures
• Range
• Interquartile range
• Index of dispersion
• Standard Deviation
Interquartile Range (IQR)
• The IQR is a simple measure of spread: It is the difference between
25th and 75th percentile values.
• The IQR tells us about the spread from the median
Interquartile Range (IQR)
25th Percentile is 1991
50th Percentile / the median
value is 1992
75th Percentile is 1993
1960
1970
1980
What year were you born?
1990
2000
Boxplots
Standard Deviation
• Often, we will use and talk about st. dev.
• Represented by sigma : σ
• The st. dev tells us about the spread from the mean
• (The IQR tells us about the spread form the median)
Standard Deviation
• We can think of the st. dev. (σ) as an measure of the average
distance from the mean.
𝜎=
𝑛
𝑖=1
𝜒−𝜇
𝑛−1
2
Standard Deviation
• We can think of the st. dev. (σ) as an measure of the average
distance from the mean.
𝜎=
𝑛
𝑖=1
𝜒−𝜇
𝑛−1
2
• And we call this part the variance:
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 =
𝑛
𝑖=1
𝜒−𝜇
𝑛−1
2
Standard Deviation
• The st. dev. is quite cumbersome to calculate
𝜇
• First, we need the average
Standard Deviation
• The st. dev. is quite cumbersome to calculate
𝜒−𝜇
2
• First, we need the average
• Then calculate the squared distance from the average for each value
Standard Deviation
• The st. dev. is quite cumbersome to calculate
𝑛
𝜒−𝜇
2
𝑖=1
• First, we need the average
• Then calculate the squared distance from the average for each value
• Sum them all up
Standard Deviation
• The st. dev. is quite cumbersome to calculate
𝑛
𝑖=1
𝜒−𝜇
𝑛−1
•
•
•
•
2
First, we need the average
Then calculate the squared distance from the average for each value
Sum them all up
Divide by n-1
Standard Deviation
• The st. dev. is quite cumbersome to calculate
𝑛
𝑖=1
𝜒−𝜇
𝑛−1
•
•
•
•
•
2
First, we need the average
Then calculate the squared distance from the average for each value
Sum them all up
Divide by n-1
Take the square root of all of that.
Standard Deviation
• But the st. dev. is really useful.
• If we have normally distributed data,
• We can expect 68% is within 1 st. dev.
• And 95% is within 2.
Other ways to describe spread
Skewness and Symmetry
Skewness and Symmetry
Skewness and Symmetry
• Why might data be skewed?
• Why might data be bimodal?
6000
0
2000
4000
Skewed data example:
Family Income
0
200,000
400,000
600,000
800,000
6000
0
2000
4000
Q: Guess the mean
0
200,000
400,000
600,000
800,000
6000
Q: Guess the mean
0
2000
4000
$71,840
0
250,000
500,000
750,000
6000
Q: Guess the mean
0
2000
4000
$71,840
0
250,000
500,000
750,000
6000
Q: Guess the mean
4000
$71,840
0
2000
Q: Guess the median
0
250,000
500,000
750,000
6000
Q: Guess the mean
4000
$71,840
Q: Guess the median
0
2000
$55,000
0
250,000
500,000
750,000
Interpreting Tables
Elements of a Table
Table 1: Outbound Freight Desitnations from the Port of New
York and New Jersey
Port
Destination
Global
NYCT
Trips
Share
Trips
Share
Canada
15
0.6%
6
0.2%
Connecticuit
6
0.2%
20
0.6%
Delaware
3
0.1%
3
0.1%
Massachusetts
42
1.7%
19
0.6%
Maryland
1
0.0%
1
0.0%
Maine
1
0.0%
0
0.0%
New Jersey
1,941
79.9%
2,681
82.0%
New York
156
6.4%
314
9.6%
Ohio
2
0.1%
0
0.0%
Pennsylvania
119
4.9%
153
4.7%
Texas
1
0.0%
0
0.0%
California
0
0.0%
16
0.5%
Missing
141
5.8%
56
1.7%
Total
2,428
3,269
NY & NJ
2,097
86.4%
2,995
91.6%
• Title describes content
• Sample size presented
• Actual and percentage shares
presented
Table 4: Toll and Operation Cost Estimates for 20 Mile Trip from New York Area Ports, 2011
U.S. Average
To and From Global To and From NYCT
Average Cost
Total
% of
Total
% of
Total
% of
Cost of Operations
per Mile
Costs
Costs
Costs
Costs
Costs
Costs
Vehicle Based
Fuel and Oil
Truck/Trailer Lease or Purchase
Repair and Maintenance
Truck Insurance Premiums
Permits and Licenses
Tires
Tolls: General
Tolls: Bridges
Driver-based
Driver Wages
Driver Benefits
$
$
$
$
$
$
$
0.59
0.19
0.15
0.07
0.04
0.04
0.02
$ 11.90
$ 3.78
$ 3.04
$ 1.34
$ 0.76
$ 0.84
$ 0.34
35%
11%
9%
4%
2%
2%
1%
$ 11.80
$ 3.78
$ 3.04
$ 1.34
$ 0.76
$ 0.84
$ 0.34
$ 8.97
27%
9%
7%
3%
2%
2%
1%
21%
$ 11.80
$ 3.78
$ 3.04
$ 1.34
$ 0.76
$ 0.84
$ 0.34
$ 48.22
14%
5%
4%
2%
1%
1%
0%
59%
$
$
0.46
0.15
$
$
27%
9%
$
$
21%
7%
$
$
11%
4%
9.20
3.02
9.20
3.02
9.20
3.02
Total Costs
$
1.71 $ 34.12
100%
$ 43.09
100%
$ 83.34
100%
Note: Estimates are of overall cost of a 20 mile trip. General operating costs from 2012 ATRI Average Carrier Costs
per Mile. Calculations by Jonathan Peters.
• Assumptions stated
• Source of calculations stated
Interpreting Tables
Homicide Rates per 100,000 residents by
year and treatment status in 1977
Group
Year
1975
1977
Total
Untreated
8.0
6.9
7.5
Treated
10.3
9.7
10.0
Total
9.6
8.8
9.2
• From Manski (2014)
• Death penalty moratorium was
lifted in U.S. is 1976
• Three ways to interpret data
presented
Interpreting Tables
1)
“Before and after”
• Average effect of death penalty
is -.6 (calculated as 9.7-10.3)
Homicide Rates per 100,000 residents by
year and treatment status in 1977
Group
Year
1975
1977
Total
Untreated
8.0
6.9
7.5
Treated
10.3
9.7
10.0
Total
9.6
8.8
9.2
Interpreting Tables
2) Compare treated and
untreated
Homicide Rates per 100,000 residents by
year and treatment status in 1977
Group
Year
1975
1977
Total
Untreated
8.0
6.9
7.5
Treated
10.3
9.7
10.0
Total
9.6
8.8
9.2
• Assumes all else equal, e.g.
propensity to kill is the same
everywhere
• Average effect in 1977 is 2.8
(=9.7-6.9)
Interpreting Tables
3) Difference in difference
Homicide Rates per 100,000 residents by
year and treatment status in 1977
Group
Year
1975
1977
Total
Untreated
8.0
6.9
7.5
Treated
10.3
9.7
10.0
Total
9.6
8.8
9.2
• Changes in effects over time to
account for policy changes
• Treated states declined from
10.3 to 9.7 = -.6
• Untreated states declined from
8.0 to 6.9 = 1.1
• Effect =.5 = [(9.7-10.3)-(6.9-8.0)]
Interpreting Tables
Homicide Rates per 100,000 residents by
year and treatment status in 1977
Group
Year
1975
1977
Total
Untreated
8.0
6.9
7.5
Treated
10.3
9.7
10.0
Total
9.6
8.8
9.2
• Before and after shows reduced
homicide rates
• Comparison of treated and
untreated shows increase in rate
to 2.8
• Difference in difference shows
increase in rate to .5 per
100,000
• Explanations?
Presenting Data
• Tables
• Charts
• Graphs
Problems with Pie Charts
• No sample size
• Similarly sized pies suggest all
groups are equal and all
response rates are about the
same
• Were yes/no the only options?
• What are “enough
transportation options”?
When Pie Charts Are Appropriate
Bar Chart
Measures of association