Transcript Session3
Secondary Data, Measures,
Hypothesis Formulation, Chi-Square
Market Intelligence
Julie Edell Britton
Session 3
August 21, 2009
Today’s Agenda
Announcements
Secondary data quality
Measure types
Hypothesis Testing and Chi-Square
Announcements
• National Insurance Case for Sat. 8/22
– Stephen will do a tutorial today, Friday, 8/21 from 1:00
-2:15 in the MBA PC Lab and be available tonight
from 7 – 9 pm in the MBA PC Lab to answer
questions
– Submit slides by 8:00 am on Sat. 8/22
– 2 slides with your conclusions – you may add
Appendices to support you conclusions
3
Primary vs. Secondary Data
Primary -- collected anew for current purposes
Secondary -- exists already, was collected for
some other purpose
Finding Secondary Data Online @ Fuqua
http://library.fuqua.duke.edu
Primary vs. Secondary Data
Evaluating Sources of
Secondary Data
If you can’t find the source of a number,
don’t use it. Look for further data.
Always give sources when writing a report.
Applies for Focus Group write-ups too
Be skeptical.
Secondary Data: Pros & Cons
Advantages
cheap
quick
often sufficient
there is a lot of data out there
Disadvantages
there is a lot of data out there
numbers sometimes conflict
categories may not fit your needs
Types of Secondary Data
Database: Can
Slice/Dice; Need
more processing
Summary:
Can’t
change categories,
get new crosstabs
Internal
External
WEMBA_C
IMS Health,
Nielsen, IRI*
Knowledge
Management
Conquistador,
Simmons,
IRI_factbook
*IRI = Information Resources, Inc. (http://us.infores.com/)
Secondary Data Quality:
KAD p. 120 & “What’s Behind the Numbers?”
Data consistent with other independent sources?
What are the classifications? Do they fit needs?
When were numbers collected? Obsolete?
Who collected the numbers? Bias, resources?
Why were the data collected? Self-interest?
How were the numbers generated?
Sample size
Sampling method
Measure type
Causality (MBA Marketing Timing & Internship)
It is Hard to Infer Causality from
Secondary Data
Took Core
Marketing
Did Not Get Desired
Marketing Internship
Term 1
Got Desired
Marketing
Internship
76%
Term 3
51%
49%
24%
Today’s Agenda
Announcements
Secondary data quality
Measure types
Hypothesis Testing and Chi-Square
Measure Types
Nominal: Unordered Categories
Male=1; Female = 2;
Ordinal: Ordered Categories, intervals
can’t be assumed to be equal.
I-95 is east of I-85; I-80 is north of I-40; Preference data
Interval: Equally spaced categories, 0 is
arbitrary and units arbitrary.
Fahrenheit temperature – each degree is equal, Attitudes
Ratio: Equally spaced categories, 0 on
scale means 0 of underlying quantity.
$ Sales, Market Share
Meaningful Statistics &
Permissible Transformations
Examples
Permissible
Transform
Meaningful
Stats
Ratio
Q1 = Bottles of wine Q2 = b*Q1
e.g., cases sold (b = 1/12)
All below
+ % change
Interval
Wine Rating Scale
1 = Very Bad to
20 = Very Good
Rank order of wines
1 = favorite
2 = 2nd preferred
3 = least preferred
All below
+ mean
Ordinal
Nominal
1 = Pinot Noir
2 = Merlot
3 = Chardonnay
Att2 = a + (b*Att1)
e.g., 81 to 100 (a = 80, b = 1)
e.g., 80.5 to 90 (a = 80, b = .5)
Any order preserving
100 = favorite
90 = 2nd preferred
0 = least preferred
Any transformation is ok
16 = Pinot Noir
3 = Merlot
13 = Chardonnay
All below
+ median
# of cases
mode
Means and Medians with Ordinal Data
Gender
Measure 1 Measure 2 Means
M
1
1
Measure 1
M
2
2
M=5.4 < F=5.6
F
3
3
Measure 2
F
4
4
M=65.4 > F=25.6
F
5
5
F
6
6
Medians
M
7
107
Measure 1
M
8
108
M=7 > F=5
M
9
109
Measure 2
F
10
110
M=107 > F=5
Ratio Scales & Index Numbers
Index= 100* (Per Capita Segment i) / (Per Capita Ave)
(000s)
Sales Per Capita Segment
Age Group Population Units (000) Sales
Index
<25
700
1400
2.00
70
25-34
500
1250
2.50
88
35-44
300
900
3.00
105
45-54
240
960
4.00
140
55 +
260
1196
4.60
161
Total
2000
5706
2.85
100
Today’s Agenda
Announcements
Southwestern Conquistador Beer Case
Backward Market Research
Secondary data quality
Measure types
Hypothesis Testing and Chi-Square
Cross Tabs of MBA Acceptance by
Gender
A.
Raw Frequencies
Accept
Reject
M
140
860
1000
F
60
740
800
200
1600
B.
Cell Percentages
Accept
Reject
M
.078
.478
.556
F
.033
.411
.444
.111
.889
1.0
C.
M
F
D.
M
F
Row Percentages
Accept
Reject
140/1000
= .140
60/800
=.075
860/1000
= .860
740/800
= .925
Column Percentages
Accept
Reject
140/200
= .700
60/200
=.300
1.00
860/1600
= .538
740/1600
= .462
1.00
1.00
1.00
Rule of Thumb
If a potential causal interpretation exists, make
numbers add up to 100% at each level of the
causal factor.
Above: it is possible that gender (row) causes
or influences acceptance (column), but not that
acceptance influences gender. Hence, row
percentages (format C) would be desirable.
Hypothesis Formulation and Testing
Hypothesis: What you believe the relationship is between the
measures.
Theory
Empirical Evidence
Beliefs
Experience
Here: Believe that acceptance is related to gender
Null Hypothesis: Acceptance is not related to gender
Logic of hypothesis testing: Negative Inference
The null hypothesis will be rejected by showing that a given
observation would be quite improbable, if the hypothesis was true.
Want to see if we can reject the null.
Steps in Hypothesis Testing
1. State the hypothesis in Null and Alternative Form
– Ho: There is no relationship between gender
and MBA acceptance
– Ha1: Gender and Acceptance are related
(2-sided)
– Ha2: Fewer Women are Accepted (1-sided)
2. Choose a test statistic
3. Construct a decision rule
Chi-Square Test
Used for nominal data, to compare the observed
frequency of responses to what would be “expected”
under the null hypothesis.
Two types of tests
Contingency (or Relationship) – tests if the variables
are independent – i.e., no significant relationship
exists between the two variables
Goodness of fit test – Compare whether the data
sampled is proportionate to some standard
Chi-Square Test
(Oi Ei )
Ei
i 1
k
2
2
With (r-1)*(c-1)
degrees of freedom
number in cell i
Oi Observed number in cell i Ei Expected
under independence
i
k
number of cells
r
number of rows
c
number of columns
Ei = Column Proportion * Row Proportion * total number observed
MBA Acceptance Data Contingency
A.
Observed Frequencies
Accept
Reject
M
140
860
1000
F
60
740
800
200
1600
1800
C.
B.
Cell Percentages
Accept
Reject
M
.078
.478
.556
F
.033
.411
.444
.111
.889
1.0
Expected Frequencies
Accept
Reject
M
.111*.556*1800=111
.889*.556*1800=890
F
.111*.444*1800= 89
.889*.444*1800=710
Chi-Square Test
(Oi Ei )
Ei
i 1
k
2
2
With (r-1)*(c-1)
degrees of freedom
2
=(140-111)2/111 + (860-890)2/890 + (60-89)2/89 + (740-710)2/710
= 19.30 So?
i
3. Construct a decision rule
Decision Rule
1. Significance Level - .05
Probability of rejecting the Null Hypothesis, when it is true
2. Degrees of freedom - number of unconstrained data used in
calculating a test statistic - for Chi Square it is (r-1)*(c-1), so
here that would be 1. When the number of cells is larger, we
need a larger test statistic to reject the null.
3. Two-tailed or One-tailed test – Significance tables are (unless
otherwise specified) two tailed tables. Chi-Sq is on pg 517
Ha1: Gender and Acceptance are related (2-sided) Critical Value =
3.84
Ha2: Fewer Women are Accepted (1-sided) Critical Value = 2.71
4.
Decision Rule: Reject the Ho if calculated Chi-sq value (19.3)
> the test critical value (3.84) for Ha1 or (2.71) for Ha2
Chi-Square Table
Chi-Square Test
Used for nominal data, to compare the observed
frequency of responses to what would be “expected”
under some specific null hypothesis.
Two types of tests
Contingency (or Relationship) – tests if the variables
are independent – i.e., no significant relationship
exists
Goodness of fit test – Compare whether the data
sampled is proportionate to some standard
Goodness of fit – Chi-Square
Ho: Car Color Preferences have not shifted
Ha: Car color Preferences have shifted
Data
Red
680
Green 520
Black
675
White
625
Tot (n) 2500
Historic Distribution Expected # = Prob*n
30%
25%
25%
20%
Do we observe what we expected?
750
625
625
500
Chi-Square Test
(Oi Ei )
Ei
i 1
k
2
2
With (k-1)
degrees of freedom
2
=(680-750)2/750 + (520-625)2/625 + (675-625)2/625 + (625-500)2/500
= 59.42
i
So?
3. Construct a decision rule
Decision Rule
1. Significance Level - .05
Probability of rejecting the Null Hypothesis, when it is true
2. Degrees of freedom - number of unconstrained data used in
calculating a test statistic - for Chi Square it is (k-1), so here that
would be 3. When the number of cells is larger, we need a larger
test statistic to reject the null.
3. Two-tailed or One-tailed test – Significance tables are (unless
otherwise specified) two tailed tables. Chi-Sq is on pg 517
Ha: Preference have changed (2-sided) Critical Value = 7.81
4.
Decision Rule: Reject the Ho if calculated Chi-sq value (59.42) >
the test critical value (7.81).
Chi-Square Table
Recap
Finding & Evaluating Secondary Data
Measure Types
permissible transformations
Meaningful statistics
Index #s
Crosstabs
Casting right direction
Chi-square statistic
Contingency Test
Goodness of Fit Test