Transcript Lecture 9
Basics of Statistical Analysis
Basics of Analysis
• The process of data analysis
Observation
Data
Encode
Information
Analysis
Example 1:
– Gift Catalog Marketer
– Mails 4 times a year to its customers
– Company has I million customers on its file
Example 1
• Cataloger would like to know if new
customers buy more than old customers?
• Classify New Customers as anyone who
brought within the last twelve months.
• Analyst takes a sample of 100,000
customers and notices the following.
Example 1
• 5000 orders received in the last month
• 3000 (60%) were from new customers
• 2000 (40%) were from old customers
• So it looks like the new customers are doing
better
Example 1
• Is there any Catch here!!!!!
• Data at this gross level, has no discrimination
between customers within either group.
– A customer who bought within the last 11 days is
treated exactly similar to a customer who bought within
the last 11 months.
Example 1
• Can we use some other variable to distinguish between old
and new Customers?
• Answer: Actual Dollars spent !
• What can we do with this variable?
– Find its Mean and Variation.
• We might find that the average purchase amount for old
customers is two or three times larger than the average
among new customers
Numerical Summaries of data
• The two basic concepts are the center and
the Spread of the data
n
• Center of data
xi
- Mean, which is given by x i 1
n
- Median
- Mode
Numerical Summaries of data
• Forms of Variation
– Difference about the mean:
– Absolute Difference:
xi x
xi x
n
– Total Sum of Squares:
2
(
x
x
)
i
i 1
n
– Variance:
2
(
x
x
)
i
i 1
n 1
– Standard Deviation: SquareRoot[Variance]
Confidence Intervals
• In catalog eg, analyst wants to know average
purchase amount of customers
• He draws two samples of 75 customers each and
finds the means to be $68 and $122
• Since difference is large, he draws another 38
samples of 75 each
• The mean of means of the 40 samples turns out to
be $ 94.85
• How confident should he be of this mean of
means?
Confidence Intervals
• Analyst calculates the standard deviation of
sample means, called Standard Error (SE)
• Basic Premise for confidence Intervals
– 95 percent of the time the true mean purchase amount
lies between plus or minus 1.96 standard errors from
the mean of the sample means.
• C.I. = Mean (+or-) (1.96) * Standard Error
Confidence Intervals
• However, if CI is calculated with only one sample
then
Standard Error of sample mean
= Standard deviation of sample
n
• Basic Premise for confidence Intervals with one sample
– 95 percent of the time the true mean lies between plus or minus 1.96
standard errors from the sample means.
C.I. For Response Rates
• Standard error for response rates is
S.E.= p * (1 p) n
Where,
p = Sample response rate
n = sample size
Example 2:
• Test 1,000 names selected at random from a new list.
• To break-even the list must be expected to have a response
rate of 4.5 percent
• Confidence Interval= Expected Response (+/-) 1.96*SE
= p(+/-) 1.96*SE
• In our case C.I. = 3.22 % to 5.78%. Thus any response
between 3.22 and 5.78 % supports hypothesis that true
response rate is 4.5%
Example 2:
•
•
•
•
The list is mailed and actually pulls in 3.5%
Thus, the true response rate maybe 4.5%
What if the actual rate pulled in were 5% ?
Regression towards mean: Phenomenon of test
result being different from true result
• Give more thought to lists whose cutoff rates
lie within confidence interval
Determining List Size
• Let us assume that we expect true p = 0.035
• We want to be 95% certain that our test mailing will
tell us if true response is between 3.3 % and 3.7%
• We are saying that Precision = 1.96*SE=.002 (or
0.2%)
• Hence,
–
–
–
–
–
SE=0.002/1.96=0.001020
0.001020= p * (1 p) n
0.001020=(0.033775/n)^1/2
0.00000104=0.033775/n
n=32,437
• In general,
– n=[p*(1-p)*1.962]/Precision2