Class1 - NYU Stern School of Business

Download Report

Transcript Class1 - NYU Stern School of Business

Statistics & Data Analysis
Course Number B01.1305
Course Section 60
Meeting Time
Monday 6-9:30 pm
CLASS #1
Class #1 Outline
 Introduction to the instructor
 Introduction to the class
• Review of syllabus
• Introduction to statistics
• Class Goals
 Types of data
 Graphical and numerical methods for univariate
series
 Minitab Tutorial
2
Professor S. D. Balkin -- May 20, 2002
Professor Balkin’s Info
 Ph.D. in Business Administration, Penn State
 Masters in Statistics, Penn State
 Mathematics/Economics and Music, Lafayette College
 Employment
• Pfizer Inc.
– Management Science Group; Sept. 2001 – current
• Ernst & Young
– Quantitative Economics and Statistics Group; June 1999 – August 2001
3
Professor S. D. Balkin -- May 20, 2002
What is Statistics?
 STATISTICS: A body of principles and methods for extracting useful
information from data, for assessing the reliability of that information,
for measuring and managing risk, and for making decisions in the
face of uncertainty.
 POPULATION: set of measurements corresponding to the entire
collection of units
 SAMPLE: set of measurements that are collected from a population
 OBJECTIVES:
• To make inferences about a population from a sample, including
the extent of uncertainty
• Design the data collection process to facilitate drawing valid
inferences
4
Professor S. D. Balkin -- May 20, 2002
Reasons for Sampling
 Typically due to prohibitive cost of contacting
millions of people or performing costly
experiments
• Election polls query about 2,000 voters to make
inferences regarding how all voters cast their ballots
 Sometimes the sampling process is destructive
• Sampling wine quality
5
Professor S. D. Balkin -- May 20, 2002
Statistics in Everyday Life
 Monthly Unemployment Rates (BLS)
 Consumer Price Index
 Presidential Approval Rating
 Quality and Productivity Improvement
 Scientific Inquiry
• Training effectiveness
• Advertising impact
6
Professor S. D. Balkin -- May 20, 2002
Interesting Statistical Perspectives
 “Statistical thinking will one day be as necessary for efficient
citizenship as the ability to read and write”.
– (H. G. Wells)
 “There are three kinds of lies -- Lies, damn lies, and
statistics”.
– (Benjamin Disraeli)
 “You’ve got to know when to hold ‘em, know when to fold
‘em.”
– (Kenny Rogers, in The Gambler)
 “The average U. S. household has 2.75 people in it.”
– (U. S. Census Bureau, 1980)
 “4 out of 5 dentists surveyed recommended Trident
Sugarless Gum for their patients who chew gum.”
– (Advertisement for Trident)
7
Professor S. D. Balkin -- May 20, 2002
Semester Overview
 Understanding data
• Intro to descriptive statistics, interpreting data, and graphical
methods
 Dealing with and quantifying uncertainty
• Random variables and probability
 Using samples to make generalizations about populations
• Assessing whether a change in data is beyond random
variation
 Modeling relationships and predicting
• Using sample data to create models that give predictions for
all values of a population
8
Professor S. D. Balkin -- May 20, 2002
Goals for this Class
 To gain an understanding of descriptive statistics,
probability, statistical inference, and regression
analysis so that it may be applied to your job
 To be able to identify when statistical procedures
are required to facilitate your business decision
making
 To be able to identify both good and poor use of
statistics in business
9
Professor S. D. Balkin -- May 20, 2002
Goals for Me
 To teach you statistics and data analysis
effectively
 To improve my effectiveness as an instructor
10
Professor S. D. Balkin -- May 20, 2002
My Promise To You
I will not teach you anything in this
class that is not regularly used in
business and industry
If you ask, “Where is this used?” I will
have a real example for you
11
Professor S. D. Balkin -- May 20, 2002
Types of Data
Data
12
Qualitative / Categorial
Qualitative trait only classifiable into categories
Quantitative / Continuous
Characteristic measurement on a numerical scale
Cable Appointment (Made, Missed)
Employment Status (employed, unemployed)
Bond Ratings (1, 2, 3, or 4 stars)
Service Quality (poor, good , excellent)
Cable Appointment Waiting Time (hours)
Employment Tenure (months)
Bond Return (percentage)
Cost (dollars)
Professor S. D. Balkin -- May 20, 2002
Example: Data Types
 Business Horizons (1993) conducted a comprehensive
survey of 800 CEOs who run the country's largest global
corporations. Some of the variables measured are given
below. Classify them as quantitative or qualitative.
•
•
•
•
•
•
•
13
State of birth
Age
Educational Level
Tenure with Firm
Total Compensation
Area of Expertise
Gender
Professor S. D. Balkin -- May 20, 2002
How Much Data
Variables
14
Univariate Data
Data sets with just one piece of information
Bivariate Data
Data sets with two pieces of information
Multivariate Data
Data sets with three or more pieces of information
What is a typical value?
How do the values vary?
Is there a relationship?
How strong is the relationship?
Is there a predictive relationship?
Are there relationships?
How strong are the relationships?
Do predictive relationships exist?
GMAT Scores for students in this class
Incomes in a zipcode
Returns for a stock over this past year
Respondent ages from market research
GMAT scores and college GPA
Incomes and age in a zipcode
Returns and volume for a stock
MR respondent age and purchase intent
GMAT Scores, Salary, Gender, Job Tenure,
Job Category, House Ownership, etc...
Professor S. D. Balkin -- May 20, 2002
CHAPTER 2
Summarizing Data about
One Variable
Introduction
 Unorganized mass of numbers is difficult to interpret
 First task in understanding data is summarizing it
• Graphically
• Numerically
16
Professor S. D. Balkin -- May 20, 2002
Chapter Goals
 Distinguish between qualitative and quantitative variables
 Learn graphic representations of univariate data
 Learn numerical representations of univariate data
 Investigate data acquired over time
17
Professor S. D. Balkin -- May 20, 2002
Distribution of Values
 Distribution is essentially how many times each
possible data values occur in a set of data.
 Methods for displaying distributions
• Qualitative data
– Frequency table
– Bar charts
• Quantitative data
– Histograms
– Stem-Leaf diagrams
– Boxplots
18
Professor S. D. Balkin -- May 20, 2002
Example: Qualitative Data
 Background: A question on a market research
survey asked 17 respondents the size of their
households
 Data: 1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,6
 Frequency Table
Household Number of
Size
Households
1
2
3
4
5
6
19
3
5
6
2
0
1
Professor S. D. Balkin -- May 20, 2002
Example: Qualitative Data (cont.)
 Barchart: Plot of frequencies each category
occurs in the data set
Number of Households
7
Frequency
6
5
4
3
2
1
0
1
20
2
3
4
Household
Size
5
6
Professor S. D. Balkin -- May 20, 2002
Example: Quantitative Data
 Background: Forbes magazine published data on
the best small firms in 1993. These were firms
with annual sales of more than five and less than
$350 million. Firms were ranked by five-year
average return on investment. The data are the
annual salary of the chief executive officer for the
first 60 ranked firms.
 Data (in thousands):
21
145
621
262
208
362
424
339
736
291
368
659
234
396
300
343
536
543
206
250
21
298
350
800
726
198
213
296
317
482
155
58
498
643
390
332
750
217
298 1103
406
254
862
204
370
536
291
808
543
149
350
242
802
200
282
573
388
250
396
572
Professor S. D. Balkin -- May 20, 2002
Example: Quantitative Data (cont.)
 Histograms are constructed in the same way as
bar charts except:
• User must create classes to count frequencies
• Bars are adjacent instead of separated with space
22
Professor S. D. Balkin -- May 20, 2002
Example: Quantitative Data (cont.)
15
0
5
10
Frequency
20
25
30
CEO Salary Histogram
0
200
400
600
800
1000
1200
Salary (in thousands)
23
Professor S. D. Balkin -- May 20, 2002
Example: Quantitative Data (cont.)
 Questions:
• What is the typical value of CEO salary?
• How much variability is there around this value?
• What is the general shape of the data?
 Histogram characteristics:
• Central tendency
• Variability
• Skewness
• Modality
• Outliers
24
Professor S. D. Balkin -- May 20, 2002
Skewnesss
1500
0 500
Freq
Symmetric Distribution
26
28
30
32
34
Data
Freq
0 500
1500
Right Skewed Distribution
0
10
20
30
Data
Freq
0 500
1500
Left Skewed Distribution
60
70
80
90
100
Data
25
Professor S. D. Balkin -- May 20, 2002
Modality
1000
0
Freq
Unimodal Distribution
26
28
30
32
Data
0 50
Freq
150
Bimodal Distribution
8
10
12
14
16
18
Data
26
Professor S. D. Balkin -- May 20, 2002
Outliers
20
0
5
10
15
Freq
25
30
35
Distribution with Outlier
28
30
32
34
36
Data
27
Professor S. D. Balkin -- May 20, 2002
Example: Stem-Leaf Diagram
 Background: Telecom company wants to analyze the time
to complete new service orders measured in hours
 Data:
42 21 46 69 87 29 34 59 81 97 64 60 87 81 69 77 75 47
73 82 91 74 70 65 86 87 67 69 49 57 55 68 74 66 81 90
75 82 37 94
 Diagram:
2
3
4
5
6
7
8
9
28
|
|
|
|
|
|
|
|
19
47
2679
579
045678999
0344557
111226777
0147
Professor S. D. Balkin -- May 20, 2002
Measures of Central Tendency
 Mode: Value or category that occurs most
frequently
 Median: Middle value when the data are sorted
 Mean: Sum of measurements divided by the
number of measurements
29
Professor S. D. Balkin -- May 20, 2002
Example: Mode
 Background: A question on a market research
survey asked 17 respondents the size of their
households
 Data: 1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,6
 Frequency Table
Household Number of
Size
Households
1
2
3
4
5
6
30
3
5
6
2
0
1
Mode
Professor S. D. Balkin -- May 20, 2002
Example: Median
 Background: A question on a market research
survey asked 17 respondents the size of their
households
 Data: 1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,6
 Since the n=17 observations,
• Median is the (n+1)/2 = 9th observation
Observation
Household Size
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
1 1 1 2 2 2 2 2 3 3 3 3 3 3 4 4 6
Median
31
Professor S. D. Balkin -- May 20, 2002
Example: Mean
 Background: Cable company wants to know how
long an installer spends at each stop. One
employee performed five installations in one day
and recorded how many minutes she was at each
location.
 Data: 45, 23, 36, 29, 52
 Mean = (45+23+36+29+52) / 5 = 37 minutes
32
Professor S. D. Balkin -- May 20, 2002
Example: Back to the
CEO’s Salaries
30
CEO Salary Histogram
20
15
10
Frequency
Median = 350
25
Mean = 404.1695
0
5
WHY THE DIFFERENCE?
0
200
400
600
800
1000
1200
Salary (in thousands)
33
Professor S. D. Balkin -- May 20, 2002
Measures of Variation
 A primary reason for using statistics is due to variability
 If there was no variability, we would not nee statistics
 Examples:
• Worker productivity
• Stock market
• Promotional expenditures
 Measures
• Standard deviation: variation around the mean
• Range: distance between smallest and largest observations
34
Professor S. D. Balkin -- May 20, 2002
Standard Deviation
 Standard Deviation: summarizes how far away
from the mean the data value typically are.
 Calculation
• Find the deviations by subtracting the mean from
each data value
• Square these deviations, add them up, and divide
by n-1
• Take the square root of this number
35
Professor S. D. Balkin -- May 20, 2002
Example: Standard Deviation
 Background: Your firm spends $19 Million per
year on advertising, and management is
wondering if that figure is appropriate. Other firms
in your industry have a mean advertising
expenditure of $22.3 Million per year.
36
Professor S. D. Balkin -- May 20, 2002
Example: Standard Deviation (cont.)
1
2
3
4
Industry Advertising Histogram
0
8
19
22
20
27
37
38
23
23
12
11
32
20
18
23
35
11
Deviations Sq Devs
-14.29 204.32
-3.29
10.85
-0.29
0.09
-2.29
5.26
4.71
22.15
14.71 216.26
15.71 246.67
0.71
0.50
0.71
0.50
-10.29 105.97
-11.29 127.56
9.71
94.20
-2.29
5.26
-4.29
18.44
0.71
0.50
12.71 161.44
-11.29 127.56
Frequency
Ad$$$
5
Mean =
St Dev =
37
22.29
10
15
20
25
30
35
40
Millions of Dollars
9.18
Professor S. D. Balkin -- May 20, 2002
Example: Standard Deviation (cont.)
 Difference from peer group average is $3.3 Million
 This difference is smaller than the industry
standard deviation of $9.18 Million
 Conclusion: You advertising budget, while slightly
below the industry average, is typical compared
with your industry peers
38
Professor S. D. Balkin -- May 20, 2002
Empirical Rule
 If the histogram for a given sample is unimodal
and symmetric (mound-shaped), then the
following rule-of-thumb may be applied:
 Let x represent the sample mean and s the
sample standard deviation. Then
x  1s contains approximat ely 68% of the measuremen ts;
x  2 s contains approximat ely 95% of the measuremen ts;
x  3s contains approximat ely all of the measuremen ts.
39
Professor S. D. Balkin -- May 20, 2002
Example: Stock Market Volatility
 Description: Stock market returns are supposed to be unpredictable.
Let’s see if the empirical rule holds true
 Data: S&P-500 Daily returns; Jan 01, 1998 – May 17, 2002
S&P-500 Daily Returns Histogram
350
 Mean = 0.0002
200
150
100
0
50
 72.8% (95.3%) of the returns fall
between the sample mean plus
and minus one (two) st.dev.
Frequency
250
300
 St. Dev. = 0.0128
-0.06
-0.04
-0.02
0.00
0.02
0.04
0.06
Daily Return
40
Professor S. D. Balkin -- May 20, 2002
Inter-Quartile Range
 Inter-Quartile Range (IQR) provides an alternative approach
to measuring variability
 Computation:
• Sort the data and find the median
• Divide the data into top and bottom halves
• Find the median of both halves. These are the 25th and
75th percentiles
• IQR = 75th percentile – 25th percentile
 Outlier Measure – Any value outside the inner fences is an
outlier candidate
• Lower inner fence = 25th percentile – 1.5 IQR
• Upper inner fence = 75th percentile + 1.5 IQR
41
Professor S. D. Balkin -- May 20, 2002
Box-Plot – S&P-500 Example
Data: S&P-500 Daily returns; Jan 01, 1998 – May 17, 2002
0.00
-0.02
-0.06
-0.04
Daily Return
0.02
0.04
S&P-500 Daily Returns Boxplot
42
Upper inner fence
Outliers
75th percentile
Median
25th percentile
Lower inner fence
Professor S. D. Balkin -- May 20, 2002
Minitab Tutorial
Why Use Minitab???
 Goal of course is to learn statistical concepts
• Most statistical analyses are performed using computers
• Each company may use a different statistical package
 YES…Minitab is used in business!
• Typically in quality control and design of experiments
 EXCEL has very limited statistical functionality and is considerably
more difficult to use than Minitab
 There are many stat packages (SAS, SPSS, Systat, Splus, R,
Statistica, Mathematica, etc.)
• Minitab is the easiest program to use right away
• Excellent Help facilities
• Statistical glossary built-in
44
Professor S. D. Balkin -- May 20, 2002
Minitab Tutorial – Case Study 1
 A hotel kept records over time of the reasons why
guest requested room changes. The frequencies
were as follows
– Room not clean
– Plumbing not working
– Wrong type of bed
– Noisy location
– Wanted nonsmoking
– Didn’t like view
– Not properly equipped
– Other
45
2
1
13
4
18
1
8
6
Professor S. D. Balkin -- May 20, 2002
Minitab Tutorial – Case Study 2
 Exercise 2.8 in book
• Produce graphics
• Produce descriptive statistics
46
Professor S. D. Balkin -- May 20, 2002
Minitab Tutorial – Case Study 3
 Diversification???
 Data: S&P-500 and IBM daily returns from Jan
01, 1998 through May 17, 2002
47
Professor S. D. Balkin -- May 20, 2002
Next Time
 Probability and Probability Distributions
48
Professor S. D. Balkin -- May 20, 2002