Module 1 slides

Download Report

Transcript Module 1 slides

Spring 2016
BUSA 3110 - Statistics for Business
Module 1: Data
Kim I. Melton, Ph.D.
2
Syllabus
•
•
•
•
•
•
Text, MyStatLab, JMP, D2L, MS Office
Accessing material D2L and MyStatLab
Software Availability: JMP and MS Office
Course Format
Grading
General expectations (especially deadlines, makeups, extra credit, academic integrity, phones)
• Inclement Weather
3
Data/Information/Knowledge/Wisdom
Doing the right things
(Effectiveness)
Doing things right
(Efficiency)
Evaluates knowledge/understanding;
deals with values; uses judgment;
WISDOM
answers what is best and why
KNOWLEDGE/
UNDERSTANDING
INFORMATION
DATA
Explains; provides answers to how to
and why questions
Describes; provides answers to who,
what, where, and when questions
Symbols (raw values) that represent
properties of objects/events
Based on the work of Russell Ackoff. See “From Data
to Wisdom” in Ackoff’s Best, pp. 170-174, 1999.
4
Data
Information
Wisdom
Content
• Six “Modules” (Sets of Slides)
1.
2.
3.
4.
5.
6.
Knowledge
Data – What is it, Types of data, How can we use it
Summarizing Data – Visually and Quantitatively
Collecting “Good” Data
Inference Involving One Variable
Simple Linear Regression
Multiple Regression and Model Building
5
Grading
• MyStatLab Homework (16 points)
•
Drop the lowest two and average the rest [then take percent of 16]
90 and above
A
80 – 89
B
70 – 79
C
60 – 69
D
Below 60
F
• MyStatLab Quizzes (16 points)
•
Average all [then take percent of 16]
• Instructor Supplied Assignments (64 points)
•
Eight assignments each graded out of 8 [add them up]
• Preparation / Participation (10 points)
•
Total earned/total available [then take percent of 16]
• Pre-final grade = Add the points from each section
• Final (0-16 points)
•
Two problems each out of eight point
• Final Grade = Points from (HW + Quizzes + Preparation /
Participation + 8 Highest Instructor Supplied Assignments)
6
Instructor Supplied Assignment Topics
(Tentative List)
1.
2.
3.
4.
5.
6.
7.
8.
Fundamentals of using JMP
Summarizing Data
Collecting “Good” Data for Statistical Inference
Inference about One Variable
Equations, Graphs, Model Statements, Hypotheses
Simple Linear Regression
Multiple Regression and Testing Theories
Model Building and Selecting the “Best” Model
7
General Expectations
• Learning is not a divided responsibility (I teach, you
learn)—learning is a joint responsibility (we learn
together)
• My “hot buttons”
• Timeliness
• Ethical behavior
• Professional orientation toward learning
• This includes putting phones away and engaging in class
• Recognition that “true” learning involves more than
getting the right answer
8
What is/are Statistics?
• Statistics vs. statistics
• Statistics vs. Math
9
How Does Statistics (as a field of study)
Apply to these Videos? …
And to what industries?
Videos
• Think
• Business Analytics and Optimization
• Turning Data into Insight
• Business Analytics: Data Trends let Businesses
Spot New Opportunities
• THINK: A Film about Making the World Work Better
10
How (and Why) is the Field of Statistics Changing?
Source: http://www.datasciencecentral.com/profiles/blogs/data-veracity
11
Analytics
12
Impact of Analytics on the
Way we Think about ___
• Evolution vs. revolution
• Improvement vs. innovation
• 1st order change vs. 2nd order change
• change in how we do something (1st order)
vs. change in what we do (2nd order)
• Paradigm shift…makes us go back to
the most basic assumptions
13
Based on the HBR Article
Analytics 2.0
• What is this?
• Examples from your life:
Analytics 3.0
• What is this?
• Data aligned with analytics 3.0 that
you are providing companies:
14
A Word about Deadlines
(MyStatLab and D2L)
• Deadlines are set to:
• Allow you time to see assignments well before the due date
• Allow you time to complete the assignments after the
material is covered
• Provide you with as much time as possible prior to when I
will start grading
• Therefore, I will use early morning deadlines rather than late night
deadlines (giving you the option of the overnight hours to work)
• Remember, you can submit assignments before the deadline
15
Why Start with the LAST
Chapter in the Book?
(Chapters 24)
CONTEXT
• This is a second course in statistics. This
chapter lets you reflect on the tools/
techniques from the first course…and sets
the stage for this course.
24.8 The Data Mining Process
(and also applies to most any data analysis in practice)
Copyright © 2015 Pearson Education. All rights reserved.
24-16
24.8 The Data Mining Process
(and data analysis in practice)
The process must start with the Business Understanding phase.
Data Understanding is central to the entire data mining project – it is crucial to
understand the data warehouse, what it contains, and what limitations are present.
Once variables are selected and the response variable has been agreed upon, the
Data Preparation phase begins.
Following preparation is the Data Modeling phase. The more knowledge of the data
and the variables that goes into the model, the higher the chances of success for the
entire project.
Finally, if the model seems to give business insight, it’s time for the Deployment
phase – just keep in mind that the business environment changes rapidly, so models
can become stale quickly.
Copyright © 2015 Pearson Education. All rights reserved.
24-17
24.4 Data Mining Myths
Myth 1: Find answers to unasked questions.
Myth 2: Automatically monitor a database for interesting patterns.
Myth 3: Eliminate the need to understand the business.
Myth 4: Eliminate the need to collect good data.
Myth 5: Eliminate the need to good data analysis skill.
Copyright © 2015 Pearson Education. All rights reserved.
24-18
24.5 Successful Data Mining
The first step is to have a well-defined business problem,
which can help you avoid going down a lot of blind paths.
Typically, 65% to 90% of the time is spent in data
preparation – investigating missing values, correcting
wrong entries, reconciling data definitions, or creating new
variables from old ones.
Copyright © 2015 Pearson Education. All rights reserved.
24-19
 Be sure that the question to be answers is specific. A goal as vague as
“improving the business” is not likely to be successful.
 Be sure that the data have the potential to answer the question. Check the
variables to see whether a model can reasonably be built to predict the response.
 Be aware of overfitting the data. Make sure you validate the model on a test
set.
 Make sure that the data are ready to use in the data mining model. Missing
values, incorrect entries, and different time scales are all challenges that need to
be overcome.
 Don’t try it alone. Data mining projects require a variety of skills and a lot of
work. Assemble the right team of people.
Copyright © 2015 Pearson Education. All rights reserved.
24-20
21
What do we mean by “good data”?
• Considerations when collecting data
•
•
•
• Considerations when evaluating claims based on data
•
•
•
22
Characteristics of
“Good” Data
• Accuracy of measurement
• Precision of measurement
• Uses an appropriate type data
(level of measurement)
• Nominal, Ordinal, Interval, Ratio
• Interval and Ratio are often grouped as continuous or
quantitative
Parking Space
Reserved for
Drive-Thru
• Aligns with the characteristic of interest
• Different numbers reflect differences in the items
measured (rather than an inability to measure
consistently)
• Measurement is a yardstick for “how we are doing”
rather than the “mission”
23
Putting Data in Context (5 W’s and H)
• Who does the data describe (doesn’t have to be people)
• What characteristics are recorded (variables of interest)
• Why are we collecting data (purpose, guiding questions,…)
• How were the data collected (theory-wise and physically)
• Sampling, convenience, primary or secondary data, training for data collection
• Operational definitions will describe what is “measured”, how the measurements are
taken (getting to the level of measurement level/modeling type and method of
measurement), and provide a way that two people looking at the same item would come
to the same conclusion about the characteristic.
• When were the data collected (date/time, across time, …)
• Where were the data collected (geographic, point in process, source…)
24
Describe, Explain, Understand, Predict,
Prescribe
• What were our sales for the month? (describing)
• How does this compare to the same month last year? (still describing)
• What’s changed that might account for the differences? (moves
toward explaining)
• Why have sales changed? (starts to move from explaining to
understanding)
• What will sales be in the future? (predicting and/or prescribing)
25
Data for Decision Making
• Major issues
• Purpose (descriptive, predictive, prescriptive)
• Measurement Level (quantitative/qualitative, nominal,
ordinal, interval, ratio)
• Variable choice and definition
• Sources of variation (population, across time, process)
• Methods of accessing (primary, secondary)
• Choice of observations (random, convenience, rational)
• External influences (ethical and practical)
Variable Choice and
Measurement Level (Modeling Type)
Measurement Level/Modeling Type
Identify the Level/Type
• Nominal (Qualitative, Categorical)
• Ordinal (Qualitative, Categorical,
Logical to Order the Categories)
• Interval (Quantitative, Differences
have consistent meaning)
• Ratio (Quantitative, Differences and
Ratios have meaning)
• Major
• Grade in a course
• Job title
• Year in school (Freshman,…, Senior)
• Price of a gallon of regular gas
• Salary
• Rank of your favorite college team
• Size of a house
• Gender
NOTE: JMP combines Interval and
Ratio into Continuous
• Level of agreement (1, 2, …, 9, 10 where higher
numbers relate to stronger agreement)
26
Lists of Most Stolen Vehicles
Ford F-250 crew 4WD
Chevrolet Silverado 1500 crew
Chevrolet Avalanche 1500
GMC Sierra 1500 crew
Ford F-350 crew 4WD
Cadillac Escalade 4WD
Chevrolet Suburban 1500
GMC Sierra 1500 extended cab
GMC Yukon
Chevrolet Tahoe
Toyota Camry/Solara
Toyota Corolla
Chevrolet Impala
Dodge Charger
Chevrolet Malibu
Ford Fusion
Nissan Altima
Ford Focus
Chevrolet Cobalt
Honda Civic
1994 Honda Accord
1998 Honda Civic
2006 Ford Full Size Pickup
1991 Toyota Camry
2000 Dodge Caravan
1994 Acura Integra
1999 Chevrolet Full Size Pickup
2004 Dodge Full Size Pickup
2002 Ford Explorer
1994 Nissan Sentra
Dodge Charger
Pontiac G6
Chevrolet Impala
CHRYSLER 300
Infiniti FX35
Mitsubishi Galant
Chrysler Sebring
Lexus SC
Dodge Avenger
Kia Rio
1
2
27
3
4
Highway Loss Data
Institute (Insurance
claims $)
National Insurance
Crime Bureau (thefts
reported to law
enforcement)
National Highway
Traffic Safety Adm.
(FBI data)
28
http://www.realclearpolitics.com/epolls/latest_polls/president/ (accessed 1/19/16)
29
Cross Sectional vs. Time Series
http://www.realclearpolitics.com/epolls/latest_polls/president/ (accessed 1/19/16)
30
Other Issues in Data Collection
•
•
External influences (practical and ethical)
• Practical (time, money, access)
• Ethical
• policies that can interfere with collecting good data:
• evaluation systems that look at components
separately
• reward systems
• quotas and arbitrary goals
• fiscal year budgets
Other issues to cover in Chapter 8
• Methods of accessing (primary, secondary;
survey, experiment, observational)
• Choice of observations (random, convenience, rational)