Big Data: Many Observations on Many Variables
Download
Report
Transcript Big Data: Many Observations on Many Variables
Big Data and Data Mining
Professor Tom Fomby
Director
Richard B. Johnson Center for Economic
Studies
Department of Economics
SMU
May 23, 2013
Big Data:
Many Observations on Many
Variables
Data File
OBS No.
Target Var.
Var. 1
Var. 2
.
.
Var. 100
1
0
63
.
.
.
.
2
1
54
.
.
.
.
3
0
44
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1,500,000
1
32
.
.
.
.
Types of Problems
•
•
•
•
•
•
•
•
•
Customer and Student Retention
Employee Churn
Credit Scoring (Auto or Home Loans)
Bond Ratings
What Characteristics Make for a Successful Mary
Kay Representative?
Detection of Fraudulent Insurance Claims
Is a Newly Introduced Product Meeting with
Consumer Acceptance or Rejection?
Who is a likely Donor to your Charity?
Early Detection of a Stolen or Compromised
Credit Card
Types of Problems
• What kind of genetic markers imply
certain susceptibilities to specific
diseases?
• Netflix and recommendations of
Related and Suggested Movies
• Recommendations for Book Purchases:
Amazon Side-Bars
• Click Stream Analysis of Optimal Web
Base Design
Statistical Hypothesis Testing
Versus
Prediction
Example of Statistical Hypothesis
Testing
A Clinical Trial of 400 people – 200 randomly
selected into a Control (Placebo) Group and the
Other 200 into a Treatment Group
Question:
Does the Drug Treatment Significantly Reduce a
Person’s Cholesterol Count?
Method:
Conventional Statistical Methods Like T-Test
Of Significant Difference in Population Means
Example of a Prediction Problem
• Early Detection of a Stolen or
Compromised Credit Card
Not So Interested in How or Why the
Credit Card was Stolen but Instead
Whether Recent Transactions are
Indicative of a Stolen or Compromised
Credit Card
• Tool – Box Plot
Getting Gems From the Data
Crankshaft Cartoon
The Task of Constructing a Meaningful
Data Warehouse
Data Rich, Information Poor
• The Amount of Raw Data Stored in Corporate Databases is
Exploding
• Most of this information is recorded instantaneously and with
minimal cost
• Data bases are measured in gigabytes and terabytes (One
terabyte = one trillion bytes. A terabyte is equivalent to about
2 million books!)
• Walmart uploads 20 million point-of-sale transactions to 500
parallel processing storage devices each day.
• Raw data by itself, however does not provide much
information. That is where Data Mining Comes in!
What is Data Mining?
• “Extracting useful information from large datasets” (Hand et
al., 2001)
• “Data mining is the process of exploration and analysis, by
automatic or semi-automatic means, of large quantities of
data in order to discover meaningful patterns and rules.”
(Berry and Linoff, 1997, 2000)
• “Data mining is the process of discovering meaningful new
correlations, patterns and trends by sifting through large
amounts of data stored in repositories, using pattern
recognition technologies as well as statistical and
mathematical techniques” (Gartner Group, 2004)
Four Distinct Characteristics of
Data Mining Projects
•
•
•
•
Partitioning given data into Training,
Validation, and Test Parts
Cross Validation – using the Validation and
Test Parts to gauge the worthiness of
competing models
Using Ensemble Methods to increase
predictive accuracy. (There is no such thing
as a correct model!)
Continual Monitoring of a PA system to
guard against structural change and to
maintain predictive accuracy
More Detailed Discussion of Specific
Data Mining Applications
• Text Mining (Classification of Documents and Evolution of Opinions on
Blogs)
• Target Marketing
• Credit Scoring
• Bond Ratings: Calculating Default Probabilities on Bonds (Bond rating
services like Moody’s, Standard & Poor’s, Fitch, etc.)
• Fraud Detection
• Customer Retention
• Franchise Locations and Performance
• Customer Segmentation
• Affinity Analysis (i.e. “Market Basket” Analysis)
• Link Analysis (Webpage design)
• Many Other Fields including Clinical Science, Statistical Genetics, Political
Science, Real Estate Assessment, and College Admissions Practices
Text Mining
Text Mining:
Converting Unstructured Data
to Structured Data
Text
Frequencies of
Words and
Phrases
Numbers for
Prediction
Who Wrote the Federalist Papers?
Frederick Mosteller and David Wallace
“Inference in an Authorship Problem” JASA, June 1963
Comparing Two Documents
Doc 1
Doc 2
18
Target Marketing
• Target Marketing is the process of choosing specific customers to
advertise to and/or to offer discounts to in order to increase the sales of
the company
• Target Marketing usually proceeds in two stages: (1) Determining the
probability that the solicited customer will purchase products from the
company once solicited and (2) Once the solicited customer decides to
purchase items from the company, estimating the profit that will likely be
generated by the customer’s purchases.
• Thus the goal is to advertise only to those potential customers that
represent expected profits that exceed the cost of advertising to the
customer
• We then need to use data mining techniques to determine (1) the
probability of purchase and (2) conditional on purchase, the expected
profit of purchase.
• Expected Profit of Purchase = (Probability of Purchase) x (Expected profits
from purchase, conditional on purchase)
Credit Scoring
• Credit scoring involves using data mining tools
determine the credit worthiness of loan applicants
• The task is determining the probability that a
potential borrower will default on his or her
obligations, given the personal characteristics of the
borrower and the macroeconomic conditions of the
economy at the time
• Some Examples: Citibank and Credit Card Issuers
reviewing applicants for credit cards; Banks
considering loaning money for mortgages
Bond Ratings: Calculating Default
Probabilities on Bonds
• Given the financial characteristics of a bond issuer
and the macroeconomic conditions at the time, what
is the probability that the bond issuer will, at some
time in the future, not be able to service the
obligations of the bond?
• Bond rating services like Moody’s, Standard and
Poor’s, and Fitch build probability of default models
and use them to give bonds their credit ratings (AAA,
AAB, …, BBB, etc.). The lower the probability of
default, the higher the bond rating and vice versa. In
turn, these ratings give rise to differential interest
rates paid by the bond issuers. (See Town and Gown
PPT for example.)
Fraud Detection
• Of interest to IRS, Credit Card Companies, and
Auditors
• Given a history of transactions, a record of
“typical” income tax reports or income or
balance sheets, which transactions\reports
appear to be “outliers”?
• Basic Tool: Statistical Outlier Analysis.
Roughly speaking: “What is three or more
standard deviations from the norm?”
Customer Retention
• What factors determine the loyalty displayed by a
customer?
• When is a customer likely to “jump ship”?
• Would loyalty programs be useful?
• Basic Tool: Duration Modeling. This method
determines what factors extend or limit the
durations of customers with companies.
• Purpose: To identify potential “fragile” customers
and then “incentivize” them so that they will remain
loyal
• Result: Higher profits
Facets of a Data Mining Job
1. Development of Problem Statement and
Consultation with Domain Experts
2. Data Acquisition
3. Data Preparation and Cleaning
4. Data Visualization and Summarization
5. Type of Task? Supervised Learning
(Prediction, Classification), or
Unsupervised Learning
6. Evaluation of Models (Data Partitioning
and Cross Validation)
7. Scoring of New Data
8. Continual Review of Model Usefulness
Franchise Locations and Performance
• What location factors affect the eventual
profitability and success of franchises?
• Even within a set of franchises, should the
product mix be the same for all franchises or
should franchises be treated differently?
• Can franchisees by put into “Clusters” and
treated differently so as to maximize the
profits of the entire franchise operation?
Customer Segmentation
• Suppose you are a giant publisher of magazines of various
types. How do your subscribers differ across your portfolio of
magazines?
• When soliciting advertising for your magazines, how do you
match your potential advertisers with your magazines so that
the advertisers receive the maximum benefit for their
advertising expenditures?
• Is there a niche market (customer segment) that none of your
magazines (or those of your competitors) is currently serving?
Is this niche market substantial enough to warrant introducing
a new magazine?
• Also, retailers often like to be able to distinguish between
customers with low versus high elasticities of demand for
their products so that they will know who to offer discounts to
increase their revenues and profits.
• Basic Tool: Cluster Analysis
Affinity Analysis
• Given that a customer purchases a given set of items, what is
the probability that they will purchase another set of items?
That is, what does the customer’s final market basket look
like, given a partially-filled one?
• Purpose: Arrange the store shelves of a retail store so as make
it most convenient for customers to purchase related goods
and minimize the time of search and shopping. We want the
customer to be able to shop quickly but at the same time buy
a lot!
• On book seller web pages, once you have indicated an interest
in purchasing a given book, several related books are often
brought to your attention by “advertisements” in the margins
of the page you are currently on. Affinity analysis is helpful in
generating “associated” sales on retail web pages. This
increases the profits of the web retailer.
• Major Tool: Association Rules – The A priori Algorithm.
Link Analysis
• Explores Associations between groups
(individuals, organizations, web sites, nationstates and the like)
• Uses: To improve webpage design, to facilitate
criminal investigations, and to benefit medical
research in epidemiology and pharmacology,
among other uses
Text Mining
•
•
•
•
To Understand Textual Content
For Finding Interesting Regularities in Text
Help Classify Documents by Type and Content
Useful for Medical Science Search Engines seeking
most current research on particular maladies seen in
patients
• Beneficial in Building Spam Filters
• Help Examine Evolution of Opinion vis-à-vis Blogs
Other Fields Where Data Mining is Used
• Clinical Science and Providing Baseline Guidance for Clinical
Treatment
• Political Science (Modeling Voting Patterns, Election
Outcomes and Appeal and Supreme Court Decisions)
• Statistical Genetics – Relating Genetic characteristics with
medical outcomes
• Real Estate Assessment Models – County Assessors using
predictive models to gauge the current value of houses for the
purpose of assessing real estate taxes
• College Admissions Practices – Which students should be
admitted and how much financial aid is needed to insure that
the chosen student will matriculate?
Typical Data Mining Course Outline
Prediction
• MLR
• K-Nearest Neighbor
• Regression Trees
• Neural Nets
Data
Preparation &
Exploration
•Sampling
•Cleaning
•Summaries
•Visualization
•Partitioning
•Dimension
reduction
Classification
• K-Nearest
Neighbor
• Naïve Bayes
• Logistic
Regression
Model Evaluation
& Selection
• Classification
Trees
• Neural Nets
• Discriminant
Analysis
Segmentation/Clu
stering
Deriving Insight
Affinity Analysis/
Association Rules
Figure 1.2: Data mining from a process perspective
G. Samueli, N. R. Patel and P.C. Bruce. Data Mining for
Business Intelligence (2007).
Deriving Insight
Available Software Packages
•
•
•
•
•
XLMINER (Frontline Systems)
SAS Enterprise Miner (SAS Product)
SPSS Modeler (IBM Product)
R (Open Source)
Data Mining Certificates are available for SAS
EM and SPSS Modeler
The Shortage of Trained Personnel
for Doing Data Mining
“Big data: The next frontier for innovation,
competition, and productivity” McKinsey Global
Institute, May 2011
• 140,000 – 190,000 more deep analytical talent
positions over the next decade
• 1.5 Million more data-savvy managers to take
advantage of insights offered by Data Mining
What is SMU doing about this
shortage?
• Department of Economics: MS in
Applied Economics and Predictive
Analytics – Starting Fall of 2013
• Department of Statistics: MS in
Statistics and Data Analytics –
Started Fall of 2012
• Cox School of Business: MS in
Business Analytics – Starting Fall of
2013
The Super Woman of
Predictive Analytics
The Skill Set of Super Woman
Analytics:
SAS/SPSS/Statistics
Data
Management:
Oracle and SQL
Reporting:
Cognos and
Dashboards