Transcript Document

DSCI 4520/5240: Data Mining
Fall 2013 – Dr. Nick Evangelopoulos
Lecture 1:
Introduction to Data Mining
Some slide material based on:
Groth; Han and Kamber; Cerrito; SAS Education
slide 1
DSCI 4520/5240
DATA MINING
ITDS Résumé Book
ITDS majors (BCIS/DS), please send your résumé to
[email protected], so that we can include it to the ITDS Résumé
Book we send to our corporate partners for hiring/coop consideration. Make
sure the résumés are formatted per UNT standards. Here is a link to the
sample résumés: https://unt.optimalresume.com/
slide 2
DSCI 4520/5240
DATA MINING
Data (and the lack thereof)
“It is a capital mistake to theorize
before one has data.
Insensibly one begins to twist facts
to suit theories, instead of
theories to suit facts.”
(Sir Arthur Conan Doyle: Sherlock Holmes, "A Scandal in
Bohemia") http://www.dilbert.com/2012-12-05/
slide 3
DSCI 4520/5240
DATA MINING
Data (and the lack thereof)
http://www.dilbert.com/2012-12-05/
slide 4
DSCI 4520/5240
DATA MINING
Nobel Laureate Calls Data Mining "A Must"
In an interview with ComputerWorld in January 1999, Dr.
Penzias (won the 1978 Nobel Prize in physics and was the
vice president and chief scientist at Bell Laboratories)
considered large scale data mining from very large databases
as the key application for corporations in the next few years.
In response to ComputerWorld's age-old question of "What will
be the killer applications in the corporation?" Dr. Penzias
replied:
"Data mining." He then added: "Data mining will become
much more important and companies will throw away
nothing about their customers because it will be so valuable.
If you're not doing this, you're out of business" he said.
slide 5
DSCI 4520/5240
DATA MINING
What Is Data Mining?
Data mining (knowledge discovery in databases):

A process of identifying hidden patterns
and relationships within data (Groth)
Data mining:

Extraction of interesting (non-trivial,
implicit, previously unknown and
potentially useful) information or
patterns from data in large databases
slide 6
DSCI 4520/5240
DATA MINING
Motivation: “Necessity is the
Mother of Invention”
Data explosion problem

Automated data collection tools and mature database
technology lead to tremendous amounts of data stored
in databases, data warehouses and other information
repositories
Problem: We are drowning in data, but starving for knowledge!
Solution: Data warehousing and data mining

Data warehousing and on-line analytical processing

Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases
slide 7
Data Deluge
DSCI 4520/5240
DATA MINING
slide 8
DSCI 4520/5240
DATA MINING
Data Mining, circa 1963
IBM 7090
600 cases
“Machine storage limitations
restricted the total number of
variables which could be
considered at one time to 25.”
slide 9
DSCI 4520/5240
DATA MINING

Business Decision Support
Database Marketing
– Target marketing
– Customer relationship management

Credit Risk Management
– Credit scoring

Fraud Detection

Healthcare Informatics
– Clinical decision support
slide 10
DSCI 4520/5240
DATA MINING
Required Expertise

Domain

Data

Analytical Methods
slide 11
Multidisciplinary
DSCI 4520/5240
DATA MINING
Statistics
Pattern
Neurocomputing
Recognition
Machine
Data Mining Learning
AI
Databases
KDD
slide 12
What Is Data Mining?
DSCI 4520/5240
DATA MINING
 IT:
Complicated database queries
 ML:
Inductive learning from examples
 Stat:
What we were taught not to do
slide 13
DSCI 4520/5240
DATA MINING
Comparing Statistics to Data Mining
(from Cerrito 2006)
slide 14
DSCI 4520/5240
DATA MINING
Comparing Statistics to Data Mining
(from Cerrito 2006)
slide 15
Predictive Modeling
DSCI 4520/5240
DATA MINING
Inputs
Cases
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
Target
...
...
...
...
...
...
...
...
...
...
.. ..
. .
slide 16
DSCI 4520/5240
DATA MINING

Types of Targets
Supervised Classification
– Event/no event (binary target)
– Class label (multiclass problem)

Regression
– Continuous outcome

Survival Analysis
– Time-to-event (possibly censored)
slide 17
DSCI 4520/5240
DATA MINING
Why Data Mining? — Potential
Applications
Database analysis and decision support

Market analysis and management
– target marketing, customer relation management, market
basket analysis, cross selling, market segmentation

Risk analysis and management
– Forecasting, customer retention, improved underwriting,
quality control, competitive analysis

Fraud detection and management
Other Applications

Text mining (news group, email, documents) and Web
analysis.

Intelligent query answering
slide 18
DSCI 4520/5240
DATA MINING
Market Analysis and Management (1)
Where are the data sources for analysis?

Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies
Target marketing

Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
Cross-market analysis

Associations/co-relations between product sales

Prediction based on the association information
slide 19
DSCI 4520/5240
DATA MINING
Market Analysis and Management (2)
Customer profiling

data mining can tell you what types of customers
buy what products (clustering or classification)
Identifying customer requirements

identifying the best products for different customers

use prediction to find what factors will attract new
customers
slide 20
DSCI 4520/5240
DATA MINING
Corporate Analysis and Risk
Management
Finance planning and asset evaluation
cash flow analysis and prediction
 contingent claim analysis to evaluate assets
 cross-sectional and time series analysis (financial-ratio, trend
analysis, etc.)
Resource planning:

summarize and compare the resources and spending
Competition:




monitor competitors and market directions
group customers into classes and a class-based pricing
procedure
set pricing strategy in a highly competitive market
slide 21
Other Applications
DSCI 4520/5240
DATA MINING
Sports

IBM Advanced Scout analyzed NBA game statistics (shots
blocked, assists, and fouls) to gain competitive advantage for
New York Knicks and Miami Heat
Astronomy

JPL and the Palomar Observatory discovered 22 quasars with
the help of data mining
Internet Web Surf-Aid

IBM Surf-Aid applies data mining algorithms to Web access
logs for market-related pages to discover customer preference
and behavior pages, analyzing effectiveness of Web marketing,
improving Web site organization, etc.
slide 22
DSCI 4520/5240
DATA MINING
On the News:
Rexer Analytics Annual Data Mining survey
The 2013 survey will become available
in Fall 2013 (stay tuned)
slide 23
DSCI 4520/5240
DATA MINING
Rexer Analytics 2011 Survey Overview
• SURVEY & PARTICIPANTS: 52-item survey of data miners, conducted on-line in
2011. Participants: 1,319 data miners from over 60 countries.
• FIELDS & GOALS: CRM/Marketing has been the #1 field for the past five years.
“Improving the understanding of customers”, “retaining customers” and other CRM goals
continue to be the primary goals.
• ALGORITHMS: Decision trees, regression, and cluster analysis continue to form the
top three algorithms for most data miners. A third of data miners currently use text mining
and another third plan to do so in the future.
• TOOLS: R continued its rise this year and is now being used by close to half of all data
miners (47%). R users prefer it for being free, open source, and having a wide variety of
algorithms. STATISTICA is selected as the primary data mining tool (17%). STATISTICA,
KNIME, Rapid Miner and Salford Systems received the strongest satisfaction ratings.
• ANALYTIC CAPABILITY AND SUCCESS MEASUREMENT: Only 12% of
corporate respondents rate their company as having very high analytic sophistication.
Measures of analytic success: Return on Investment (ROI), and predictive validity or
accuracy of their models. Challenges to measuring success: user cooperation and data
availability/quality.
slide 24
DSCI 4520/5240
DATA MINING
Where Data Miners Work
Data Mining is everywhere!
Data miners also report working in
Non-profit (6%), Hospitality /
Entertainment / Sports (3%), Military
/ Security (3%), and Other (9%).
© 2012 Rexer Analytics
slide 25
DSCI 4520/5240
DATA MINING
© 2012 Rexer Analytics
The Algorithms Data Miners use
slide 26
DSCI 4520/5240
DATA MINING
The positive impact of Data Mining
In the 5th Annual Survey (2011) of Rexer Analytics (1,319 participant data
miners from over 60 countries) data miners shared examples of situations
where data mining is having a positive impact on society. The five areas
mentioned most often were:





Health / Medical Progress
Business Improvements
Personalized Communications & Marketing
Fraud Detection
Environmental
slide 27
DSCI 4520/5240
DATA MINING
The rise of Text Mining
Text Material
No Plans to
Conduct Text
Mining
33%
Text
Miners
34%
33%
Customer / market surveys
Blogs and other social media
E-mail or other correspondence
News articles
Scientific or technical literature
Web-site feedback
Online forums or review sites
Contact center notes or transcripts
Employee surveys
Insurance claims or underwriting notes
Medical records
Point of service notes or transcripts
38%
33%
27%
25%
23%
22%
21%
16%
15%
15%
11%
10%
Plan to Start Text
Mining
© 2012 Rexer Analytics
slide 28
Data Mining Software
DSCI 4520/5240
DATA MINING
• The average data miner reports using 4 software tools.
• R is used by the most data miners (47%).
Overall
© 2012 Rexer Analytics
Corporate
Consultants
Academics
NGO / Gov’t
slide 29
29
DSCI 4520/5240
DATA MINING
Satisfaction with Data Mining Tools
Extremely Dissatisfied
© 2012 Rexer Analytics
Extremely Satisfied
slide 30
Measuring Analytic Success
DSCI 4520/5240
DATA MINING
Question: Please share your best practices
concerning how you measure analytic project
performance / success. (text box provided for
response)
53
Model Performance (Accuracy, F, ROC,AUC, Lift)
43
Financial Performance (ROI, etc.)
Performance in Control or Other Group
35
Feedback from User / Client / Management
29
14
Cross-Validation
0
10
20
30
40
50
Number of respondents
© 2012 Rexer Analytics
slide 31
60
DSCI 4520/5240
DATA MINING
Overcoming Data Mining challenges
In the four annual data miner surveys, these key
challenges have been identified by data miners more than
any others:



Dirty Data
Explaining Data Mining to Others
Unavailability of Data / Difficult Access to Data
slide 32
DSCI 4520/5240
DATA MINING
Data Mining: A KDD Process
Pattern Evaluation

Data mining: the core of
knowledge discovery
Data Mining
process.
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
slide 33
DSCI 4520/5240
DATA MINING
Steps of a KDD Process
Learning the application domain:

relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation:
 Find useful features, dimensionality/variable reduction, invariant
representation.
Choosing data mining algorithms
summarization, classification, regression, association, clustering.
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
 visualization, transformation, removing redundant patterns, etc.

Use of discovered knowledge
slide 34
DSCI 4520/5240
DATA MINING
Data Mining and Business Intelligence
Increasing potential
to support
business decisions
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
DBA
slide 35