Lec14DataMining

Download Report

Transcript Lec14DataMining

Databases and Decision
Support Systems
CSS263 Lecture 14
LECTURE PLAN
 What is a Decision Support System?
 What is a Data Warehouse?
 Different uses for a Data Warehouse
 Problems of Data Warehousing
 What is OLAP?
 What is Data Mining?
 Data Mining Operations
 Data Mining Pit Falls!
COMPARISON OF OLTP AND DSS
ON-LINE TRANSACTION PROCESSING
 Updates to operational data
 Stores detailed data
 Repetitive processing
 Predictable pattern of usage
 Transaction driven
 Application oriented
 Supports day-to-day decisions
 Usually small changes
 Generally a large number of transactions
 Serves many operational users
COMPARISON OF OLTP AND DSS
DECISION SUPPORT SYSTEMS
 Analysis of historical data
 Ad-hoc fairly complex (read-only) queries
 Stores detailed, lightly, and highly summarised data
 Low to medium level of transaction throughput
 Analysis driven
 Subject oriented
 Supports strategic decisions
 Serves a few ‘managerial’ users
 Fast-response time required
DECISION SUPPORT SYSTEMS
SEEING DATA AS INFORMATION
• Operational Information
Users of data at the operational level of the business are
concerned with data at its highest level of detail, e.g.
particular accounts, invoices, delivery dates, etc…
• Tactical Information
Users of data at the tactical level are more interested in
aggregated historical data to assist in planning decisions.
• Strategic Information
Users of data at the strategic level are concerned with using
highly summarised data to give an overview of operations.
DATA WAREHOUSING
DATA WAREHOUSING
WHAT IS A DATA WAREHOUSE?
DEFINITION :
‘A subject-oriented, integrated, time-variant, and non-volatile collection of
data in support of management’s decision-making process’ [Inmon, 1993].
SUBJECT-ORIENTED:
The warehouse is organized around the major subjects of an enterprise
(e.g. customers, products, and sales) rather than the major application
areas (e.g. customer invoicing, stock control, and order processing).
DATA WAREHOUSING
WHAT IS A DATA WAREHOUSE?
DEFINITION :
‘A subject-oriented, integrated, time-variant, and non-volatile collection of
data in support of management’s decision-making process’ [Inmon, 1993].
INTEGRATED DATA:
The data warehouse integrates corporate application-oriented data
from different source systems, which often includes data that is
inconsistent. Such data, must be made consistent to present a unified
view of the data to the users.
DATA WAREHOUSING
WHAT IS A DATA WAREHOUSE?
DEFINITION :
‘A subject-oriented, integrated, time-variant, and non-volatile collection of
data in support of management’s decision-making process’ [Inmon, 1993].
TIME VARIANT:
Data in the warehouse is only accurate and valid at some point in time
or over some time interval. Time-variance is also shown in the
extended time that the data is held, the association of time with all
data, and the fact that data represents a series of historical snapshots.
DATA WAREHOUSING
WHAT IS A DATA WAREHOUSE?
DEFINITION :
‘A subject-oriented, integrated, time-variant, and non-volatile collection of
data in support of management’s decision-making process’ [Inmon, 1993].
NON-VOLITILE:
Data in the warehouse is not updated in real-time but is refreshed from
operational systems on a regular basis. New data is always added as a
supplement to the database, rather than a replacement.
DATA WAREHOUSING
THE USE OF A DATA WAREHOUSE
INVENTORY
DATABASE
PERSONNEL
DATABASE
NEWCASTLE
SALES DB
STEP 1: Load the Data Warehouse
STEP 2: Question the Data Warehouse
DATA
WAREHOUSE
LONDON
SALES DB
GLASGOW
SALES DB
STEP 3: Do something
with what you learn from
the Data Warehouse
DECISIONS
and ACTIONS!
DATA WAREHOUSING
CREATING A DATA WAREHOUSE
• Data Collection
There need to be extraction routines to gather data from
the various operational data sources that interface with
the Data Warehouse.
• Data Cleaning & Transformation
Data must be checked for validity and accuracy, and
differences in syntax and semantics must be resolved.
DATA WAREHOUSING
CREATING A DATA WAREHOUSE
• Data Loading
Data must be loaded into the Data Warehouse after
carrying out appropriate summarisation and aggregation.
Often this will be done using parallelism (as it could take
weeks to serially load a terabyte of data!).
• Data Refresh
Updates to base data (operational data) must periodically
be propagated to the Data Warehouse.
DATA WAREHOUSING
CREATING A DATA WAREHOUSE
• Data Storage
Appropriate storage structures must exist to allow the
Data Warehouse to support fast access for search and
analysis of differing data types (text, graphic, picture, …).
DATA WAREHOUSING
ARCHITECTURE OF A DATA WAREHOUSE
Warehouse Manager
Operational
data source
1
Meta
Data
Operational
data source
2
Highly
Summarized
Data
Lightly
Summarized
Data
Detailed
Data
Operational
data source
n
Reporting query,
A/P development
and EIS tools
DBMS
OLAP tools
Warehouse Manager
Archive/
Backup
data
Data mining tools
DATA WAREHOUSING
DATA WAREHOUSE INFORMATION FLOWS
• INFLOW - Processes associated with the extraction,
cleansing, and loading of the data from the source systems
into the data warehouse.
• UPFLOW - Processes associated with adding value to the
data in the warehouse through summarizing, packaging,
and distribution of the data.
• DOWNFLOW - Processes associated with archiving and
backing-up/recovery of data in the warehouse.
• OUTFLOW - Processes associated with making the data
available to the end-users.
• METAFLOW - Processes associated with the management
of the metadata.
DATA WAREHOUSING
DATA FLOWS IN A DATA WAREHOUSE
DATA WAREHOUSING
DATA WAREHOUSE DBMS REQUIREMENTS
•
•
•
•
•
•
•
•
•
•
Load performance
Load processing
Data quality management
Query performance
Terabyte scalability
Mass user scalability
Networked data warehouse
Warehouse administration
Integrated dimensional analysis
Advanced query functionality
DATA WAREHOUSING
PROBLEMS
•
•
•
•
•
•
•
•
•
•
Underestimation of resources for data loading
Hidden problems with source systems
Required data not captured
Increased end-user demands
Data homogenization
High demand for resources
Data ownership
High maintenance
Long duration projects
Complexity of integration
ON-LINE ANALYTICAL
PROCESSING (OLAP)
OLAP
WHAT IS OLAP?
DEFINITION :
‘OLAP applications and tools are those that are designed to ask
ad hoc, complex queries of large multidimensional collections
of data. It is for this reason that OLAP is often mentioned in the
context of Data Warehouses’.
OLAP
TYPICAL OLAP QUESTIONS
• Which type of property sells for prices above the average selling
price for properties in the main cities of Great Britain and how
does this correlate to demographic data?
• What are the three most popular areas in each city for renting
property in 1997 and how does this compare with the figures for
the previous two years?
• What is the current monthly revenue for property sales at each
branch office, compared with rolling 12-monthly prior figures?
• What is the relationship between the total annual revenue
generated by each branch office and the total number of sales
staff assigned to each branch office?
OLAP
CODD’S RULES
•
•
•
•
•
•
•
•
•
•
•
•
Multi-dimensional conceptual view
Transparency
Accessibility
Consistent reporting performance
Client-server architecture
Generic dimensionality
Dynamic sparse matrix handling
Multi-user support
Unrestricted cross-dimensional operations
Intuitive data manipulation
Flexible reporting
Unlimited dimensions and aggregation levels
OLAP
MULTDIMENSIONAL DATA MODEL
London
Glasgow
Newcastle
Socks
Jumpers
10
50
10
10
0
0
1
2
80
80
80
80
0
25
20
15
0
0
0
0
Spring
T-Shirts
Shorts
Pyjamas
Summer Autumn Winter
Example: Three dimensions – Product, Sales Area, and Season
OLAP
TYPICAL OLAP OPERATIONS
Drill
Down
Total Sales
Total Sales per city
Total Sales per city per store
Total Sales per city per store per month
Drill
Down
Total Sales
Total Sales per city
Total Sales per city by category
Drill Across
Drill
Up
Drill
Up
OLAP
TYPICAL ARCHITECTURE FOR MOLAP TOOLS
OLAP
TYPICAL ARCHITECTURE FOR ROLAP TOOLS
OLAP
RELATIONAL STAR SCHEMA
PRODUCTS
id
pname cat
LOCATIONS
desc
price
….
locid
city
Dimension Table
SALES
timeid
date
TIMES
week
id
month quarter year
timeid locid
…...
Dimension Table
state country …..
Dimension Table
amount
cost
Fact Table
DATA MINING
DATA MINING
WHAT IS DATA MINING
DEFINITION :
‘A set of techniques used in an automated approach to
exhaustively explore and bring to the surface complex
relationships in very large datasets’
[DBMS DATA WAREHOUSE SUPPLEMENT – AUG 1996]
DATA MINING
WHAT IS DATA MINING
SHORT DEFINITION :
‘Spot hidden gold in large collections of data’
IMPORTANT – Data Mining tools extract NEW information from
data, this information is then used to guide business decisions about
future activities
DATA MINING
DATA MINING APPLICATIONS
RETAIL/MARKETING:
• Identifying buying patterns of customers
• Finding associations among customers demographic characteristics
• Predicting response to mailing campaigns
• Market basket analysis
BANKING:
• Detecting patterns of fraudulent credit card use
• Identifying loyal customers
• Predicting customers likely to change their credit card affiliations
• Determining credit card spending by customer groups
DATA MINING
DATA MINING APPLICATIONS
INSURANCE:
• Claims analysis
• Predicting which customers will buy new policies
MEDICINE:
• Characterising patient behaviour to predict surgery visits
• Identifying successful medical therapies for different illnesses
DATA MINING
DATA MINING QUESTIONS
DISCOVERY-ORIENTED (LINK ANALYSIS):
“What are the factors that determine sales of Product X”
PREDICTIVE MODELLING:
“How much profit will this customer generate?”
“Where is the best place to build a new road?”
DATA MINING
DATA MINING OPERATIONS
DESCRIPTIVE OPERATIONS
ASSOCIATION RULES
Descriptive model that discovers rules that relate separate
classes of data items together. For example, ‘people who buy
beer also buy crisps 50% of the time’.
SEQUENCING RULES
Descriptive model that discovers sequence correlations in timesequenced data. For example, ‘People who have purchased a
VCR are 300% more likely to purchase a camcorder in the time
period 2-4 months after the VCR was purchased’
DATA MINING
DATA MINING OPERATIONS
PREDICTIVE OPERATIONS
CLASSIFICATION
Predict class membership. For example, income within one of
three categorical values: ‘Low’, ‘Middle’, or ‘High’.
REGRESSION
Predict a specific value. For example, income will be a certain
amount.
DATA MINING
CLASSIFICATION AND REGRESSION
This is the largest area where data mining is currently applied!
All techniques generate a predictive model based on historical
data. The model then predicts the outcome of new cases. This is
known as ‘Data Training’.
The data necessary to build a predictive model therefore has to
be composed of cases where the outcome is known and
included.
DATA MINING
CLASSIFICATION AND REGRESSION
EXAMPLE:
‘It may be found that if a Bank’s customer is aged between 18
and 24, and their average account balance is between £0.00
and £200.00, then they are highly likely to default on a loan.’
This rule will then be applied to predict whether it would be
wise to authorise a bank loan, for a particular customer.
DATA MINING TECHNIQUES
DATA MINING
CLASSIFICATION AND REGRESSION
DECISION TREES
DATA MINING
CLASSIFICATION AND REGRESSION
NEURAL NETWORKS
DATA MINING
CLASSIFICATION
NAÏVE-BAYES
This technique limits its inputs to categorical data, and it is
applicable only to classification. Simplicity and speed make
this an ideal exploratory tool.
The technique is based on a simple concept; conditional
probabilities derived from observed frequencies in the training
data.
DATA MINING
CLASSIFICATION
NAÏVE-BAYES - EXAMPLE
Try to predict customer turnover based on the following facts:
75% of customers who had monthly bills of between £300 and £400
have left. 68% of customers who had made more than four calls to
customer service have left.
This technique will predict that a customer who has an average
monthly bill of £380, and who has made three calls to customer
services has a high likelihood of leaving soon.
Therefore, they should be contacted and offered a discount!
DATA MINING PIT-FALLS!
DATA MINING
CORRELATIONS AND CAUSALITY
Data mining tools find correlations, not causes, and the rules
and predictions that come out of data mining tools are based
on correlation only.
EXAMPLE:
Rule: "Customers who purchase pasta are three times more likely to
purchase cheese than customers who don’t buy pasta"
Therefore:
Does buying pasta cause people to buy cheese?
Does buying cheese cause people to buy pasta?
Or is it the sudden popularity of a book called ‘You Can Lose Five
Pounds a Week Eating Pasta With Cheese!’ ?