Data Modeling, OLAP, Data Mining

Download Report

Transcript Data Modeling, OLAP, Data Mining

Business Intelligence
&
Data Warehousing
Tom A. Fürstenberg
Business Intelligence Consultant
Cap Gemini Ernst & Young
1
Leerdoelen college
• Wat is BI & DWH? (Conceptueel en
Technisch)
• Toepassing van BI & DWH
• De praktijk van een consultant iha en bij Cap
Gemini Ernst & Young ihb
2
Inhoud College
•
•
•
•
•
•
•
•
•
•
Performance Management
Business Intelligence (Performance Measurement)
OLAP
Extranets
Architectuur
Data Warehouse
ETL
Multidimensioneel Modelleren
CGE&Y Aanpak
3
Data Mining
Performance
Management
Doelgericht meten en bijsturen van
bedrijfsdoelstellingen
4
In control of a company
5
Overview
Strategie &
Missie
Conceptueel
Besturingsmodel
Verantwoordelijkheden
& Bevoegdheden
Doelen
Middelen
Randvoorwaarden
Operationeel
Besturingsmodel
Besturingssystematiek
Key Performance Indicators
Critical Succes Indicators
InformatieExternal Indicators
voorziening
Informatiemodel
Informatiesysteem
Datawarehouse
6
data
data
data
Besturings visie: Bouwstenen
voor besturing van organisaties
Wie?
Organisatie
Methoden
Systemen
Hoe?
Doelstellingen &
Prestatie indicatoren
Wat?
Strategie &Missie
Stake holders
Waarden & normen
7
Waarom ?
Methoden
Diverse Financiële modellen
Balanced Scorecard
B eleid &
strategie
M id d elen m n gt.
8
P ro c e s s e n
L e id e rs c h a p
INK managementmodel
w aard erin g
d oor
k lan ten
w aard erin g
d oor
m aatsch ap p ij
O n d e rn e m in g s re s u lta te n
w aard erin g
d oor
p erson eel
P erson eelsb eleid
Naar een operationeel Besturingsmodel
CSI
CSI
CSI
CSI
KPI
OI
OI
Na het vaststellen van de Doelen en KPI’s
Worden de Critische Succes Indicatoren bepaald
Gevolgd door het vaststellen van de Omgevings Indicatoren
9
OI
OI
CSI
CSI
CSI
Naar een operationeel Besturingsmodel
KPI
KPI
KPI
KPI
Tijd
Product
Markt
Afdeling
10 Regio
en Bevoegdheden
Multi-dimensionale
OI
OI
CSI
CSI
Verantwoordelijkheden
CSI
Van Model naar Gedragsverandering
Gegevensstructuur
KPI
KPI
KPI
KPI
Operationeel Besturingsmodel
Management
Charter
Informatievoorziening
Planning
en Commitment
Beoordeling en Sturing
11
Some Typical Mgt. Questions
PRODUCT
• How much have we sold?
• Which product gives the best profit?
• Which product has the largest sales
volume this quarter?
• Which product best meets market
needs?
• How much to produce of each product?
CUSTOMER
• Who is the most profitable customer?
• What is the satisfaction level?
• Which are the best segments?
• Which service to improve?
• How many customers have we lost last
year?
• Who are our biggest accounts?
CHANNEL
MARKETING
• Which retailer yields most by
volume and which by profit?
• What promotions will yield most
profit?
• What effect will discounts have on
the turnover?
• What are the area coverage levels?
• How many contacted people became
a customer?
• Promotions’ results?
• What is the competition doing?
12
Key Performance Indicators Top 10
P u rch asin g
# h its
P ro d u ctio n
# h its
1
developm ent of purc has e pric es
86
1
overhead c os ts
94
2
s toc k level by item or produc t line
84
2
direc t c os ts of m aterials
93
3
reliability of s uppliers ; quality (z ero-defec ts )
81
3
direc t c os ts of labour
92
4
financ ial pos ition of s uppliers
81
4
fac tory overhead ex pens es
91
5
reliability of s uppliers ; tim e to delivery
79
5
quality finis hed produc t
91
6
reliability of s uppliers ; c om pletenes s
74
6
s ic k day s
91
7
purc has ing departm ent c os ts
72
7
overtim e
90
8
num ber of goods returned
48
8
s toc k of raw m aterials
90
9
c as h dis c ounts
47
9
produc tion volum e
89
10
ac c ounts pay able as a % of purc has ing value
39
10
m aintenanc e c os ts
89
S ale s
# h its
F in an ce
# h its
1
s ales volum e or s ales volum e grow th
95
1
E arnings B efore Interes t & Tax es (E B IT)
104
2
deprec iation of ac c ounts rec eivable
92
2
G ros s m argin
103
3
num ber of buy er c om plaints
92
3
P rofit B efore Tax (P B T)
102
4
quality of deliveries
88
4
G ros s inves tm ents
102
5
m ark eting ex pens es
87
5
w ork ing c apital
101
6
freight c os ts
86
6
c as h flow
94
7
value of new orders
85
7
liquidity pos ition
92
8
ac c ounts rec eivable as a % of turnover
85
8
R eturn on C apital E m ploy ed
77
9
reliability of deliveries ; tim e to delivery
84
9
R eturn on S ales
75
10
m ark et grow th
83
10
turnover tim e of goods in s toc k
75
Source:
Results FIND! The Best benchmarkstudy conducted in
1997/1998 by Ernst & Young Consulting and VU. 103
industrial companies participated in the study.
13
En nu alleen nog even
meten…
Business Intelligence
(performance measurement)
14
The Answers
The information is there, but spread
15 everywhere!
De praktijk...
16
Problemen
•
•
•
•
•
•
•
•
•
•
•
•
(Over)belasting IT-afdeling (queries)
Lange doorlooptijd rapport-’fabricage’
Hoge kosten aan manuren
Databronnen moeilijk integreerbaar
Niet-gestandaardiseerde rapporten
Geen eenduidige definities
Foutgevoelig
Manipuleerbaar
Afhankelijkheid van ‘schakels’
Discussies over verschillen in cijfers
Beperkte analyse-mogelijkheden
Verkeerde en te late interpretaties,
17
conclusies, beslissingen
Een druk op de knop...
18
Van chaos... Naar structuur
19
Why now? Hype? Developments:
Market Pull
• Globalisation of markets
• Individualisation of customers
• Shorter life cycle of products
• Information overload
• Mergers
Technology Push
• Faster hardware
• Cheaper disk capacity
• Modern OLAP-tools
• Any access: c/s, web, mobile
20
OnLine Analytical Processing
• Gebaseerd op de syntax van managementinformatie vragen:
<meetwaarde> per <dim1> per <dim2> per ...
• KPI’s, CSI’s en OI’s zijn meetwaarden
• Produkt, Regio, Klant, Tijd, etc. zijn
dimensies (slice & dice)
• Dimensies kennen hierachiën (drill down)
21
OLAP
Product Manager’s View
Regional Manager’s View
Financial Manager’s View
Ad Hoc View
22
Introduction to Cubes
Location
Atlanta
Denver
Detroit
Grapes
Cherries
Melons
Sales
Sales
Apples
Pears
Q1
Q2 Q3
Time
23
Q4
Product
Demo
• eFashion Case
• BusinessObjects Demo
24
BusinessObjects: Semantic
Layer
25
Any Access
26
Info- & Analysis-need at 3rd parties
27
e/m-Business Intelligence: Extranets
SUPPLIERS
extr anet
Data Warehouse
extra
net
extr anet
CUSTOMERS
28
PARTNERS
Extranet demo’s
29
Business Intelligence
Theory
30
BI Definition
Business Intelligence is the process of
collection, cleansing, combining, consolidation,
analysis, interpretation and communication of
all internal and available external data, relevant
for the decision making process in the
organisation
31
BI Concept
Business Value
Decisions
Action
Analysis
Integration
Collection
Knowledge
Information
Data
32
Feedback
BI Systems
Reporting &
Query
DSS, MIS and EIS
OLAP
Data Mining
33
The Five Functional Levels
‘bunch of
reports’,
‘cube’
i.e. finding
variables
Complexity of the question
Number of users
reporting
analysis
querying
standard
reports
Static
unique
‘report’ or
question
exploring
mining
34
i.e stat. analysis,
testing
a hypothesis
Dynamic
The Five Functional Levels
Complexity of the question
Number of users
reporting
80 %
analysis
of all
users
Static/
Dynamic
querying
exploring
mining
35
interactief
Corporate Information Factory
Any Source
Applications
External data
Any Data
L
O
A
D
M
A
N
A
G
E
M
E
N
T
Any Access
Q
U
E
R
Y
Data Marts
Data Warehouse
Operational Data Store
36
M
A
N
A
G
E
M
E
N
T
LAN/
WAN
WWW
Components of the CIF
•
•
•
•
Data Warehouse
Data Mart
Operational Data Store
ETL
37
Data Warehouse
38
Definition Bill Inmon
Characteristics of a data warehouse:
• Subject-oriented
• Integrated
• Time-variant
• Non-volatile
• Both summary and detailed data
39
Data Warehouse
• Contains data that can be used to meet the
information of (part of) the organisation
• Contains integrated data extracted from one
or more sources
• Mostly contains large amounts of data
• Contains data that is clean and consistent
• May contain aggregated data
• Optimised for its use
40
Data Warehouse
Data Base
Data Warehouse
Actual
Historical
Internal
Internal and External
Isolated
Integrated
Transactions
Analysis
Normalised
Dimensional
Dirty
Clean and Consistent
Detailed
Detailed and Summary
41
Data Warehouse
Advantages
• One point of contact
• Time savings
• No loss of historical data
• OLTP’s not hampered by BI activities
• Better consistency and quality of data
• Improvement of Business Intelligence
42
Data Warehouse
Disadvantages
• Never quite up-to-date
• Requires a lot of storage space
• Requires a lot of communication,
coordination and cooperation
• Large impact on the organisation
• A data warehouse is only the beginning
43
Data Mart
• DW design does not optimise query
performance
• Data is not stored in an optimal fashion for
any given department in the DW
• Competition to get the resources required to
get inside the DW
• Costs for DSS computing facilities are high
because of the large volume in DW
44
Data Mart
Characteristics:
• Customised for a specific department
• Limited amount of history
• Summarised
• Very flexible
• Elegant presentation
• Processor dedicated to the department
45
Data Mart
Divided by:
• Business
• Geography
• Security
• Political (budget)
• Structure (data mining)
46
Data Mart
Three different kinds of data marts:
• Subset/summary
• MOLAP
• ROLAP
47
Operational Data Store
Characteristics:
• Subject-oriented
• Integrated
• Current-valued
• Volatile
• Detailed data
48
ETL: Extraction
Source selection:
• Data model is starting point: determine data
elements that are needed
• For each data element, determine available
data sources
• If more han 1 source available, select on:
–
–
–
–
Quality, reliability and integrity
Scope of data
Location and availability of data
Location and availability of expertise
49
ETL: Transformation
Processing:
• Aggregate records
• Encoding structures
• Simple reformatting
• Mathematical conversion
• Resequencing of data
• Default values
• Key conversion
• Cleansing
50
ETL: Transformation
Key transformation
Key structure A
Key structure B
Key structure A
Key structure C
Key structure A
Key structure B
New key structure
Key structure C
51
ETL: Cleansing
Data quality is critical for:
• Marketing communications
• Targeted marketing
• Customer matching
• Retail- and commercial householding
• Combining information
• Tracking retail sales
52
ETL: Cleansing
Common excuses for not cleaning:
• The data in the operational systems seem to
work just fine
• Data can be joined most of the time
• Cleansing will take place after population of
the data warehouse
• Data entry will be improved
• The users will never agree to change their
data
53
Multi
Dimensional
Data
Modeling
54
MD Modeling: Contents
•
•
•
•
•
•
•
•
E/R Modeling (Ex.)
MD Modeling (Ex.)
Star Schema
Slowly Changing Dimensions (Ex.)
Surrogate Keys
Aggregation (Ex.)
Measures & Dimensions reviewed
Other important MDM aspects
55
Exercise: E/R Modeling
How could the sales transaction database of the
eFashion retailer look like?
Loyalty Card
Products
Products Sold
Ticket
Card_nr
Ticket_nr
Ticket_nr
Product_nr
Cust_name
Store_nr
Product_nr
Bar_Code
Adress
Card-nr
#_products
Prod_Desc
Zip_code
Employee_nr
price
Actual_price
City
Time_Stamp
dicount
Weight
...
...
Store
Store_nr
Employee
Employee_nr
Store_name
Emp_name
Adress
...
Zip_code
City
State
Manager
56
...
Management Questions
• Give me the annual revenue of all my product
lines divided over all the sales regions over
the last 3 years
• Give me the top 10 of most profitable
products this year
• Give me the top 10 of most sold products of
last year
• Give me the top 10 of most profitable
customers
• Compare the YTD revenue with the one in the
same period last year and
the target
57
Why not E/R Modeling?
• End users cannot understand, remember,
navigate an E/R model (not even with a GUI)
• Software cannot usefully query an E/R model
• Use of E/R modeling doesn’t meet the DW
purpose: intuitive and high performance
querying
58
Exercise: Model the Efashion DM
•
•
•
•
Sales Revenue
Time hierarchy (Year-Quarter-Month)
Store hierarchy (Region, State, City, Store)
Product hierarchy (Line, Category, SKU)
59
eFashion Data Mart
Time
Facts
Month_nr
Month_nr
Store_nr
Month_desc
SKU_nr
SKU_nr
Quarter
SKU_desc
Sales_revenue
Year
Category
...
Product
Line
Geography
Store_nr
Store_name
City
State
Region
60
DW Modeling Components
Dimension
Tables
Geographic
Dimension
Fact Table
Geographic
Product
Product
Time
Measures
Units
$
Facts
Time
61
Using a Star Schema
Employee_Dim
EmployeeKey
EmployeeID
.
.
.
Dimension Table
Time_Dim
TimeKey
TheDate
Fact Table
Sales_Fact
Product_Dim
ProductKey
ProductID
TimeKey
EmployeeKey
ProductKey
CustomerKey
ShipperKey
.
.
.
.
.
.
$
Shipper_Dim
.
.
.
Customer_Dim
ShipperKey
ShipperID
.
.
.
CustomerKey
CustomerID
62
.
.
.
Components of a Star Schema
Employee_Dim
EmployeeKey
EmployeeID
.
.
.
Time_Dim
Sales_Fact
TimeKey
TheDate
.
.
.
Dimensional Keys
Shipper_Dim
TimeKey
TimeKey
EmployeeKey
ProductKey
CustomerKey
ShipperKey
$
.
.
.
ProductKey
ProductID
.
.
.
Multipart Key
Measures
Customer_Dim
ShipperKey
ShipperID
.
.
.
Product_Dim
CustomerKey
CustomerID
63
.
.
.
Exercise: Slowly Changing Dimensions
Suppose the product categories change from
time to time.
Model the Data Mart when the manager wants
to see historical reports against:
1. The present categories
2. The categories at the time of the sale
3. Both against the present categories and the
immediate previous categories
4. The categories at any specified time
64
SCD Exercise 1
Time
Facts
Month_nr
Month_nr
Store_nr
Month_desc
SKU_nr
SKU_nr
Quarter
SKU_desc
Sales_revenue
Year
Category
...
Product
Line
Geography
Store_nr
Store_name
City
State
Region
65
SCD Exercise 2
Time
Facts
Month_nr
Month_nr
Store_nr
Month_desc
Product_key
Product_key
Quarter
SKU_nr
Sales_revenue
Year
SKU_desc
...
Product
Category
Line
Geography
Store_nr
Store_name
City
State
Region
Most Recent
Product Key Map
Product_key
SKU_nr
66
SCD Exercise 3
Time
Facts
Month_nr
Month_nr
Store_nr
Month_desc
Product_key
Product_key
Quarter
SKU_nr
Sales_revenue
Year
SKU_desc
...
Product
Category
Category_old
Geography
Line
Store_nr
Store_name
City
State
Region
67
SCD Exercise 4
Time
Facts
Month_nr
Month_nr
Store_nr
Month_desc
SKU_nr
SKU_nr
Quarter
SKU_desc
Sales_revenue
Year
Category
...
Product
Line
Valid_from
Geography
Valid_until
Store_nr
Store_name
City
State
Region
68
Slowly Changing Dimensions
• Type 1: Overwrite the dimension record
• Type 2: Create new dimension record
• Type 3: Create an ‘old’ field in the dimension
record
• Type 4: Add a valid_from and valid_until
field in the dimension record
Ad. Type 2: requires surrogate keys, but in general, one should always use
these because of performance and flexibility
Ad. Type 4: Kimball only recognizes 3 types SCD’s
69
Always Use Surrogate Keys
• Allows DWH to assign new key versions for
SCD’s (type 2)
• Higher performance with numeric keys than
with long, alphanumeric keys
70
Exercise: Aggregation
Suppose the manager queries frequently on
product line level and finds the performance too
low.
Question: How to model the data mart when we
want to add aggregated measures on product
line level?
71
Exercise: Aggregation
Time
Facts
Month_nr
Month_nr
Store_nr
Month_desc
Product_key
Product_key
Quarter
SKU_nr
Sales_revenue
Year
SKU_desc
...
Product
Category
Line
Geography
Aggregated Facts
Store_nr
Store_name
Week_nr
City
Product_Line
Store_nr
Line_key
Line_key
Line
Sales_revenue
State
...
72
Region
Exercise: Measures
Add the following measures to the eFashion
Data Mart:
• Stock Quantity
• Product Price
• Promotion Costs (product-specific, storeindependent)
73
Exercise: Measures
Q_Stock Facts
Month
Quarter
Quarter
Store_nr
Year
SKU_nr
Stock_qty (av, eom)
Time
Facts
Month_nr
Month_nr
Store_nr
Month_desc
Product_key
Product_key
Quarter
SKU_nr
Sales_revenue
Year
SKU_desc
Stock_qty
Product
Price
Category
Line
(Valid_from
Valid_until)
Promotion Facts
Geography
Store_nr
Month_nr
Store_name
SKU_nr
City
Promotion_cost
State
Duration
Region
Promotion_type
...
74
Measures & Dimensions reviewed
The most useful measures are
• Numeric
• Additive
Dimensions are:
• The natural entry points of the facts
• I.e., used for constraints and report breaks
• Independent of each other, not hierarchically
related
75
Other Important MDM-Aspects
•
•
•
•
•
•
Cardinality
Grain
Referential Integrity
Conformed Dimensions
Drill Across
Traps
76
How to make the CIF?
Any Source
Applications
External data
Any Data
L
O
A
D
M
A
N
A
G
E
M
E
N
T
Any Access
Q
U
E
R
Y
Data Marts
Data Warehouse
Operational Data Store
77
M
A
N
A
G
E
M
E
N
T
LAN/
WAN
WWW
CGE&Y BI-Approach Overview
Metamodel
Extraction,
Transformation
Load
Development
Data Warehouse
Architecture II
Data Warehouse
Architecture I
Evolutionary Strategy
Project Management
78
Implementation
Definition Increments
Source data
Incremental Delivery
DW blueprint
Strategy & objectives
Awareness
Communication
79
Data Mining
80
Data Mining
Definition:
The process of digging intelligently into large
volumes of data to discover and analyse
previously unknown relationships or to validate
hypotheses.
81
Data Mining Versus OLAP
OLAP/Query
Data Mining
Are there some
customers from large
accounts with a high
decrease in
international calls?
Are there any
common
characteristics among
these customers?
Information
Data
82
Applications
• Risk Analysis (grant credit, investment)
• Fraud Detection (telephone charge, bank
withdrawals)
• Trouble Shooting and Diagnosis
• Process Controls (wafer fabrication)
• Promotion Analysis
• Bankruptcy Prediction (mortgage lending,
business partners)
• Customer Churn (telco)
• CRM (next slides) 83
Maximizing Customer Value
•
•
•
•
Getting more prospects in
Turning prospects into customers
Selling more products to existing customers
Getting less customers out
84
Which ones in and which ones out?
Growth
Yield
per
individual
customer
Keep
Migration
Yield per customer
Costs per customer
Highest
Lowest
Customer profitability
Out
Example: One to One Marketing
• Treat different customers differently
– differentiate message
– differentiate product offer
– differentiate channel
• Need for usable information => predict
customer behavior out of databases
86
89
90
91
92
Example: clickstream analysis
• What parts of our Web site get the most visitors?
• What parts of the Web site do we associate most frequently with actual
sales?
• What parts of the Web site are superfluous or visited infrequently?
• Which pages on our Web site seem to be "session killers," where the
remote user stops the session and leaves?
• What is the new-visitor click profile on our site?
• What is the click profile of an existing customer? A profitable
customer? A complaining customer that all too frequently returns our
product?
• What is the click profile of a customer about to cancel our service,
complain, or sue us?
• How can we induce the customer to register with our site so we learn
some useful information about that customer?
• How many visits do unregistered customers typically make with us before
they are willing to register? Before they buy a product or service?
93
Customized Customer Service
Tele-sales
94
Service desk
Example: Contact Strategy
Good
Tele-sales
Data
mining
Customer
Data
Direct
Mail
Bad
95
Channel optimisation
Sales
visit
The customer choses the channel
Operational systems
Service question
CC
App.
Complaint handling
Integration
Analysis
Leaflet request
Contact
Leaflet receipt
Order
Status
Order
service question
Organisation!
Status order
Email
Complaint
96
Data Sources for Data Mining
Collecting &
Cleansing
DATA
• Transactions (loyalty cards)
• Behaviour of existing customers
• Logfiles & cookies
• Market research
• Data suppliers
• Public data
97
Example: Affinity Grouping
• Market Basket: what items are sold together?
• Market Basket: what categories are sold with
what items?
• Market Basket: what is not sold with certain
items?
• Event Correlations: what other services are
brought in the first month after signing up for
a satellite TV subscription?
98
Data Mining Techniques
• Decision Trees, Classification Trees, Rule
Induction
• Neural Nets
• Visualisation
• Fuzzy Logic; Nearest Neighbour; Memory
Based Reasoning; Case Based Reasoning
• Proprietary Logic
• Classical Statistics
99
Predictive Power
Data Mining Techniques





Statistical analysis
Neural networks
Genetic algorithms
Decision trees
 Intuïtion




Simplicity
100
Critical Success Factors
• Data availability (large amounts of a wide
variety of data)
• Data consistency
• Data quality
• Domain expertise
• Data used/needed is allowed by privacy laws
101
Benefits
• Improved customer relationships
• More revenue from existing customers
• Market segmentation
• Differentiated products and services
• Differentiated sales channels
• More effective marketing programs
• Improved fraud detection
• Improved investments
•…
102
Demo
Decision Tree with
BusinessMiner from
BusinessObjects
103
Contact information
Tom A. Fürstenberg
Business Intelligence Consultant
Cap Gemini Ernst & Young
Sector Energy, Products & Transport
Tel +31 6 21 878 915
email: [email protected]
104