Week5 - Information Management and Systems
Download
Report
Transcript Week5 - Information Management and Systems
Data-Driven Business
Intelligence Systems:
Part I
Week 5
Dr. Jocelyn San Pedro
School of Information Management &
Systems
Monash University
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
Lecture Outline
Data-driven BIS
Data warehouse
Data warehouse architectures
Entity-Relationship Modelling
Multi-dimensional Modelling
Star Schema
An Example: Retail Trading
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
2
Learning Objectives
At the end of this lecture, the students will
Have better understanding of concepts, tools and
technology underlying data-driven business
intelligence systems
Have knowledge of multidimensional modelling and
star schema for data modelling for data warehouses
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
3
Data-Driven Business
Intelligence Systems
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
4
Data-Driven BIS
Data-driven BIS
information systems that provide BI through access
and manipulation of large databases of structured data
includes tools for
“drill down” for more detailed information
“drill up” for broader, more summarised view
“slice and dice” for a change in data dimensions
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
5
Data-Driven BIS
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
6
Data-Driven BIS
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
7
Data-Driven BIS
“Slicing” the
$225,764
cube
$201,196
$162,504
$182,500
$123,033
$72,528
$116,963
$75,048
$68,792
P
r
o
d
u
c
t
Peacock
Leverling
Fuller
Davolio
Callahan
Suyama
King
Dodsworth
Buchanan
Time
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
8
Data-Driven BIS
“Dicing” the
cube
$225,764
$201,196
$162,504
$182,500
$123,033
$72,528
$116,963
$75,048
$68,792
Q3
$23,181
Q2
$16,035
Q1
$6,858
Suyaman
King
Dodsworth
Buchanan
$22,719
Peacock
Leverling
Fuller
Davolio
Callahan
Q4
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
9
Data Warehouse
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
10
Data Warehouse
A data warehouse is a subject-oriented,
integrated, time-variant, nonvolatile collection
of data in support of management’s decision
making process – Bill Inmon (1995)
Subject-oriented: focus is on subjects related to
business or organisational activity like customers,
employees, suppliers (instead of applications-oriented
(finance, marketing, production)
Integrated: data from various databases is stored in a
consistent format through use of naming conventions,
domain constraints, physical attributes and
measurements
Time-variant: associating data with specific points in
time
Nonvolatile: data does not change once it is in the
data warehouse and stored in data warehouse
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
11
Data Warehouse
Data warehouse is a copy of transaction data
specifically structured for query and analysis
– Ralph Kimball (1996)
Data warehouse is a specific database designed
and populated to provide decision support in
an organisation - Gray and Watson (1998)
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
12
Data Warehouse
Data warehousing emerged as result of
improvements in database technology – relational data
model and relational database management systems
(DBMS)
advances in computer hardware - emergence of
affordable mass storage and parallel computer
architectures
emergence of end-user computing, facilitated by
powerful, intuitive computer interfaces and tools
advances in middleware products that enable
enterprise database connectivity across
heterogeneous platforms
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
13
Data Warehouse
triggered by recognition of fundamental
differences between operational (or production)
systems and informational (or decision
support) systems
Operational system – system that is used to run a
business in real time, based on current data – e.g.
sales order processing, reservation systems, patient
registration
Informational systems – designed to support
decision making based on stable point-in-time or
historical data; for complex read-only queries or
data mining applications – e.g. sales trend analysis,
customer segmentation, human resources planning
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
14
Data Warehouse
Comparison of Operational and informational Systems –
McFadden, Hoffer and Prescott 1999
Characteristic Operational
Informational
Primary purpose Run the business on a
current basis
Support managerial
decision making
Type of data
Current representation
state of the business
Historical or point-in(snapshots)
Primary users
Clerks, salespersons,
administrators
Managers, business
analysts, customers
Scope of usage
Narrow vs. simple
and queries
Broad vs. complex
and analysis
Design goal
performance
Ease of access and use
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
15
Data Warehouse
Architectures
Generic two-level architecture
Source
(file)
Source
database)
Transformation
and
Integration
Data
warehouse
Source
(database)
Source
(database)
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
16
Data Warehouse
Architectures
Three-level
architecture
Source
(file)
Source
(database)
Source
Transformation
and Integration
Enterprise
Data
warehouse
Data mart
database)
Source
(database)
Selection and
aggregation
Data mart
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
17
Data Warehouse
Architectures
Data mart
a data warehouse that is limited in scope
contains selected and summarised data to support
specific decision support applications of specific enduser group
e.g., marketing data mart, finance data mart
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
18
Data Warehouse
Architectures
Three-layer data architecture
Derived data
Data mart
Reconciled data
Enterprise data warehouse
Enterprise data model
Operational
data
Data
mart
metada
EDW
metadata
Operational
metadata
Operational systems
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
19
Data Warehouse
Architectures
Enterprise data model
Presents a total picture explaining the data
required by an organisation
Must be developed prior to designing a data
warehouse
Entity-Relationship Models – traditional
approach in relational database design
Multidimensional Models – are commonly
used in data warehouses and data marts for
faster retrieval for querying and analysis
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
20
Data Warehouse
Architectures
Operational Data
current or transient, not historical
restricted in scope to a particular application
poor quality
not normalised (there are multi-valued attributes or
repeating groups, partial dependencies, transitive
dependencies in data relations)
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
21
Data Warehouse
Architectures
Sample Operational Data from Northwind database
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
22
Data Warehouse
Architectures
Reconciled Data
Detailed - rather than summarised
Historical – snapshots, periodic
Comprehensive – should reflect enterprise-wide
perspective; conform to enterprise data model
Quality controlled
Normalised – 3NF or higher
3NF – no multi-valued attributes, no partial
dependencies, no transitive dependencies
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
23
Data Warehouse
Architectures
Steps in Normalisation
Table with
multi-valued
attributes
Remove multivalued attributes
1st Normal
Form
Remove partial
dependencies
2nd Normal
Form
Remove transitive
dependencies
3rd Normal
Form
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
24
Data Warehouse
Architectures
Sales relation with sample data
Relation in 3NF
Salesperson
Cust_ID
Name
Salesperson
Region
Cust_ID
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
Name
Region
Salesperson
25
Data Warehouse
Architectures
Derived Data
selected, formatted, aggregated
provides ease of use for decision support applications
provides fast response for user queries
supports ad-hoc queries and data mining applications
data model commonly used is star schema
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
26
Data Warehouse
Architectures
Metadata
data that describe the properties or characteristics of
other data
Operational metadata – describe the data in various
operational systems (as well as external data) that feed
the EDW
EDW metadata – describe the reconciled data layer as
well as the rules for transforming operational data to
reconciled data
Data mart metadata – describe the data in derived data
layer and rules of transforming reconciled data to
derived data
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
27
Data Warehouse
Architectures
Sample data description
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
28
Data Warehouse
Architectures
Sample data description
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
29
Data Warehouse
Architectures
Data Reconciliation Process
Stage 1: Initial load, when EDW is first created
Stage 2: Subsequent updates
Steps in Data Reconciliation Process
Capture – extract relevant data from source/s
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
30
Data Warehouse
Architectures
Scrub – clean or upgrade the quality of raw data before
transformation and loading (using pattern recognition, artificial
intelligence techniques)
Track and correct errors: misspelled names, erroneous
birthdates, missing data; inconsistent data formats
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
31
Data Warehouse
Architectures
Transform - includes
converting data format or representation from
source to target system
partitioning data according to predefined criteria
aggregating data from detailed to summary level
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
32
Data Warehouse
Architectures
Load and Index
Refresh mode – filling
the EDW by bulk
rewriting of target data
Update mode – only
changes in source data
are written to the data
warehouse; at periodic
intervals, data
warehouse is rewritten,
replacing previous
contents without
overwriting or deleting
previous contents
Create necessary
indexes
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
33
Entity-Relationship Modelling
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
34
Entity-Relationship Modelling
Customer
Type
groups
Customer
within
contains
makes
Product
Type
groups
Product
in
Sale
Region
located at
Store
within
Period
(based on Kimball (1996), p29, and Simsion-Bowles (1996), p2)
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
35
Entity-Relationship Modelling
Entities, attributes and relationships
Rules of normalisation
3NF is typical
Protection of integrity of database by avoiding
anomalies
Every logical thing is represented only once
Separate consideration of logical and physical aspects
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
36
Entity-Relationship
Modelling
ER Model for the Northwind sample database
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
37
Entity-Relationship Modelling
Large numbers of tables
Oracle Financials - 1,800; SAP 7 up to 8,000
Commonly used
Feels natural once you get used to it
Research shows that they are not easily understood
by IT people
Especially concepts like abstraction,
generalisation, sub-types, etc.
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
38
Multi-dimensional Modelling
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
39
Multi-dimensional Modelling
It is possible to conceptualise data as multidimensional
Difficult to design
Easy to use resulting reports
Advocated by Ralph Kimball (see his manifesto, and
a rebuttal, available on the web site).
A logical design technique that seeks to present data
in a standard framework that is intuitive and allows
for high-performance access.
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
40
Multi-dimensional Modelling
An approach to database design that provides an easy
to understand and navigate database
The aim is to encourage understanding, exploration
and learning
Each number in a database has a set of associated
attributes
What it measures, what point of time it was created,
what location its from, what product it’s associated
with, what promotion, etc.
This makes the number meaningful.
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
41
Multi-dimensional Modelling
Each attribute associated with each number
represents a dimension
Measure, time, location, product, location, etc.
Resulting views are easy to navigate and move
around
Slice and dice
Report template
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
42
Multi-dimensional Modelling
Example Widget Sales ($Million)
One
Dimension
(State):
43.6
53.4
31.4
27.5
28.3
14.7
Vic
NSW
QLD
WA
SA
TAS
State
Two Dimensions
(location x time):
48.2
53.4
31.4
28.4
25.1
15.4
2002
50.1
57.2
33.6
28.1
22.5
16.3
2001
56.3
62.3
35.1
29.4
21.5
13.3
2000
46.2
52.1
29.6
25.1
27.1
18.2
1999
43.6
53.4
31.4
27.5
28.3
14.7
1998
Vic
NSW
QLD
WA
SA
TAS
Year
State
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
43
Multi-dimensional Modelling
Three Dimensions
(location x time x product):
Year
2002
48.2
53.4
31.4
28.4
25.1
15.4
2001
50.1
57.2
33.6
28.1
22.5
16.3
2000
56.3
62.3
35.1
29.4
21.5
13.3
1999
46.2
52.1
29.6
25.1
27.1
18.2
1998
43.6
53.4
31.4
27.5
28.3
14.7
Vic
NSW
QLD
WA
SA
TAS
State
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
44
Multi-dimensional Modelling
Usually talk about information spaces as cubes, or
hyper-cubes, or n-cubes
Resulting views of databases are easy to navigate and
move around
Slicing and dicing
Report Template
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
45
Multi-dimensional Modelling
Slicing and Dicing
Select certain dimension values to examine
a set of data:
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
46
Multi-dimensional Modelling
Report Templates
One template is produced for a set of slices
Data changes, layout doesn’t
Location Drop Down Box
Product Sales: Victoria, 2001
50
Year Drop Down Box
40
30
20
10
0
Widgets
Sprockets
Flanges
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
Gaskets
47
Multi-dimensional Modelling
From Traditional Relational to Multi-dimensional
Typical relational data-base
From Pilot Software OLAP White Paper
Same data displayed in twodimensions
Easy!
(The key is to identify the continuous
and discrete variables in the flat file.)
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
48
Star Schema
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
49
Star Schema
Used to implement dimensional analysis using
standard relational database technology
Very common in data warehousing
Many variations
Two components:
Fact Table – contains measurements of business,
eg. sales, purchase order, shipment
Dimension Tables – stores the textual descriptions
of the dimensions of the business, eg. product,
customer, vendor, store.
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
50
Star Schema
Fact tables store the hard data
Dimension tables store all the information
about our dimensions.
The fact table has a many-to-one
relationship with each dimension table
Each dimension table has a primary key
that appears as a foreign key in the fact
table, whose primary key is a
concatenation of all of the foreign keys.
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
51
Star Schema
Dimension tables in star schemas are
denormalised resulting in:
Fewer tables
Simpler for users to navigate
Reduced number of complex multi-join
tables.
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
52
Star schema
Customer
Customer key
Name
Customer type
Product
Product key
Product type
weight
Sale
Time key
Store key
Customer key
Product key
Dollar sales
Unit sales
Store
Store key
Address
Region
Time
Time key
Day
Month
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
Legend:
Primary Key
Foreign key
53
Snowflake schema
“Do not snowflake your
dimensions, even if very
large. If you do snowflake
your dimensions, prepare to
live with poor performance”
Kimball (1996)
Product
Type
Product
Customer
Type
Customer
Sale
Store
Region
Time
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
54
Star Schema
Dimensions can be shared amongst fact tables.
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
55
Star Schema
ER schemas are useful for data mapping to legacy
systems and for integration of the data warehouse
Star schemas are useful for the design of warehouse
databases as they are efficient and easy to understand
and use
Allow relational databases to support multidimensional data cubes
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
56
Star Schema
Steps in the design process
1.
2.
3.
4.
5.
Choose a business process
Choose the grain of the fact table
Too fine > Oversized database
Too large > Loss of meaningful information
Choose the dimensions
Choose the measured facts
(usually numeric, additive quantities)
Complete the dimension tables
Kimball (1996)
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
57
Extra steps in the design
process
6.
7.
8.
9.
Determine strategy for slowly changing dimensions
Create aggregations and other physical storage
components
Determine the historical duration of the database
Determine the urgency with which the data is to be
extracted and loaded into the data warehouse.
Kimball (1996)
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
58
An Example: Retail Trading
A large grocery store with approx. 500 stores
Each store has approx. 60,000 products on shelves
Need to maximise profit and keep shelves stocked
Important decisions concern pricing and promotion
Promotion types are:
Temporary price reductions
Newspaper advertisements
Shelf and end-aisle displays
Coupons
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
59
An Example: Retail Trading
1. Choose a Business Process
Daily Item Movement
2. Choose the grain of the fact table
Stock Keeping Unit (SKU) by store by promotion by
day
3. Choose the Dimensions
Time, product, store and promotion
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
60
An Example: Retail Trading
Retail Trading Dimensions
Promotion
Promotion key
Other Promotion
attributes
Sale
Product
Product key
Other product
attributes
Time key
Product key
Store Key
Promotion Key
Facts – to be detailed
next
Store
Store key
Other Store
Attributes
Time
Time key
Other Time
Attributes
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
61
An Example: Retail Trading
4. Choose the measured facts
Promotion
Promotion key
Other Promotion
attributes
Sale
Product
Product key
Other product
attributes
Time key
Product key
Store Key
Promotion Key
Dollar Sales
Unit Sales
Dollar Costs
Customer Count
Store
Store key
Other Store
Attributes
Time
Time key
Other Time
Attributes
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
62
An Example: Retail Trading
5. Complete the dimension tables
Promotion
Promotion key
Other Promotion
attributes
Product
Product key
SKU Description
SKU Number
Package Size
Brand
Sub Category
Department
Package Type
Diet Type
Weight
Weight unit of measure
Units per retail case
Units per ship case
Cases per pallet
Sale
Time key
Product key
Store Key
Promotion Key
Dollar Sales
Unit Sales
Dollar Costs
Customer Count
Store
Store key
Other Store
Attributes
Time
Time key
Other Time
Attributes
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
63
References
Inmon, W. H. (1996) Building the Data
Warehouse (2nd ed), Wiley, NY.
Kimball, R. (1996) The Data Warehouse
Toolkit, Wiley, NY.
McFadden, F., Hoffer, J. and Prescott, M.
(1999) Modern Database Management,
Addison-Wesley.
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
64
Questions?
[email protected]
School of Information Management and Systems,
Monash University
T1.28, T Block, Caulfield Campus
9903 2735
IMS3001 – BUSINESS INTELLIGENCE SYSTEMS – SEM 1 , 2004
65