Data - Telkom University

Download Report

Transcript Data - Telkom University

Data Warehouse & OLAP
Introduction
Sesi: 03-04
Dosen Pembina : Danang Junaedi
The Knowledge Discovery Process
Interpretation/
Evaluation
Data Mining
Transformation
Knowledge
Patterns /
Models
Preprocessing
Transformed
Data
Selection
Preprocessed
Data
Target
Data
Data
Sources
Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996.
Data Sources
 Relational Databases
 Data Warehouses
 WWW
 Audio
 Video
 Printed Materials
:
:
Relational Databases

Multidimensional Data Cube
Han, J. & Kamber, M., Data
Mining: Concepts and
Techniques, Morgan Kaufmann,
2000
Evolution of DSS
 Transaction Processing Systems (TPS)
 Operational data stores and OLTP
 Batch reports, hard to find and analyze information, inflexible and expensive,
reprogram every new request (circa 60’s)
 MIS
 Management reporting from transactions in TPS
 Still inflexible, not integrated with desktop tools (circa 70’s)
 DSS
 Combine data with analytic models or expert rules
 Integration with desktop tools (80’s)
 Data Warehousing
 Data integrated after (cleaning and scrubbing) from multiple sources (both
internal and external to the organization)
 OLAP is the technology used to study the data in terms of operations on a
multi-dimensional data set
 Data warehousing also supports processing of data by analytic methods and
permits data mining (90’s)
Applications
 Retail - inventory management, promotions
 Manufacturing - order shipment
 Insurance – policy and claims tracking
 Telecommunications - call analysis
 Financial – account tracking
 CRM/eCRM – customer profiling, clickstream analysis
 Healthcare – disease management, patient and physician
profiling
Databases for Decision Support
 Transaction Processing systems are optimized for
performance
 Data they capture are too detailed to be of use for decision
support purposes
 Online Analytical Processing (OLAP) imposes very different
demands on databases than does Online Transaction
Processing (OLTP)
Heterogeneous Database Integration
Integration System
World
Wide
Web
Personal
Digital Libraries
Scientific Databases
Databases
• Collects and combines information from disparate
sources
• Provides integrated view, and a uniform user interface
• Supports sharing of data between entities
Why look at data in this way?
 What would be the demand for services (forecasting)?
 Who are our key customers/patients, and
 What are the margins/outcomes? (profitable
customers/satisfied patients)
 How do we market to them/treat them?
 What pricing/treatment strategy is desirable?
 What are their preferences?
 What type of customer/patient services are required?
 What services when packaged result in higher/better
sales/revenues/margins/outcomes, efficient workflow?
 Which promotion/patient education/counseling works or does not work and
why?
 What is the inventory/patient turnover?
 Which channel/technology is more effective/profitable?
 Why do margins/outcomes differ from one place to another or one patient
to another?
Data Warehousing and Industry
One of the hottest topic in IS.
Over 90% of larger companies either have a DW or
are starting one.
 Warehousing is big business
 $2 billion in 1995
 $3.5 billion in early 1997
 $8 billion in 1998 [Metagroup]
 over $200 billion over next 5 years.
Data Warehousing and Industry (2)
A 1996 study of 62 data warehousing projects
showed:
An average return on investment of 321%, with an
average payback period of 2.73 years.
WalMart has largest warehouse
900-CPU, 2,700 disk, 23 TB Teradata system
~7TB in warehouse
40-50GB per day
Why Data Warehousing?
 Advance of information technology.
 Data collected in huge amounts.
 Need to make good use of data?
 Architecture and tools to
 Bring together scattered information from multiple sources to
provide consistent data source for decision support.
 Support information processing by providing a solid platform of
consolidated, historical data for analysis.
A producer wants to know….
Which are our
lowest/highest margin
customers ?
Who are my customers
and what products
are they buying?
What is the most
effective distribution
channel?
What product prom-otions have the biggest
impact on revenue?
Which customers
are most likely to go
to the competition ?
What impact will
new products/services
have on revenue
and margins?
Data, Data everywhere yet ...
I can’t find the data I need
 data is scattered over the network
 many versions, subtle differences
I can’t get the data I need
o need an expert to get the data
I can’t understand the data I
found
o available data poorly documented
I can’t use the data I found
o results are unexpected
o data needs to be transformed from one
form to other
What are the users saying...
 Data should be integrated across the enterprise
 Summary data has a real value to the organization
 Historical data holds the key to understanding data over
time
 What-if capabilities are required
What is a Data Warehouse?
 Defined in many different ways non-rigorously.
 A DB for decision support.
 Maintained separately from an organization’s operational
database.
 A data warehouse is a subject-oriented, integrated, timevariant, and nonvolatile collection of data in support of
management’s decision-making process.— W. H. Inmon
 90% of major organizations have or are building some
kind of data warehouse.
 A decision support database that is maintained
separately from the organization’s operational
databases.
Data Warehouse
 A data warehouse is a
 subject-oriented
 integrated
 time-varying
 non-volatile
collection of data that is used primarily in organizational
decision making.
-- Bill Inmon, Building the Data Warehouse 1996
What is Data Warehousing?
Information
A process of transforming
data into information and
making it available to users
in a timely enough manner
to make a difference
[Forrester Research, April
1996]
Data
Data Warehousing -- It is a
process
 Technique for assembling and
managing data from various
sources for the purpose of
answering business questions.
Thus making decisions that were
not previous possible
 A decision support database
maintained separately from the
organization’s operational
database
Explorers, Farmers and Tourists
Tourists: Browse information
harvested by farmers
Farmers: Harvest information
from known access paths
Explorers: Seek out the unknown and
previously unsuspected rewards hiding in
the detailed data
Data Warehouse Architecture
OLAP / Decision
support/
User queries
Data cubes/ data
mining
Relational database (warehouse)
Data
extraction
Data cleaning/
programs
scrubbing
Data
Data
Data
source
source
source
Data Warehouse Architecture
Data Warehouse Architecture
Data Warehouse Architecture
Relational
Databases
Optimized Loader
ERP
Systems
Extraction
Cleansing
Data Warehouse
Engine
Purchased
Data
Legacy
Data
Metadata Repository
Analyze
Query
Data Warehouse Component
Data Warehouse Architecture
(e.g.,
AS400)
Clinical
Payroll
System
System
Data
Meta-Data
Mining Tools
Oracle
OLAP servers
Financials
on HP 9000
Billing System
Transformation
Data
Integration
Warehouse
Access,
Files
Other
Internal
Data
(Industry Reports)
External
Data
Excel
Web
Other
Data Sources
ETL Software
S
T
A
G
I
N
G
Transaction Data
Prod
IBM
Mkt
IMS
HR
Fin
VSAM
Ascential
Oracle
Extract
Acctg
Syba se
Other Internal Data
ERP
SAP
Infor mix
SAS
HarteHanks
Users
ANALYSTS
Cognos
Teradata
IBM
Load
Informatica
D
A
T
A
External Data
Demographic
A
R
E
A
O
P
E
R
A
T
I
O
N
A
L
Data Analysis
Tools and
Applications
SQL
Sagent
Web Data
Clickstream
Data Stores
S
T
O
R
E
Clean/Scrub
Trans form
Firstlogic
Data
Warehouse
Data Marts
SAS
MANAGERS
Finance
Essbase
Marketing
Queries,Reporting,
DSS/EIS,
Data Mining
EXECUTIVES
Micro Strategy
Meta
Data
Sales
Microsoft
Siebel
Business
Objects
OPERATIONAL
PERSONNEL
Web
Browser
CUSTOMERS/
SUPPLIERS
Data Warehouse Architecture
Extraction, Transformation, & Load (ETL)
 ETL is a set of tools and techniques used to populate a data
warehouse
 Extraction
 Extract data from sources (e.g., operational DBMSs, file systems,
Web pages)
 Transformation
 Clean data
 Convert from legacy/host format to warehouse format (e.g.,
convert “surname” to “last name”)
Extraction, Transformation, & Load (ETL)
 Load
 Sort, summarize, consolidate, compute views, check integrity,
build indexes, partition
 Huge volumes of data to be loaded, yet small time window
(usually at night) when the warehouse can be taken off-line
 Techniques: batch, sequential load often too slow; incremental,
parallel loading techniques may be used
 Refresh
 Propagate updates from sources to the warehouse
 When to refresh - on every update, periodically (e.g., every 24
hours), or after “significant” events
 How to refresh – full extract from base tables vs. incremental
techniques
Data Mart
 A data mart stores data for a limited number of subject areas,
such as marketing and sales data. It is used to support
specific applications.
 An independent data mart is created directly from source
systems.
 A dependent data mart is populated from a data warehouse.
Data Warehouse vs. Data Marts
 Enterprise warehouse: collects all information about
subjects (customers, products, sales, assets, personnel)
that span the entire organization.
 Requires extensive business modeling
 May take years to design and build
 Data Marts: departmental subsets that focus on selected
subjects: Marketing data mart: customer, products, sales.
 Faster roll out, but complex integration in the long run.
Baca artikel di url: http://www.martinsights.com/?p=1088
Data Warehouse for Decision Support & OLAP
 Putting Information technology to help the knowledge
worker make faster and better decisions
 Which of my customers are most likely to go to the competition?
 What product promotions have the biggest impact on revenue?
 How did the share price of software companies correlate with
profits over last 10 years?
Decision Support
 Used to manage and control business
 Data is historical or point-in-time
 Optimized for inquiry rather than update
 Use of the system is loosely defined and can be ad-hoc
 Used by managers and end-users to understand the
business and make judgements
The Complete Decision Support System
(Source:
Franconi)
Information Sources
Data Warehouse
OLAP Servers
Clients
Server
(Tier 2)
(Tier 3)
(Tier 1)
e.g., MOLAP
Analysis
Semistructured
Sources
Data
serve
Warehouse
extract
Query/Reporting
transform
serve
load
e.g., ROLAP
refresh
Operational
DB’s
etc.
Data Mining
serve
Data Marts
Data Mining works with Warehouse
Data
 Data Warehousing provides the
Enterprise with a memory
 Data Mining provides the
Enterprise with intelligence
We want to know ...
 Given a database of 100,000 names, which persons are the least
likely to default on their credit cards?
 Which types of transactions are likely to be fraudulent given the
demographics and transactional history of a particular customer?
 If I raise the price of my product by Rs. 2, what is the effect on my
ROI?
 If I offer only 2,500 airline miles as an incentive to purchase rather
than 5,000, how many lost responses will result?
 If I emphasize ease-of-use of the product as opposed to its technical
capabilities, what will be the net effect on my revenues?
 Which of my customers are likely to be the most loyal?
Data Mining helps extract such information
Why Separate Data Warehouse?
 Performance
 Op dbs designed & tuned for known txs & workloads.
 Complex OLAP queries would degrade perf. for op txs.
 Special data organization, access & implementation methods
needed for multidimensional views & queries.
 Function
 Missing data: Decision support requires historical data, which op
dbs do not typically maintain.
 Data consolidation: Decision support requires consolidation
(aggregation, summarization) of data from many heterogeneous
sources: op dbs, external sources.
 Data quality: Different sources typically use inconsistent data
representations, codes, and formats which have to be reconciled.
Data Warehouse vs. Heterogeneous DBMS
 Traditional heterogeneous DB integration:
 Build wrappers/mediators on top of heterogeneous databases
 Query driven approach
– When a query is posed to a client site, a meta-dictionary is used to
translate the query into queries appropriate for individual
heterogeneous sites involved, and the results are integrated into a
global answer set
– Complex information filtering, compete for resources
 Data warehouse: update-driven, high performance
 Information from heterogeneous sources is integrated in advance
and stored in warehouses for direct query and analysis
Data Warehouse vs. Operational DBMS
 OLTP (on-line transaction processing)
 Major task of traditional relational DBMS
 Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
 OLAP (on-line analytical processing)
 Major task of data warehouse system
 Data analysis and decision making
 Distinct features (OLTP vs. OLAP):
 User and system orientation: customer vs. market
 Data contents: current, detailed vs. historical, consolidated
 Database design: ER + application vs. star + subject
 View: current, local vs. evolutionary, integrated
 Access patterns: update vs. read-only but complex queries
RDBMS used for OLTP
 Database Systems have been used traditionally for OLTP
 clerical data processing tasks
 detailed, up to date data
 structured repetitive tasks
 read/update a few records
 isolation, recovery and integrity are critical
Operational Systems
 Run the business in real time
 Based on up-to-the-second data
 Optimized to handle large numbers
of simple read/write transactions
 Optimized for fast response to
predefined transactions
 Used by people who deal with
customers, products -- clerks,
salespeople etc.
 They are increasingly used by
customers
Examples of Operational Data
Data
Industry Usage
Technology
Volumes
Customer
File
All
Legacy application, flat
files, main frames
Small-medium
Account
Balance
Finance
Legacy applications,
hierarchical databases,
mainframe
Large
Point-ofSale data
Retail
ERP, Client/Server,
relational databases
Very Large
Call
Record
Telecommunications
Legacy application,
hierarchical database,
mainframe
ERP,
relational databases,
AS/400
Very Large
Production ManufactRecord
uring
Track
Customer
Details
Control
account
activities
Generate
bills, manage
stock
Billing
Control
Production
Medium
Application-Orientation vs. Subject-Orientation
Application-Orientation
Subject-Orientation
Operational
Database
Loans
Credit
Card
Data
Warehouse
Customer
Vendor
Trust
Savings
Product
Activity
OLTP vs. OLAP
OLTP
OLAP
users
clerk, IT professional
knowledge worker
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
lots of scans
unit of work
read/write
index/hash on prim. key
short, simple transaction
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response
usage
access
complex query
OLTP vs. Data Warehouse
 OLTP systems are tuned for known transactions and
workloads while workload is not known a priori in a data
warehouse
 Special data organization, access methods and
implementation methods are needed to support data
warehouse queries (typically multidimensional queries)
 e.g., average amount spent on phone calls between 9AM-5PM in
Pune during the month of December
OLTP vs Data Warehouse
OLTP
 Application Oriented
 Used to run business
 Detailed data
 Current up to date
 Isolated Data
 Repetitive access
 Clerical User
Warehouse (DSS)
 Subject Oriented
 Used to analyze business
 Summarized and refined
 Snapshot data
 Integrated Data
 Ad-hoc access
 Knowledge User (Manager)
OLTP vs Data Warehouse
OLTP
 Performance Sensitive
 Few Records accessed at a
time (tens)
 Read/Update Access
 No data redundancy
 Database Size 100MB 100 GB
Data Warehouse
 Performance relaxed
 Large volumes accessed at
a time(millions)
 Mostly Read (Batch Update)
 Redundancy present
 Database Size
100 GB
- few terabytes
OLTP vs Data Warehouse
OLTP
 Transaction throughput is
the performance metric
 Thousands of users
 Managed in entirety
Data Warehouse
 Query throughput is the
performance metric
 Hundreds of users
 Managed by subsets
To summarize ...
 OLTP Systems are
used to “run” a business
 The Data Warehouse
helps to “optimize” the
business
From Tables and Spreadsheets to Data Cubes
 A data warehouse is based on a multidimensional data
model which views data in the form of a data cube
 A data cube, such as sales, allows data to be modeled
and viewed in multiple dimensions
 Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
 Fact table contains measures (such as dollars_sold) and keys to
each of the related dimension tables
 In data warehousing literature, an n-D base cube is
called a base cuboid. The top most 0-D cuboid, which
holds the highest-level of summarization, is called the
apex cuboid. The lattice of cuboids forms a data cube.
A Sample Data Cube
Total annual sales
2Qtr
3Qtr
4Qtr
sum
U.S.A
Canada
Mexico
sum
Country
TV
PC
VCR
sum
1Qtr
of TVs in U.S.A.
Date
Cube: A Lattice of Cuboids
all
time
0-D(apex) cuboid
item
time,location
location
supplier
item,location
location,supplier
time,item
time,supplier
1-D cuboids
2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier
item,location,supplier
4-D(base) cuboid
time, item, location, supplier
Data (Hyper) cubes
2-d to 3-d cube
Rotating the cube
Conceptual Modeling of Data Warehouses
 ER design techniques not appropriate
 Modeling data warehouses: dimensions & measures
 Star schema: A fact table in the middle connected to a set of
dimension tables
 Snowflake schema: A refinement of star schema where some
dimensional hierarchy is normalized into a set of smaller
dimension tables, forming a shape similar to snowflake
 Fact constellations schema : Multiple fact tables share
dimension tables, viewed as a collection of stars, therefore
called galaxy schema or fact constellation
Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.
Problem with ER
ER models are NOT suitable for DW?
 End user cannot understand or remember an ER Model
 Many DWs have failed because of overly complex ER
designs
 Not optimized for complex, ad-hoc queries
 Data retrieval becomes difficult due to normalization
 Browsing becomes difficult
Warehouse Models & Operators
 Data Models
 relations
 stars & snowflakes
 cubes
 Operators
 slice & dice
 roll-up, drill down
 pivoting
 other
Multidimensional Data Model
 Database is a set of facts (points) in a multidimensional space
 A fact has a measure dimension
 quantity that is analyzed, e.g., sale, budget
 A set of dimensions on which data is analyzed
 e.g. , store, product, date associated with a sale amount
 Dimensions form a sparsely populated coordinate system
 Each dimension has a set of attributes
 e.g., owner, city and county of store
 Attributes of a dimension may be related by partial order
 Hierarchy: e.g., street > county >city
 Lattice: e.g., date> month>year, date>week>year
Example: Patient profiling
 A healthcare organization needed a longitudinal view of
patients, including trends of services to patients
 Model
 Facts include Healthcare (e.g., diagnosis, procedure), Financial
(e.g., amount billed, number of claims), Resources (e.g., number
of bed-days, inpatient and outpatient visits)
 Dimensions include Time, Provider, Claim Type, Demographics,
Encounter Type, Diagnosis and Procedure, Person, Organization
 Questions answered by system:
 Which individuals are eligible for services but not obtaining them?
 Which individuals are registered for services, but not receiving
preventive healthcare?
Star Schema
 A single fact table and a single table for each dimension
 Every fact points to one tuple in each of the dimensions
and has additional attributes
 Does not capture hierarchies directly
 Straightforward means of capturing a multiple dimension
data model using relations
 Slowly Changing Dimensions
Example of Star Schema
time
item
time_key
item_key
day
Sales Fact Table
day_of_the_week
time_key
month
item_key
quarter
year
branch_key
branch
location_key
branch_key
branch_name
units_sold
branch_type
dollars_sold
avg_sales
Measures
item_name
brand
type
supplier_type
location
location_key
street
city
province_or_street
country
Slide is modified from slides provided by Han, J. &
Kamber, M., Data Mining: Concepts and
Techniques, Morgan Kaufmann, 2000.
Example of a Star Schema
Order
Product
Order No
ProductNO
Order Date
Customer
Customer No
Customer Name
Customer
Address
City
Salesperson
SalespersonID
SalespersonName
City
Quota
ProdName
Fact Table
OrderNO
SalespersonID
ProdDescr
Category
CategoryDescription
UnitPrice
CustomerNO
ProdNo
Date
DateKey
DateKey
CityName
Date
Quantity
Total Price
City
CityName
State
Country
A star schema for credit card purchases
Purchase Key
1
2
3
4
5
6
.
.
.
Purchase Dimension
Category
Supermarket
Travel & Entertainment
Auto & Vehicle
Retail
Restarurant
Miscellaneous
.
.
.
Time Dimension
Time Key Month Day Quarter Year
10
Jan
5
1
2002
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Cardholder Key Purchase Key Location Key
1
2
1
15
4
5
1
2
3
.
.
.
.
.
.
.
.
.
Cardholder Key Name
1
John Doe
2
Sara Smith
.
.
.
.
.
.
Cardholder Dimension
Gender Income Range
Male
50 - 70,000
Female
70 - 90,000
.
.
.
.
.
.
Fact Table
Time Key Amount
10
14.50
11
8.25
10
22.40
.
.
.
.
.
.
Location Key Street
10
425 Church St
.
.
.
.
.
.
Location Dimension
City
State Region
Charleston SC
3
.
.
.
.
.
.
.
.
.
The Star Schema
A multi-dimensional data model
Customer dimension
Non-normalized
Time dimension
Fact table
Non-normalized
Normalized
Non-normalized
Non-normalized
Application dimension
Problem dimension
One E-R vs. Many Stars
Transaction
Analysis
focus
focus
One E-R model for all the business
process.
One star per modeled business process.
Snowflake Schema
 Represent dimensional hierarchy directly by normalizing the
dimension tables
 Easy to maintain
 Saves storage, but may reduce effectiveness of browsing
(Kimball)
Example of Snowflake Schema
time
item
time_key
day
Sales Fact Table
day_of_the_week
time_key
month
item_key
quarter
branch_key
year
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
item_key
supplier
item_name
supplier_key
brand
supplier_type
type
supplier_key
location
location_key
street
city_key
city
city_key
city
province_or_street
Slide is modified from slides provided by Han, J. &
Kamber, M., Data Mining: Concepts and
Techniques, Morgan Kaufmann, 2000.
country
Example of a Snowflake Schema
Order
Product
Order No
Category
ProductNO
Order Date
Fact Table
CategoryName
ProdName
CategoryDescr
ProdDescr
Customer
Category
Customer No
OrderNO
Category
Customer Name
SalespersonID
UnitPrice
Customer
Address
CustomerNO
City
ProdNo
DateKey
Salesperson
CityName
SalespersonID
Quantity
SalespersonName
Total Price
City
Quota
Date
DateKey
Date
Month
City
CityName
State
Country
Month
Month
Year
Year
State
StateName
Country
Year
Fact Constellation Schema
 Multiple fact tables share dimension tables.
 This schema is viewed as collection of stars hence called
galaxy schema or fact constellation.
 Sophisticated applications require such schema.
Example of Fact Constellation
time
time_key
item
day
day_of_the_week
month
Sales Fact Table
item_key
year
branch_key
branch
location_key
branch_key
units_sold
branch_name
dollars_sold
branch_type
brand
type
supplier_type
location
avg_sales
item_key
shipper_key
from_location
to_location
location_key
dollars_cost
street
units_shipped
city
Measures
time_key
item_name
time_key
quarter
item_key
Shipping Fact Table
province_or_street
country
Slide is modified from slides provided by Han, J. &
Kamber, M., Data Mining: Concepts and
Techniques, Morgan Kaufmann, 2000.
shipper
shipper_key
shipper_name
location_key
shipper_type
Example Fact Constellation Schema
Sales
Fact Table
Store Key
Shipping
Product Dimension
Fact Table
Shipper Key
Product Key
Product Key
Store Key
Period Key
Product Desc
Product Key
Units
Period Key
Price
Units
Store Dimension
Store Key
Store Name
City
State
Region
Price
A constellation schema for credit card purchases and
promotions
Promotion Key
1
.
.
.
Promotion Dimension
Description
Cost
watch promo 15.25
.
.
.
.
.
.
Time Dimension
Time Key Month Day Quarter Year
5
Dec
31
4
2001
8
Jan
3
1
2002
10
Jan
5
1
2002
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Purchase Key
1
2
3
4
5
6
Promotion Fact Table
Cardholder Key Promotion Key Time Key
1
1
5
2
1
5
.
.
.
.
.
.
.
.
.
Response
Yes
No
.
.
.
Cardholder Key Name
1
John Doe
2
Sara Smith
.
.
.
.
.
.
Purchase Dimension
Category
Supermarket
Travel & Entertainment
Auto & Vehicle
Retail
Restarurant
Miscellaneous
Purchase Fact Table
Cardholder Key Purchase Key Location Key
1
2
1
15
4
5
1
2
3
.
.
.
.
.
.
.
.
.
Cardholder Dimension
Gender Income Range
Male
50 - 70,000
Female
70 - 90,000
.
.
.
.
.
.
Time Key Amount
10
14.50
11
8.25
10
22.40
.
.
.
.
.
.
Location Key Street
5
425 Church St
.
.
.
.
.
.
Location Dimension
City
State Region
Charleston
SC
3
.
.
.
.
.
.
.
.
.
What is OLAP?
 Software tool providing multi-dimensional view of data for
business analysis
 Example of “Decision Support” or “Business Intelligence”
tool
 Fast data access and fast computations
 Interactive, flexible user interface
 “Slice, dice, drill-down”
 Excel Pivot Table and Pivot Chart are examples of simple
OLAP tools
Defining OLAP - ANALYSIS
 Business logic and statistical analysis relevant to end user
 Should not require programming for everything
 Analysis can be via vendors’ tools or link to generic
analytical platform such as spreadsheet
 Examples include time series analysis, cost allocation,
currency translation, goal seeking, ad-hoc multidimensional structural changes (cube building), nonprocedural modeling, exception alerting, and data mining.
 Capabilities vary widely by vendor and market
OLAP Operations
Common cube operations
 Pivot or Rotate – change which dimensions and/or levels within
dimensions are shown on row and column axes
 Roll-up – aggregate or combine cells within a dimension according to
some mathematical operation
 Uses a hierarchy definition for the dimension
 Commonly this is summation or count
 Drill down – examine data a greater level of detail
 Add another row or column header which is further down the concept
hierarchy
 Slice – select a subset of a cube by constraining the value of some
dimension
 Ex: Select cells for month = January in time dimension
 Dice – select a subset of a cube by constraining two or more
dimensions
 Drill through – access atomic level detail data
OLAP Operations & SQL Sample
Cube Operation (SQL)
A Few Products
 Microsoft Analysis Services
 Part of SQL Server 2005
 Create OLAP cubes, 10 data mining algorithms
 Tableau
 A new, pretty amazing pivoting tool
 Cognos
 Recently bought by IBM
 Hyperion Essbase
 Full suite of business intelligence developer and end user tools
 Purchased by Oracle
 Business Objects (Crystal)
 Full suite of business intelligence developer and end user tools
 Microstrategy
 Oracle
 Information Builders
 Home of WebFocus, a web based OLAP tool
 Pentaho
 A new open source business intelligence project
 http://www.pentaho.org/
A Data Mining Query Language, DMQL:
Language Primitives
 Cube Definition (Fact Table)
define cube <cube_name> [<dimension_list>]:
<measure_list>
 Dimension Definition ( Dimension Table )
define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
 Special Case (Shared Dimension Tables)
 First time as “cube definition”
 define dimension <dimension_name> as
<dimension_name_first_time> in cube <cube_name_first_time>
Defining a Star Schema in DMQL
define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week,
month, quarter, year)
define dimension item as (item_key, item_name, brand,
type, supplier_type)
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)
Defining a Snowflake Schema in DMQL
define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter,
year)
define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city(city_key,
province_or_state, country))
Defining a Fact Constellation in DMQL
define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state, country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location in
cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
First Example
Enrollment Data
Sumber: Dr. Mourad YKHLEF,Decision Support System, King Saud University, 2009
Enroll Table
Class
Day
Time
Prof
Enrolled
1336
T
8
Aars
14
1430
M
2
Aars
28
1430
M
11
Booth
30
1430
T
11
Booth
26
1430
T
2
Booth
23
1430
T
2
Fry
27
1430
T
12
Aars
29
1440
M
1
Aars
11
2334
M
9
Fry
27
2350
M
10
Maurer
19
3101
T
12
Grabow
16
3303
M
11
Aars
11
3324
M
8
Gaitros
20
3330
T
11
Fry
5
3331
T
12
Aars
11
3334
M
11
Hamerly
20
3335
M
2
Donahoo
17
3336
T
8
Sturgill
9
3342
T
2
Aars
10
3439
T
9
Poucher
10
Group by rollup
Select Prof, Sum(Students)
From enroll
Group by prof
With rollup
Prof
Aars
Booth
Donahoo
Fry
Gaitros
Grabow
Hamerly
Maurer
Poucher
Sturgill
Enrolled
114
79
17
59
20
16
20
19
10
9
363
Group by cube
Select Day, Time, Sum(Students)
From enroll
Group By Day,Time
With cube {rollup(Day,Time) -- all rows except ([NULL])}
Day
Time
[NULL]
[NULL]
[NULL]
[NULL]
[NULL]
[NULL]
[NULL]
M
M
M
M
M
M
M
T
T
T
T
T
T
1
2
8
9
10
11
12
1
2
8
9
10
11
2
8
9
11
12
Enrolled
363
11
105
43
37
19
92
56
183
11
45
20
27
19
61
180
60
23
10
31
56
Group by rollup
Select Day, Time, Prof, Sum(Students)
From enroll
Group by day,time, prof
With rollup
Second Example
Sales Data
Sales Cube
Product
Market
Time
Sales Example (Cont.)
Simple Cross-Tabular Report
1997
Region
Department
CassetteSales
Profit
Video Sales
Profit
Total Profit
Central
82,000
85,000
167,000
East
101,000
137,000
238,000
West
96,000
97,000
193,000
Total
279,000
319,000
598,000
Sales Example (Cont.)
Roll up – the query
SELECT Time, Region, Department, sum(Profit)
FROM
sales
GROUP BY Time, Region, Dept
WITH ROLLUP
this query returns the following sets of rows:
•
Regular aggregation rows that would be produced by GROUP BY without
using ROLLUP .
•
First-level subtotals aggregating across Department for each combination of
Time and Region .
•
Second-level subtotals aggregating across Region and Department for each
Time value .
•
A grand total row .
Sales Example (Cont.)
Roll up – the result of query
Time
Region
Dept
Profit
1996
Central
CassetteSales
75,000
1996
Central
VideoSales
74,000
1996
Central
[NULL]
149,000
1996
East
CassetteSales
89,000
1996
East
VideoSales
115,000
1996
East
[NULL]
204,000
1996
West
CassetteSales
87,000
1996
West
VideoSales
86,000
1996
West
[NULL]
173,000
1996
[NULL]
[NULL]
526,000
1997
Central
CassetteSales
82,000
1997
Central
VideoSales
85,000
1997
Central
[NULL]
167,000
1997
East
CassetteSales
101,000
1997
East
VideoSales
137,000
1997
East
[NULL]
238,000
1997
West
CassetteSales
96,000
1997
West
VideoSales
97,000
1997
West
[NULL]
193,000
1997
[NULL]
[NULL]
598,000
[NULL]
[NULL]
[NULL]
1,124,000
Sales Example (Cont.)
Roll up
Calculating Subtotals without ROLLUP
The result set could be generated by the UNION of four SELECT statements, as shown
below. This is a subtotal across three dimensions. Notice that a complete set of
ROLLUP-style subtotals in n dimensions would require n+1 SELECT statements linked
with UNION ALL.
SELECT Time, Region, Department, SUM(Profit) FROM Sales
GROUP BY Time, Region, Department UNION ALL
SELECT Time, Region, '' , SUM(Profit) FROM Sales
GROUP BY Time, Region UNION ALL
SELECT Time, '', '', SUM(Profits) FROM Sales
GROUP BY Time UNION ALL
SELECT '', '', '', SUM(Profits)
FROM Sales;
Sales Example (Cont.)
Cube - the query
SELECT Time, Region, Department, sum(Profit)
FROM
sales
GROUP BY Time, Region, Dept
WITH CUBE
Sales Example (Cont.)
Cube – the result of query
Time
Region
Dept
Profit
1996
Central
CassetteSales
75,000
1996
Central
VideoSales
74,000
1996
Central
[NULL]
149,000
1996
East
CassetteSales
89,000
1996
East
VideoSales
115,000
1996
East
[NULL]
204,000
1996
West
CassetteSales
87,000
1996
West
VideoSales
86,000
1996
West
[NULL]
173,000
1996
[NULL]
CassetteSales
251,000
1996
[NULL]
VideoSales
275,000
1996
[NULL]
[NULL]
526,000
1997
Central
CassetteSales
82,000
1997
Central
VideoSales
85,000
1997
Central
[NULL]
167,000
1997
East
CassetteSales
101,000
1997
East
VideoSales
137,000
1997
East
[NULL]
238,000
1997
West
CassetteSales
96,000
1997
West
VideoSales
97,000
1997
West
[NULL]
193,000
1997
[NULL]
CassetteSales
279,000
1997
[NULL]
VideoSales
319,000
1997
[NULL]
[NULL]
598,000
[NULL]
Central
CassetteSales
157,000
[NULL]
Central
VideoSales
159,000
[NULL]
Central
[NULL]
316,000
[NULL]
East
CassetteSales
190,000
[NULL]
East
VideoSales
252,000
[NULL]
East
[NULL]
442,000
Sales Example (Cont.)
Cube
Calculating Subtotals without CUBE
Just as for ROLLUP, multiple SELECT statements
combined with UNION statements could provide the
same information gathered through CUBE. However,
this may require many SELECT statements: for an ndimensional cube, 2n SELECT statements are needed. In
our 3-dimension example, this would mean issuing 8
SELECTS linked with UNION ALL.
Sales Example (Cont.)
Grouping
Two challenges arise with the use of ROLLUP and CUBE.
First, how can we programmatically determine which
result set rows are subtotals, and how do we find the exact
level of aggregation of a given subtotal? We will often
need to use subtotals in calculations such as percent-oftotals, so we need an easy way to determine which rows
are the subtotals we seek. Second, what happens if query
results contain both stored NULL values and "NULL"
values created by a ROLLUP or CUBE? How does an
application or developer differentiate between the two?
Sales Example (Cont.)
Grouping
To handle these issues, we have a function called
GROUPING. Using a single column as its argument,
Grouping returns 1 when it encounters a NULL value
created by a ROLLUP or CUBE operation. That is, if the
NULL indicates the row is a subtotal, GROUPING
returns a 1. Any other type of value, including a stored
NULL, will return a 0.
Sales Example (Cont.)
Grouping – the query
SELECT Time, Region, Department, SUM(Profit)
GROUPING (Time) as T,
GROUPING (Region) as R,
GROUPING (Department) as D
FROM Sales
GROUP BY Time, Region, Department
WITH ROLLUP
Sales Example (Cont.)
Grouping – the result of query
Time
Region
Dept
Profit
T
R
D
1996
Central
CassetteSales
75,000
0
0
0
1996
Central
Video Sales
74,000
0
0
0
1996
Central
[NULL]
149,000
0
0
1
1996
East
CassetteSales
89,000
0
0
0
1996
East
Video Sales
115,000
0
0
0
1996
East
[NULL]
204,000
0
0
1
1996
West
CassetteSales
87,000
0
0
0
1996
West
Video Sales
86,000
0
0
0
1996
West
[NULL]
173,000
0
0
1
1996
[NULL]
[NULL]
526,000
0
1
1
1997
Central
CassetteSales
82,000
0
0
0
1997
Central
Video Sales
85,000
0
0
0
1997
Central
[NULL]
167,000
0
0
1
1997
East
CassetteSales
101,000
0
0
0
1997
East
Video Sales
137,000
0
0
0
1997
East
[NULL]
238,000
0
0
1
Grouping
This table shows an ambiguous result set created
using the CUBE extension.
Time
Region
Profit
1996
East
200,000
1996
[NULL]
200,000
[NULL]
East
200,000
[NULL]
[NULL]
190,000
[NULL]
[NULL]
190,000
[NULL]
[NULL]
190,000
[NULL]
[NULL]
390,000
Grouping (Cont.)
We can resolve the ambiguity by using the
GROUPING and other functions in the code below
SELECT
decode(grouping(Time), 1, 'All Times', Time) as Time,
decode(grouping(region), 1, 'All Regions', Region) as
Region, sum(Profit)
FROM Sales
GROUB BY Time, Region
WITH CUBE
Grouping (Cont.)
The code result
Time
Region
Profit
1996
1996
East
All Regions
200,000
200,000
All Times
East
200,000
[NULL]
[NULL]
190,000
[NULL]
All Regions
190,000
All Times
[NULL]
190,000
All Times
All Regions
390,000
Grouping (Cont.)
Also we can use GROUPING function for this
purpose
we retrieve a subset of the subtotals created by a CUBE and none
of the base-level aggregations. The HAVING clause constrains
columns which use GROUPING functions
SELECT Time, Region, Department, SUM(Profit) AS Profit,
GROUPING (Time) AS T, GROUPING (Region) AS R,
GROUPING (Department) AS D
FROM Sales GROUP BY Time, Region, Department
WITH CUBE
HAVING (D=1 AND R=1 AND T=1)
OR (R=1 AND D=1)
OR (T=1 AND D=1)
Grouping (Cont.)
The query result
Time
Region
Department Profit
1996
1997
[NULL]
[NULL]
[NULL]
[NULL]
526,000
598,000
[NULL]
Central
[NULL]
316,000
[NULL]
East
[NULL]
442,000
[NULL]
West
[NULL]
366,000
[NULL]
[NULL]
[NULL]
1,124,000
Roll up Example
SELECT Year, Quarter, Month,
SUM(Profit) AS Profit
FROM
sales
GROUP BY Year, Quarter, Month
WITH ROLLUP
Roll up Example
The query result
Year
Quarter
Month
Profit
1997
Winter
Jan
55,000
1997
Winter
Feb
64,000
1997
Winter
March
71,000
1997
Winter
[NULL]
190,000
1997
Spring
April
75,000
1997
Spring
May
86,000
1997
Spring
June
88,000
1997
Spring
[NULL]
249,000
1997
Summer
July
91,000
1997
Summer
August
87,000
1997
Summer
September
101,000
1997
Summer
[NULL]
279,000
1997
Fall
October
109,000
1997
Fall
November
114,000
Referensi
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
Keith C.C. Chan,2003, Data Warehousing & Data Mining, The Hong
Kong Polytechnic University
Dr. Mourad YKHLEF,2009,Decision Support System, King Saud
University,
-,-, Decision Support Technology,The Heinz School
Ahmed M. Zeki, 2004, Data Mining & Data Warehousing, INFO 6630
Dan St. Clair, 2002, Lect 1 – Intro. To Data Mining & Data
Warehouses, University of Missouri-Rolla
S. Sudarshan; Krithi Ramamritham,-, Data Warehouse and Data Mining,
IIT Bombay
Chris Clifton, 2004, Data Warehousing, Purdue University
Richard J. Roiger,-,The Data Warehouse,Hugh J. Watson,-, Recent Developments in Data Warehousing,
http://www.terry.uga.edu/~hwatson/dw_tutorial.ppt, Tanggal Akses:17-092010
Mark Isken,-, Data Warehousing and Online Analytical Processing
(OLAP),Ari Cahyono,-,Introduction to Data Warehouse, Magister Teknologi
Informasi UGM
Reference Library
BI Resources
 The Data Warehousing Institute http://www.tdwi.org/
 Kimball and Associates http://www.ralphkimball.com./html/articles.html A
Dimensional Modeling Manifesto – Kimball, R.
http://www.dbmsmag.com/9708d15.html
 DSS Resources http://dssresources.com/
 Data Warehousing Information Center http://www.dwinfocenter.org/
 Intelligent Enterprise http://www.intelligententerprise.com/
 DM Review http://dmreview.com/
 KDNuggets http://www.kdnuggets.com/
 IT Toolbox http://www.ittoolbox.com/
http://businessintelligence.ittoolbox.com/
http://datawarehouse.ittoolbox.com/
 OLAP Report http://www.olapreport.com/
 Some free stuff (nice history of OLAP and commentary on industry trends
 Other stuff costs $
 http://www.mosha.com/msolap/
 Awesome set of resources from the lead developer on MS SQL Server Analysis
Server team
Some Good Books and Articles
 The Data Warehouse Toolkit – Kimball, R.
 Definitive, Microsoft SQL Server 2005 based 3rd edition now out
 OLAP Solutions – Thomsen, E.
 Definitive, abstract and dense, good
 MDX Solutions: With Microsoft SQL Server Analysis Services 2005 and
Hyperion Essbase by George Spofford
 MDX is the “SQL” for cubes
 Data Mining with SQL Server 2005 (Paperback)
by ZhaoHui Tang (Author), Jamie MacLennan
 Data Warehouse Design Solutions – Adamson and Venerable
 Multi-D DW designs from lots of different industries
 Very practical, uses realistic situations to reinforce the concepts
 Summers Rubber Company designs its data warehouse
Gorla, Narasimhaiah; Krehbiel, Steve
Interfaces; Mar/Apr 1999; 29, 2; ABI/INFORM Global
More AS Tutorials and Resources
 http://www.mosha.com/msolap/
 This is the granddaddy of MS SQL Server Analysis Services resoures.
Mosha Pasumansky is the MS development lead on AS engine.
 Site is gold mine of information and links regarding AS and related
software
 He participates in microsoft.public.sqlserver.olap
 Great Blog at http://www.sqljunkies.com/WebLog/mosha/
 Introduction to Analysis Services - by William Pearson (series of
articles) http://www.databasejournal.com/article.php/1459531/
 Very nice series of MS AS tutorials
 Best practices for Business Intelligence using the Microsoft Data
Warehousing Framework
 A white paper from Microsoft