Transcript slides

Design of DW
Remember
item, city, year, and sales_in_Euro
()
(city)
(city, item)
(item)
(city, year)
(city, item, year)
(year)
(item, year)

2 leveles of hierarchies for each dimension




Item(part,color)
City(downtown,suburb)
Year(good_year,bad_year)
 i1,i2
 c1,c2
 y1,y2
For a 3-dimensional data cube, where Li is the
number of all levels (L1,2,3=2), the total number
of cuboids that can be generated is
3
3
(2

1)

3
 27

i1







{(),
(c1),(c2),(i1),(i2),(y1),(y2),
(c1,i1),(c1,i2),(c2,i1),(c2,i2),
(c1,y1),(c1,y2),(c2,y1),(c2,y2),
(i1,y1),(i1,y2),(i2,y1),(i2,y2),
(c1,i1,y1),(c1,i1,y2),(c1,i2,y1),(c1,i2,y2),
(c2,i1,y1),(c2,i1,y2),(c2,i2,y1),(c2,i2,y2)}

DMQL Data Mining Query Language








Relational database schema
Translation into SQL query
Example, star schema, and relational data base
MDX
Multifeature cubes
Design of a Data Warehouse
Lifecycle models
Data Warehouse models
DMQL

DMQL: A Data Mining Query Language
for Relational Databases (Han et al,
Simon Fraser University)

Data warehouses and data marts can be
defined by cube definition and dimension
definition
DMQL




Create and manipulate data mining models
through a SQL-based interface (“Commanddriven” data mining)
Abstract away the data mining particulars
Data mining should be performed on data in the
database (should not need to export to a
special-purpose environment)
Approaches differ on what kinds of models
should be created, and what operations we
should be able to perform
Cube Definition Syntax in DMQL

Cube Definition (Fact Table)
define cube <cube_name> [<dimension_list>]:
<measure_list>

Dimension Definition (Dimension Table)
define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
time
Example of Star Schema
item
time_key
day
day_of_the_week
month
quarter
year
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
state_or_province
country
Defining Star Schema in DMQL
define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week,
month, quarter, year)
define dimension item as (item_key, item_name, brand,
type, supplier_type)
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)
The star schema contains two measures
 dollars_sold and units_sold


How are the DMQL commands
interpreted to generate a specified data
cube?
Relational database schema
time(time_key,day_of_week,month,quater,year)
item(item_key,item_name,brand,type,supplier_type)
branch(branch_key,branch_name,branch_type)
location(location_key,street,city,province_or_state,country)
sales(time_key,item_key,branch_key,location_key,number_of_
units_sold,price)

The DMQL specification is translated into the following
SQL query which generates the base cuboid
SELECT s.time_key,s.item_key,s.branch_key,s.location_key,
SUM(s.number_of_units_sold*s.price),
SUM(s.number_of_units_sold)
FROM time t, item i, branch b, location l, sales s,
WHERE s.time_key=t.time_key AND s.item_key=i.item_key
AND s.branch_key=b.branch_key AND s.location_key=l.location_key
GROUP BY (s.time_key,s.item_key,s.branch_key,s.location_key);

The granularity (resolution) of each
dimension is at the join key level
A join key is the key that links a fact table
and the dimension table
 The fact table associated with a base cuboid
is sometimes referred as base fact table





By changing GROUP BY we may generate
other cuboids
The apex cuboid representing the total sum of
dollars_sold and total count of units_sold is
generated by GROUP BY ();
Other cuboids may be generated by applying
selection and projection operations on the base
cuboid
To generate a data cube we may as well use
GROUP BY CUBE
(s.time_key,s.item_key,s.branch_key,s.location_key);
Defining Snowflake Schema
in DMQL
define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month,
quarter, year)
define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street, city(city_key,
province_or_state, country))
Defining Fact Constellation
in DMQL
define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold
= count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state, country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location in
cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
Example (Exercício 7)




Suponha um datawarehouse que contém as seguintes
quatro dimensões: Data, Espectador, Localizaçaõ e
Jogo e o facto “preço” que consiste no valor, em euros
Valor pago por um espectador quando assiste a um
determinado jogo numa data
Espectadores podem ser estudantes, adultos ou
séniores, e que cada uma destas categorias tem o seu
preço de bilhete
Dimensão data contenha o dia, mês e ano; que a
dimensão Localização contenha o nome do estádio, e
que a dimensão Jogo contenha o nome das duas
equipas defrontadas
Diagrama em estrela
para este DW
jogo(jogoId, equipa1, equipa2)
 data (dataId, dia, mes, ano)
 espectador (espId, nome, categoria)
 localizacao(localId, estadio)
 factos(jogoId, dataId, espId, localId, preco)


Escreva em SQL a interrogaçã o que
devolve o preço total pago por
espectadores estudantes para assistir
ao jogo que se realizou no Estádio da
Luz no dia 1 de Março de 2005
The corresponding SQL querry
SELECT SUM(preco)
FROM Factos F, Data D, Localizacao L, Espectador E
WHERE F.dataId = D.dataId
AND F.espId = E.espId
AND F.localId = L.localId
AND D.dia = 1
AND D.mes = 3
AND D.ano = 2005
AND L.estadio = ‘Estadio da Luz’
AND E.categoria= ‘Estudante’;

Esquema relacional que modele a mesma informação e
uma interrogaçã o SQL que devolva a mesma
informação:
jogo(jogoId, equipa1, equipa2, localId, data)
localizacao(localId, estadio)
espectador(espId, nome, categoriaId)
categoria (categoriaId, nomeC, preco)
jogoEspectador(jogoId, espId)
The corresponding SQL querry
SELECT SUM (C.preco)
FROM Categoria C, Espectador E, JogoEspectador JE, Jogo, J,
Localizacao L
WHERE C.nomeC = ‘Estudante’
AND J.data = 1/3/2005
AND L.estadio = ‘Estadio da Luz’
AND C.categoriaId = E.categoriaId
AND E.espId = JE.espId
AND JE.jogoId = J.jogoId
AND J.localId = L.localId;
Difference between both
approaches
Com o modelo em estrela, existe três joins, da
tabela Factos com cada uma das 3 dimensões
relevantes
 Com o esquema relacional, existem 4 joins

In the star schema has less joins, corresponding to
the relevant dimensions
 In multidimensional model the base cuboid is
already precomputed

MDX

Multidimensional Expressions (MDX) as a
Language

MDX emerged circa 1998, when it first began to
appear in commercial applications. MDX was
created to query OLAP databases, and has
become widely adopted within the realm of
analytical applications


Provide the total sales and total cost amounts for the years 1997
and 1998 individually for all USA-based stores (including all
products)
We are asked, moreover, to provide the information in a twodimensional grid, with the sales and cost amounts (called
measures in our data warehouse) in the rows and the years (1997
and 1998) in the columns
SELECT
{[Time].[1997],[Time].[1998]} ON COLUMNS,
{[Measures].[Warehouse Sales],[Measures].[Warehouse Cost]}
ON ROWS
FROM Warehouse
WHERE ([Store].[All Stores].[USA])




The cube that is targeted by the query (the query scope) appears in
the FROM clause of the query
The FROM clause in MDX works much as it does in SQL, where it
stipulates the tables used as sources for the query
The query syntax also uses other keywords that are common in
SQL, such as SELECT and WHERE.
Important difference is that the output of an MDX query, which uses
a cuboid as a data source, is another cuboid, whereas the output of
an SQL query (which uses a columnar table as a source) is
typically columnar

A query has one or more dimensions. The query above
has two. (The first three dimensions (=axes) that are
found in MDX queries are known as rows, columns
and pages)
SELECT{[Time].[1997].[Q1],[Time].[1997].[Q2]}ON
COLUMNS,{[Warehouse].[All Warehouses].[USA]} ON
ROWSFROM WarehouseWHERE ([Measures].[Warehouse
Sales])

Curled brackets "{}" are used in MDX to represent a set
of members of a dimension or group of dimensions
Complex Aggregation at
Multiple Granularities

Multifeature cubes compute complex queries
involving multiple aggregates at multiple
granularities (resolution)

Example



item is purchased in a sales region on a business
day (year,month,day)
The shelf life in months of a given item is stored in
shelf
The item price and sales is stored in price and sales

Find the total sales in 2000, broken down by
item, region, and month with subtotal for each
dimension

A data cube is constructed
{(item,region,month),(item,region),(item,month),
(month,region),(item),(month),(region,()}



Simple data cube, since it does not involve any
dependent aggregates
What are dependent aggregates?
Example
Grouping by all subsets (cuboids)
{item,region,month} (=data cube)
 Find maximum price for each group
(cuboid) in 2000
 Among the maximum price tuples find the
minimum and maximum shelf lives

Multifeature cube graph for the
example query
R2 cube {=MIN(R1.shelf)}
R3 cube {=MAX(R1.shelf)}
R1 cube {=MAX(price}
R0 cube
The multifeature graph illustrates the
aggregate dependencies
 R0,R1,R2,R3 are the grouping variables
 The grouping variables R2,R3 are
dependent on R1
 In extended SQL

R2 in R1
 R3 in R1

Query in extended SQL
SELECT
item,region,month,MAX(price),MIN(R1.shelf),MAX(R1.shelf)
FROM Purchases
WHERE year=2000
CUBE BY item,region,month:R1,R2,R3
SUCH THAT R1.price=MAX(price) AND
R2 IN R1 and R2.shelf=MIN(R1.shelf) AND
R3 IN R1 and R3.shelf=MAX(R1.shelf);
Design of Data Warehouse:
A Business Analysis Framework

Four views regarding the design of a data warehouse

Top-down view
• allows selection of the relevant information necessary for the data
warehouse

Data source view
• exposes the information being captured, stored, and managed by
operational systems

Data warehouse view
• consists of fact tables and dimension tables

Business query view
• sees the perspectives of data in the warehouse from the view of enduser
Data Warehouse
Design Process


Top-down, bottom-up approaches or a combination of both

Top-down: Starts with overall design and planning (mature)

Bottom-up: Starts with experiments and prototypes (rapid)
From software engineering point of view

Waterfall: structured and systematic analysis at each step before
proceeding to the next

Spiral: rapid generation of increasingly functional systems, short
turn around time, quick turn around
Lifecycle planning




Translation from user requirements into
software requirements
Transformation of the software requirements
into software design
Implementation of the design into programming
code
The sequence of this steps is defined by the
lifecyle model

A software lifecycle model must be
defined for every project!

The lifecycle model you choose has as
much influence over your project’s
success as any other planning decision
you make!
Pure Waterfall model
Software
Concept
Requirements
Analysis
Architecural
Design
Detailed
Design
Coding and
Debugging
System
Testing
Pure Waterfal model

Document driven model which means that the main
work products that are carried from phase to phase are
documents

The disadvantage of the pure waterfall model arise from
the difficulty of fully specifying requirements at the
beginning of the project, before any design works has
been done and before any code has been written
Salmon model
Software
Concept
Requirements
Analysis
Architecural
Design
Detailed
Design
Coding and
Debugging
System
Testing
Code-and-Fix I
System Specification
(maybe)
Code-And-Fix
Release
Code-and-Fix II
Advantages



No overhead, you don’t spend any time on planning,
documentation, quality assurance, enforcement, or
other activities than pure coding
Since you jump right into coding, you can show signs of
progress immediately
It requires little expertise
Code-and-Fix III
Maintainability and reliability decrease
with the complexity and the time
 For any kind of project other than a tiny
project, this model is dangerous. It might
have no overhead, but it also provides no
means of assessing progress, you just
code until you’re done

Spiral I
Determine objectives,
Alternatives, and
constraints
Review
Risk
analysis
Start
I
II
IV
III Prototypes
Simulation models
Reqirements
Development
Plan
Realse
code
test
Evaluate
Spiral II

The basic idea behind the diagram is that
you start on a small scale in the middle of
the spine, explore the risks, make a plan
to handle the risks, and then commit to an
approach of the next iteration. Each
iteration moves your project to a larger
scale
Spiral III

The spiral model is a risk-oriented
lifecycle model that breaks a software
project up into mini projects. Each mini
project addresses one or more major
risks until all the major risks have been
addressed
Spiral IV






Determine objectives, and constraints
Identify and resolve risks
Evaluate alternatives
Develop the deliverables for that iteration, and verify
that they are correct
Plan next iteration
Commit to an approach for the next iteration
One of the most important advantages of the spiral model is that as
costs increase, risk decrease. The more time and money you
spend, the less risk your’re taking
Data Warehouse: A Multi-Tiered Architecture
Other
sources
Operational
DBs
Metadata
Extract
Transform
Load
Refresh
Monitor
&
Integrator
Data
Warehouse
OLAP Server
Serve
Analysis
Query
Reports
Data mining
Data Marts
Data Sources
Data Storage
OLAP Engine Front-End Tools
Three Data Warehouse Models

Enterprise warehouse


collects all of the information about subjects spanning the entire
organization
Data Mart

a subset of corporate-wide data that is of value to a specific groups
of users. Its scope is confined to specific, selected groups, such as
marketing data mart
• Independent vs. dependent (directly from warehouse) data mart

Virtual warehouse


A set of views over operational databases
Only some of the possible summary views may be materialized
Data Warehouse Usage

Three kinds of data warehouse applications

Information processing
• supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts and graphs

Analytical processing
• multidimensional analysis of data warehouse data
• supports basic OLAP operations, slice-dice, drilling, pivoting

Data mining
• knowledge discovery from hidden patterns
• supports associations, constructing analytical models, performing
classification and prediction, and presenting the mining results using
visualization tools
Data Warehouse Back-End Tools
and Utilities

Data extraction


Data cleaning


convert data from legacy or host format to warehouse format
Load


detect errors in the data and rectify them when possible
Data transformation


get data from multiple, heterogeneous, and external sources
sort, summarize, consolidate, compute views, check integrity, and build
indicies and partitions
Refresh

propagate the updates from the data sources to the warehouse

DMQL Data Mining Query Language








Relational database schema
Translation into SQL query
Example, star schema, and relational data base
MDX
Multifeature cubes
Design of a Data Warehouse
Lifecycle models
Data Warehouse models

Data Cleaning
(De)normalization(?)
 Missing Values
 ...
