Data mining - Texas Tech University

Download Report

Transcript Data mining - Texas Tech University

ISQS 6339, Data Management and Business Intelligence
Cubism – Measures and
Dimensions
Zhangxi Lin
Texas Tech University
1
Outline




Measures
Where we’ve been
Populating fact table
Types of dimensions
2
Structure and Components of Business
Intelligence
SSMS
SSIS
SSAS
SSRS
SAS
EG
SAS
EM
3
Snowflake Schema of the Data Mart
10
DimBatch
9
DimMachine
8
Manufacturingfact
3
2
1
DimProduct
5
DimMachineType
DimPlant
7
4
DimMaterial
DimCountry
6
DimProductSubType
DimProductType
4
Where we’ve been and where we are now
Exercise 1: Getting started
 Exercise 2: Creating a data mart with SSMS
 Exercise 3: Creating data mart with BIDS
 Exercise 4: Populating dimensions of a data mart
 Exercise 5: Loading fact tables
 Exercise 6: Create and customize a cube

5
What we need to do with the half-done data
mart?





Populate DimBatch dimenstion table
Populate ManufacturingFact table
Build an OLAP cube (we already did this before)
Check measures
Check dimensions
6
MEASURES
7
Facts
Facts are measurements associated with a specific
business process.
 Many facts can be derived from other facts, including
additive and semiadditive facts.
 Non-additive facts can be avoided by calculating it
from additive facts.
 Measures are clustered together in a group, called
measure group.

8
Types of measures

Three types
◦ Additive measures. Most facts are additive (calculative), such as
sum
◦ Semiadditive measures. The measures that can be added along
some dimensions, but not along others. For example, inventory
level can be added along product dimension but not time
dimension.
◦ Non-additive (such as max, average), or descriptive (e.g. factless
fact table).

Aggregate functions
◦ Additive: Sum
◦ Semiadditive: ByAccount, Count, FirstChild, FirstNonEmpty,
LastChild, LastNonEmpty, Max, Min
◦ Nonadditive: DistinctCount, None.
Measures and dimensions
Dimensions are used to aggregate
measures. Therefore, they must be
somehow related to measures
 Granularity

◦ Important for the analysis
◦ There could be missing values in the fact table
LOADING FACT TABLES
11
Exercise 5: Loading Fact Tables
Project name: MMMFactLoad-lastname
 Package name: FactLoad.dtsx
 Tasks

◦
◦
◦
◦

Create Inventory Fact table
Load Dim Batch
Load Manufacturing Fact
Load Inventory Fact
Deliverable: email a screenshot of the “green”
outcome of the ETL project to
[email protected], with a subject title “ISQS
6339 EX5 - <lastname>”
12
Inventory Fact Table

Create a Table InventoryFact in your database.
◦ Compound primary key: DateOfInventory, ProductCode, and
Material
◦ Define two foreign keys
Column Name
Data Type
Allow Nulls
InventoryLevel
Int
No
NumberOnBackorder
Int
No
DateOfInventory
Datatime
No
ProductCode
Int
No
Material
Varchar(30)
No
13
Data Sources for Loading Fact

For loading DimBatch table and ManufacturingFact table

◦ BatchInfo.CSV
For loading InventortyFact table
◦ Lin.OrderProcessingSystem Database
14
Control Flow for Loading Facts and the
Remaining Dimension

Note: to ease
debugging, you
may use three
packages and test
them one by one,
instead of doing
everything in one
package
15
Flat File Connection

Data types
◦ BatchNumber, MachinNumber: four-byte
signed integer [DT_I4]
◦ ProductCode, NumberProduced,
NumberRejected: four-byte signed integer
[DT_I4]
◦ TimeStarted, TimeStopped: database
timestamp [DT_DBTimeStamp]


Only check BatchNumber as the input of Dim
Batch
All columns are needed for fact tables
16
Some Frequently Used Nodes
Load DimBatch Data Flow
18
Load DimBatch Data Flow
Note: Because of duplication in the source file, we may insert
An Aggregate item after the Flat File Source item.
19
The Flat File Source
20
Sort Transformation
In the Aggregate item,
Define “Group-by”
BatchNumber.
In Derived column item,
Define BatchName
From BatchNumber
Use the expression
(DT_WSTR, 50)[BatchNumber]
To change the data type
Of BatchName.
21
Load Fact Data Flow
22
Derived Columns for the Fact table
23
Expressions for the Derived Columns



AcceptedProducts
◦ [NumberProduced] – [NumberRejected]
ElapsedTimeForManufacture
◦ DATEDIFF(“mi”, [TimeStarted],[TimeStopped])
DateOfManufacture
◦ (DT_DBTIMESTAMP)SUBSTRING((DT_WSTR,25)[TimeS
tarted],1,10)
 This expression converts TimeStarted into a string and
selects the first ten characters of that string. This string is
then converted back into a date time, without the time
portion.
24
OLE DB Destination
For loading the fact
table
25
Load Inventory Fact
OLE DB Source
◦ OrderProcessingSystem.InventoryFact
 OLE DB Destination
◦ MaxMinManufacturingDM-lastname.InventoryFact
 No transformation


There are two ways to loading the table
◦ Create the table and use ETL to load it
◦ Import directly from the source to the database
MaxMinManufacturingDM-lastname
26
Debugging Results
Loading DimBatch
Loading ManufacturingFact
27
BUILDING AN OLAP
CUBE
28
Exercise 6: Design a Cube
Project name: ISQS6339_EX6_2015_lastname
 Tasks

◦ Add in new date items (year, quarter, and month) to two fact
tables
◦ Create time dimension using Manufacturing Fact table
◦ Define calculated measures (Total Products, Percent Rejected)
◦ Define hierarchies of attributes in dimension tables
◦ Create a cube from the MaxMinManufacturing data mart with
hierarchical date dimension

Deliverable:
◦ Screenshots: dimension hierarchies, dimensions, relationships of
facts and dimensions, deployment result, format of measures, and
browsing results.
29
Three Steps to Create a Cube from Data
Sources
Defining data source
 Defining data source view
◦ Add in three new columns of year, quarter, and month for the
two fact tables
 Building a cube.
◦ Define a new dimension Dim Time from Manufacturing Fact
table
 Customize the cube:
◦ Link two fact tables in a cube
◦ Define new primary key for Dim Time
◦ Define calculated measures
◦ Relate dimensions to measures

30
T-SQL Expressions for DS View Definition Manufacture

YearOfManufacture
CONVERT(char(4),YEAR(DateOfManufacture))

QuarterOfManufacture
CONVERT(char(4),YEAR(DateOfManufacture)) +
CASE
WHEN MONTH (DateOfManufacture) BETWEEN 1 AND 3 THEN 'Q1'
WHEN MONTH (DateOfManufacture) BETWEEN 4 AND 6 THEN 'Q2'
WHEN MONTH (DateOfManufacture) BETWEEN 7 AND 9 THEN 'Q3'
ELSE 'Q4'
END

MonthOfManufacture
CONVERT(char(4),YEAR(DateOfManufacture)) +
RIGHT('0'+CONVERT(varchar(2), MONTH(DateOfManufacture)),2)
31
T-SQL Expressions for DS View Definition Inventory

YearOfInventory
CONVERT(char(4),YEAR(DateOfInventory))

QuarterOfInventory
CONVERT(char(4),YEAR(DateOfInventory)) +
CASE
WHEN MONTH (DateOfInventory) BETWEEN 1 AND 3 THEN 'Q1'
WHEN MONTH (DateOfInventory) BETWEEN 4 AND 6 THEN 'Q2'
WHEN MONTH (DateOfInventory) BETWEEN 7 AND 9 THEN 'Q3'
ELSE 'Q4'
END

MonthOfInventory
CONVERT(char(4),YEAR(DateOfInventory)) +
RIGHT('0'+CONVERT(varchar(2), MONTH(DateOfInventory)),2)
32
Data Source View
New columns
33
Select Measures Page
Uncheck Manufacture
Fact Count
34
The finished cube
35
Cube
Structure
36
Defining a format string
37
Inventory measures
“Number on Backorder” is also set with these two parameters
38
Calculated measures – made-up
facts



The definition of calculated measure is stored in the
OLAP cube itself.
The actual values that result from a calculated measure
are not calculated, however, until a query containing that
calculated measure is executed. The results of that
calculation are then cached in the cube. The cached
value is then delivered to any subsequent users
requesting the same calculation.
The expressions of calculation are created using a
language known as Multidimensional Expression
Language (MDX) script. MDX is different from T-SQL. It
is a special language with features designed to handle
the advanced mathematics and formulas required by
OLAP analysis. This is not found in T-SQL.
39
41
DIMENSIONS
in SQL Server
42
Types of Dimensions
Fact dimensions: the Dimensions created from attributes in a fact
table
 Parent-Child dimensions: Built on a table containing a selfreferential relationship, such as a parent attribute.
 Role playing dimensions: related to the same measure group
multiple times; each relationship represents a different role the
dimension play; for example, time dimension plays three different roles:
date of sale, data of shipment, and date of payment.

◦ To create a role playing dimension, add the dimension to the Dimension
Usage tab multiple times. Then create a relationship between each instance
of the dimension and the measure group.
Reference dimensions: Not related directly to the measure group
but to another regular dimension which in turn related to the measure
group
 Data mining dimensions: the information discovered by data mining
 Many-to-many dimensions: e.g. multiple ship to addresses
 Slowly changing dimensions

43
Slowly changing dimensions
Type 1 SCD – no track
Type 2 SCD – tracking the entire history, adding four
attributes: SCD Original ID, SCD Start Date, SCD
End Date, SCD Status
 Type 3 SCD – Similar to Type 2 SCD but only track
current state and the original state; two additional
attribute: SCD Start Date, SCD Initial Value


Add a time dimension (a fact dimension)
Rename time dimension
Date Hierarchy
Material Hierarchy & Plant Hierarchy
Product Hierarchy
Relating Dimensions in the Cube