Data mining - Texas Tech University
Download
Report
Transcript Data mining - Texas Tech University
ISQS 6339, Data Management and Business Intelligence
Cubism – Measures and
Dimensions
Zhangxi Lin
Texas Tech University
1
Outline
Measures
Where we’ve been
Populating fact table
Types of dimensions
2
Structure and Components of Business
Intelligence
SSMS
SSIS
SSAS
SSRS
SAS
EG
SAS
EM
3
Snowflake Schema of the Data Mart
10
DimBatch
9
DimMachine
8
Manufacturingfact
3
2
1
DimProduct
5
DimMachineType
DimPlant
7
4
DimMaterial
DimCountry
6
DimProductSubType
DimProductType
4
Where we’ve been and where we are now
Exercise 1: Getting started
Exercise 2: Creating a data mart with SSMS
Exercise 3: Creating data mart with BIDS
Exercise 4: Populating dimensions of a data mart
Exercise 5: Loading fact tables
Exercise 6: Create and customize a cube
5
What we need to do with the half-done data
mart?
Populate DimBatch dimenstion table
Populate ManufacturingFact table
Build an OLAP cube (we already did this before)
Check measures
Check dimensions
6
MEASURES
7
Facts
Facts are measurements associated with a specific
business process.
Many facts can be derived from other facts, including
additive and semiadditive facts.
Non-additive facts can be avoided by calculating it
from additive facts.
Measures are clustered together in a group, called
measure group.
8
Types of measures
Three types
◦ Additive measures. Most facts are additive (calculative), such as
sum
◦ Semiadditive measures. The measures that can be added along
some dimensions, but not along others. For example, inventory
level can be added along product dimension but not time
dimension.
◦ Non-additive (such as max, average), or descriptive (e.g. factless
fact table).
Aggregate functions
◦ Additive: Sum
◦ Semiadditive: ByAccount, Count, FirstChild, FirstNonEmpty,
LastChild, LastNonEmpty, Max, Min
◦ Nonadditive: DistinctCount, None.
Measures and dimensions
Dimensions are used to aggregate
measures. Therefore, they must be
somehow related to measures
Granularity
◦ Important for the analysis
◦ There could be missing values in the fact table
LOADING FACT TABLES
11
Exercise 5: Loading Fact Tables
Project name: MMMFactLoad-lastname
Package name: FactLoad.dtsx
Tasks
◦
◦
◦
◦
Create Inventory Fact table
Load Dim Batch
Load Manufacturing Fact
Load Inventory Fact
Deliverable: email a screenshot of the “green”
outcome of the ETL project to
[email protected], with a subject title “ISQS
6339 EX5 - <lastname>”
12
Inventory Fact Table
Create a Table InventoryFact in your database.
◦ Compound primary key: DateOfInventory, ProductCode, and
Material
◦ Define two foreign keys
Column Name
Data Type
Allow Nulls
InventoryLevel
Int
No
NumberOnBackorder
Int
No
DateOfInventory
Datatime
No
ProductCode
Int
No
Material
Varchar(30)
No
13
Data Sources for Loading Fact
For loading DimBatch table and ManufacturingFact table
◦ BatchInfo.CSV
For loading InventortyFact table
◦ Lin.OrderProcessingSystem Database
14
Control Flow for Loading Facts and the
Remaining Dimension
Note: to ease
debugging, you
may use three
packages and test
them one by one,
instead of doing
everything in one
package
15
Flat File Connection
Data types
◦ BatchNumber, MachinNumber: four-byte
signed integer [DT_I4]
◦ ProductCode, NumberProduced,
NumberRejected: four-byte signed integer
[DT_I4]
◦ TimeStarted, TimeStopped: database
timestamp [DT_DBTimeStamp]
Only check BatchNumber as the input of Dim
Batch
All columns are needed for fact tables
16
Some Frequently Used Nodes
Load DimBatch Data Flow
18
Load DimBatch Data Flow
Note: Because of duplication in the source file, we may insert
An Aggregate item after the Flat File Source item.
19
The Flat File Source
20
Sort Transformation
In the Aggregate item,
Define “Group-by”
BatchNumber.
In Derived column item,
Define BatchName
From BatchNumber
Use the expression
(DT_WSTR, 50)[BatchNumber]
To change the data type
Of BatchName.
21
Load Fact Data Flow
22
Derived Columns for the Fact table
23
Expressions for the Derived Columns
AcceptedProducts
◦ [NumberProduced] – [NumberRejected]
ElapsedTimeForManufacture
◦ DATEDIFF(“mi”, [TimeStarted],[TimeStopped])
DateOfManufacture
◦ (DT_DBTIMESTAMP)SUBSTRING((DT_WSTR,25)[TimeS
tarted],1,10)
This expression converts TimeStarted into a string and
selects the first ten characters of that string. This string is
then converted back into a date time, without the time
portion.
24
OLE DB Destination
For loading the fact
table
25
Load Inventory Fact
OLE DB Source
◦ OrderProcessingSystem.InventoryFact
OLE DB Destination
◦ MaxMinManufacturingDM-lastname.InventoryFact
No transformation
There are two ways to loading the table
◦ Create the table and use ETL to load it
◦ Import directly from the source to the database
MaxMinManufacturingDM-lastname
26
Debugging Results
Loading DimBatch
Loading ManufacturingFact
27
BUILDING AN OLAP
CUBE
28
Exercise 6: Design a Cube
Project name: ISQS6339_EX6_2015_lastname
Tasks
◦ Add in new date items (year, quarter, and month) to two fact
tables
◦ Create time dimension using Manufacturing Fact table
◦ Define calculated measures (Total Products, Percent Rejected)
◦ Define hierarchies of attributes in dimension tables
◦ Create a cube from the MaxMinManufacturing data mart with
hierarchical date dimension
Deliverable:
◦ Screenshots: dimension hierarchies, dimensions, relationships of
facts and dimensions, deployment result, format of measures, and
browsing results.
29
Three Steps to Create a Cube from Data
Sources
Defining data source
Defining data source view
◦ Add in three new columns of year, quarter, and month for the
two fact tables
Building a cube.
◦ Define a new dimension Dim Time from Manufacturing Fact
table
Customize the cube:
◦ Link two fact tables in a cube
◦ Define new primary key for Dim Time
◦ Define calculated measures
◦ Relate dimensions to measures
30
T-SQL Expressions for DS View Definition Manufacture
YearOfManufacture
CONVERT(char(4),YEAR(DateOfManufacture))
QuarterOfManufacture
CONVERT(char(4),YEAR(DateOfManufacture)) +
CASE
WHEN MONTH (DateOfManufacture) BETWEEN 1 AND 3 THEN 'Q1'
WHEN MONTH (DateOfManufacture) BETWEEN 4 AND 6 THEN 'Q2'
WHEN MONTH (DateOfManufacture) BETWEEN 7 AND 9 THEN 'Q3'
ELSE 'Q4'
END
MonthOfManufacture
CONVERT(char(4),YEAR(DateOfManufacture)) +
RIGHT('0'+CONVERT(varchar(2), MONTH(DateOfManufacture)),2)
31
T-SQL Expressions for DS View Definition Inventory
YearOfInventory
CONVERT(char(4),YEAR(DateOfInventory))
QuarterOfInventory
CONVERT(char(4),YEAR(DateOfInventory)) +
CASE
WHEN MONTH (DateOfInventory) BETWEEN 1 AND 3 THEN 'Q1'
WHEN MONTH (DateOfInventory) BETWEEN 4 AND 6 THEN 'Q2'
WHEN MONTH (DateOfInventory) BETWEEN 7 AND 9 THEN 'Q3'
ELSE 'Q4'
END
MonthOfInventory
CONVERT(char(4),YEAR(DateOfInventory)) +
RIGHT('0'+CONVERT(varchar(2), MONTH(DateOfInventory)),2)
32
Data Source View
New columns
33
Select Measures Page
Uncheck Manufacture
Fact Count
34
The finished cube
35
Cube
Structure
36
Defining a format string
37
Inventory measures
“Number on Backorder” is also set with these two parameters
38
Calculated measures – made-up
facts
The definition of calculated measure is stored in the
OLAP cube itself.
The actual values that result from a calculated measure
are not calculated, however, until a query containing that
calculated measure is executed. The results of that
calculation are then cached in the cube. The cached
value is then delivered to any subsequent users
requesting the same calculation.
The expressions of calculation are created using a
language known as Multidimensional Expression
Language (MDX) script. MDX is different from T-SQL. It
is a special language with features designed to handle
the advanced mathematics and formulas required by
OLAP analysis. This is not found in T-SQL.
39
41
DIMENSIONS
in SQL Server
42
Types of Dimensions
Fact dimensions: the Dimensions created from attributes in a fact
table
Parent-Child dimensions: Built on a table containing a selfreferential relationship, such as a parent attribute.
Role playing dimensions: related to the same measure group
multiple times; each relationship represents a different role the
dimension play; for example, time dimension plays three different roles:
date of sale, data of shipment, and date of payment.
◦ To create a role playing dimension, add the dimension to the Dimension
Usage tab multiple times. Then create a relationship between each instance
of the dimension and the measure group.
Reference dimensions: Not related directly to the measure group
but to another regular dimension which in turn related to the measure
group
Data mining dimensions: the information discovered by data mining
Many-to-many dimensions: e.g. multiple ship to addresses
Slowly changing dimensions
43
Slowly changing dimensions
Type 1 SCD – no track
Type 2 SCD – tracking the entire history, adding four
attributes: SCD Original ID, SCD Start Date, SCD
End Date, SCD Status
Type 3 SCD – Similar to Type 2 SCD but only track
current state and the original state; two additional
attribute: SCD Start Date, SCD Initial Value
Add a time dimension (a fact dimension)
Rename time dimension
Date Hierarchy
Material Hierarchy & Plant Hierarchy
Product Hierarchy
Relating Dimensions in the Cube