Transcript Lecture 2

BI & DM for CRM
Lecture 2
Data Warehouse and OLAP
Steps in Knowledge Discovery
Steps in KD cont...







Data Cleaning: To remove noise and inconsistent data / check data
validity & correctness – PLAY WİTH DATA (good step to know your
data)
Data İntegration: To combine multiple sources of data / integrate
various data sources on various platforms
(DBMS/flatfiles/spreedsheets etc..)
Data Selection: Retrieve relevent data from data repository for
analysis
Data Transformation: To transform or consolidate data into different
forms to deploy in DM operations. (If Data Warehouses were used,
the process would be before the data selection.)
Data Mining: To apply intelligent models for extracting patterns from
data.
Pattern Evaluation: To identify interesting patterns, result of DM
process, on some measures.
Knowledge presentation: To visualize the representation of
knowledge, driven by DM process and pattern evaluationz
DM System Architecture
So, what is Data Warehouse?
Loosely speaking;
• A decision support database that is
maintained separately from the
organization’s operational database
• Support information processing by
providing a solid platform of consolidated,
historical data for analysis.
 W.H. Inmon- ‘ A data warehouse is a
subject-oriented, integrated, time-varient
and nonvolatile collection of data in support
of management’s decision making proces.’
DW—Subject-Oriented

Organized around major subjects, such as customer,
product, sales.

Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing.

Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process.
6
DW—Integrated
Constructed by integrating multiple, heterogeneous
data sources
 relational databases, flat files, on-line transaction
records
 Data cleaning and data integration techniques are
applied.
 Ensure consistency in naming conventions,
encoding structures, attribute measures, etc. among
different data sources



E.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is
converted.
7
DW—Time Variant


The time horizon for the data warehouse is significantly
longer than that of operational systems.

Operational database: current value data.

Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse

Contains an element of time, explicitly or implicitly

But the key of operational data may or may not contain
“time element”.
8
DW—Non-Volatile

A physically separate store of data transformed from
the operational environment.

Operational update of data does not occur in the data
warehouse environment.

Does not require transaction processing, recovery,
and concurrency control mechanisms

Requires only two operations in data accessing:
• initial loading of data and access of data.
9
On-line transaction processing vs On-line
Analytical Processing
OLTP
OLAP
users
clerk, IT professional
knowledge worker
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
lots of scans
unit of work
read/write
index/hash on prim. key
short, simple transaction
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response
usage
access
complex query
10
Why seperate Database?



Main reason is high performance for both
systems – processing OLAP queries in
transactional DBs would substantially
degrade the performance of operational
tasks
Unlike OLTP, OLAP usually needs read only
access from the database. So, for OLAP
operations DBs does not require concurrent
control and recovery. If applied for OLAP this
may jeopardize the execution of OLTP
operations
Systems are mainly used for different
operations...
Data Model- Multidimentional
DW & OLAP are based on multidementional
data models...
Actually... It is all about multidimentional
data models...
2D view of data
Facebook Data...
POSTING
Time
Video
Photo
Write Up
W1
200
500
1000
W2
250
520
1023
W3
310
782
1051
3D view of Data cont
MALE
FEMALE
N/A
POSTING
POSTING
POSTING
Time
Video Photo Write Video Photo Write
Up
Up
W1
50
W2
W3
120
400
80
150
500
Video Photo Write
Up
70
230
100
LATTICE OF CUBOIDS
ALL
1-D Cuboids
Time
Time, Posting
2-D Cuboids
Time, Posting,
Gender
3-D Cuboids
Posting
Time, Gender Time, Education
0 D – Apex cuboid
Gender
Education
Posting, Time,
Gender Education
Time, Posting, Time, Gender,
Education
Education
Time, Posting, Gender, Education
Gender,
Education
Posting, Gender,
Education
4-D Cuboids
Latice & Cuboids
Any n-D data as a series of (n-1)-D “cubes”
 In data warehousing literature,
 A data cube is referred to as a cuboid
 The lattice of cuboids forms a data cube.
 The cuboid holding the lowest level of summarization is called a
 the 4-D cuboid is the base cuboid for the given four dimensions
• base cuboid.
 The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid.
 Here this is the total posting
 typically denoted by all

Conceptual Modelling of DW

Modeling data warehouses: dimensions & measures
– Star schema: A fact table in the middle connected to a set
of dimension tables
– Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
– Fact constellations: Multiple fact tables share dimension
tables, viewed as a collection of stars, therefore called
galaxy schema or fact constellation
STAR SCHEMA
Time
Time_key
Day
Day_of_week
Post
Post Type
Time_key
Post_key
Post_key
Post _name
Education_key
Month
Gender_key
Quarter
Total Post
Year
Gender
Education
Education_key
Education_name
Edu_type_key
Gender_key
Gender_name
Snow Flake Schema
Time
Time_key
Day
Day_of_week
FACT TABLE
Post
Post Type
Time_key
Post_key
Post_key
Post _name
Education_key
Month
Gender_key
Quarter
Total Post
Year
Gender
Gender_key
Education
Gender_name
Education_key
Education_name
Edu_type_key
Edu Type
Edu_type_key
Edu_type_name
Fact Constellation
Time
Post
Post Type
Time_key
Time_key
Post_key
Day
Post_key
Post _name
Day_of_week
Education_key
Chat
Month
Gender_key
Time_key
Quarter
Total Post
Education_key
Year
Gender_key
Total_chat_time
Education
Education_key
Education_name
Edu_type_key
Gender
Gender_key
Gender_name
Total_online_time
Categorization & Computation

How are measures computed?


Data Cube measure is a numarical function that
can be evaluated at each point in the data cube
space.
Distributive- sum(), min(), max()
How many posts are from USA?



Sum (Washington)+Sum(Nevada)....
Algebraic: sum(),count(),
standard_deviation()...
Holisatic: median(), mode()
Concept Hierarchies

What is it?
Flow of a concept (could be dimension)
from lower levels to higher levels.
More specific
More General
Example for concept hierarchy
Location
all
all
Europe
region
country
city
office
Germany
Frankfurt
...
...
...
Spain
North_America
Canada
Vancouver ...
L. Chan
...
...
Mexico
Toronto
M. Wind
23
Hierarchical vs lattice Structures
(Concept Hierarchy)
year
country
quarter
Province
or state
city
Street
week
month
day
We also have user defined concept hierarcies like Fiscal year or Academic Year

How can we use Concept Hierarchies useful in
OLAP?
 In
multi dimensional model, data are organized in multi dimensions and
each dimension contains multi level of abstraction defined by concept
hierarchies
 This type of organization provides user to view the data from various
perspectives
 Basically, OLAP provides user friendly environment for interactive data
analysis.



Roll-up (drill-up)
Drill-down
Slice and Dice: The slice operation performs a selection on
one dimension of the given cube, resulting in a subcube. The
operation defines a subcube by performing a selection on two or
more dimensions.

Pivot: visualization operation that rotates the data axes in view in
order to provide an alternative presentation of the data.

Roll-up (drill-up):Performs aggregation on a data cube, either
by climbing up a concept hierarchy for a dimension or
dimension reduction.
PC
Printer
location by city
Istanbul Ankara Berlin
Münih
20
30
50
40
15
5
10
20
Hierarchy Roll Up
Dimension Roll Up
Location by country
Location
TR
GR
PC
140
PC
50
90
Printer
50
Printer
20
30

Drill Down:Drill down is reverse of roll-up. It navigates from
general hierarchy to more specific hierarchy. Adding new dimension
to data
Year 2009
PC
100
Printer
150
Quarter
Drill
Down
Q1
Q2
Q3
Q4
PC
20
30
40
10
Printer
30
60
20
40

The slice operation performs a selection on one dimension of the
given cube, resulting in a subcube. The operation defines a
subcube by performing a selection on two or more dimensions.
Measure
Quarter
Q1
PC
20
Q2
30
40
Q1
Q4
10
Dice
Printer
30
60
Time 2010
Slice
Q3
20
Sales
PC
20
Printer
30
40
Time 2010
Q1
PC
20






drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its back-end
relational tables (using SQL)
ranking the top N or bottom N items in lists
moving averages
growth rates
interests



Radial lines from a central point
each line represents a concept hierarchy for a
dimension
each abstraction level is called a footprint
granularities available for use by OLAP
 four radial lines for concept hierarchies
location,customer,item,time
time line has 4 footprints:
day,month,quarter,year

A Star-Net Query Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Time
ANNUALY
QTRLY
DAILY
CITY
PRODUCT ITEM
Product
PRODUCT GROUP
SALES PERSON
COUNTRY
DISTRICT
REGION
Location
DIVISION
Each circle is
called a footprint
Promotion
Organization
Design of a Data Warehouse: A Business
Analysis Framework

Four views regarding the design of a data warehouse

Top-down view


Data source view


exposes the information being captured, stored, and
managed by operational systems
Data warehouse view


allows selection of the relevant information necessary for the
data warehouse
consists of fact tables and dimension tables
Business query view

sees the perspectives of data in the warehouse from the view
of end-user
•
•
•
Top-down, bottom-up approaches or a combination of both

Top-down: Starts with overall design and planning (mature)

Bottom-up: Starts with experiments and prototypes (rapid)
From software engineering point of view

Waterfall: structured and systematic analysis at each step before
proceeding to the next

Spiral: rapid generation of increasingly functional systems, short turn
around time, quick turn around
Typical data warehouse design process

Choose a business process to model, e.g., orders, invoices, etc.

Choose the grain (atomic level of data) of the business process

Choose the dimensions that will apply to each fact table record

Choose the measure that will populate each fact table record

Enterprise Warehouse
Collects

all information about subject spanning of the entire organization
Data Mart
A
subset of corporate-wide data that is valuable to specific groups of
users. Such as marketing

Virtual Warehouse
A
set of views over operational databases
Only some of the possible summary views may be materialized
Cuboids are referred as aggregations
 One factor affecting storage requirements
 Sparsity: the amount of empty cells in a cube
 The base cuboid is likely to contain many empty
cells
 it is a spares cube or array
the 0 or lower dimensional cuboids are less
spares than the higher dimensional ones
 it is not likely that they contain empty cells
 Moving along higher levels for the dimension
hierarchy
 the cuboids becomes less spares or more dense

PC
01.10.2003
DV
Two dimensional
sparse cuboid
10
02.10.2003
03.10.2003
Prt CD
1
4
04.10.2003
2
2
items
One dimensional dense
cuboid
01.10.2003
10
02.10.2003
1
03.10.2003
6
04.10.2003
2



ROLAP – Relational OLAP
MOLAP – Multidimensional OLAP
HOLAP – Hybrid OLAP






Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware to support missing pieces
query response is generally slower
low storage requirement
Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
greater scalability
appropriate for large data sets that are infrequently queried
 historical data from less recent previous years




Array-based multidimensional storage engine (sparse matrix
techniques)
fast indexing to pre-computed summarized data
a two-level storage representation
 dense subcubes are stored as array structures
 spars subcubes are stored by compression techniques
appropriate for cubes with frequent use and rapid query response




combines ROLAP and MOLAP benefiting from
 greater scalability of ROLAP
 faster computation of MOLAP
Large volumes of data base cuboid is stored in a relational
database
aggregations are stored as arrays
appropriate for for cubes that requre
 rapid query response for summaries based on a large amount
of base data

Data cubes can be viewed as lattice of cuboids
The
bottom cuboid is the base cuboid
The top is the apex cuboid
What is the number of cuboids for N dimensional data cube? 2N

OLAP computes at least some of the cuboids
For
fast response
For avoiding redundant calculation
all
0-D(apex) cuboid
product
product,date
date
country
product,country
1-D cuboids
date, country
2-D cuboids
product, date, country
3-D(base) cuboid

Three Types
Materialize
every cuboids
Huge amounts of memory space
Non materialization – Zero cube calculation
Show processing of queries
Some (partial) materialization
Trade off between storage space and response time
Selection of which cuboids to materialize
Based on size, sharing, access frequency and etc



Complete load of the cube

all dimension and fact table data is read and

all specified aggregations cuboids are calculated

process a cube when

its structure is new or

its dimensions or measures have been edited
Incrementally updating a cube

new data is added but existing data not changed and cube
structure si the same
Refreshing

data cleared and reloaded

its aggregations recalculated

faster then processing:no design of aggregation tables




Dimension member or measure whose value is
computed at run time using an expression
Only the definitions are stored but values exists
only in memory upon a query
do not increase in cube size
Ex: if sales and cost are included in the base fact
table
a
profit measure can be a calculated member
profit = sales – cost
Average_sales = sales/#_items_sold





Combination of multiple cubes in one logical
cube
can be based on a single cube to expose only selected subsets of
measures and dimensions
Require no physical space
store only the dimensions information not actual data
provide a valuable security function
limiting
the access of some users




Attribute of a dimension member
provides additional information about the member
a column in the same dimension table as the
associated members
used in queries
provide users more options when analysing cube
data
A typical time table:
(time_id,day,month,quarter,year,business
day,leap,day of the week)
dimension levels day<month<quarter<year
member properties for day:
weekend or business day:0 or 1
day of the week:1,2,3,...,7
a member property for year is
whether it is leap year or not:0 or 1



Logical dimension based on a member property of a level in a
physical dimension
enables users to analyse cube data based on the member
properties of dimension levels
add a virtual dimension to a cube only if
the
dimension that supplies its member property is also included in the
cube

adding a virtual dimension does not increase cube size
not
affect cube processing time
calculated in memory when needed
query processing time is slower



The business day column was a member
property for day level of the time dimension
the user may want to investigate sales by type
of day (business or weekend)
makes business day member property as a
virtual dimension of the sale cube

Based on two dimension table columns that
together define the lieage relationships among
the members of the dimension
Member
key column:identifies each member
Parent key column: identifies the parent of each member
Example: A HR Department
Poal West
James Smith
John Grande
Amy Joens
Jo Brown
Jill Kelly
Emp Name
Emp_id
Manager_id
James Smith
1
3
Amy Jones
2
3
Paul West
3
3
Jim Kelly
4
3
John Grande
5
1
Jo Brown
6
1
Emp_id represnts each employee
Manager_id represents parent dimention
Q? How can you represnt the same hierarchy in the traditional
concept hierarchy?
Three kinds of data warehouse applications
 Information processing
 supports querying, basic statistical analysis, and reporting using
cross-tabs, tables, charts and graphs
 Analytical processing
 multidimensional analysis of data warehouse data
 supports basic OLAP operations, slice-dice, drilling, pivoting
 Data mining
 knowledge discovery from hidden patterns
 supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools.
 Differences among the three tasks

Why online analytical mining?
High
quality of data in data warehouses – DW contains integrated,
consistent, cleaned data
Available information processing structure surrounding data warehouses
– ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP
tools
OLAP-based exploratory data analysis (mining with drilling, dicing,
pivoting, etc.)
On-line selection of data mining functions – integration and swapping of
multiple mining functions, algorithms, and tasks.
Q&A
???