Lecture_dw_02 - Computer Science Unplugged

Transcript Lecture_dw_02 - Computer Science Unplugged

CSS
Data Warehousing
for BS(CS)
Lecture 1-2: DW & Need for DW
Khurram Shahzad
[email protected]
Department of Computer Science
Course Objectives

At the end of the course you will (hopefully) be able to answer the
questions







Why exactly the world needs a data warehouse?
How DW differs from traditional databases and RDBMS?
Where does OLAP stands in the DW picture?
What are different DW and OLAP models/schemas? How to implement and
test these?
How to perform ETL? What is data cleansing? How to perform it? What are
the famous algorithms?
Which different DW architectures have been reported in the literature? What
are their strengths and weaknesses?
What latest areas of research and development are stemming out of DW
domain?
2
Course Material

Course Book
 Paulraj Ponniah, Data Warehousing Fundamentals, John Wiley
& Sons Inc., NY.

Reference Books


W.H. Inmon, Building the Data Warehouse (Second Edition),
John Wiley & Sons Inc., NY.
Ralph Kimball and Margy Ross, The Data Warehouse Toolkit
(Second Edition), John Wiley & Sons Inc., NY.
3
Assignments



Implementation/Research on important concepts.
To be submitted in groups of 2 students.
Include
1.
2.
3.
4.
Modeling and Benchmarking of multiple warehouse schemas
Implementation of an efficient OLAP cube generation algorithm
Data cleansing and transformation of legacy data
Literature Review paper on




View Consistency Mechanisms in Data Warehouse
Index design optimization
Advance DW Applications
May add a couple more
4
Lab Work

Lab Exercises. To be submitted individually
5
Course Introduction
What this course is about?

Decision Support Cycle
Planning – Designing – Developing - Optimizing – Utilizing

6
Course Introduction
Information Sources
Data Warehouse
Server
(Tier 1)
OLAP Servers
(Tier 2)
Clients
(Tier 3)
e.g., MOLAP
Analysis
Semistructured
Sources
Data
Warehouse
extract
transform
load
refresh
etc.
serve
Query/Reporting
serve
e.g., ROLAP
Operational
DB’s
serve
Data Mining
Data Marts
7
Operational Sources (OLTP’s)






Operational computer systems did provide information to run day-to-day
operations, and answer’s daily questions, but…
Also called online transactional processing system (OLTP)
Data is read or manipulated with each transaction
Transactions/queries are simple, and easy to write
Usually for middle management
Examples





Sales systems
Hotel reservation systems
COMSIS
HRM Applications
Etc.
8
Typical decision queries




Data set are mounting everywhere, but not useful for decision
support
Decision-making require complex questions from integrated data.
Enterprise wide data is desired
Decision makers want to know:






Where to build new oil warehouse?
Which market they should strengthen?
Which customer groups are most profitable?
How much is the total sale by month/ year/ quarter for each offices?
Is there any relation between promotion campaigns and sales growth?
Can OLTP answer all such questions,  efficiently?
9
Information crisis

Integrated


Easily accessible with intuitive access paths and responsive for analysis
Credible


Information must be accurate and must conform to business rules
Accessible


Must have a single, enterprise-wide view
Data Integrity


*
Every business factor must have one and only one value
Timely

Information must be available within the stipulated time frame
* Paulraj 2001.
10
Data Driven-DSS*
* Farooq, lecture slides for ‘Data Warehouse’ course
11
Failure of old DSS








Inability to provide strategic information
IT receive too many ad hoc requests, so large over load
Requests are not only numerous, they change overtime
For more understanding more reports
Users are in spiral of reports
Users have to depend on IT for information
Can't provide enough performance, slow
Strategic information have to be flexible and conductive
12
OLTP vs. DSS
Trait
OLTP
DSS
User
Middle management
Executives, decision-makers
Function
For day-to-day operations
For analysis & decision support
DB (modeling)
E-R based, after normalization
Star oriented schemas
Data
Current, Isolated
Archived, derived, summarized
Unit of work
Transactions
Complex query
Access, type
DML, read
Read
Access frequency
Very high
Medium to Low
Records accessed
Tens to Hundreds
Thousands to Millions
Quantity of users
Thousands
Very small amount
Usage
Predictable, repetitive
Ad hoc, random, heuristic based
DB size
100 MB-GB
100GB-TB
Response time
Sub-seconds
Up-to min.s
13
Expectations of new soln.









DB designed for analytical tasks
Data from multiple applications
Easy to use
Ability of what-if analysis
Read-intensive data usage
Direct interaction with system, without IT assistance
Periodical updating contents & stable
Current & historical data
Ability for users to initiate reports
14
DW meets expectations






Provides enterprise view
Current & historical data available
Decision-transaction possible without affecting operational source
Reliable source of information
Ability for users to initiate reports
Acts as a data source for all analytical applications
15
Definition of DW
Inmon defined
“A DW is a subject-oriented, integrated, non-volatile, time-variant
collection of data in favor of decision-making”.
Kelly said
“Separate available, integrated, time-stamped, subject-oriented, nonvolatile, accessible”
Four properties of DW
16
Subject-oriented





In operational sources data is organized by applications, or
business processes.
In DW subject is the organization method
Subjects vary with enterprise
These are critical factors, that affect performance
Example of Manufacturing Company



Sales
Shipment
Inventory etc
17
Integrated Data


Data comes from several applications
Problems of integration comes into play



In addition to internal, external data sources





File layout, encoding, field names, systems, schema, data
heterogeneity are the issues
Bank example, variance: naming convention, attributes for data item,
account no, account type, size, currency
External companies data sharing
Websites
Others
Removal of inconsistency
So process of extraction, transformation & loading
18
Time variant






Operational data has current values
Comparative analysis is one of the best techniques for business
performance evaluation
Time is critical factor for comparative analysis
Every data structure in DW contains time element
In order to promote product in certain, analyst has to know about
current and historical values
The advantages are



Allows for analysis of the past
Relates information to the present
Enables forecasts for the future
19
Non-volatile






Data from operational systems are moved into DW after specific
intervals
Data is persistent/ not removed i.e. non volatile
Every business transaction don’t update in DW
Data from DW is not deleted
Data is neither changed by individual transactions
Properties summary
Subject Oriented
Organized along the lines
of the subjects of the
corporation. Typical
subjects are customer,
product, vendor and
transaction.
Time-Variant
Every record in the
data warehouse has
some form of time
variancy attached to it.
Non-Volatile
Refers to the inability of
data to be updated. Every
record in the data
warehouse is time
stamped in one form or
another.
20
Lecture 2
DW Architecture & Dimension Modeling
Khurram Shahzad
[email protected]
21
Agenda





Data Warehouse architecture & building
blocks
ER modeling review
Need for Dimensional Modeling
Dimensional modeling & its inside
Comparison of ER with dimensional
22
Architecture of DW
Information Sources
Data Warehouse
Server
(Tier 1)
OLAP Servers
(Tier 2)
Clients
(Tier 3)
e.g., MOLAP
Semistructured
Sources
Data
Warehouse
extract
transform
load
refresh
Analysis
serve
Query/Reporting
serve
e.g., ROLAP
Operational
DB’s
serve
Staging area
Data Mining
Data Marts
23
Components
Major components






Source data component
Data staging component
Information delivery component
Metadata component
Management and control component
24
1. Source Data Components

Source data can be grouped into 4 components

Production data




Internal data




Private datasheet, documents, customer profiles etc.
E.g. Customer profiles for specific offering
Special strategies to transform ‘it’ to DW (text document)
Archived data



Comes from operational systems of enterprise
Some segments are selected from it
Narrow scope, e.g. order details
Old data is archived
DW have snapshots of historical data
External data


Executives depend upon external sources
E.g. market data of competitors, car rental require new
manufacturing. Define conversion
25
Architecture of DW
Information Sources
Data Warehouse
Server
(Tier 1)
OLAP Servers
(Tier 2)
Clients
(Tier 3)
e.g., MOLAP
Semistructured
Sources
Data
Warehouse
extract
transform
load
refresh
Analysis
serve
Query/Reporting
serve
e.g., ROLAP
Operational
DB’s
serve
Staging area
Data Mining
Data Marts
26
2. Data Staging Components



After data is extracted, data is to be prepared
Data extracted from sources needs to be
changed, converted and made ready in
suitable format
Three major functions to make data ready




Extract
Transform
Load
Staging area provides a place and area with a
set of functions to




Clean
Change
Combine
Convert
27
Architecture of DW
Information Sources
Data Warehouse
Server
(Tier 1)
OLAP Servers
(Tier 2)
Clients
(Tier 3)
e.g., MOLAP
Semistructured
Sources
Data
Warehouse
extract
transform
load
refresh
Analysis
serve
Query/Reporting
serve
e.g., ROLAP
Operational
DB’s
serve
Staging area
Data Mining
Data Marts
28
3. Data Storage Components





Separate repository
Data structured for efficient processing
Redundancy is increased
Updated after specific periods
Only read-only
29
Architecture of DW
Information Sources
Data Warehouse
Server
(Tier 1)
OLAP Servers
(Tier 2)
Clients
(Tier 3)
e.g., MOLAP
Semistructured
Sources
Data
Warehouse
extract
transform
load
refresh
Analysis
serve
Query/Reporting
serve
e.g., ROLAP
Operational
DB’s
serve
Staging area
Data Mining
Data Marts
30
4. Information Delivery Component


Authentication issues
Active monitoring services




Performance, DBA note selected aggregates
to change storage
User performance
Aggregate awareness
E.g. mining, OLAP etc
31
DW Design
32
Designing DW
Information Sources
Data Warehouse
Server
(Tier 1)
OLAP Servers
(Tier 2)
Clients
(Tier 3)
e.g., MOLAP
Semistructured
Sources
Data
Warehouse
extract
transform
load
refresh
Analysis
serve
Query/Reporting
serve
e.g., ROLAP
Operational
DB’s
serve
Staging area
Data Mining
Data Marts
33
Background (ER Modeling)



For ER modeling, entities are collected from
the environment
Each entity act as a table
Success reasons

Normalized after ER, since it removes redundancy
(to handle update/delete anomalies)


But number of tables is increased
Is useful for fast access of small amount of data
34
ER Drawbacks for DW / Need of Dimensional
Modeling








ER Hard to remember, due to increased number of tables
Complex for queries with multiple tables (table joins)
Conventional RDBMS optimized for small number of tables
whereas large number of tables might be required in DW
Ideally no calculated attributes
The DW does not require to update data like in OLTP
system so there is no need of normalization
OLAP is not the only purpose of DW, we need a model that
facilitate integration of data, data mining, historically
consolidated data.
Efficient indexing scheme to avoid screening of all data
De-Normalization (in DW)



Add primary key
Direct relationships
Re-introduce redundancy
35
Dimensional Modeling





Dimensional Modeling focuses subjectorientation, critical factors of business
Critical factors are stored in facts
Redundancy is no problem, achieve efficiency
Logical design technique for high performance
Is the modeling technique for storage
36
Dimensional Modeling (cont.)

Two important concepts
 Fact




Numeric measurements, represent business activity/event
Are pre-computed, redundant
Example: Profit, quantity sold
Dimension


Qualifying characteristics, perspective to a fact
Example: date (Date, month, quarter, year)
37
Dimensional Modeling (cont.)





Facts are stored in fact table
Dimensions are represented by dimension
tables
Dimensions are degrees in which facts can be
judged
Each fact is surrounded by dimension tables
Looks like a star so called Star Schema
38
Example
TIME
time_key (PK)
SQL_date
day_of_week
month
STORE
store_key (PK)
store_ID
store_name
address
district
floor_type
CLERK
clerk_key (PK)
clerk_id
clerk_name
clerk_grade
FACT
time_key (FK)
store_key (FK)
clerk_key (FK)
product_key (FK)
customer_key (FK)
promotion_key (FK)
dollars_sold
units_sold
dollars_cost
PRODUCT
product_key (PK)
SKU
description
brand
category
CUSTOMER
customer_key (PK)
customer_name
purchase_profile
credit_profile
address
PROMOTION
promotion_key (PK)
promotion_name
price_type
39
ad_type
Inside Dimensional Modeling

Inside Dimension table








Key attribute of dimension table, for
identification
Large no of columns, wide table
Non-calculated attributes, textual attributes
Attributes are not directly related
Un-normalized in Star schema
Ability to drill-down and drill-up are two ways
of exploiting dimensions
Can have multiple hierarchies
Relatively small number of records
40
Inside Dimensional Modeling

Have two types of attributes



Inside fact table







Key attributes, for connections
Facts
Concatenated key
Grain or level of data identified
Large number of records
Limited attributes
Sparse data set
Degenerate dimensions (order number
Average products per order)
Fact-less fact table
41
Star Schema Keys

Primary keys



Surrogate keys



Replacement of primary key
System generated
Foreign keys


Identifying attribute in dimension table
Relationship attributes combine together to form P.K
Collection of primary keys of dimension tables
Primary key to fact table


System generated
Collection of P.Ks
42
Advantage of Star Schema



Ease for users to understand
Optimized for navigation (less joins
fast)
Most suitable for query processing
Karen Corral, et al. (2006) The impact of alternative diagrams on
the accuracy of recall: A comparison of star-schema
diagrams and entity-relationship diagrams, Decision Support
Systems, 42(1), 450-468.
43
Normalization [1]


1.
2.


“It is the process of decomposing the
relational table in smaller tables.”
Normalization Goals:
Remove data redundancy
Storing only related data in a table (data
dependency makes sense)
5 Normal Forms
The decomposition must be lossless
44
st
1

Normal Form [2]
“A relation is in first normal form if and only if
every attribute is single-valued for each tuple”
STU_ID
STU_NAME
MAJOR
CREDITS
CATEGORY
S1001
Tom Smith
History
90
Comp
S1003
Mary Jones
Math
95
Elective
S1006
Edward
Burns
CSC, Math
15
Comp,
Elective
S1010
Mary Jones
Art, English
63
Elective,
Elective
S1060
John Smith
CSC
25
Comp
45
st
1
Normal Form (Cont.)
STU_ID
STU_NAME
MAJOR
CREDITS
CATEGORY
S1001
Tom Smith
History
90
Comp
S1003
Mary Jones
Math
95
Elective
S1006
Edward
Burns
CSC
15
Comp
S1006
Edward
Burns
Math
15
Elective
S1010
Mary Jones
Art
63
Elective
S1010
Mary Jones
English
63
Comp
S1060
John Smith
CSC
25
Comp
46
Another Example (composite key:
SID, Course) [1]
47
1st Normal Form Anomalies [1]



Update anomaly: Need to update all six rows
for student with ID=1if we want to change his
location from Islamabad to Karachi
Delete anomaly: Deleting the information
about a student who has graduated will
remove all of his information from the
database
Insert anomaly: For inserting the information
about a student, that student must be
registered in a course
48
Solution  2nd Normal Form
“A relation is in second normal form if and
only if it is in first normal form and all the
nonkey attributes are fully functional
dependent on the key” [2]
 In previous example, functional
dependencies [1]
SID —> campus
Campus degree

49
Example in
nd
2
Normal Form [1]
50
Anomalies [1]


Insert Anomaly: Can not enter a program for
example PhD for Peshawar campus unless a
student get registered
Delete Anomaly: Deleting a row from
“Registration” table will delete all information
about a student as well as degree program
51
Solution  3rd Normal Form
“A relation is in third normal form if it is in
second normal form and nonkey attribute is
transitively dependent on the key” [2]
 In previous example: [1]
Campus degree

52
Example in
rd
3
Normal Form [1]
53
Denormalization [1]


“Denormanlization is the process” to
selectively transforms the normalized
relations in to un-normalized form with the
intention to “reduce query processing time”
The purpose is to reduce the number of
tables to avoid the number of joins in a query
54
Five techniques to denormalize
relations [1]





Collapsing tables
Pre-joining
Splitting tables (horizontal, vertical)
Adding redundant columns
Derived attributes
55
Collapsing tables (one-to-one) [1]
For example, Student_ID, Gender in Table 1 and
Student_ID, Degree in Table 2
56
Pre-joining [1]
57
Splitting tables [1]
58
Redundant columns [1]
59
Updates to Dimension Tables
60
Updates to Dimension Tables (Cont.)


Type-I changes: correction of errors, e.g.,
customer name changes from Sulman
Khan to Salman Khan
Solution to type-I updates:
 Simply update the corresponding
attribute/attributes. There is no need to
preserve their old values
61
Updates to Dimension Tables (Cont.)


Type 2 changes: preserving history
For example change in “address” of a
customer, but the user wants to see orders
by geographic location then you can not
simply update the address by replacing old
value with new value, you need to preserve
the history (old value) as well as need to
insert new value
62
Updates to Dimension Tables (Cont.)

Proposed solution:
63
Updates to Dimension Tables (Cont.)


Type 3 changes: When you want to compare
old and new values of attributes for a given
period
Please note that in Type 2 changes the old
values and new values were not comparable
before or after the cut-off date (when the
address was changed)
64
Updates to Dimension Tables (Cont.)
Solution: Add a new column of attribute
65
Updates to Dimension Tables (Cont.)


What if we want to keep a whole history of
changes?
Should we add large number of attributes to
tackle it?
66
Rapidly Changing Dimension


When dimension’s records/rows are very
large in numbers and changes are required
frequently then Type-II change handling is
not recommended
It is recommended to make a separate table
of rapidly changing attributes
67
Rapidly Changing Dimension (Cont.)



“For example, an important attribute for customers
might be their account status (good, late, very late,
in arrears, suspended), and the history of their
account status” [4]
“If this attribute is kept in the customer dimension
table and a type 2 change is made each time a
customer's status changes, an entire row is added
only to track this one attribute” [4]
“The solution is to create a separate account_status
dimension with five members to represent the
account states” [4] and join this new table or
dimension to the fact table.
68
Example
69
Junk Dimensions


Sometimes there are some informative flags
and texts in the source system, e.g., yes/no
flags, textual codes, etc.
If such flags are important then make their
own dimension to save the storage space
70
Junk Dimension Example [3]
71
Junk Dimension Example (Cont.) [3]
72
The Snowflake Schema




Snowflacking involves normalization of
dimensions in Star Schema
Reasons:
To save storage space
To optimize some specific quires (for
attributes with low cardinality)
73
Example 1 of Snowflake Schema
74
Example 2 of Snowflake Schema
75
Aggregate Fact Tables


Use aggregate fact tables when too many
rows of fact tables are involved in making
summary of required results
Objective is to reduce query processing time
76
Example
Total Possible Rows = 1825 * 300 * 4000 * 1 = 2 billion
77
Solution
Make aggregate fact tables, because you
might be summing some dimension and
some might not then why we should store the
dimensions that do not need highest level of
granularity of details.
 For example: Sales of a product in a year OR
total number of items sold by category on daily
basis

78
A way of making aggregates
Example:
79
Making Aggregates

But first determine what is required from your
data warehouse then make aggregates
80
Families of Stars
81
Families of Stars (Cont.)

Transaction (day to day) and snapshot tables (data after
some specific intervals)
82
Families of Stars (Cont.)

Core and custom tables
83
Families of Stars (Cont.)

Conformed Dimension: The attributes of a dimension
must have the same meaning for all those fact tables
with which the dimension is connected.
84
Extract, Transform, Load (ETL)



Extract only relevant data from the internal
source systems or external systems, instead
of dumping all data (“data junkhouse”)
The ETL completion can take up to 50-70%
of your total effort while developing a data
warehouse.
These ETL efforts depends on various factors
which will be elaborated as we proceed in our
lectures regarding ETL.
85
Major steps in ETL
86
Data Extraction


1.
2.
3.
4.
5.
Data can be extracted using third party tools
or in-house programs or scripts
Data extraction issues:
Identify sources
Method of extraction for each source (manual,
automated)
When and how much frequently data will be extracted
for each source
Time window
Sequencing of extraction processes
87
How data is stored in operational
systems


Current value: Values continue to changes as
daily transactions are performed. We need to
monitor these changes to maintain history for
decision making process, e.g., bank balance,
customer address, etc.
Periodic status: sometimes the history of
changes is maintained in the source system
88
Example
89
Data Extraction Method

1.
2.
3.

1.
Static data extraction:
Extract the data at a certain time point.
It will include all transient data and periodic data along
with its time/date status at the extraction time point
Used for initial data loading
Data of revisions
Data is loaded in increments thus preserving history of
both changing and periodic data
90
Incremental data extraction


1.
2.
3.
4.
5.
Immediate data extraction: involves data
extraction in real time.
Possible options:
Capture through transactions logs
Make triggers/Stored procedures
Capture via source application
Capture on the basis of time and date stamps
Capture by comparing files
91
Data Transformation


1.
2.
3.
4.
5.
6.
Transformation means to integrate or
consolidate data from various sources
Major tasks:
Format conversions (change in data type, length)
Decoding of fields (1,0  male, female)
Calculated and derived values (units sold, price, cost profit)
Splitting of single fields (House no 10, ABC Road, 54000, Lahore,
Punjab, Pakistan)
Merging of information (information from different sources
regarding any entity, attribute)
Character set conversion
92
Data Transformation (Cont.)
8.
9.
10.
11.
12.
13.
Conversion of unit of measures
Date/time conversion
Key restructuring
De-duplication
Entity identification
Multiple source problem
93
Data Loading


1.
2.
3.
4.
Determine when (time) and how (as a whole or in
chunks) to load data
Four modes to load data
Load: removes old data if available otherwise load
data
Append: The old data is not removed, the new data
is appended with the old data
Destructive Merge: If primary key of the new record
matched with the primary key of and old record
then update old record
Constructive Merge: If primary key of the new
record matched with the primary key of and old
record then do not update old record just add the
new record and mark it as superseding record
Data Loading (Cont.)

Data Refresh Vs. Data Update
Full refresh reloads whole data after deleting old data and
data updates are used to update the changing attributes
Data Loading (Cont.)

Loading for dimensional tables: You need to define
a mapping between source system key and system
generated key in data warehouse, otherwise you will
not be able to load/update data correctly
Data Loading (Cont.)

Updates to dimension table
Data Quality Management



It is important to ensure that the data is correct to
make right decisions
Imagine the user working on operational system is
entering wrong regions’ codes of customers.
Imagine that the relevant business has never sent
an invoice using these regions codes (so they are
ignorant). But what will happen if the data
warehouse will use these codes to make decisions?
You need to put proper time and effort to ensure
data quality
Data Quality
What is Data: An abstraction/representation/
description of something in reality
 What is Data Quality: Accuracy + Data must
serve its purpose/user expectations

Indicators of quality of data



Accuracy: Correct information, e.g., address
of the customer is correct
Domain Integrity: Allowable values, e.g.,
male/female
Consistency: The content and its form is
same across all source system, e.g., product
code of a product ABC in one system is 1234
then in other system it must be 1234 for that
particular product
Indicators of quality of data (Cont.)




Redundancy: Data is not redundant, if for some
reason for example efficiency the data is redundant
then it must be identified accordingly
Completeness: There are no missing values in any
field
Conformance to Business rules: Values are
according to the business constraints, e.g., loan
issued cannot be negative
Well defined structure: Whenever the data item can
be divided in components it must be stored in terms
of components/well structure, e.g., Muhammad
Ahmed Khan can be structured as first name, middle
name, last name. Similar is the case with addresses
Indicators of quality of data (Cont.)




Data Anomaly: Fields must contain that value
for which it was created, e.g., State filed
cannot take the city name
Proper Naming convention
Timely: timely data updates as required by
user
Usefulness: The data elements in data
warehouse must be useful and fulfill the
requirements of the users otherwise data
warehouse is not of any value
Indicators of quality of data (Cont.)

Entity and Referential Integrity: Entity integrity
means every table must have a primary key
and it must be unique and not null.
Referential integrity enforces parent child
relationship between tables, you can not
insert a record in child table unless you have
a corresponding record in parent table

SQL Server Integration Services (SSIS)
Problems due to quality of data

Businesses ranked data quality as the biggest problem
during data warehouse designing and usage
Possible causes of low quality data



Dummy values: For example, to pass a check
on postal code, entering dummy or not
precise information such as 4444 (dummy) or
54000 for all regions of Lahore
Absence of data values: For example not a
complete address
Unofficial use of field: For example writing
comments in the contact field of the customer
Possible causes of low quality data


Cryptic Information: At one time operation
system was using ‘R’ for “remove” then later
for “reduced” and some other time point for
“recovery”
Contradicting values: compatible fields must
not contradict, e.g., two fileds ZIP code and
City can have values 54000 and Lahore but
not some other city name for ZIP code 54000
Possible causes of low quality data


Violation of business rule: Issued load is negative
Reused primary keys: For example, a business has
2 digit primary key. It can have maximum 100
customers. When a 101th customer comes the
business might archive the old customers and
assign the new customer a primary key from the
start. It might not be a problem for the operation
system but you need to resolve such issues
because DW keeps historical data.
Possible causes of low quality data



Non-unique identifiers: For example different
product codes in different departments
Inconsistent values: one system is using
male/female to represent gender while other
system is using 1/0
Incorrect values: Product Code: 466, Product
Name: “Crystal vas”, Height:”500 inch”. It
means that product and height values are not
compatible. Either product name or height is
wrong or maybe the product code as well
Possible causes of low quality data

Erroneous integration: A person might be a
buyer or seller to your business. Your
customer table might be storing such person
with ID 222 while in seller table it might be
500. In data warehouse you might need to
integrate this information. The persons with
IDs 222 in both tables might not be same
Sources of data pollution



System migration or conversions: For
example, from manual system Flat files 
Hierarchal databases  relational
databases….
Integration of heterogeneous systems: More
heterogeneity means more problems
Poor database design: For example lack of
business rules, lack of entity and referential
integrity
Sources of data pollution (Cont.)


Incomplete data entry: For example, no city
information
Wrong information during entry: For example
United Kingdom  Unted Kingdom
Data Cleansing
Examples of data cleansing tools
Data Ladder, http://www.dataladder.com/products.html
 Human Inference,
http://www.humaninference.com/master-data-management/dataquality/data-cleansing
 Google Refine,
http://code.google.com/p/google-refine/
 Wrangler
http://vis.stanford.edu/wrangler/
 Text Pipe Pro
http://www.datamystic.com/textpipe.html#.UKjm9eQ3vyo

Information Supply



It deals with how you will provide information
to the users of the data warehouse.
It must be easy to use, appealing, and
insightful
For successful data warehouse you need to
identify users, and when, what, where, and
how (in which form) they need information
Data warehouse modes of usage


Verification mode: The users interact with the
system to prove or disprove a hypothesis
Discovery mode: Without any prior
hypothesis. The intention is to discover new
things or knowledge about the business. The
data mining techniques are used here.
Approaches of interaction with data
warehouse

1.
2.
3.

1.

1.
Informational approach:
With queries and reporting tools
The data might be summarized
Can perform statistical analysis by using current
and historical data
Analytical approach:
Used for analysis in terms of drill-up, drill-down,
slice, and dice
Data Mining approach
Used for knowledge discovery or hidden patterns
Classes of users


We need to identify the classes of users in order to get
firm basis what and how to provide information to them
A typical example of classifying users
User classes (cont.)



Naive users
Regular users: daily users but cannot make
queries and reports themselves. They need
query templates and predefined reports
Power users: Technically sound users, who
can make queries, reports, scripts, import
and export data themselves
User classes from other perspective




High-level Executives and Managers: Need
standard pre-processed reports to make strategic
decisions
Technical Analysts: complex analysis, statistical
analysis, drill-down, slice, dice
Business Analysts: comfortable with technology but
cannot make queries, can modify reports to get
information from different angles.
Business-Oriented users: Predefined GUIs, and
might be support for some ad-hoc queries
The ways of interaction




Preprocessed reports: routine reports which are
delivered at some specific interval
Predefined queries and templates: The users can
use own parameters with predefined queries
templates and reports with predefined format
Limited ad-hoc access: few and simple queries
which are developed from scratch
Complex ad-hoc access: complicated queries and
analysis. Can be used as a basis for predefined
reports and queries
Information delivery framework
Online Analytical Processing (OLAP)
used for fast and complex analysis on data warehouse
 It is not a database design technique but is only a category of
applications
 Definition by Dr. E. F. Codd
“On-Line Analytical Processing (OLAP) is a category of software
technology that enables analysts, managers and executives to
gain insight into data through fast, consistent, interactive
access in a wide variety of possible views of information that
has been transformed from raw data to reflect the real
dimensionality of the enterprise as understood by the user.”
 So what is the difference between data warehouse and
OLAP?

OLAP (cont.)
What is the solution?



Should we write all queries in advance?
Should we ask users to write SQL
themselves?
The solution is to facilitate all kinds of
aggregates across multi-dimensions
OLAP as FASMI [1] (by Nigel Pendse)





Fast: Delivery of information at fairly constant
rate. Most queries are answered with in 5 sec
Analysis: Support of analysis
Shared: ensure security in a multi user
environment
Multi-dimensional
Information: Access to all information
necessary for the application
Major features of OLAP

1.
2.




Dimensional Analysis
What are cubes?
What are hyper-cubes?
Drill-down and Roll-up
Drill through
Slice and Dice
Pivoting
Dimensional Analysis


OLAP supports multi-dimensional analysis
Cube: have three dimensions
Z-axis
X-axis
Y-axis
Same view on spreadsheet
Multidimensional Analysis (Spreadsheet
view) / Hypercube
Drill-down and Roll-up [5]
Drill through
Slicing [5]
Dicing [5]
Pivoting [5]
A SQL server view for cube [6]
OLAP Implementations [1]




MOLAP (uses multi-dimensional structure OR
cubes)
ROLAP (uses relational database approach
for OLAP)
HOLAP (hybrid approach of MOLAP and
ROLAP)
DOLAP (OLAP for desktop decision support
system)
MOLAP



Uses proprietary file formats to sore cube
physically
Do not use SQL, uses proprietary interfaces
such as MDX from Microsoft (to support drilldown, roll-up, pivoting, etc.)
Remember cubes store all possible
aggregates, which are calculated across
dimensions and their hierarchies, e.g., Time
dimension and its possible hierarchies are:
day, week, months, quarter, year.
MOLAP advantages [1]


Fast query processing because the
aggregates are already pre-calculated
Valuable functions such as % change,
ranking, etc.
MOLAP disadvantages [1]



Long loading time due to pre-aggregations
especially when there are many dimensions
and for dimensions with large hierarchies.
Wastage of storage due to sparse cube.
Imagine sales of heaters in summer in Sibbi.
Maintenance of cubes with updated values
(insert, delete, update). Need to again
calculate the aggregates
MOLAP disadvantages (cont.) [1]

Storage issues: Need a lot of space to store
multiple cubes (may be more than 100s),
because aggregates need to be calculated
across dimensions and hierarchies within
them. If the number of rows are large in
dimensions then physically storing cubes will
be again difficult.
Solution to storage problem [1]
But at the expense of processing time, so partition of nonfrequent joins
Solution to storage problem [1]
Virtual cubes: The cubes that is created by
joining one or more cubes. It is similar of
creating a “view” in relational databases. It is
not physically stored but requires processing
time when invoked.
Example: one cube storing number of visitors,
the second cube is storing purchases on the
site. Then the virtual cube can calculate the
average sales per visit.

Relational OLAP (ROLAP) [1]




ROLAP is facilitated by Star Schema
Results are retrieved using SQL
Cube in ROLAP is supported by Star Schema
(fact tables and dimension tables)
ROLAP is simply querying the star schema,
querying pre-aggregates (aggregate fact
tables), and calculating summaries (at higher
level) of pre-aggregates at run time
Creating tables using SQL queries [1]
The required queries are [1]:
Problems with this approach [1]



The number of queries increases as the number of
aggregates increases due to rise in number of
dimensions
Therefore avoid wasteful queries. For example we
could have saved the results of first query in
previous example and could have computed the
other two queries using that result.
Moreover, avoid calculating and saving all possible
aggregates. Such methodology will consume the
storage space and will create the maintenance
problem (same problems as in MOLAP)
Solution to avoid saving all possible
aggregates [1]


Compute dynamically the less detailed
aggregates (e.g., year sales) from the
detailed aggregates (e.g., daily sales) at run
time
In this approach, the query run time might be
slow but will save the storage space
ROLAP problems [1]


Maintenance (as with MOLAP)
Divergence from simple hierarchies: It is
possible that simple hierarchies are not
followed (e.g., product category, product subcategory, product) but instead some other
unconventional hierarchy is followed (e.g.,
sales of green items vs. yellow items OR
sales of items with national flag vs. items
without flag)
ROLAP problems (cont.) [1]

1.
2.
3.
Different semantics:
For example, what constitute an year 1st July to
30th June OR 1st January to 31st December.
Another example is week, whether it ends on
Monday or Saturday? (in various departments, e.g.,
University of Lahore, The CS department holiday is
on Monday and others take holiday on Saturday).
Different semantics will force you in creating many
pre-aggregates.
You have to figure out the best solution by your
own self (either calculating pre-aggregates or
convincing various departments to follow the same
convention).
ROLAP problems (Cont.) [1]


Storage problem: The combinatorial
explosion, unconventional hierarchies,
different semantics force you to create
various pre-aggregates which consumes a lot
of storage space
Aggregation problem: As a general rule of
thumb make aggregate to higher level if there
is considerable reduction (approx. 10% or
more) in size as compared to the detailed
level table
Reducing pre-aggregates tables [1]



Calculating summaries at run time
Enhancing performance of common queries
at coarser level
Using smart wizards for a trade-off between
performance and storage
Hybrid OLAP (HOLAP) [1]


HOLAP = MOLAP + ROALP
Its purpose is to take advantage of both
approaches. It answer queries using MOLAP
for dimensions with low cardinality and if the
user requires more detailed analysis then it
used ROLAP. However, this intelligent shift
between ROLAP and MOLAP is transparent
to the users
Desktop OLAP (DOLAP) [1]

It is used for the remote users or for the
machines with limited processing power. A
subset is copied to the remote machine
Parallel Execution OR Parallelism [1]




Basically parallelism is used to speed up the
response time of queries in data warehouse
The processing of a query is done by many
processes instead on one process
For example the whole year sales can be retrieved
by a query partitioned in four quarters and can be
processed individually by four processes instead of
one process executing sequentially
However the divided processes should be
independent of each other
Benefits of Parallelism [1]





Large table scans and joins
Creation of large indexes
Partitioned index scans
Bulk inserts, updates, and deletes
Aggregations and copying
When we can parallelize? [1]





“Symmetric multi-processors (SMP), clusters,
or Massively Parallel (MPP) systems AND
Sufficient I/O bandwidth AND
Underutilized or intermittently used CPUs (for
example, systems where
CPU usage is typically less than 30%) AND
Sufficient memory to support additional
memory-intensive processes such as sorts,
hashing, and I/O buffers”
Speed-Up & Amdahl’s Law [1]





Remember, if the dependencies between processes
are high then performance or speed-up will degrade
Amdahl’s Law
Where f= fraction of the task that must be executed
sequentially
N= number of processors
S= Speed-Up
Amdahl’s Law (Cont.) [1]



This formula basically computes the
maximum expected speed-up from the
proportion of task that must be executed
sequentially and available resources
Example-1: f = 5% and N = 100 then S = 16.8
Example-2: f = 10% and N = 200 then S =
9.56
Amdahl’s Law (Cont.) [1]
Parallel hardware architecture [1]

Symmetric Multiprocessing (SMP): shares
memory and I/O disks. In following figure p1,
p2, p3, p4 are processors. Supported by
UNIX and Windows NT
Parallel hardware architecture (cont.) [1]

Distributed Memory or Massively Parallel
Processing (MPP)
Non-uniform Memory Access (NUMA)

NUMA deals with a shared memory
architecture along with their accessibility
timings. L1 cache  L2 cache  L3 cache 
local memory  remote memory. So that the
data bus is not overloaded as in the case of
SMP
Parallel Software Architectures [1]

Shared disk RDBMS Architecture (Shared
Everything)
Parallel Software Architectures (cont.) [1]


Shared disk RDBMS Architecture Advantage: It
does not fail as a unit as compared to shared
memory architecture. Remember, in case of failure
in the RAM the Symmetric Multiprocessing (SMP)
fails.
Shared disk RDBMS Architecture
Disadvantage: It requires some distributed locking
management, so that the data remains consistent
when two or more processes ask for updates to
the same data (data buffer). This requirement
leads to serialization (degradation in performance)
as more and more locking is required
Parallel Software Architectures (cont.) [1]

Shared Nothing RDBMS Architectures
Parallel Software Architectures (cont.) [1]
Shared Nothing RDBMS Architectures
Basically the tables are partitioned logically and
stored on separate disks. Such as horizontal
split. When any query requires data from
more than one disk the data is joined at run
time and shown to user. Each machine is
responsible for accessing and locking of data
individually.

Parallel Software Architectures (cont.) [1]


Shared Nothing RDBMS Architecture
Advantage: No distributed locking of data or
serialization problem
Shared Nothing RDBMS Architecture
Disadvantage: Failure of one node will
cause a part of the whole system to be failed.
Partitioning Techniques [1]



Ranges: e.g., >2000 and <=2000
List: e.g., Europe, Asia, North America…
Hashing: Usage of hashing function (implemented
by a DBMS) to distribute rows across partitions. It is
done when rows can not be divided using ranges
and lists. e.g., in mySQL (example taken from [7])
CREATE TABLE employees ( id INT NOT NULL, fname VARCHAR(30), lname
VARCHAR(30), hired DATE NOT NULL DEFAULT '1970-01-01', separated DATE
NOT NULL DEFAULT '9999-12-31', job_code INT, store_id INT )
PARTITION BY HASH (store_id)
PARTITIONS 4;
Partitioning Techniques (cont.) [1]

Round-Robin: Equal distribution of rows
across partitions. For example if we have four
partitions. Then first row to first partition,
second to the second partition, third to the
third partition, and fourth to fourth partition
and then repeat the same procedure again
which means fifth row to first partition, sixth to
second and so on.
Some concepts to understand about
parallelism [1]
Some concepts to understand about
parallelism (cont.) [1]

1.
2.
3.
To get maximum speed
Need of enough resources (CPUs, Memory,
and data bus)
The query coordinator must be quick in
preparing the final result
The processing nodes ideally should finish
tasks in equal time otherwise the query
coordinator has to wait for the node taking
most time.
Spatial parallelism OR temporal
parallelism OR Pipelining [1]

The basic idea is to break a complex task in
to smaller sub-tasks to execute them
concurrently
Spatial parallelism OR temporal
parallelism (cont.) [1]
Spatial parallelism OR temporal
parallelism (cont.) [1]

The speed-up can be calculated as


Where S=speed-up
N= number of tasks

T= time for the execution of one sequence of tasks
Pipelining limitations



The speed-up can be a maximum up to 10
due to relational operators.
Pipelining will not be effective when all subtasks must be executed to get the final result
such as sorting, aggregation
The sub-tasks can take different timing to
complete
Partitions with respect to queries



Full table scan
Point queries
Range queries
Partitions with respect to queries (cont.)

1.
2.
In the context of Round-Robin
Load will be equivalent, the scan query will
take equal time
In the context of point query it is not good
because the records are not partitioned
using values or ranges
Partitions with respect to queries (cont.)




In the context of Hashing technique
If the hashing is uniform the number of
records will be approx. same on all nodes
otherwise the full table
The point based queries will execute well if
the hashing is done on the same column,
similarly the join will execute well
The range query will not exectue well
because it is not distributed according to
range across partitions
Partitions with respect to queries (cont.)

1.
2.
In the context of range partition
If the partitions are done in proper manner
means that the load is same across
partitions then the scan query will run well
The point based and range based queries
well
Indexing




Indexing is an efficient ways to access any
item
It speeds up query processing by searching
minimum records instead scanning whole
table/s
It provides speed up without using additional
and expensive hardware
Imagine searching a book in shelves in a
library vs. through catalog
Conventional Indexing techniques



Dense Index
Sparse Index
Multilevel Index (B-Tree)
Dense Indexing


For each record store the key of that record and the
pointer where the record is actually placed on disk
If it fits in memory it requires one I/O operation if not then
performance will degrade
Sparse Index


The index is kept for a block of data items
It takes less space but at the expense of
some efficiency in terms of time
Multilevel Indexing


It uses a little more space but it increases efficiency in
terms of time
It is good for queries posing some conditions or range
conditions
Example of B-Tree
B-Tree limitations




A bigger table in terms of rows but with few
unique values
“Insertion can be very expensive”
Poor for text searches
Does not use a combination of indexes in a
table
Hash based indexing


Instead of keeping sequential indexes as in
B-Tree the indexes are stored using hashes
It is good for columns having exact number
Hash indexing (Cont.)
Hash indexing (Cont.)
Example of B-Tree limitation
Conventional indexes disadvantages




Conventional indexing comes in two types which
includes, B-Tree and Hash indexing. There
disadvantages are as follows
It maps columns of tables into index but it does not
provide any information about joining of this column
with other tables
Consumes a lot of memory when number of rows
are too many
These indexes are made for OLTP systems they do
not take advantage of the stability of data in data
warehouse for bulk updates
Questions?
193
References





[1] Abdullah, A.: “Data warehousing handouts”, Virtual
University of Pakistan
[2] Ricardo, C. M.: “Database Systems: Principles
Design and Implementation”, Macmillan Coll Div.
[3] Junk Dimension,
http://www.1keydata.com/datawarehousing/junkdimension.html
[4] Advanced Topics of Dimensional Modeling
https://mis.uhcl.edu/rob/Course/DW/Lectures/Advanced
%20Dimensional%20Modeling.pdf
[5] http://en.wikipedia.org/wiki/OLAP_cube
194
References


[6]
http://www.mssqltips.com/sqlservertutorial/2011/processi
ng-dimensions-and-cube/
[7] http://dev.mysql.com/doc/refman/5.1/en/partitioninghash.html

Lecture_dw_02 - Computer Science Unplugged

Transcript Lecture_dw_02 - Computer Science Unplugged

Directory