Building Data Warehousing - The Institute of Finance Management

Download Report

Transcript Building Data Warehousing - The Institute of Finance Management

Building Data Warehousing
CIT858: Data Mining & Data Warehousing
Bajuna Salehe
The Institute of Finance Management:
Computing and IT Dept.
Data Transformation
• Data extracted from transactional databases
must go through several kinds of data
transformation on its way to a data
warehouse:
– Data from different transactional databases being
merged to form the data warehouse tables
– Data will often be aggregated as it is being
extracted from the transactional databases and
prepared for the data warehouse
– Units of measure used for attributes in different
transactional databases must be reconciled as
they are being merged into common data
warehouse tables
Data Transformation
– Coding schemes used for attributes in different
transactional databases must be reconciled as
they are being merged into common data
warehouse tables
– Sometimes values from different attributes in
transactional databases are combined into a single
attribute in the data warehouse (e.g., employee
name)
Data Loading
• After all of the extracting, cleaning, and
transforming, the data is ready to be loaded
into the data warehouse
• Data will be loaded into a “loading” or
working area in the database
– Some of the previous steps may have been done
in the database
– Data may have to go through a number of stages
dividing up the data and merging with other data
– When the above has been done the Star Schemas
are populated with the new, time specific data
Data Loading (cont…)
• A schedule for regularly updating the data
warehouse must be put in place
– Frequency of updates is important
– Time taken to get to this point is important
Data Warehouse Queries
• Types of queries that a data warehouse is
expected to answer ranges from the relatively
simple to the highly complex and is dependent on
the type of end-user access tools used
• End-user access tools include:
–
–
–
–
Reporting, query, and application development tools
Executive information systems (EIS)
OLAP tools
Data mining tools
Steps in Building DW
• Users specify information needs
• Analysts and users create a logical and
physical design
• Sources of data is scrubbed, extracted and
transformed
• Data is transferred and loaded into the
warehouse periodically
• Users are given the access to warehouse
• The warehouse is maintained in terms of
changing requirements
Data Warehouse Schema/Dimensional
Modeling
• The entity-relationship data model (ERD) is
commonly used in the design of relational
databases, where a database schema consists
of a set of entities and the relationships
between them.
• Such a data model is appropriate for on-line
transaction processing (OLTP)
Data Warehouse Schema/Dimensional
Modeling
• A data warehouse, however, requires a
concise, subject-oriented schema that
facilitates on-line data analysis (OLAP).
• The most popular data model for a data
warehouse is a multidimensional model.
Dimensional Modeling
• Dimensional modeling/Multidimensional
Modeling is a new method of designing
database for analytical purpose.
– In dimensional model design; simplicity and
understandability of data is crucial to help
business users (e.g CEOs and other top leaders) to
analyse the business performance and make right
decision.
Dimensional Modeling
• Data warehouses and OLAP tools are based on
a dimensional data model.
• This model views data in the form of a data
cube.
• A data cube allows data to be modeled and
viewed in multiple dimensions. It is defined by
dimensions and facts.
Dimensional Modeling
• For example, AllElectronics may create a sales
data warehouse in order to keep records of
the store’s sales with respect to the
dimensions time, item, branch, and location.
• These dimensions allow the store to keep
track of things like monthly sales of items and
the branches and locations
Dimensional Modeling
• A multidimensional data model is typically
organized around a central theme, like sales,
for instance.
• This theme is represented by a fact table.
Facts are numerical measures which look like
quantities by which we want to analyze
relationships between dimensions.
Dimensional Modeling
• Examples of facts for a sales data warehouse
include shillings sold (sales amount in
shillings), units sold (number of units sold),
and amount budgeted.
• The fact table contains the names of the facts,
or measures, as well as keys to each of the
related dimension tables.
Dimensional Modeling
• Data cube can be of 2 – D or 3 – D geometric
structures.
• For example in the company AllElectronics
sales data for items sold per quarter in the city
of Arusha. The data is in 2 – D representation.
• The sales for Arusha are shown with respect
to the time dimension (organized in quarters)
and the item dimension (organized according
to the types of items sold).
Dimensional Modeling
• The fact or measure displayed is shillings sold
(in thousands).
• You can view the sales data with the third
dimension. For instance, you can view data
according to time and item as well as location
for the cities Dar, Tanga, Mwanza, Kagera and
Lindi.
Dimensional Modeling
• These 3 – D data are represented as a series of
2-D tables or in the form of a 3-D data cube.
Data Warehouse Schema/Dimensional
Modeling
• Dimensional model can exist in the following
forms
– a star schema
– a snowflake schema
– a fact constellation schema.
• The major focus will be on the star schema
which is commonly used in the design of many
data warehouse.
Star Schema
• This is the most common modeling paradigm
for designing data warehouse.
• In this model a data warehouse consists of:
– a large central table (fact table) containing the
bulk of the data, with no redundancy
– a set of smaller attendant tables (dimension
tables), one for each dimension.
• The diagram below show an example of star
schema
Star Schema
• Star schema of a data warehouse for sales.
Star Schema
• A star schema for AllElectronics sales is shown in
Figure in the above slide. Sales are considered along
four dimensions namely, time, item, branch, and
location.
• The schema contains a central fact table for sales
that contains keys to each of the four dimensions,
along with two measures: dollars sold and units sold.
• To minimize the size of the fact table, dimension
identifiers (such as time key and item key) are
system-generated identifiers.
Star Schema
• Notice that in the star schema, each
dimension is represented by only one table,
and each table contains a set of attributes.
• For example, the location dimension table
contains the attribute set {location key, street,
city, province or state, country}
Note on Dimensional Modeling
• In doing dimensional modeling for designing
DW there are generally four steps:– Select the business process to model.
– Select the grain for business process
– Choose the dimension that apply to each fact
table.
– Identify the numeric facts/measurements that will
populate each fact table row.
Typical Data Warehouse Queries
• Examples include:
– What was total IFM revenue in 3rd quarter of 2006?
– What was total revenue for property sales for each
type of property in Tanzania in 2006?
– What are the three most popular areas in each city for
the renting of property in 2003 and how does this
compare with the figures for the previous two years?
– What would be effect on property sales in the
different regions of Europe if legal costs went up by
3.5% and Government taxes went down by 1.5% for
properties over €250,000?
– What is monthly revenue for property sales at each branch
office, compared with rolling 12-monthly prior figures?
Benefits Of Data Warehousing
• Gives the data you want, in a suitable
format
• Removes inconsistency of reporting
• Gives one consistent picture of the data.
i.e. It provides single manageable
structure for decision support data.
• Potential high returns on investment
Benefits Of Data Warehousing
• Enable users to run complex queries on
data that traverses a number of business
areas.
• Competitive advantage
• Increased productivity of corporate
decision-makers