The Data Warehouse

Download Report

Transcript The Data Warehouse

Lecture 1
Themes in this session
• An introduction to the data warehouse
• Decision making in organisations
• The use of data and information in organisations
An introduction to the data
warehouse
The Data Warehouse
Definition: ” A single, complete and consistent source of data
obtained from a variety of sources and made available to end
users in a way that they can understand and use in a business
context” - Barry Devlin
The data warehouse is an architecture for the provision
of information !
The data contained in the data warehouse is...
– subject oriented
– integrated
– non-volatile
– time variant
– transformed
– accessible
Subject-oriented
Operational Systems
Informational Systems
Sales
System
Customer
Data
Payroll
System
Employee
Data
Purchasing
System
Vendor
Data
Integrated
Operational Systems
Informational Systems
Marketing
System
Order
System
Billing
System
Customer
Data
Time variant
Operational Systems
Informational Systems
Order
System
Customer
Data
60-90 days
5-10 years
Non-volatile
Operational Systems
Create
Update
Informational Systems
Delete
Order
System
Load
Insert
Customer
Data
Access
The Data Related Problems Behind the Data
Warehouse
Data in organisations often has the following
characteristics...
– massive volume
– dispersed
– difficult to access
– badly integrated
– complex data structures
– not suitable for high level business queries
The Information Needs Behind the Data
Warehouse
Organisations need information which is...
–
–
–
–
–
–
more holistic in its coverage of the business
selected and enriched
easily accessible
more easily understandable
of a high quality
directly applicable to the decision situation
The integration problem
DB
DBMS
DB
DBMS
End user
application A
End user
application B
DB
DB
DBMS
DBMS
End user
application C
The integrated data warehouse
DB
End user
application A
DB
DB
Extraction
DB
DW
DBMS
End user
application B
DB
External
DB
End user
application C
Characteristics of the integrated data
warehouse
• Contained data represents a holistic view of the
organisation
• All contained data is for use in ”knowledge discovery”
– current data
– historical data
– aggregated data
•
•
•
•
Data usually time-stamped
Data often pre-processed for expected queries
Does not burden operational systems
Own database management environment
Some important terminology
• granularity - the level of detail of data contained in the
data warehouse
• aggregation - a summation of detailed data to a less
detailed level
• de-normalisation - the abandonment of the rules of
normalisation and allowance of repetitive storage of
data
• dimension - a business perspective from which data is
looked upon
A generic five-tier data warehouse
architecture
Data Flow
DB
DB
Extraction
DB
DB
DW
DBMS
Middleware
DB
Legacy
Layer
Extraction
Layer
Database
Layer
Middleware Applications
Layer
Layer
The legacy layer
DB
DB
DB
DB
DB
DB
Common legacy layer characteristics...
• online source data often in OLTP systems
• often dispersed
• discrepancies between different data sources
• often contains inconsistencies
• ”twinkling databases”
• already taxed by the normal operations of the
organisation
• number of different types of data structures present
The extraction layer
Often the most complex layer in the architecture,
involves...
• data selection
• retrieval and conversion of data
• data cleansing and data scrubbing
• data enrichment
• data aggregation
DB
DB
DB
DB
DB
DB
Extraction
The database layer
The database layer is characterised by...
DW
DB
• storage of data
DBMS
• processing of queries
• data warehouse DBMS
– relational database management system (RDBMS)
– modified relational database management system
– multi-dimensional database management system
(MDDBMS)
– object database management system (ODBMS)
• the storage of metadata
The middleware layer
A technological platform for accessing the database,
involves the distribution and control of enquiries and
resulting data flows...
•
•
•
•
client-server
LAN
WAN
Intranet
DB
DW
DBMS
Middleware
The application layer
Interaction with system users through a number of
decision support applications...
• standard reporting
• ad-hoc query processing
• statistical analysis
• simulations - ”what if”
Middle• OLAP
ware
• data mining
The data mart issue
A small-subject oriented warehouse
• provides information for a specific sub-group of the
enterprise’s business analyses
• far narrower scope than an enterprise wide data warehouse
• far smaller data volumes
• more manageable
• faster and easier to fine tune the design of the data mart
• allows a “piecemeal” approach to some of the enormous
integration problems involved in creating an enterprise wide
data model
Data mart evolution
Operational Systems
Sales
System
Payroll
System
Purchasing
System
Marketing
System
Order
System
Billing
System
Data Mart
Customer
Data
Data marts based on an enterprise wide data
warehouse
Data Mart
Operational Systems
Sales
System
Payroll
System
Purchasing
System
Marketing
System
Order
System
Billing
System
Data
Warehouse
Avoid the data mart explosion
Data
Mart
Data
Mart
Data
Mart
Alternative DBMS’s for the database layer
A closer look at the three of the four major DBMS
alternatives, namely...
• relational database management system (RDBMS)
• modified relational database management system
• multi-dimensional database management system
(MDDBMS)
DB
DW
DBMS
The relational database management
system (RDBMS)
• based on standard, normalised relational tables
• known technology, many supporting applications,
portable
• standard query interface (SQL)
• supports easy summations and calculations
• can support very large databases
• can be slow when processing complex queries
• established suppliers
DB
DW
DBMS
The modified RDBMS
•
•
•
•
•
•
uses star-join schema based data structure
expanded SQL, good for business queries
provides a more readily understandable interface
specially designed for quick access and fast calculations
highly indexed
often used in data marts (ie, dw with a limited subject
area)
• good market reputation
• demands good knowledge of users information needs
DB
DW
DBMS
Star-join schema in a modified
RDBMS
Time Dimension
Product Dimension
time_key
time_attributa A
time_attribute B
product_key
product_attribute A
product_attribute B
Sales Fact
Promotion Dimension
promotion_key
promo_attribute A
promo_attribute B
time_key
promotion_key
store_key
product_key
dollar_sales
units_sales
dollar_cost
customer_count
Store Dimension
store_key
store_attribute A
store_atribute B
Star schema representing a data perspective
Dimension table or
single dimensional
data subject
Fact table or
multidimensional data
subject
The Multi-Dimensional DBMS
(MDDBMS)
• Uses a ”metacube” as standard data structure
• data stored as an array with any number of possible
dimension
• Optimised for OLAP applications
• often only compatible with proprietary systems
• no branch standards
• demands new competence
• can have bad performance with large data volumes
DB
DW
DBMS
Metadata
Data about data
Main functions are to give...
• data definitions
• the origin of data
• the structure of data
• rules for the selection and transfer of data
• qualitative and quantitative data about data
The metadata repository
An integrated complete source of metadata
• is at the heart of the data warehouse architecture
• supports the information needs of...
– system developers
– data administrators
– system administrators
– users
– applications on the data warehouse
• very complex data structure
• must contain full version history
• must always be up to date
Decision making in organisations
The basics of organisational decision
making
• decisions involve choices between alternative courses of
action in a given decision situation
• decision making relates to organisational purposes and
should serve the interests of the organisation
• decisions are made on behalf of the organisation by a
decision making unit, these can be anything from an
individual to an entire organisation
• decision making occurs in a specific environment which
has a great impact on the decision making process and
the choices made
Decision making models
• Provide a framework for interpreting decision
making in an organisation
• help articulate the organisational aspects of the
environment in which decision making occurs
• can help suggest what form of decision support is
best suited to an organisation
4 organisational decision making
models
•
•
•
•
The rational model
The political/competitive model
The garbage can model
The program model
Rational
Political
competitive
Garbage
can
Program
The rational model
• Involves organisational units using information rationally
in order to make choices on behalf of the organisation
• decisions based on sound analysis of possible outcomes
and their expected value to the organisation
• decision unit chooses those options that lead to the
outcome with the maximum expected value
• the model is widely spread but there is some scepticism
as to the applicability of the model in complex,
organisational decision situations
Rational
Political
competitive
Garbage
can
Program
The political/competitive model
• Decision making occurs in an environment where
decisions are consequences of strategies and tactics of
decision units trying to influence decisions so that they
result in choices favourable to themselves
• different stakeholders in the same decision process
have different goals
• decision units promote the choice of options that they
believe will result in an outcome which best serves their
own purposes
• the interests of the organisation are often subordinate
to the interests of the decision
Political
Rational
competitive
units involved in the process
Garbage
can
Program
The garbage can model
• Organisational decisions are consequences of intersections
of problems looking for solutions, solutions looking for
problems, and opportunities for making decisions
• participants in the decision making process look for
opportunities to apply tried and tested solutions to problems
they are confronted with
• the organisational memory records which solutions are
available and to what sorts of problems they were applicable
• this model highlights the role of chance and timing in
organisational decision making and is quite different from
the rational model
Rational
Political
competitive
Garbage
can
Program
The program model
• A model based on the presumption that decision makers
are to some extent incapable and unwilling to take
decisions
• decision making is affected by standard operating
procedures, group norms, and action-constraining
organisational programs
• these standard practices result in a “programming” of
decision makers
• decision making behaviour at time t will probably be the
same as decision making behaviour at time t-1
Rational
Political
competitive
Garbage
can
Program
Problems with organisational decision
making
• limited rationality in the choice of alternatives
• inconsistent preferences for decision outcomes
• chaotic, fragmented, and coincidental decision
processes
• decision making as a sacred ritual
• conflict of interest
The organisational decision making
environment
• Available knowledge - more and increasing
– knowledge will become more available and more
proliferous
• Complexity - more and increasing
– specialization and diversity in systems accompanied
by increasing interdependence
• Turbulence - more and increasing
– the rate of change in the environment is continually
increasing
The difficulties of managerial problem
solving
• managerial problems are by nature indefinite
– not easily identified and in some cases not even possible to
agree on what the problem is
• most managerial problems are interconnected
– acting towards the solution of one problem may adversely
affect another problem, the sub-optimisation problem
• managerial problems come in an infinite variety of
shapes and sizes
• managerial problems most often have no clear cut
solution
• the factors which affect managerial problems are often
unbounded
The use of data and information in
organisations
Retrieving information
• Scanning (exploration)
– general browsing of several data sources in order to
spot variations or find a specific piece of information
• Focused search (exploitation)
– searching for data in order to answer a specific
question or solve a given problem
The rational use of information
• Information’s main role is reducing uncertainty
– ignorance about future consequences of current
actions
– ignorance of the knowledge possessed by others and
their probable actions
• The cost of information should never exceed the
value of the benefits it delivers
– this can /theoretically) be achieved by comparing the
situation with a specific piece of information as
compared to the situation without that same
information
The real (ab)use of information
• Information overload and the distribution of worthless
information
• Information is used for ex-ante substantiation of
decisions
• Information is power and power is rarely shared
• Information seeking is a ritual where the organisations
decision making history decided which information is
“right”
• Information can be imbued with multiple meanings
depending on the intentions of the person citing the
information
• Soft is best
Some of the problems of providing
information
• What is the organisations “culture” of information
use?
• What decision situations are going to be supported
by the information and which factors are “relevant”
to these decisions?
• How will the decision situations evolve and how will
this affect the information requirements?
• How can the information be presented so as to be
minimally unambiguous and maximally accessible ?