Transcript Lecture IV
Decision Support Systems
1201311
Data Warehousing
Chattrakul Sombattheera
Agenda
•
•
•
•
•
•
•
Definitions and Concepts
Process Overview
Architecture
Data Integration and the ETL Processes
Development
Real-Time Data Warehousing
Administration and Security Issues
Definitions and Concepts
• A data warehouse (DW) is a pool of data produced to
support decision making; it is also a repository of current
and historical data of potential interest to managers
throughout the organization.
• Data are usually structured to be available in form ready
for analytical processing activities (e.g., online analytical
processing [OLAP], data mining, querying, reporting,
other decision support applications).
• A data warehouse is a subject-oriented, integrated, timevariant, nonvolatile collection of data in support of
management’s decision-making process.
Characteristics of DW
• Subject-oriented. Data are organized by detailed subject, such as
sales, products, or customers, containing only information relevant
for decision support that allows decision makers to both know how
their business performs and why.
– A DW differs from typical DBs that they are more product-oriented and
are tuned to handle transactions of DBs. Subject orientation allows a
more comprehensive view of organization
• Integrated. Data are from different sources and must be in a
consistent form. DWs must deal with naming conflicts and
discrepancies among units of measure.
• Time variant (time series). A DW usually contains historical data
(containing multiple time points, e.g. daily, weekly, monthly), except
in real-time systems. They detect trends, deviations, long-term
relationships for forecasting and comparisons, leading to decision
making.
• Nonvolatile. Users cannot change data in DWs. Obsolete data are
discarded, and changes are recorded as new data.
Additional DW Characteristics
• Web-based. DWs are typically web-based
application.
• Relational/Multidimensional. DWs use either a
relational structure or multidimensional structure.
• Client/server. DWs use client/server architecture
to provide easy access to end users.
• Real-time. Newer DWs provide real-time or
active, data access and analysis capabilities.
• Include metadata. DWs contain metadata (data
about data) about how data organized and how
to effectively use them.
Parts of DWs.
• Data Marts. A data mart is a subset of a DW, typically consisting of a
single subject area (e.g. marketing, operations).
• Operational Data Stores (ODS). An ODS provides a fairly recent
form of customer information file (CIF) and is used for short-term
decision involving mission-critical applications. The contents of an
ODS are updated through the course of business operations.
• Enterprise Data Warehouses (EDW). An EDW is a large scale DW
that is used across the enterprise for decision support. EDW are
used to provide data for many types of DSS, including customer
relation management (CRM), supply chain management (SCM),
business performance management (BPM), business activity
monitoring (BAM), product lifecycle management (PLM), revenue
management, knowledge management systems (KMS), etc.
DW Process Overview
•
•
•
•
•
•
Data sources. Data come from various sources including legacy systems,
external data providers, online transaction processing (OLTP), enterprise
resource planning (ERP) system, Web logs, etc.
Data extractions. Data are extracted using custom-written or commercial
software (ETL).
Data loading. Data are loaded into staging area, where they are
transformed and cleansed.
Comprehensive database. EDWs support all decision analysis by providing
relevant summarised and detailed information originating from many
different sources.
Metadata. Medata are maintained so that they can be accessed by IT
personnel and users. Metadata include software programs about data and
rules for organizing data summaries that are easy to index and search.
Middleware tools. Middleware tools enable access to the data warehouse.
Middleware tools include SQL, Business Object, applications (data mining,
OLAP, reporting tools, data visualize tools), etc.
DW Process Overview
Applications
Custom-suit
applications
Access
Replication
Legacy
Select
OLTP
Transform
Metadata
reports
Integrate
External
Maintain
Preparation
Operational
systems/data
Enterprise data
warehouse
Target database
[HDB, MDDB]
Data
mart
Data
mart
Data
mart
API Middleware
Extract
Production
reporting
tools
Relational
query tools
OLAP/
hOLAP
Web
browsers
Data
mining
Information
visualization
DW Architectures
Application Server
Database Server
Client Workstation
Three-tier DW
DW Architectures
Database
Application & Database Server
Client Workstation
Two-tier DW
DW Architectures
Web
browser
Client
Web pages
Application
Server
Web server
Data
warehouse
Internet, intranet,
and/or extranet
Web-based DW
DW Architectures
Source Systems
ETL
Central
metadata
Central data
warehouse
Data
analysis
Data mart
Enterprise Data Warehousing
Architecture
RDBMS
Data
analysis
DW Architecture
Source Systems
ETL
RDBMS
Sales Data
mart
Local Metadata
RDBMS
Financial
Data mart
Local Metadata
Data Mart Architecture
MDB
Human
Resource
Data mart
Local Metadata
DW Architectures
Source Systems
Central
metadata
ETL
RDBMS
RDBMS
Data
analysis
Local Metadata
Data
analysis
Local Metadata
Hub and Spoke Data Mart
Architecture
MDB
Data
analysis
Local Metadata
DW Architectures
Source Systems
ETL
Central data
warehouse
Data
analysis
RDBMS
Data mart
Data
analysis
Enterprise Warehouse and
Operational Data Store
Central
metadata
Operational
data store
OLTP tools
DW Architectures
Source Systems
ETL
Central data
warehouse
Data
analysis
RDBMS
Data mart
Data
analysis
Enterprise Warehouse and
Operational Data Store
Central
metadata
Operational
data store
OLTP tools
DW Architectures
Source Systems
ETL
ETL
ETL
Data mart
Central DW
Data
analysis
Data
analysis
Distributed Data Warehouse
Architecture
10 Factors for choosing DW
architecture
• Information interdependence between organizational
units
• Upper management’s information needs
• Urgency of need for a data warehouse
• Nature of end-user tasks
• Constraints on resources
• Strategic view of the data warehouse prior to
implementation
• Compatibility with existing systems
• Perceived ability of the in-house IT staff
• Technical issues
• Social/Political factors
Data Integration
• Data integration comprises three major
processes that, when correctly implemented,
permit data to be accessed and made
accessible to an array of ETL and analysis tools
and data warehousing environment:
– Data access (i.e., the ability to access and extract
data from any data source),
– Data federation (i.e., the integration of business views
across multiple data stores), and
– Change capture (i.e., based on the indentification,
cpature, and delivery of the changes made to
enterprise data sources).
Data Integration Techniques
• Enterprise application integration (EAI) provides a vehicle for
pushing data from source systems into the data warehouse.
• EAI involves integrating application functionality and is focused on
sharing functionality (rather than data) across systems, thereby
enabling flexibility and reuse.
• Traditional EAP focuses on enabling application reuse at the
programming level, whereas modern EAP uses services-oriented
architecture (SOA).
• Enterprise information integration (EII) proposes real-time data
integration from a variety of sources, such as relational databases,
Web services, and multidimensional databases.
• EII tools use predefined metadata to populate views that make
integrated data appear relational to end users. XML seems to be the
most appropriate tool to define metadata.
Extraction, Transformation, and
Load (ETL)
•
•
ETL is the heart of DW.
ETL is composed of
– Extraction: reading data from one or more databases,
– Transformation: converting the extracted data from its previous form into the form
in which it needs to be so that it can be placed into a data warehouse or simply
another database, and
– Load: putting the data into datawarehouse.
•
•
•
Transformation occurs by using rules or lookup tables or by combining the
data with other data.
The three database functions are integrated into one tool to pull data out of
one or more databases and place them into another, consolidated database
or a data warehouse.
ETL tools transport data between sources and targets, document how data
elements (e.g. metadata) change as the move between source and target,
exchange metadata with other application as needed, and administer all
runtime processes and operations (e.g., scheduling, error management,
audit logs, statistics).
ETL Process
Transient data
source
Data
warehouse
Packaged
application
Legacy
system
Extract
Transform
Cleanse
Other internal
applications
Load
Data mart
ETL Process
DW Development Approaches
• The inmon Model: The EDW Approach
• The Kimbell Model: The Data Mart
Approach
DW Structure: Star Schema
Driver
Automotive
Claim Information
Location
Time
Dimension: How data will be
accessed (e.g. by location,
time period, type of
automobile or driver
Facts: Central table that
contains summarized
(usually) information; also
contains foreign keys to
access each dimension table
Star Shema
DW Implementation Issues
The following is a list of major tasks that could be performed:
• Establishment of service-level agreements and data-refresh
requirements
• Identification of data sources and their governance policies
• Data quality planning
• Data model design
• ETL tool selection
• Relational database software and platform selection
• Data transport
• Data conversion
• Reconciliation process
• Purge and archive planning
• End-user support
DW Implementation Issues
Some of the best practices for implementing a data warehouse:
• The project must fit with corporate strategy and business objectives.
• There must be complete buy-in to the project by executives,
managers, and users.
• It is important to manage user expectation about the completed
project.
• The DW must be built incrementally.
• Build in adaptability.
• The project must be managed by both IT and business
professionals.
• Develop a business/supplier relationship.
• Only load data that have been cleansed and are of a quality
understood by the organization.
• Do not overlook training requirements.
• Be politically aware.
DW Implementation issues
Risks in implementing DW:
•
No mission or objective
•
Quality of source data unknown
•
Skills not in place
•
Inadequate budget
•
Lack of supporting software
•
Source data not understood
•
Weak sponsor
•
Users not computer literate
•
Political problems or turf wars
•
Unrealistic user expectations
•
Architectural and design risks
•
Scope creep and changing requirements
•
Vendors out of control
•
Multiple platforms
•
Key people leaving the project
•
Loss of the sponsor
•
Too much new technology
•
Having to fix an operating system
•
Geography distributed environment
•
Team geography and language culture
DW Implementation Issues
When developing a DW, the followings are “not to” things:
• Starting with the wrong sponsorship chain
• Setting expectations that you cannot meet and frustrating executives at the
moment of truth
• Engaging in politically naïve behavior
• Loading the warehouse with information just because it is available
• Believing that data warehousing database design is the same as
transactional database design
• Choosing a data warehouse manager who is technology oriented rather
than user oriented.
• Focusing on traditional internal record-oriented data and ignoring the value
of external data and of text, images, and, perhaps, sound and video.
• Delivering data with overlapping and confusing definitions.
• Believing promises of performance, capacity, and scalability.
• Believing that your problems are over when the data warehouse is up and
running.
• Focusing on ad hoc data mining and periodic reporting instead of alerts.
Massive DW and Scalability
• Scalability is the ability of a system to maintain its level
of performance when the size of input grows.
• When the size of input in DWs grows, DWs must be able
to handle the number of users and the complexity of their
queries.
• Wal-Mart use hundreds of terabytes of data to study
sales trends and other tasks.
• US Defence use 5 petabyte DW to maintain the medical
records of its 9 million personnel.
• Good scalability of DW means that queries and other
data access functions will grow (ideally) linearly with the
size of the DW.
Real-Time Data Warehousing
• Traditionally, DWs work mainly on historical data to support strategic
and tactical decision making.
• For many business, making fast and consistent decisions across the
enterprise requires real-time data warehousing. Decision support
has become operational.
• Real-time data warehouse (RDW), also known as active data
warehouse (ADW), is the process of loading and providing data via
the data warehouse as they become available.
• At a basic level, ADW simply report what happened. At the next
level, some analysis occurs. As the system evolves, it provides
prediction capabilities, which lead to the next level of
operationalization.
• At its highest evolution, ADW is capable of making events happen
(e.g. activities such as creating sales and marketing compaigns or
idnetifying and exploiting opportunities).
Traditional vs Active DW
•
•
•
•
•
•
Strategic decision only
Results sometimes hard to
measure
Daily, weekly, monthly data
currency acceptable; summaries
often appropriate
Moderate user concurrency
•
•
•
•
Highly restrictive reporting used to •
confirm or check existing
processes and patterns; often uses
predeveloped summary tables or
data marts
•
Power users, knowledge workers,
internal users
Strategic, tactic and Operational
decisions
Results measured with operations
Only comprehensive detailed data
available within minutes is
acceptable
High number (1000 or more) of
users accessing and querying the
system simultaneously
Flexible ad hoc reporting, as well
as machine-assisted modeling
(e.g., data mining) to discover new
hypotheses and relationships
Operational staffs, call centers,
external users
DW Administration and Security
Issues
• Establishing effective corporate security policies and
procedures. An effective security policy should start at
the top, with executive management, and should be
communicated to all individuals within the organization.
• Implementing logical security procedures and techniques
to restrict access. This includes user authentication,
access controls, and encryption technology.
• Limiting physical access to the data center environment.
• Establishing an effective internal control review process
with an emphasis on security and privacy.