Transcript Lecture III

Decision Support Systems
1201311
Data Warehousing
Chattrakul Sombattheera
Agenda
•
•
•
•
•
•
•
Definitions and Concepts
Process Overview
Architecture
Data Integration and the ETL Processes
Development
Real-Time Data Warehousing
Administration and Security Issues
Definitions and Concepts
• A data warehouse (DW) is a pool of data produced to
support decision making; it is also a repository of current
and historical data of potential interest to managers
throughout the organization.
• Data are usually structured to be available in form ready
for analytical processing activities (e.g., online analytical
processing [OLAP], data mining, querying, reporting,
other decision support applications).
• A data warehouse is a subject-oriented, integrated, timevariant, nonvolatile collection of data in support of
management’s decision-making process.
Characteristics of DW
• Subject-oriented. Data are organized by detailed subject, such as
sales, products, or customers, containing only information relevant
for decision support that allows decision makers to both know how
their business performs and why.
– A DW differs from typical DBs that they are more product-oriented and
are tuned to handle transactions of DBs. Subject orientation allows a
more comprehensive view of organization
• Integrated. Data are from different sources and must be in a
consistent form. DWs must deal with naming conflicts and
discrepancies among units of measure.
• Time variant (time series). A DW usually contains historical data
(containing multiple time points, e.g. daily, weekly, monthly), except
in real-time systems. They detect trends, deviations, long-term
relationships for forecasting and comparisons, leading to decision
making.
• Nonvolatile. Users cannot change data in DWs. Obsolete data are
discarded, and changes are recorded as new data.
Additional DW Characteristics
• Web-based. DWs are typically web-based
application.
• Relational/Multidimensional. DWs use either a
relational structure or multidimensional structure.
• Client/server. DWs use client/server architecture
to provide easy access to end users.
• Real-time. Newer DWs provide real-time or
active, data access and analysis capabilities.
• Include metadata. DWs contain metadata (data
about data) about how data organized and how
to effectively use them.
Parts of DWs.
• Data Marts. A data mart is a subset of a DW, typically consisting of a
single subject area (e.g. marketing, operations).
• Operational Data Stores (ODS). An ODS provides a fairly recent
form of customer information file (CIF) and is used for short-term
decision involving mission-critical applications. The contents of an
ODS are updated through the course of business operations.
• Enterprise Data Warehouses (EDW). An EDW is a large scale DW
that is used across the enterprise for decision support. EDW are
used to provide data for many types of DSS, including customer
relation management (CRM), supply chain management (SCM),
business performance management (BPM), business activity
monitoring (BAM), product lifecycle management (PLM), revenue
management, knowledge management systems (KMS), etc.
DW Process Overview
•
•
•
•
•
•
Data sources. Data come from various sources including legacy systems,
external data providers, online transaction processing (OLTP), enterprise
resource planning (ERP) system, Web logs, etc.
Data extractions. Data are extracted using custom-written or commercial
software (ETL).
Data loading. Data are loaded into staging area, where they are
transformed and cleansed.
Comprehensive database. EDWs support all decision analysis by providing
relevant summarised and detailed information originating from many
different sources.
Metadata. Medata are maintained so that they can be accessed by IT
personnel and users. Metadata include software programs about data and
rules for organizing data summaries that are easy to index and search.
Middleware tools. Middleware tools enable access to the data warehouse.
Middleware tools include SQL, Business Object, applications (data mining,
OLAP, reporting tools, data visualize tools), etc.
DW Process Overview
Applications
Custom-suit
applications
Access
Replication
Legacy
Select
OLTP
Transform
Metadata
reports
Integrate
External
Maintain
Preparation
Operational
systems/data
Enterprise data
warehouse
Target database
[HDB, MDDB]
Data
mart
Data
mart
Data
mart
API Middleware
Extract
Production
reporting
tools
Relational
query tools
OLAP/
hOLAP
Web
browsers
Data
mining
Information
visualization
DW Architectures
Application Server
Database Server
Client Workstation
Three-tier DW
DW Architectures
Database
Application & Database Server
Client Workstation
Two-tier DW
DW Architectures
Web
browser
Client
Web pages
Application
Server
Web server
Data
warehouse
Internet, intranet,
and/or extranet
Web-based DW
DW Architectures
Source Systems
ETL
Central
metadata
Central data
warehouse
Data
analysis
Data mart
Enterprise Data Warehousing
Architecture
RDBMS
Data
analysis
DW Architecture
Source Systems
ETL
RDBMS
Sales Data
mart
Local Metadata
RDBMS
Financial
Data mart
Local Metadata
Data Mart Architecture
MDB
Human
Resource
Data mart
Local Metadata
DW Architectures
Source Systems
Central
metadata
ETL
RDBMS
RDBMS
Data
analysis
Local Metadata
Data
analysis
Local Metadata
Hub and Spoke Data Mart
Architecture
MDB
Data
analysis
Local Metadata
DW Architectures
Source Systems
ETL
Central data
warehouse
Data
analysis
RDBMS
Data mart
Data
analysis
Enterprise Warehouse and
Operational Data Store
Central
metadata
Operational
data store
OLTP tools
DW Architectures
Source Systems
ETL
Central data
warehouse
Data
analysis
RDBMS
Data mart
Data
analysis
Enterprise Warehouse and
Operational Data Store
Central
metadata
Operational
data store
OLTP tools
DW Architectures
Source Systems
ETL
ETL
ETL
Data mart
Central DW
Data
analysis
Data
analysis
Distributed Data Warehouse
Architecture
10 Factors for choosing DW
architecture
• Information interdependence between organizational
units
• Upper management’s information needs
• Urgency of need for a data warehouse
• Nature of end-user tasks
• Constraints on resources
• Strategic view of the data warehouse prior to
implementation
• Compatibility with existing systems
• Perceived ability of the in-house IT staff
• Technical issues
• Social/Political factors
Data Integration
• Data integration comprises three major
processes that, when correctly implemented,
permit data to be accessed and made
accessible to an array of ETL and analysis tools
and data warehousing environment:
– Data access (i.e., the ability to access and extract
data from any data source),
– Data federation (i.e., the integration of business views
across multiple data stores), and
– Change capture (i.e., based on the indentification,
cpature, and delivery of the changes made to
enterprise data sources).
Data Integration Techniques
• Enterprise application integration (EAI) provides a vehicle for
pushing data from source systems into the data warehouse.
• EAI involves integrating application functionality and is focused on
sharing functionality (rather than data) across systems, thereby
enabling flexibility and reuse.
• Traditional EAP focuses on enabling application reuse at the
programming level, whereas modern EAP uses services-oriented
architecture (SOA).
• Enterprise information integration (EII) proposes real-time data
integration from a variety of sources, such as relational databases,
Web services, and multidimensional databases.
• EII tools use predefined metadata to populate views that make
integrated data appear relational to end users. XML seems to be the
most appropriate tool to define metadata.
Extraction, Transformation, and
Load (ETL)
•
•
ETL is the heart of DW.
ETL is composed of
– Extraction: reading data from one or more databases,
– Transformation: converting the extracted data from its previous form into the form
in which it needs to be so that it can be placed into a data warehouse or simply
another database, and
– Load: putting the data into datawarehouse.
•
•
•
Transformation occurs by using rules or lookup tables or by combining the
data with other data.
The three database functions are integrated into one tool to pull data out of
one or more databases and place them into another, consolidated database
or a data warehouse.
ETL tools transport data between sources and targets, document how data
elements (e.g. metadata) change as the move between source and target,
exchange metadata with other application as needed, and administer all
runtime processes and operations (e.g., scheduling, error management,
audit logs, statistics).
ETL Process
Transient data
source
Data
warehouse
Packaged
application
Legacy
system
Extract
Transform
Cleanse
Other internal
applications
Load
Data mart
ETL Process
DW Development Approaches
• The inmon Model: The EDW Approach
• The Kimbell Model: The Data Mart
Approach
DW Structure: Star Schema
Driver
Automotive
Claim Information
Location
Time
Dimension: How data will be
accessed (e.g. by location,
time period, type of
automobile or driver
Facts: Central table that
contains summarized
(usually) information; also
contains foreign keys to
access each dimension table
Star Shema