Architecture and Infrastructure
Download
Report
Transcript Architecture and Infrastructure
Components and
Architecture
CS 543 – Data Warehousing
Architecture
What are the key components of a data warehouse?
Architecture is the structure that binds the components
into an integrated whole
DW architecture provides
the overall framework for
developing and deploying DW solutions
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
2
Architectural Components
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
3
Distinguishing Characteristics
Different objectives and scope
Data content
Complex analysis and quick response
Flexible and dynamic
Metadata driven
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
4
Architecture Supporting Flow of Data
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
5
Technical Architecture
The technical architecture of a DW is the complete set
of functions and services provided within its
components
Functions
Services
Rules
and procedures
Data stores
Tools are the means to implement an architecture
Architecture comes
first, then the tools; select the appropriate
tools based on the architecture
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
6
Data Acquisition (1)
This component includes
Extraction
Transfer
into staging area
Preparation for loading (transformation, cleansing, and
integration)
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
7
Data Acquisition (2)
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
8
Data Acquisition – Functions and Services (1)
Data extraction
Select data sources and determine the types of filters to apply to
individual sources
Generate automatic extract files from operational systems using
replication and other techniques
Create intermediary files to store selected data to be merged later
Transport extracted files from multiple platforms
Provide automated job control services for creating extract files
Reformat input from outside sources, departmental files, databases, and
spreadsheets
Resolve inconsistencies for common data elements from multiple sources
Generate common application code for data extraction
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
9
Data Acquisition – Functions and Services (2)
Data transformation
Map
input data to data for DW repository
Clean data, remove duplicates, merge/purge
De-normalize extracted data structures as required by the
dimensional model of the DW
Convert data types
Calculate and derive attribute values
Check for referential integrity
Aggregate data as needed
Resolve missing values
Consolidate and integrate data
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
10
Data Acquisition – Functions and Services (3)
Data staging
Provide
backup and recovery for staging area repository
Sort and merge files
Create files as input to make changes to dimension tables
If staging area storage is a relational database, create and
populate database
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
11
Data Storage
This architectural component covers the process of
loading the prepared data from the data staging area
into the data warehouse repository
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
12
Data Storage – Functions and Services
Load data for full refreshes of DW tables
Perform incremental loads at regular prescribed intervals
Support loading into multiple tables at the detailed and
summarized levels
Optimize the loading process
Provide automated job control services for loading the data
warehouse
Provide backup and recovery for the DW database
Provide security
Monitor and fine-tune the database
Periodically archive data from the database according to preset
conditions
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
13
Information Delivery (1)
This architectural component spans a broad spectrum
of many different methods of making information
available to the users of the DW
To the users, information delivery is the DW; it is the
front-end through which the users retrieve information
from the DW
Information
Online
queries and interactive analyses
Regular and ad-hoc reports
Specialized applications (e.g. executive information system)
Data mining
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
14
Information Delivery (2)
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
15
Information Delivery – Functions and Services
Provide security to control information access
Monitor user access to improve service and for future enhancements
Allow users to browse data warehouse content
Simplify access by hiding internal complexities of data storage from users
Automatically reformat queries for optimal execution
Enable queries to be aware of aggregate tables for faster results
Govern queries and control runaway queries
Provide self-service report generation for users
Store result sets for queries and reports for future use
Provide multiple levels of data granularity
Provide event triggers to monitor data loading
Make provision for the users to perform complex analysis
Enable data feeds to downstream, specialized data support systems such as
EIS and data mining
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
16
Infrastructure Supporting Architecture
The architecture defines the functions and services; the
infrastructure defines the elements to support the
architecture
Infrastructure is the foundation supporting the
architecture
Hardware
servers
OSs
Data
management systems
Networking elements
Supporting tools and applications
People
Procedures
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
17
Operational Infrastructure
Operational infrastructure includes
People
Procedures
Training
Management
software
Operational infrastructure are the people and
procedures that keep the DW functioning, and not
those who develop the DW
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
18
Physical Infrastructure (1)
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
19
Physical Infrastructure (2)
Physical infrastructure includes
Computing
hardware (e.g. server)
OS and utilities
Networking hardware and software
Software tools
Decisions about the physical infrastructure are critical
for a DW. Two principles
Leverage
as much of the existing physical infrastructure
Keep the infrastructure as modular as possible
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
20
Hardware and Operating System
Hardware
Scalability
Support
Vendor reference
Vendor stability
Operating system
Compatibility
Scalability
Security
Reliability
Availability
Preemptive multitasking
Multi-threaded approach
Memory protection
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
21
Single Platform Option
Simplest option, where all functions and services are
performed by a single computing platform
Typically used by small to medium sized companies
who have mainframes or large Unix servers already in
use with capacity to spare
Some shortcomings of using mainframes
Stretched to
capacity
Non availability of tools
Multiple legacy platforms
Company’s migration policy
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
22
Hybrid Option
Most companies opt for the hybrid option where
multiple platforms are used for data warehousing (data
acquisition, data storage, information delivery)
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
23
Data Extraction
Data extraction
Initial reformatting and merging
Performed on the staging area platform
Validation and final quality check
Also performed on source system platform
Transformation and consolidation
Best performed on each source system’s own computing platform
Extract files are reformatted and merged into a smaller number of files
performing verification against the source system
Initial data cleansing
Best performed on each source system’s own computing platform
Performed on the staging area platform
Creation of load images
Performed on the staging area platform
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
24
Options for the Data Staging Area
In one of the legacy platforms
On the data storage platform
On a separate optional platform
You
can optimize the platform for complex transformations
and cleaning
Install specialized tools for transformations and cleaning
Keep track of entire data content in the staging area
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
25
Data Movement
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
26
Client/Server Architecture (1)
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
27
Client/Server Architecture (2)
Application server (middle tier)
To
run middleware and establish connectivity
To execute management and control software
To handle data access from the Web
To manage metadata
For authentication
As front end
For managing and running standard reports
For sophisticated query management
For OLAP applications
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
28
Maturing of the Infrastructure
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
29