data warehouse

Download Report

Transcript data warehouse

DATAWAREHOUSE
Introduction
www.notesvillage.com
What is Data Warehousing?
Information
A process of transforming data
into information and making it
available to users in a timely
enough manner to make a
difference
[Forrester Research]
Data
www.notesvillage.com
What are the users saying...
 Data should be integrated across the
enterprise
 Summary data has a real value to the
organization
 Historical data holds the key to
understanding data over time
 What-if capabilities are required
www.notesvillage.com
What is a Data Warehouse?
A single, complete and consistent
store of data obtained from a
variety of different sources made
available to end users in a what
they can understand and use in a
business context.
[Barry Devlin]
www.notesvillage.com
Data, Data everywhere
yet ...
 I can’t find the data I need
 data is scattered over the network
 many versions, subtle differences
• I can’t get the data I need
– need an expert to get the data
• I can’t understand the data I found
– available data poorly documented
• I can’t use the data I found
– results are unexpected
– data needs to be transformed from
www.notesvillage.com
one form to other
A producer wants to know….
Which are our
lowest/highest margin
customers ?
Who are my customers
and what products
are they buying?
What is the most
effective distribution
channel?
What product prom-otions have the biggest
impact on revenue?
www.notesvillage.com
Which customers
are most likely to go
to the competition ?
What impact will
new products/services
have on revenue
and margins?
Necessity
 Need for Strategic Information
 Retain consumer base
 Increase customer base by 20%
 Enhance consumer satisfaction level
 Launch two new products
 Information Crisis
 How to maintain lots of data in the organization
 IT resources and strategy is not effective for data
 Opportunities and Risks:
 Quick decisions
 Managerial analysis
 Crisis management
www.notesvillage.com
History of DSS
 Ad hoc reports
 Small applications
 Information centers
 Decision support system
www.notesvillage.com
Disadvantages
 Many adhoc requests .
 Supplementary report problems
 No interactivity of the user
 Flexible informations.
www.notesvillage.com
OPERATIONAL VS INFORMATIONAL
 Current Values
 Archived value
 Optimized for transaction
 Optimized for complex
 High frequency
 Read Update Delete

 Large user

 Response time is very fast


www.notesvillage.com
queries
Low
Read
Adhoc,Random
Relatively more several
seconds
Scenerio 1
 ABC Pvt Ltd is a company with branches at Mumbai, Delhi,
Chennai and Banglore. The Sales Manager wants quarterly
sales report. Each branch has a separate operational system.
www.notesvillage.com
Scenario 1 : ABC Pvt Ltd.
Mumbai
Delhi
Sales per item type per branch
for first quarter.
Chennai
Banglore
www.notesvillage.com
Sales
Manager
Solution 1:ABC Pvt Ltd
 Extract sales information from each database.
 Store the information in a common repository at a single
site.
www.notesvillage.com
Solution
1:ABC
Pvt
Ltd
.
Mumbai
Report
Delhi
Chennai
Banglore
www.notesvillage.com
Data
Warehouse
Query &
Analysis tools
Sales
Manager
Scenario 2 : One Stop Shopping
Data Entry Operator
Report
Wait
Data Entry Operator
www.notesvillage.com
Operational
Database
Management
Solution 2
•Extract data needed for analysis from operational
database.
•Store it in warehouse.
•Refresh warehouse at regular interval so that it
contains up to date information for analysis.
•Warehouse will contain data with historical
perspective.
www.notesvillage.com
Solution 2
Data Entry
Operator
Report
Transaction
Operational
database
Data Entry
Operator
www.notesvillage.com
Extract
data
Data
Warehouse
Manager
Scenario 3
Cakes & Cookies is a small,new company.President of the
company wants his company should grow.He needs
information so that he can make correct decisions.
www.notesvillage.com
Solution 3
•Improve the quality of data before
loading it into the warehouse.
•Perform data cleaning and
transformation before loading the data.
•Use query analysis tools to support
adhoc queries.
www.notesvillage.com
Solution 3
Expansio
n
sales
Data
Warehouse
Query and Analysis
tool
President
time
Improvemen
t
www.notesvillage.com
Definitions of a Data Warehouse
“A subject-oriented, integrated, time-variant and
1.
non-volatile collection of data in support of
management's decision making process”
- W.H. Inmon
2.
“A copy of transaction data, specifically
structured for query and analysis”
- Ralph Kimball
www.notesvillage.com
Data Warehouse
 For organizational learning to take place, data from many
sources must be gathered together and organized in a
consistent and useful way – hence, Data Warehousing
(DW)
 DW allows an organization (enterprise) to remember
what it has noticed about its data
 Data Mining techniques make use of the data in a DW
www.notesvillage.com
 What is Data Warehouse??
 Inmons’s definition
A data warehouse is
-subject-oriented,
-integrated,
-time-variant,
-nonvolatile
collection of data in support of management’s
decision making process
www.notesvillage.com
Subject-oriented
•Data warehouse is organized around
subjects such as sales, product, customer.
•It focuses on modeling and analysis of data
for decision makers.
•Excludes data not useful in decision support
process.
www.notesvillage.com
Integrated Data
 Integration
 Data Warehouse is constructed by integrating multiple
heterogeneous sources.
 Data Preprocessing are applied to ensure consistency.
RDBMS
Legacy
System
www.notesvillage.com
Flat File
Data
Warehouse
Integration
In terms of data.
encoding structures.
Measurement of
attributes.
physical attribute.
of data
naming conventions.
Data type format
www.notesvillage.com
Time-variant
•Provides information from historical perspective
e.g. past 5-10 years
•Data is stored as snapshots
•Every key structure contains either implicitly or
explicitly an element of time
•Allows analysis for the past
•Relates information to the present
www.notesvillage.com
•Enables
forecasts for the future
Nonvolatile
•Data once recorded cannot be updated.
•Data warehouse requires two operations in
data accessing
•Initial loading of data
•Access of data
www.notesvillage.com
Data Granularity
 Operational system is kept at the lowest level
 Data is summarized at different levels
 The lower the level the more data granularity.
www.notesvillage.com
Operational v/s Information System
Features
Operational
Information
Characteristics
Operational
processing
Informational processing
Orientation
Transaction
Analysis
User
Clerk,DBA,database
professional
Knowledge workers
Function
Day to day operation
Decision support
Data
Current
Historical
View
Detailed,flat relational
Summarized,
multidimensional
DB design
Application oriented
Subject oriented
Unit of work
Short ,simple
transaction
Complex query
Read/write
Mostly read
www.notesvillage.com
Access
Data Mart
 A Data Mart is a smaller, more focused Data Warehouse – a
mini-warehouse.
 A Data Mart typically reflects the business rules of a specific
business unit within an enterprise.
www.notesvillage.com
Data Warehouse to Data Mart
Data
Warehouse
www.notesvillage.com
Data Mart
Decision
Support
Information
Data Mart
Decision
Support
Information
Data Mart
Decision
Support
Information
www.notesvillage.com
Data Marts vs Data Warehouses
Internet/Intranet Layer 11
direct queries
virtual queries
ad hoc queries
Virtual DW
Coarse DW
Operational Data
Layer 2a
Central DW
Distributed DW
Core DW Layer 3
North A merica
External Data
Layer
2b
Unit ed St at es
$11, 000
Sa le s
Un it e d St a t e s
by
S a le s
$ 1 0 , 3 t4$o01 0 , 3 5 (0 1 )
$ 8 , 7 3t $0o1 0 , 3 4 (0 2 )
$ 4 , 3 2t 0$
o 8 , 7 3 0( 2 )
$ 1 , 1 0t 0$
o 4 , 3 2 0( 1 )
$ 7 3 0t $
o 1 , 1 0 0( 3 )
Presentation/
Desktop Access
Layer 1
Data Mart
Layer 4
Data Feed/
Data Mining/
Indexing Layer 6
Data Staging and
Quality Layer
Meta-data Repository Layer 8
Warehouse Management Layer 9
Application Messaging (Transport) Layer 10
www.notesvillage.com
5
Data
Access
Layer 7
Non-operational
Data
Layer 2c
Operational vs. Informational
Systems
Operational
Systems
Data
Marts
Data
Information
Delivery System
Warehouse
Informational
Systems
www.notesvillage.com
Virtual Date Warehouse
 A Virtual Data Warehouse approach is often chosen when
there are infrequent demands for data and management
wants to determine if/how users will use operational
data.
 One of the weaknesses of a Virtual Data Warehouse
approach is that user queries a made against operational
DBs.
 One way to minimize this problem is to build a “Query
Monitor” to check the performance characteristics of a
query
before executing it.
www.notesvillage.com
• A Coarse Data Warehouse is often chosen when the
organization has a relatively clean/new operational
system and management wants to make the operational
data more easily available for just that system.
• A Central Data Warehouse is often chosen when the
organization has a clear understanding about it
Information Access needs and wants to provide
“quality”, “integrated” , information to its knowledge
workers
• A Distributed Data Warehouse is similar in most respects
to a Central Data Warehouse, except that the data is
distributed to separate mini-Data Warehouses (Data
Marts )on local or specialized servers
www.notesvillage.com
Central Data Warehouse
Internet/Intranet Layer 11
direct queries
virtual queries
ad hoc queries
Virtual DW
Coarse DW
Operational Data
Layer 2a
Central DW
Distributed DW
Core DW Layer 3
North A merica
External Data
Layer
2b
Unit ed St at es
$11, 000
Sa le s
Un it e d St a t e s
by
S a le s
$ 1 0 , 3 t4$o01 0 , 3 5 (0 1 )
$ 8 , 7 3t $0o1 0 , 3 4 (0 2 )
$ 4 , 3 2t 0$
o 8 , 7 3 0( 2 )
$ 1 , 1 0t 0$
o 4 , 3 2 0( 1 )
$ 7 3 0t $
o 1 , 1 0 0( 3 )
Presentation/
Desktop Access
Layer 1
Data Mart
Layer 4
Data Feed/
Data Mining/
Indexing Layer 6
Data Staging and
Quality Layer
Meta-data Repository Layer 8
Warehouse Management Layer 9
Application Messaging (Transport) Layer 10
www.notesvillage.com
5
Data
Access
Layer 7
Non-operational
Data
Layer 2c
Top-down approach
 Single, Central storage of
 Needs high level cross
data about the content
 Centralized Rule and
Content
 May seek quick result if
implemented with Iteration
functional skills
 High outlay with proof of
concepts
 High exposure to
risk/Failures
www.notesvillage.com
www.notesvillage.com
Data Warehouse Architectures
 Generic Two-Level Architecture
 Independent Data Mart
 Dependent Data Mart and Operational Data Store
 Logical Data Mart and Real-Time Data Warehouse
 Three-Layer architecture
All involve some form of extraction, transformation and loading (ETL)
www.notesvillage.com
Process Architecture
 These interrelated sub-processes are sometimes referred to as an
“ETL” process.
 1)Extract- Since data for the data warehouse can come from different
sources and may be of different types, the plan to extract the data along with
appropriate compression and encryption techniques is an important
requirement for consideration.
 2)Transform- Transformation of data with appropriate conversion,
aggregation and cleaning besides de-normalization and surrogate key
management is also an important process to be planned for building a data
warehouse.
 3)Load- Steps to be considered to load data with optimization by considering
the multiple areas where the data is targeted to be loaded and retrieved is
also an important part of the data warehouse architecture plan
www.notesvillage.com
Data Model Architecture
 In Data Model Architecture (also known as Dimensional
Data Model), there are 3 main data modeling styles for
enterprise warehouses:
 3rd Normal Form - Top Down Architecture, Top Down
Implementation
 Federated Star Schemas - Bottom Up Architecture,
Bottom Up Implementation
 Data Vault - Top Down Architecture, Bottom Up
Implementation
.
www.notesvillage.com
Technology Architecture
 Scalability and flexibility is required in all facets. The extent of these features are
largely depend upon organizational size, business requirements, nature of
business etc.
 Technology or Technical architecture primary evolved from derivations from
the process architecture, meta data management requirements based on
business rules and security levels implementations and technology tool specific
evaluation.
 Besides these, the Technology architecture also looks into the various
technology implementation standards in database management, database
connectivity protocols (ODBC, JDBC, OLE DB etc), Middleware (based on
ORB, RMI, COM/DOM etc.), Network protocols (DNS, LDAP etc) and other
related technologies.
www.notesvillage.com
Information Architecture
 Information Architecture is the process of translating the
information from one form to another in a step by step
sequence so as to manage the storage, retrieval, modification
and deletion of the data in the data warehouse.
www.notesvillage.com
Resource Architecture
 Resource architecture is related to software architecture in
that many resources come from software resources.
Resources are important because they help determine
performance.
 Workload is the other part of the equation. If you have
enough resources to complete the workload in the right
amount of time, then performance will be high.
www.notesvillage.com
Figure 11-2: Generic two-level data warehousing architecture
L
T
One,
companywide
warehouse
E
Periodic extraction  data is not completely current in warehouse
www.notesvillage.com
Scrub/Cleanse…uses pattern recognition and AI
techniques to upgrade data quality
Figure 11-10:
Steps in data
reconciliation
(cont.)
Fixing errors: misspellings,
erroneous dates, incorrect field
usage, mismatched addresses,
missing data, duplicate data,
inconsistencies
www.notesvillage.com
Also: decoding, reformatting,
time stamping, conversion, key
generation, merging, error
detection/logging, locating
missing data
Figure 11-3 Independent data mart
data warehousing architecture
Data marts:
Mini-warehouses, limited in scope
L
T
E
Separate ETL for each
independent data mart
www.notesvillage.com
Data access complexity
due to multiple data marts
Figure 11-4 Dependent data mart with
ODS provides option for
operational data store: a three-level architecture obtaining current data
L
T
E
Single ETL for
enterprise data warehouse
(EDW)
www.notesvillage.com
Simpler data access
Dependent data marts
loaded from EDW
Figure 11-5 Logical data mart and real
time warehouse architecture
ODS and data warehouse
are one and the same
L
T
E
Near real-time ETL for
Data Warehouse
www.notesvillage.com
Data marts are NOT separate databases,
but logical views of the data warehouse
 Easier to create new data marts
Figure 11-6 Three-layer data architecture for a data warehouse
www.notesvillage.com