Database Administration: The Complete Guide to Practices and

Download Report

Transcript Database Administration: The Complete Guide to Practices and

Database Administration:
The Complete Guide to Practices and Procedures
Chapter 20
Data Warehouse Administration
Agenda
• What is a Data Warehouse?
– designed and optimized for a specific type
of processing
• Administering the Data Warehouse
• Questions
What is a Data Warehouse?
• Data Warehouse
– A collection of integrated, subject-oriented
databases designed to support the DSS(Decision
Support system) function, where each unit of data
is specific to some moment of time. The data
warehouse contains atomic data and lightly
summarized data.
Breaking Down the Definition
• Subject-oriented. Data relate to a particular subject
instead of the many subjects relevant to the
company’s ongoing operations.
• Integrated. Although the data can originate from a
variety of disparate sources, the data within a data
warehouse is merged into a coherent ()‫ متماسك‬whole.
• Time-variant. Data stored within a data warehouse is
identified with a particular time period.
• Nonvolatile. Data is stable in a data warehouse.
Although data can be added, data cannot be removed.
This provides a consistent picture of the business for
analysis.
data warehouse definition
• For our purposes, data warehousing can be
defined as the process of extracting,
integrating, transforming, and cleansing data
and storing it in a combined)‫ (موحدة‬database.
Once combined and made available, the data
warehouse becomes the only data source that
management should access for decision
making.
What is a Data Mart?
• A data mart is basically a departmental data warehouse
defined for a single business unit or area.
– There is no universally agreed-on difference between a
data warehouse and a data mart.
– A data mart is not necessarily smaller than an enterprise
data warehouse. It may be smaller, but size is determined
based on business needs.
• Departmental analysis at the business-unit level may require more
historical information than cross-department, enterprise-wide
analysis.
• Both data marts and data warehouses need to be
treated differently than traditional database systems
from a DBA perspective.
Analytical vs. Transaction Processing
• The traditional database is designed primarily
for transaction processing—that is, supporting
the transactions that are required for business
operations to be conducted.
• Data warehouses are designed for analytical
processing.
– Include decision support DSS, business
intelligence (BI), advanced analytics, OLAP, data
mining, and information center queries.
OLAP
OLAP stands for online analytical processing.
• With OLAP technology, end users can navigate through
a data warehouse to derive intelligence from data
through interactive and iterative queries against the
warehoused data.
• OLAP uses a multidimensional view of detail, summary,
and aggregate data to access information for further
analysis.
• The key term here is multidimensional.
– A dimension is a structural attribute viewed as similar by
the end user.
– For example, years, quarters, months, days, and so on
make up the time dimension.
OLAP versus OLTP(transaction)
Characteristics
OLAP
OLTP
Scope of work
Entire database
Single transaction
Amount of data
Individual rows
Groups of rows
Nature of data
Derivative
Primitive
Volatility of data
Low—rarely updated
High—regularly updated
Expected response time
Minutes, perhaps hours
Subsecond
Machine usage
Dynamic and unpredictable
Stable and predictable
Processing priorities
User flexibility
High performance and availability
Type of access
Undefined and dynamic
Predefined and static
Data Mining
• Data mining is the process of discovering heretoforeunknown information and patterns lurking )‫(موجودة‬within
an organization’s data.
• Data mining processes apply heuristic algorithms to
historical data to automatically uncover patterns in the
data.
• The quantity and quality of available data and the quality of
the pattern discovery algorithms determine the value of
the data mining applications.
• Consequently, organizations that deploy data mining tend
to store more data for longer periods.
• Both OLAP and data mining are methods for searching out
patterns in corporate data.
– OLAP is user driven, while data mining is computer driven.
Conceptual Structure of Data Warehouse
• Data Warehouse processing involves
– Cleaning and reformatting of data
– OLAP
– Data Mining Back Flushing
Data Warehouse
OLAP
Cleaning
Databases
Other Data Inputs
Data
Reformatting
Metadata
Updates/New Data
DSSI
EIS
Data
Mining
Administering the Data Warehouse
• Many issues of importance when managing a data
warehouse differ from those of traditional database
administration.
– Some of the tasks are equivalent, but the implementation
and expectations are different.
• Data warehouses are accessed mostly by read
operations for analytical processing, whereas the
traditional database is accessed by a mix of read and
write operations for transactional processing.
• The data warehouse is strategic—it allows users to
discover hidden trends and new ways of conducting
business. The traditional database is tactical—it
allows the business to conduct day-to-day
operations.
data warehouse administration
issues
• Be aware of the complexity of data warehouses before
undertaking any implementation project.
– Detailed knowledge of the applications accessing the
databases that feed the data warehouse must be available.
– Be sure to allot development time for learning the
complexities of the source systems.
– Furthermore, the documentation for production systems
is often inadequate or nonexistent, so additional time can
be lost just trying to understand what already exists.
– analyzing the source data to determine what level of data
scrubbing is required is a time consuming task.
• All of these issues must be factored into the task of
data warehouse administration.
DW Administration Issues
•
•
•
•
•
•
•
•
•
•
•
•
Too Much Focus on Technology?
Data Warehouse Design
Data Movement
Data Cleansing
Data Warehouse Scalability
Data Warehouse Performance
Data Freshness
Data Content
Data Usage
Financial Chargeback
Backup & Recovery
Do Not Operate in a Vacuum!
Too Much Focus on Technology?
• Be sure to include tools, people, and methods in
your warehouse blueprint.
• Too often, the focus is solely on the technology
and tools.
• A data warehouse project requires more than just
sound technology.
• It needs careful planning and implementation
(methods) as well as a means to learn from the
efforts of others (people) through mentoring,
consulting, education, seminars, and user groups.
Data Warehouse Design
• When designing the data warehouse, be sure to create
a physical design with the special needs of the data
warehouse in mind, instead of just mimicking the
design of similar production OLTP databases.
• Common physical design techniques used for OLTP
databases can inhibit an OLAP database from
performing well.
• This is so because the data access requirements for file
and table structures, indexes, and clustering sequence
in a data warehouse differ from those of OLTP
databases, as do the access requirements and patterns.
Data Modeling for Data Warehouses
• Traditional Databases generally deal with twodimensional data (similar to a spread sheet).
– However, querying performance in a multidimensional data storage model is much more
efficient.
Data Modeling for Data Warehouses
• Example of Two- Dimensional vs. MultiDimensional
Two Dimensional Model
T hree d imensio nal d at a cub e
REGION
REG1
P
R
O
D
U
C
T
REG2
REG3
P123
P124
P125
P126
:
:
P
r
o
d
u
c
t
P1 2 3
er
a r t Q tr 4
u
Q
a l Qtr 3
c
F i s tr 2
Q
1
r
t
Q
Reg 1 Reg 2
Reg 3
P1 2 4
P1 2 5
P1 2 6
:
:
Region
Data Modeling for Data Warehouses
• Advantages of a multi-dimensional model
– Multi-dimensional models lend themselves readily
to hierarchical views in what is known as roll-up
display and drill-down display.
– The data can be directly queried in any
combination of dimensions, bypassing complex
database queries.
pivoting
• The data can be queried directly in any combination of
dimensions, bypassing complex database queries.
• Tools exist for viewing data according to the user’s choice of
dimensions.
• Changing from one-dimensional hierarchy (orientation) to
another is easily accomplished in a data cube with a
technique called pivoting (also called rotation).
pivoting
•
In this technique the data cube can be thought of as rotating
to show a different orientation of the axes
• For example, you might pivot the data cube to show regional
sales revenues as rows, the fiscal quarter revenue totals as
columns, and the company’s products in the third dimension
(Figure 29.4).
• Hence, this technique is equivalent to having a regional sales
table for each product separately, where each table shows
quarterly sales for that product region by region.
roll-up display
• Multidimensional models lend themselves readily to
hierarchical views in what is known as roll-up display and drilldown display.
• A roll-up display moves up the hierarchy, grouping into larger
units along a dimension
• (for example, summing weekly data by quarter or by year).
Figure 29.5 shows a roll-up display that moves from individual
products to a coarser-grain of product categories.
roll-up display
drill-down display
• Shown in Figure 29.6, a drill-down display
provides the opposite capability, furnishing a
finergrained view, perhaps disaggregating
country sales by region and then regional
sales by subregion and also breaking up
products by styles.
drill-down display
Multi-dimensional Schemas
• Multi-dimensional schemas are specified
using:
– Dimension table
• It consists of tuples of attributes of the dimension.
– Fact table
• Each tuple is a recorded fact. This fact contains some
measured or observed variable (s) and identifies it with
pointers to dimension tables. The fact table contains
the data, and the dimensions to identify each tuple in
the data.
Multi-dimensional Schemas
• Two common multi-dimensional schemas are
– Star schema:
• Consists of a fact table with a single table for each
dimension
– Snowflake Schema:
• It is a variation of star schema, in which the
dimensional tables from a star schema are organized
into a hierarchy by normalizing them.
Multi-dimensional Schemas
• Star schema:
– Consists of a fact table with a single table for each
dimension.
Multi-dimensional Schemas
• Snowflake Schema:
– It is a variation of star schema, in which the
dimensional tables from a star schema are
organized into a hierarchy by normalizing them.
Star Schema Benefits
• The design is flexible enough to facilitate quick
changes or additions as the data warehouse
grows or the application changes.
• The design is easy to understand for both
developers and end users.
• The design mimics the way end users typically
think about and use their business data.
• The design is easily implemented as a physical
database and results in efficient queries because
many DBMSs are built to recognize and optimize
star joins.
Normalized vs. Denormalized
The Goal!
Data Movement
• A big part of any data warehousing effort is
establishing the data requirements and then
creating methods to capture and move the
data into the warehouse.
• Without a clear understanding of end-user
needs, the DBA will have great difficulty
determining what data to move, never mind
how to it.
• Consider data movement technology
Data Cleansing
• The quality of the data in the data warehouse is of utmost
importance to the overall usefulness of the data
warehouse.
– A data warehouse project requires data cleansing and scrubbing
tasks to be defined and performed as the data is moved from
other sources to the data warehouse.
• Never cover up data integrity problems.
– Document them and bring them to the attention of your
manager and the users who rely on the data.
• Small data discrepancies can become statistically irrelevant
when large volumes of data are averaged.
– But averaging is not the only analytical function that is
employed by analytical data warehouse queries.
• Standardization of “unknown” values.
• Data must be treated as a corporate asset.
Data Quality Issues
• Maintaining data quality will be an ongoing
concern.
• Data problems will be discovered not only
throughout the development phase of the data
warehouse but also throughout its useful life.
• Be sure to establish a policy for how data
anomalies are to be reported and corrected
before the data warehouse is made generally
available.
• Do not underestimate the amount of time and
effort that will be required to clean up dirty data.
Operational Problems
Previously undetected operational problems may be
uncovered. You can:
Not the
• Ignore the problem with the underbest
idea!
standing that the problem will exist
in the data warehouse too.
• Fix the problem in the operational production system.
• Fix the problem during the data transformation phase
of data warehouse population, if possible.
But do not mix operational needs into the data
warehouse project plan (time consuming to fix problems
against project time).
Data Warehouse Scalability
• As a data warehouse becomes accepted within the
organization, demand for its services will grow.
• The need for new reports and summary tables increases,
and the data warehouse can explode to several times its
original size.
• Some industry surveys indicate that 60% to 70% of data
warehouses are filled with duplicate or redundant data
such as summary tables and indexes.
• This can more than double the disk storage required to
store the data warehouse.
• It is important, therefore, to architect the system so that it
will be able to scale linearly with demand.
• The DBA can choose to use: parallel processors, parallel
databases, bitmapped indexes, data compression, and
other techniques to provide better scalability.
Data Warehouse Performance
• Extract performance—how smoothly data is
updated and refined in the warehouse
• Data management—quality of the data
warehouse design and database maintenance
• Query performance—how well the database is
tuned for analytical processing and how well
each query is written
• Server performance—the configuration and
performance of the hardware
Aggregation
• Aggregating records in a data warehouse can
enhance the performance of large data
warehousing applications.
• This is true because the amount of detailed
information that exists in most data
warehouses cannot be easily understood.
• Summarizing the data into logical groups
makes it easier to be assimilated and analyzed
in a cogent manner.
Automatic Summary Tables
(aka Materialized Query Tables)
• Automatic summary tables, also known as aka materialized query
tables or materialized views build an automatically aggregated table
from a base table by using standard SQL functions such as COUNT,
SUM, AVG, MIN, or MAX.
– The result is a table that is aggregated at a higher-level dimension.
– Users do not need to even know that summary tables exist. The DBMS
optimizer can determine when using a summary table can optimize a
query written against the detail data and then automatically rewrite
the query to use the summary table. Furthermore, in most cases, the
DBMS automatically maintains aggregates.
– In other words, when underlying data changes, the DBMS
automatically updates the summary table data.
• The DBA controls the creation of summary tables.
• Data needs to be periodically refreshed.
Data Freshness
• Data warehouse processes must be set up to
keep data up-to-date.
• Data latency is the difference between the data in
the data warehouse and the data in the
production system.
• Business needs will drive the desired amount of
latency.
– Some data warehouse require very fresh data (that is,
low latency), whereas other data warehouses can
tolerate high latency.
Data Content
• The data warehouse administrator may need to provide
metadata on the contents of the data warehouse.
• This metadata could include the types of data being stored,
which types of data change most frequently, and how fast
certain types of data are growing.
• Keeping track of growth patterns helps to determine the
areas of the data warehouse that may require attention in
terms of storage allocation, reorganization, or redesign.
• Develop methods to analyze and verify data integrity and
relationships in the data warehouse.
• The primary value of the data warehouse is in its data
content.
– It is good practice for the DBA to gain as much understanding of
that data content as possible.
Data Usage
• A data warehouse management task that is not associated with
traditional database administration is the identification of unused
data.
• Maintaining certain data in the warehouse does not make sense
when certain thresholds or conditions are met. This situation may
occur because of :
–
–
–
–
Usage reasons (data is no longer used by the business)
Technology reasons (reaching a capacity limit)
Regulatory reasons (change in regulations or laws)
Business reasons (restructuring data or instituting different processes).
• If large amounts of unused data are identified, the DBA may decide
to remove the data from the warehouse. Such action should not be
taken indiscriminately.
• The unused data may then be purged from the data warehouse or
perhaps moved to a cheaper storage media.
• The capability to track data warehouse activity is a useful
administrative goal.
Financial Chargeback
• In most organizations, data warehouse projects are
managed by multiple departments, each of which has its
own financial goals.
• Data warehouse DBAs should ensure that they can charge
back appropriate costs to business units and users so that
they can meet financial reporting requirements.
• For chargeback purposes, you should consider measuring
and tracking every request made to the data warehouse,
including who requested the data, the number of rows
returned by the request, the CPU cycles spent to return the
data, and which tables were accessed.
– Even if your organization chooses not to actually bill
departments by usage, it can make sense to establish a financial
chargeback system to demonstrate the value of the data
warehouse to the business units that use it the most.
Backup & Recovery
•
The DBA must create a backup and recovery plan for the data warehouse even
though most access is read-only.
– Some novices mistakenly view backup and recovery as a non-issue for data warehouses,
but this is simply not the case.
– Data is moved into the data warehouse on a consistent basis—some of it from external
sources that might not be easily duplicated.
•
•
•
•
If warehouse data is lost due to some type of failure, the production data from
which it was gathered may no longer exist in the same form.
Data may have been cleansed, transformed, or aggregated before being populated
into the data warehouse. Such data may be difficult or impossible to recreate.
Furthermore, data warehouses often contain external data that, if lost, may have
to be purchased again, creating an unnecessary financial drain on the company.
So, for all of the reasons you need to provide a backup and recovery plan for
traditional databases, you need to do the same for the data warehouse databases.
– The backups need only be taken whenever data is added to the data warehouse.
•
Do not overlook disaster recovery planning for the data warehouse.
Don’t Operate in a Vacuum
• The DBA must be aware of the business requirements of the data
warehouse and ensure that these needs are met.
• As business needs change, operational systems change. When
operational databases and files change, the data warehouse will be
affected as well.
• Before any changes are made to the operational system, the data
warehouse team must be prepared to accept the new format as
input to the data warehouse and to either maintain multiple data
formats for the changed data element or to implement a conversion
mechanism as part of the data transformation process.
– Conversion can result in lost or confusing data.
– When undertaking a conversion, it is wise to plan thoroughly and
automate the process as much as possible with tested, trusted tools
and scripts.
– Relying solely on human conversion will almost surely result in errors
to the data.
Questions