Data Warehousing - Augmented Intel

Download Report

Transcript Data Warehousing - Augmented Intel

Data Warehousing
Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl
• Overview
– Basic Definitions
– Normalization
• Entity Relationship Diagrams (ERDs)
• Normal Forms
• Many to Many relationships
– Warehouse Considerations
• Dimension Tables
• Fact Tables
• Star Schema
• Snowflake Schema
• Further Warehouse Design Considerations
– Changing Dimensions
– Conformed Dimensions
•
Data warehouse
•
OLTP - On-Line Transaction Processing
– A data warehouse is a copy of transaction data specifically structured for
querying and reporting.
– “a collection of computerized data that is organized to most optimally
support reporting and analysis activity”
– OLTP describes a type of processing that databases are designed to
support.
– OLTP applications need to support a high number of transactions per unit of
time.
– A transaction is a set of Insert, Update, and sometimes Delete statements
that must succeed or fail as a unit. Transactions typically perform such
functions as recording orders, depleting inventory, etc.
– Electronic banking and order processing are common OLTP applications.
•
OLAP - On-Line Analytical Processing
– In its broadest usage, the term "OLAP" is used as a synonym for "data
warehousing".
– The term "On-Line Analytical Processing" was developed to distinguish data
warehousing activities from On-Line Transaction Processing.
– In a narrower usage, the term OLAP is used to refer to the tools used for
Multidimensional Analysis…
• Sample Star Schema:
… but when people speak of OLAP they may
properly be referring to a schema like this one
in a relational database.
• Database Normalization
– Normalization reduces redundant data storage by organizing data
efficiently.
– There are many ways to normalize a database consistently within
a set of business requirements.
– Normalization reduces the potential for anomalies during data
manipulation operations.
– Non-normalized databases are vulnerable to data anomalies when
they store data redundantly.
• If data is stored in two locations, but is later updated in only
one location, then the data becomes inconsistent; this is
referred to as an update anomaly.
• To avoid data anomalies, non-primary key data in a
normalized database are stored in only one location.
– If you need a Department’s physical location, you should
need to look in the Department Table.
• Unnormalized Table
• We could design a database so that each record we
would read about specific types of business object
would have all the information we’d typically need about
those object types.
But
This schema is more typical of a normalized database.
•
We could generate the information on the previous page with this
query:
Select
e.EmployeeID, e.LastName, e.FirstName,
d.DeptID, d.Name, d.Location
From
Department d
Inner Join Employee e
on d.DeptID = e.DeptID
When we normalize, we’re building a logical hierarchy.
• Entities
– classes of objects that are of interest from a business standpoint, about
which information needs to be maintained
– In the process of modeling, they evolve into database tables.
•
Entities are always nouns in business narratives (but not all nouns in business
narratives are entities).
– Examples: Employee, Department, Project
•
Entities must have attributes, or properties, that need to be known, which
become columns.
– Employee:
• Name, Birth Date, Salary
– Department:
• Name, Number, Location
•
Each entity is representative of a class of objects, and each instance of an wellformed entity will map to a row in a table.
Each instance of an entity must be uniquely distinguishable from other instances
of the same entity.
– An attribute or set of attributes that uniquely identify an entity is called a
Unique Identifier (UID).
•
• Relationship
– A bi-directional, significant association between two entities, or
between an entity and itself
– Each (direction of a) relationship has:
• Name
• Optionality
– Either Must Be or May Be
• Degree/Cardinality/Ordinality
– 1:1 or 1:M ( or M:M)
– Degree = 0 is expressed as “may be.”
• Each employee must be assigned to one and only one
department.
• Each department may be responsible for one or more
employees.
•
Our definitions for entities, attributes and relationships must have
equal validity for each instance; not the normal case only.
• This point is critically important.
• First Normal Form
Each cell, which is
the intersection of a
row and a column,
can contain only one
value.
– 1NF requires that each attribute store only one value.
– There can be no repeating groups ( = no “multivalued
attributes”).
– Each attribute of the table is said to be “atomic”.
• For example, each record in the Home table below should
have only one owner.
Mention the PK
convention.
•Unnormalized Entity
What if some homes
have more than three
owners? How would we
write stored procs to
read from this table?
•
To support multiple owners we need another entity:
– This will always be the case when an entity has a 1:M relationship
with one of its attributes.
Both entities are now in 1NF.
Second Normal Form
– To be in 2NF, a table must be in 1NF.
– In addition, each non-key attribute must be dependent on all
parts of its primary key.
• There must be no “partial key dependencies”.
– In the previous example:
– The Home entity is not in 2NF.
» The Mayor attribute doesn’t depend on the entire
primary key.
» We need a new entity.
– The Owner entity is not in 2NF.
» The Price of Tea does not depend on the Owner.
» “We decide not to track this attribute.”
•
In normalizing to 2NF, we attempt to reduce the amount of redundant
data in a table by extracting it, placing it in new tables, and creating
relationships between tables.
Tables are now in 2NF.
• Third Normal Form
– To be in 3NF, a table must be in 2NF.
– Additionally, all attributes that are not wholly dependent upon
the primary key must be remodeled.
• Each table attribute can depend on nothing other than its
primary key.
• 3NF = “Every non-key attribute must depend on the
key, the whole key, and nothing but the key.”
– In the previous example:
• Sun sign depends on birth date, so it should be stored in a
different table.
• A general modeling principle we see here is that when an
attribute depends on another attribute, a new table will be
necessary to model the relationship.
Entities are now in 3NF.
• Modeling the M:M relationship
– How do we record the owners of individual homes?
• We need an intermediate table that has a M:1
relationship with each of its parent tables.
• The query below shows the name of each home’s
owner(s).
• General Remarks:
• The definitions of normal forms provide guidelines for
relational database design. Occasionally, it is necessary to
stray from them to meet practical business requirements in an
OLTP environment.
• There is not a single best way to normalize a database to
conform with a specific set of business requirements.
• Insert, Update, and Delete operations run more quickly in a
normalized database.
• Complex Select statements run more slowly.
• Reasons to denormalize
• The fundamental reason to denormalize is to improve query
performance.
• Consider the case of City, State, and CityStateZip tables.
– These tables can be designed to conform to the third normal form.
– But each time you need to write a query to extract Customer
data, you will need to join data from four tables.
• If no valid business reason exists to divide city, state, and ZIP
Code information into separate tables, then it may make
sense to denormalize.
• Dimension tables in a star schema are intentionally
denormalized.
• Normalized database:
– Many “narrow” tables (i.e. fewer columns)
– Optimized for Insert Update, and Delete operations
– Slower Select statements because of the need for frequent
join operations
– Few indexes
– Necessary for large OLTP applications
• Non-normalized database:
– Fewer (but “wider”) tables
– Faster Select statements because we don’t need to join as
often
– Transactions are more problematic because of the need to
maintain redundant instances of data during Insert,
Update, and Delete operations
– Many indexes because data is relatively static
– Necessary for large relational OLAP applications
• Data Warehouses
• Data warehouses and data marts are storage mechanisms for
read-only, historical, aggregated data.
•
Consider this example: we sell 2 products, dog food and cat food. Each
day, we record the sales of each product. Here is some sample OLTP
data for a couple of days:
• Our data warehouse would usually not record this level of detail.
– Instead, in a warehouse we would summarize, or aggregate, the
data to daily totals. Our records in the data warehouse might look
something like this:
– Here we have reduced the number of records by aggregating the
individual transaction records into daily records that show the
number of each product purchased each day.
– We can certainly generate this data set from the OLTP system by
running a query…
•
… but if we want to view our data as aggregated numbers broken
down along a series of criteria (i.e. so-called “by conditions”), then
query performance will improve if we store data in a denormalized
format.
– That’s exactly what we do when implementing a star schema.
•
It’s important to realize that OLTP is not meant to be the basis of a
decision support system. OLTP applications are optimized for activities
such as recording (high numbers of) orders, etc.
•
A system optimized for processing transactions is not optimized to
perform complex analyses designed to uncover hidden trends.
• Therefore, rather than tie up our OLTP system by performing
expensive queries, we should build a less normalized structure
that conforms better to our query needs.
• The Warehouse
– Typical business questions that drive warehouse design:
• How many units did we sell last week?
• Are overall sales of individual products or product categories
higher or lower this year than in previous years?
• On a quarterly or monthly basis, are sales for some
products/categories cyclical?
• In what regions are sales down this year?
– What products/categories in those regions account for the
greatest percentage of the decrease?
– Some characteristics of warehouse business questions:
• Many concern the element of time.
• Many questions require the aggregation of data; sums and
counts are important in an OLAP environment, whereas
individual transactions are important in an OLTP environment.
• Each questions looks at data in terms of “by” conditions.
– “On a quarterly and then monthly basis, are Dairy Product
sales cyclical?”
= “We need to see total sales of Dairy Products by
quarter and by month.”
These “by” conditions drive the design
of our star schema.
Each “by” condition is represented by a
Dimension table.
• Dimension Tables – General Remarks
• Product and Geography are common dimensions.
• Date/Time information is almost always stored in a Dimension
table.
• If our data happen to start on a particular date, do we care
what sales have been since that date, or do we care more
about how one year’s sales compares to other years’?
– Comparing one year to another is a common form of trend
analysis accomplished through the use of a star schema.
Dimension Table Structure
– Dimension tables should have a single-field primary key. This key is often an
identity column.
• The value of the primary key is irrelevant; our information is stored in the
other fields in the table.
• Because the fields are the full descriptions, the dimension tables are often
“wide”, i.e. they contain many large fields.
For example, if we have a
Product dimension, then we’ll
have fields in it that contain
the description, the category
name, the sub-category name,
etc. These fields do not
contain codes that link us to
other tables.
Dimension tables are often small
in terms of row count relative to
Fact tables.
• Dimensional Hierarchies (Denormalization):
•
In a star schema, the entire hierarchy for a dimension is stored in its
corresponding Dimension table in the data warehouse.
•
The product dimension, for example, contains individual products.
– Products are normally grouped into categories, and these categories may
contain sub-categories.
• For example, a product with a product number of M1652 may be a
refrigerator. Thus it belongs in the major appliance category, and in
the refrigerator sub-category.
– We may have more levels of sub-categories to further classify
each product.
– In an OLAP environment, it is preferable to maintain the product hierarchy
in a single table, although this hierarchy would certainly be distributed
among Product, Category, and SubCategory tables in an OLTP
environment.
• This hierarchy allows us to perform “drill-down” functions on the data.
We can perform a query that performs sums by category. We can then
drill-down into that category by calculating sums for the subcategories
for that category. We can the calculate the sums for the individual
products in a particular subcategory.
– The actual sums we are calculating are based on numbers stored in the
fact table.
– Fact tables
• When we talk about the way we want to look at data, we usually want
to see some sort of aggregated data. These data are called measures.
• Measures are numeric values that are measurable and additive.
– Sales dollars are a very common measure.
– The Number of Customers we have is also a typical measure.
– We’d probably track both of these by day.
• Fact tables are used to store measures, or facts, which are numeric and
additive across some or all dimensions.
• In the following star schema, sales dollars are numeric, and we can
examine total sales in terms of product, category, and time period.
• Fact tables are “narrow” in the sense that they contain few (and
numeric) columns, but they do contain large numbers of rows.
– Fact tables are responsible for most of the disk space used in a
warehouse.
• Fact Table Granularity
– Granularity refers to the level of detail in a fact table
and is one of the most important design decisions in
data warehouse planning.
• Granularity is often determined by the time dimension.
– For example, you may elect to store only weekly or
monthly totals for sales dollars.
• Granularity determines how far we can drill down
without recourse to the source OLTP data.
– Many if not most OLAP systems have daily grain in
the Time dimension.
• Selecting a finer grain results in more records in the
fact table.
• Choose data types for fact table columns that keep the
table as small as possible.
• Aggregations
– Fact table data consists of aggregations that
are based on the fact table’s granularity.
– Frequently we’ll want to aggregate to a
higher level.
• We may choose to keep total sales dollars at a
quarterly or monthly level.
• We may be interested in only a particular product
or category in this case.
• A better alternative is to build a cube structure…
Simple Star Schema:
•
To obtain total sales for all major appliances during March of 1999:
Select Sum (sf.SalesDollars) as TotalSales
From SalesFact sf
Inner Join TimeDimension td
On td.TimeID = sf.TimeID
Inner Join ProductDimension pd
On pd.ProductID = sf.ProductID
Where pd.Category = ‘Major Appliance’
And td.Month = 3 And td.Year = 1999
• Snowflake Schemas
•
Sometimes dimension tables have hierarchies broken out into
separate tables. This will result in a different schema type known as a
snowflake.
– This is a more normalized structure, but leads to more difficult
queries and slower response times.
– It does conserve more disk space than a star schema that
contains the same data.
Graphical comparison of Star and Snowflake schemas
Star Schema
Snowflake Schema
• Further Warehouse Design Considerations
– Changing Dimensions
– In the schema below, consider a scenario in which we have
realigned some of our stores, placing them in different territories
and regions.
– In the StoreDimension table, we have each store in a
particular region, territory, and zone.
– If we simply update the StoreDimension table with new
territory/region information, and then examine historical
sales for a region, the numbers will no longer be accurate.
– To address this issue, consider creating new records
for affected stores.
• Every new record will contain each store’s new region,
but leaves old store records intact along with the old
regional sales data.
• This approach, however, prevents us from
comparing this stores current sales to its
historical sales unless we keep track of its
previous StoreID. This may require an extra field
called PreviousStoreID or something similar.
– There are no right and wrong answers. Each case may
require a different solution.
• When building an enterprise warehouse from local data marts:
– It is necessary to produce a set of conformed dimensions.
• It will also be necessary to standardize the definitions of facts.
– A conformed dimension is a dimension that means the same thing with
every possible fact table to which it can be joined.
– Generally, this means that a conformed dimension is identical in
each data mart.
– The conformed Product dimension is the enterprise’s agreed-upon master
list of products, including all product attributes and all product rollups such
as category, subcategory, and department.
– The conformed Calendar dimension will almost always be a table of
individual days, spanning a decade or more. Each day will have many useful
attributes drawn from the legal calendars of the various states and
countries the enterprise deals with, as well as special fiscal calendar periods
and marketing seasons relevant only to internal managers.
– Most conformed dimensions will naturally be defined at the most granular
level possible.
• The grain of the Customer dimension will be the individual customer.
Simplified Star Schema with Conformed Dimensions
• Permissible Variations of Conformed Dimensions
– It is possible to create a subset of a conformed
dimension table for certain data marts if you know
that the domain of the associated fact table only
contains that subset.
– For example, the Product table for a specific data
mart may be restricted so as to include only those
products manufactured at that location, if the data
mart in question pertains to that location only.
Links:
•
•
•
•
•
•
•
Wikipedia page on normalization
Datbases.About.Com page on normalization
MSDN Glossary
Oracle-specific site where I got some schema diagrams
Ralph Kimball's Data Warehousing site
Kimball on Fact and Dimension Tables
BI and Data Warehouse Glossary