Data Warehousing

Download Report

Transcript Data Warehousing

Data Warehousing - 3
ISYS 650
Snowflake Schema
one or more dimension tables do not join directly to the fact
table but must join through other dimension tables.
Location
Dimension
LocationCode
State
City
Can group by State, City
Product
Category
CategoryID
Description
FactTable
LocationCode
PeriodCode
Rating
PID
Qty
Amount
Product
Dimension
PID
Pname
CategoryID
CustomerRating
Dimension
Rating
Description
Period
Dimension
PeriodCode
Year
Quarter
Multiple Fact Tables
• A star schema can contain multiple fact tables.
• Case 1: Unrelated facts:
– Multiple fact tables exist because they contain
unrelated facts; for example, invoices and sales.
• Case 2: Multiple fact tables are often used to hold
various levels of aggregated (summary) data,
particularly when the amount of aggregation is
large; for example, daily sales, monthly sales, and
yearly sales.
– Improve performance
Surrogate Key
• A surrogate key is the primary key for a dimension
table and is independent of any keys provided by
source data systems.
• Surrogate keys are created and maintained in the data
warehouse and should not encode any information
about the contents of records; automatically increasing
integers make good surrogate keys.
• The original key for each record is carried in the
dimension table but is not used as the primary key.
• Surrogate keys provide the means to maintain data
warehouse information when dimensions change.
Slowly Changing Dimensions
• A characteristic of dimensions is that
dimension data is relatively stable.
• Of particular concern is the potential effect of
a change to a dimension attribute on how
historical data is tracked and summarized.
• "Slowly changing dimensions" is the
customary term used for discussions of issues
associated with the impact of changes to
dimension attributes.
Three Ways of Dealing with Slowly
Changing Dimension
• Type 1: Overwrite the dimension record.
• Type 2: Add a new dimension record.
• Type 3: Create new fields in the dimension
record.
– Not recommended
Type 1
• Type 1 changes cause history to be rewritten,
which may affect analysis results if an
attribute is changed that is used to group data
for summarization.
• Changes to a dimension attribute that is never
used for analysis can be managed by simply
changing the data to the new value.
Type 2
• Type 2 changes cause history to be partitioned
at the event that triggered the change. Data
prior to the event continues to be summarized
and analyzed as before; new data is
summarized and analyzed in accordance with
the new value of the data.
• Surrogate keys on the dimension table are
required for Type 2 solutions.
Extraction in Data Warehouses
http://download.oracle.com/docs/cd/B10501_01/server.920/a
96520/extract.htm
• Full Extraction
– The data is extracted completely from the source
system.
• Incremental Extraction
– At a specific point in time, only the data that has
changed since a well-defined event back in history
will be extracted.
Change Data Capture
• Timestamps
– The tables in some operational systems have
timestamp columns.
• Partitioning
– Some source systems might use Oracle range
partitioning, such that the source tables are
partitioned along a date key, which allows for easy
identification of new data.
• Triggers
– Triggers can be created in operational systems to keep
track of recently updated records.
Operational Data Store, ODS
• The ODS is designed to support tactical decision-making.
• It has elements of both data warehouse and a
transaction system.
– Like a data warehouse, the ODS typically contains data
consolidated from multiple systems and grouped by subject
area.
– Like a transaction system, the ODS may be updated by
business users, and contains relatively little historical data.
• ODS: 30 to 60 days of information
• Data warehouse typically contains years of data.
– ODS is designed to quickly perform relatively simply queries
on smaller volumes of data such as finding orders of a
customer or looking for available items in the retails store.
While data warehouse queries typically are complex and on
high volumes of data.
A business case for an operational data
store
• Customer Call Center
– Call center operators have little need for broad
analytical queries that reveal trends in customer
behavior.
– Need up-to-date information about all
transactions involving the complaining customer.
– Data may be a replicated mirror of the transaction
system, or data warehouse star schema.
Real-time/Active DW at Teradata
Traditional vs Active DW Environment
Enterprise Decision Evolution and DW
Data Warehouse Appliance
• A data warehouse appliance is a combination hardware
and software product that is designed specifically for
analytical processing. An appliance allows the purchaser
to deploy a high-performance data warehouse right out
of the box.
• It comes with its own operating system, storage,
database management system and software. It uses
massively parallel processing and distributes data across
integrated disk storage, allowing independent processors
to query data in parallel.
• Data warehouse appliances use Open Database
Connectivity (ODBC), Java Database Connectivity (JDBC),
and OLE DB interfaces to integrate with other extracttransform-load (ETL) tools and business intelligence
applications.
A Web-based DW Architecture
Web pages
Client
(Web browser)
Internet/
Intranet/
Extranet
Application
Server
Web
Server
Data
warehouse
SaaS BI/DW
• Growing demand for SaaS BI/DW tools:
– Especially among small and medium-sized businesses
that lack the IT infrastructure to support an onpremise BI installment.
– SaaS BI is also useful for quick, one-off BI projects
– As workspaces for power users that want to
experiment with more advanced analytics.
• Drawbacks:
– Integrating large data volumes from the cloud creates
latency problems.
– It makes it more difficult to control the data and
maintain "a single version of the truth."
Bitmap Index
1. Rows - possible values of the attribute
2. Columns - table rows
3. Bit indicates whether the attribute of a row has
the values
The bitmap index is used where the values of a field repeats very
frequently(not many distinct values), it is not used for primary key index.
Bitmap Index for Data Warehouse’s
Fact Table
• The number of rows in a dimension tables is
much smaller than the number of rows in the
fact table.
• The dimension keys have a relatively small
number of distinct values compared to the
number of rows in the fact table.