Data Warehousing

Download Report

Transcript Data Warehousing

Data Warehousing
Definition
• Data Warehouse:
– A subject-oriented, integrated, time-variant, nonupdatable collection of data used in support of
management decision-making processes
– Subject-oriented: e.g. customers, patients,
students, products
– Integrated: Consistent naming conventions,
formats, encoding structures; from multiple data
sources
– Time-variant: Contain a time dimenstion so that it
may be used to study trends and changes
– Nonupdatable: Read-only, periodically refreshed
• Data Mart:
– A data warehouse that is limited in scope
Need for Data Warehousing
• Integrated, company-wide view of high-quality
information (from disparate databases)
• Separation of operational and informational (decision
support) systems and data (for improved
performance)
Data Warehouse Architectures
• Generic Two-Level Architecture
• Independent Data Mart
All involve some form of extraction, transformation
and loading (ETL)
Figure 11-2: Generic two-level data warehousing architecture
L
T
One,
companywide
warehouse
E
Periodic extraction  data is not completely current in warehouse
Figure 11-3 Independent data mart
data warehousing architecture
Data marts:
Mini-warehouses, limited in scope
L
T
E
Separate ETL for each
independent data mart
Data access complexity
due to multiple data marts
The ETL Process
• Capture/Extract
• Scrub or data cleansing
• Transform:
– Convert data from the format of the source to
the format of the data warehouse.
• Load and Index
ETL = Extract, transform, and load
Load/Index= place transformed data
into the warehouse and create indexes
Figure 11-10:
Steps in data
reconciliation
(cont.)
Refresh mode: bulk rewriting
of target data at periodic intervals
Update mode: only changes
in source data are written to data
warehouse
Index
• Bitmap index
• Join index
Figure 6-8
Bitmap index
index
organization
Bitmap saves on space requirements
Rows - possible values of the attribute
Columns - table rows
Bit indicates whether the attribute of a row has the values
Figure 6-9 Join Indexes–speeds up join operations
Star Schema for Data
Warehouse
• Objectives
– Ease of use for decision support applications
– Fast response to predefined user queries
– Customized data for particular target audiences
Also called “dimensional model”
• Dimension:
– A dimension is a term used to describe any
category used in analyzing data, such as time,
geography, and product line.
Figure 11-13 Components of a star schema
Fact tables contain factual
or quantitative data
1:N relationship between
dimension tables and fact tables
Dimension tables are denormalized to
maximize performance
Dimension tables contain descriptions
about the subjects of the business
Excellent for ad-hoc queries, but bad for online transaction processing
Figure 11-14 Star schema example
Fact table provides statistics for sales
broken down by product, period and
store dimensions
Figure 11-15 Star schema with sample data
On-Line Analytical Processing (OLAP) Tools
• The use of a set of graphical tools that
provides users with multidimensional views of
their data and allows them to analyze the
data using simple windowing techniques
• Relational OLAP (ROLAP)
– Traditional relational representation
• Multidimensional OLAP (MOLAP)
– Cube structure
• OLAP Operations
– Cube slicing–come up with 2-D view of data
– Drill-down–going from summary to more detailed
views
Figure 11-23 Slicing a data cube
Figure 11-24
Example of drill-down
Starting with summary
data, users can obtain
details for particular
cells
Summary report
Drill-down with
color added
Data Mining and Visualization
• Knowledge discovery using a blend of statistical, AI, and
computer graphics techniques
• Goals:
– Explain observed events or conditions
– Confirm hypotheses
– Explore data for new or unexpected relationships
• Techniques
–
–
–
–
–
–
–
–
–
Statistical regression
Decision tree induction
Clustering and signal processing
Affinity
Sequence association
Case-based reasoning
Rule discovery
Neural nets
Fractals
• Data visualization–representing data in graphical/multimedia
formats for analysis
Pivot Table
• Excel:
– Drill Down, Roll Up
• Access CrossTab query
SQL GROUPING SETS
• GROUPING SETS
– SELECT CITY,RATING,COUNT(CID) FROM HCUSTOMERS
– GROUP BY GROUPING SETS(CITY,RATING,(CITY,RATING),())
– ORDER BY CITY;
• Note: () indicates that an overall total is desired.
SQL CUBE
• Perform aggregations for all possible
combinations of columns indicated.
– SELECT CITY,RATING,COUNT(CID) FROM HCUSTOMERS
– GROUP BY CUBE(CITY,RATING)
– ORDER BY CITY, RATING;
SQL ROLLUP
• The ROLLUP extension causes
cumulative subtotals to be calculated for
the columns indicated. If multiple columns
are indicated, subtotals are performed for
each of the columns except the far-right
column.
– SELECT CITY,RATING,COUNT(CID) FROM HCUSTOMERS
– GROUP BY ROLLUP(CITY,RATING)
– ORDER BY CITY, RATING