Data Warehousing

Download Report

Transcript Data Warehousing

Data Warehousing
On-Line Analytical Processing (OLAP) Tools
• The use of a set of graphical tools that provides
users with multidimensional views of their data
and allows them to analyze the data using simple
windowing techniques
• Relational OLAP (ROLAP)
– Traditional relational representation
• Multidimensional OLAP (MOLAP)
– Cube structure
• OLAP Operations
– Cube slicing–come up with 2-D view of data
– Drill-down–going from summary to more detailed
views
– Roll-up – the opposite direction of drill-down
– Reaggregation – rearrange the order of dimensions
Slicing a data cube
Summary report
Example of drill-down
Starting with summary
data, users can obtain
details for particular
cells
Drill-down with
color added
Excel’s Pivot Table
• Data/Pivot Table
– Drilldown, rollup, reaggregation
Access Pivot Form
Drill Down
Data Warehouse
• A subject-oriented, integrated, time-variant,
non-updatable collection of data used in
support of management decision-making
processes
– Subject-oriented: e.g. customers, employees,
locations, products, time periods, etc.
• Dimensions for data analysis
– Integrated: Consistent naming conventions,
formats, encoding structures; from multiple data
sources
– Time-variant: Can study trends and changes
– Nonupdatable: Read-only, periodically refreshed
• Data Mart:
– A data warehouse that is limited in scope
Need for Data Warehousing
• Integrated, company-wide view of high-quality
information (from disparate databases)
• Separation of operational and informational systems
and data (for improved performance)
Generic two-level data warehousing architecture
L
T
One,
companywide
warehouse
E
Periodic extraction  data is not completely current in warehouse
The ETL Process
•
•
•
•
Capture/Extract
Scrub or data cleansing
Transform
Load and Index
ETL = Extract, transform, and load
Capture/Extract…obtaining a snapshot of a chosen subset
of the source data for loading into the data warehouse
Static extract = capturing
a snapshot of the source
data at a point in time
Incremental extract =
capturing changes that
have occurred since the last
static extract
Data Warehouse Design
- Star Schema • Also called “dimensional model”
• Fact table
– contain detailed business data
• Dimension tables
– contain descriptions about the subjects of the
business such as customers, employees, locations,
products, time periods, etc.
– A dimension is a term used to describe any category
used in analyzing data, such as time, geography, and
product line.
Star schema example
Fact table provides statistics for sales
broken down by product, period and
store dimensions
Dimension tables contain descriptions about the subjects of the business
Star schema with sample data
Example:
Order Processing System
CID
Cname
City
OID
ODate
Rating
SalesPerson
Customer
1
Has
M
Order
M
Qty
Has
M
Product
Price
PID
Pname
Star Schema
Location
Dimension
LocationCode
State
City
Can group by State, City
Product
Category
CategoryID
Description
(Snowflake model)
FactTable
LocationCode
PeriodCode
Rating
PID
Qty
Amount
Product
Dimension
PID
Pname
CategoryID
CustomerRating
Dimension
Rating
Description
Period
Dimension
PeriodCode
Year
Quarter
From SalesDB to
MyDataWarehouse
• Extract data from SalesDB:
– Create query to get the data
– Download to MyDataWareHouse
• File/Import/Save as Table
• Data scrub/cleasing,and transform:
– Transform City to Location
– Transform Odate to Period
• Load data to FactTable
Figure 6-8
Bitmap index
index
organization
Bitmap saves on space requirements
Rows - possible values of the attribute
Columns - table rows
Bit indicates whether the attribute of a row has the values
Figure 6-9 Join Indexes–speeds up join operations
Data Mining and Visualization
• Knowledge discovery using a blend of statistical, AI, and
computer graphics techniques
• Goals:
– Explain observed events or conditions
– Confirm hypotheses
– Explore data for new or unexpected relationships
• Techniques
–
–
–
–
–
–
–
–
–
Statistical regression
Decision tree induction
Clustering and signal processing
Affinity
Sequence association
Case-based reasoning
Rule discovery
Neural nets
Fractals
• Data visualization–representing data in graphical/multimedia
formats for analysis
SQL GROUPING SETS
• GROUPING SETS
– SELECT CITY,RATING,COUNT(CID) FROM HCUSTOMERS
– GROUP BY GROUPING SETS(CITY,RATING,(CITY,RATING),())
– ORDER BY CITY;
• Note: () indicates that an overall total is desired.
SQL CUBE
• Perform aggregations for all possible
combinations of columns indicated.
– SELECT CITY,RATING,COUNT(CID) FROM HCUSTOMERS
– GROUP BY CUBE(CITY,RATING)
– ORDER BY CITY, RATING;
SQL ROLLUP
• The ROLLUP extension causes
cumulative subtotals to be calculated for
the columns indicated. If multiple columns
are indicated, subtotals are performed for
each of the columns except the far-right
column.
– SELECT CITY,RATING,COUNT(CID) FROM HCUSTOMERS
– GROUP BY ROLLUP(CITY,RATING)
– ORDER BY CITY, RATING