Data Mining and Data Warehouse

Download Report

Transcript Data Mining and Data Warehouse

Data Warehousing
Data Warehousing
• Make the right decisions for your organization
– Rapid access to all kinds of information
– Research and analyze the past data
– Identify and predict future trends
• The construction of data warehouse
– Involve data cleaning and data integration
– Provide on-line analytical processing (OLAP) tools for the
interactive analysis of data
• W.H. Inmon
– A data warehouse is a subject oriented, integrated, timedependent and non-volatile collection of data in support
of management’s decision making process
Characteristics of Data
Warehouse
• Subject-oriented
– Data warehouse is designed for decision
support and around major subject, such as
customer and sales
– Not all information in the operational database
is useful
• Integrated
– Integrate multiple heterogeneous sources and
make it consistent
– The data from different sources may use
different names for the same entities
Characteristics of Data
Warehouse
• Time dependent
– Record the information and the time when it
was entered
– Data mining can be done from the data in some
period of time
• Non-volatile
– Data in a data warehouse is never updated
Data Warehousing
• Data warehousing
– The process of constructing and using data warehouse
• Two types of databases
– Operational database
• Large database in operation
• Built for high speed and large number of users
– Data warehouse
• Designed for decision support
• Contain vast amounts of historical data
• Data mart
– A department subset of the data warehouse that focuses on
selected subjects, and its scope is department-wide
OLTP & OLAP System
• OLTP (On-Line Transaction Processing)
System
– The major task of operational database is to
perform on-line transaction and query
processing
• OLAP (On-Line Analytical Processing)
System
– Data warehouse system serves users on data
analysis and decision making
Differences ~ OLTP & OLAP
• Characteristic
– OLTP: operational processing
– OLAP: informational processing
• Orientation
– OLTP: transaction-oriented
– OLAP: analysis-oriented
• User
– OLTP: customer, DBA
– OLAP: manager, analyst
• Function
– OLTP: day-to-day operations
– OLAP: information requirement, decision support
Differences ~ OLTP & OLAP
• DB design
– OLTP: ER based, application-oriented
– OLAP: star/snowflake, subject-oriented
• Data
– OLTP: current; guaranteed up-to date
– OLAP: historical
• Unit of work
– OLTP: short, simple query
– OLAP: complex query
• Access
– OLTP: read/write
– OLAP: mostly read
• DB size
– OLTP: 100 MB to GM
– OLAP: 100 GB to TB
Differences ~ OLTP & OLAP
Data Warehousing
Relational Data Model
Relational Schema
Multidimensional Data Model
Star Schema or Snowflake Schema
Model & Schema for
Relational Database
Relational Data Model
Relational
Schema
Multidimensional Data Model
• Example: AllElectronics creates a sales
data warehouse in order to keep records
of the store’s sales
– Fact Table
• sales amount in dollars and number of units sold
(measure)
– Dimension Tables
• time, item, branch, and location
• Multidimensional data model views data in
the form of a data cube
Two Dimensions
• 2-D view of sales data for item sold per
quarter in the city of Vancouver. The
measure is dollars_sold (in thousands)
Dimensions
Measures
Three Dimensions
• 3-D view of sales data according to the
dimensions time, item and location. The
measure is dollars_sold (in thousands)
Dimensions
Measures
Three Dimensions
• 3-D data cube representation according to
the dimensions time, item and location. The
measure is dollars_sold (in thousands)
Four Dimensions
• 4-D data cube of sales data according to the
dimensions supplier, time, item and location.
The measure is dollars_sold (in thousands)
Schemas for Multidimensional
Data Model
• Star Schema
• Snowflake Schema
• Fact Constellation Schema
Star Schema
Snowflake Schema
Snowflake Schema
• Some dimension tables are normalized to
reduce redundancies and save storage space
• Reduce the effectiveness of browsing since
more join will be needed to execute a query
• This saving of space is negligible in
comparison to the magnitude of the fact
table
• Snowflake schema is not as popular as the
start schema in data warehouse design
Fact Constellation Schema
• Multiple fact tables share dimension tables
OLAP Technologies
Concept Hierarchies
Three-Tier DW Architecture
Case Study in Data
Warehousing
公司簡介
公司簡介
公司簡介
背景資料
• A公司利用傳統的E-R Model 來建立其關
聯式資料庫系統
• A公司發現此種資料庫系統無法即時地滿
足高階主管對有效資訊的取得與分析,
進而做出決策
– 傳統的E-R Model資料模型的設計對資料的
一致性 (Consistency) 及避免資料的重複
(Duplication) 上有最佳的效率
– 對於Multi-constraint及Multi-join的多維度查
詢除了會拉長查詢的時間外,還會搶奪系統
資源,造成系統負荷過重而產生瓶頸
背景資料
• A公司決定利用多維度資料模型
(Multidimensional Data Model) 所設計的
資料庫系統來解決上述的問題
– 建立資料倉儲 (Data Warehousing)
– 一次滿足所有的限制,而不需大量的合併動
作,同時使用者介面也較為和善
建立多維度資料庫的步驟
• 了解作業流程與需求,以作為設計時的基礎知
識,此部份可藉由與客戶的訪談、閱讀交易系
統文件、分析現有作業流程而得知
• 界定Fact Table內要有哪些組成?此部份要注意
到是否能滿足第一步驟所定義的需求
• 找出用戶的思考觀點及每一個思考觀點的層級
關係,也就是Dimension Table
• 定義Fact Table的Measure,這些Measure是各個
維度所可能會取用的值
因
果
關
係
圖
因
果
關
係
圖
因果關係圖
因
果
關
係
圖
多維度資料庫的建立
多維度資料庫的建立
多維度資料庫的建立
多維度資料庫的建立
多維度資料庫的建立
多維度資料庫的建立
多維度資料庫的建立
多維度資料庫的建立
多維度資料庫的建立
多維度資料庫的建立
• 其餘表格依此類推。
• 最後共產生共20個 Fact Tables及數十個
Dimension Tables 。
• 這些表格為OLAP系統或資料探勘(Data
Mining)系統的輸入(Input)。
• 利用這些系統我們才能得到更進一步的
統計及知識的輸出(Output)。
Design of Data Warehouse
• How can I design a data warehouse ?
– Top-down approach
– Bottom-up approach
– Combination of both
• In general, the warehouse design process consists
of the following steps
–
–
–
–
Choose a business process to model
Choose the gain of the business process
Choose the dimensions
Choose the measures
其它應用實例
其它應用實例
其它應用實例