Chapter 15 - Data Miners

Download Report

Transcript Chapter 15 - Data Miners

Chapter 15
Data Warehousing, OLAP, and
Data Mining
Introduction
•
•
•
•
Data, data, data…everywhere!
Information…that’s another story!
Especially, the right information @ the right time!
Data warehousing’s goal is to make the right
information available @ the right time
• Data warehousing is a data store (eg., a
database of some sort) and a process for
bringing together disparate data from throughout
an organization for decision-support purposes
2
Introduction
• Data warehouses are natural allies for
data mining (work together well)
• Data mining can help fulfill some of the
goal of data warehouses – right
information @ the right time
• Relational database management systems
(RDBMS), such as Oracle, DB2, Sybase,
Informix, Focus, SQL Server, etc. are often
used for data warehousing
3
Definitions of a Data Warehouse
“A subject-oriented, integrated, time-variant and
1.
non-volatile collection of data in support of
management's decision making process”
- W.H. Inmon
2.
“A copy of transaction data, specifically
structured for query and analysis”
- Ralph Kimball
4
Data Warehouse
• For organizational learning to take place, data
from many sources must be gathered together
and organized in a consistent and useful way –
hence, Data Warehousing (DW)
• DW allows an organization (enterprise) to
remember what it has noticed about its data
• Data Mining techniques make use of the data in
a Data Warehouse
5
Data Warehouse
Enterprise
“Database”
Customers
Orders
Transactions
Etc…
Vendors
Etc…
Data Miners:
• “Farmers” – they know
• “Explorers” - unpredictable
Copied,
organized
summarized
Data
Warehouse
Data Mining
6
Data Warehouse
 A data warehouse is a copy of transaction data
specifically structured for querying, analysis, reporting,
and more rigorous data mining
 Note that the data warehouse contains a copy of the
transactions which are not updated or changed later by
the transaction system
 Also note that this data is specially structured, and may
have been transformed when it was copied into the data
warehouse
7
Data Mart
• A Data Mart is a smaller, more focused
Data Warehouse – a mini-warehouse.
• A Data Mart typically reflects the business
rules of a specific business unit within an
enterprise.
8
Data Warehouse to Data Mart
Data
Warehouse
Data Mart
Decision
Support
Information
Data Mart
Decision
Support
Information
Data Mart
Decision
Support
Information
9
Generic Architecture of Data
(synonym) Transaction data
10
Transaction (Operational) Data
• Operational (production) systems create (massive
number of) transactions, such as sales, purchases,
deposits, withdrawals, returns, refunds, phone calls, toll
roads, web site “hits”, etc…
• Transactions are the base level of data – the raw
material for understanding customer behavior
• Unfortunately, operational systems change due to
changing business needs
• Fortunately, operational systems can usually be changed
to support changing business needs
• Data warehousing strategies need to be aware of
operational system changes
11
Operational Summary Data
Summaries are for a
specific time period
and utilize the
transaction data for
that time period
Other Examples???
12
Decision Support Summary Data
• The data that are used to help make decisions about the
business
– Financial Data, such as:
• Income Statements (Profit & Loss)
• Balance Sheets (Assets – Liabilities = Net Worth)
– Sales summaries
– Other examples???
• Data warehouses maintain this type of data, however
financial data “of record” (for audit purposes) usually
comes from databases and not the data warehouse
(confusing???)
• Generally, it is a bad idea to use the same system for
analytic and operational purposes
13
Database Schema
• Database schema defines the structure of data,
not the values of the data (e.g., first name, last
name = structure; Ron Norman = values of the
data)
• In RDBMS:
– Columns = fields = attributes (A,B,C)
– Rows = records = tuples (1-7)
14
Logical & Physical Database Schema
• Describes data in a
way that is familiar to
business users
• Describes the data
the way it will be
stored in an RDBMS
which might be
different than the way
the logical shows it
15
Metadata
• General definition: Data about data !!!
– Examples:
• A library’s card catalog (metadata) describes publications (data)
• A file system maintains permissions (metadata) about files (data)
• A form of system documentation including:
–
–
–
–
–
Values legally allowed in a field (e.g., AZ, CA, OR, UT, WA, etc.)
Description of the contents of each field (e.g., start date)
Date when data were loaded
Indication of currency of the data (last updated)
Mappings between systems (e.g., A.this = B.that)
• Invaluable, otherwise have to research to find it
16
Business Rules
• Highest level of abstraction from operational
(transaction) data
• Describes why relationships exist and how they are
applied
• Examples:
– Need to have 3 forms of ID for credit
– Only allow a maximum daily withdrawal of $200
– After the 3rd log-in attempt, lock the log-in screen
– Accept no bills larger than $20
– Others???
17
General Architecture for Data Warehousing
• Source systems
• Extraction, (Clean),
Transformation, &
Load (ETL)
• Central repository
• Metadata repository
• Data marts
• Operational feedback
• End users (business)
18
Where does OLAP fit in?
19
OLAP Overview
gender
• Interactive, exploratory analysis of
multidimensional data to discover patterns
ts
n
e
age
id
c
c
a
20
OLAP Architecture
21
Server Options
• Single processor
• Symmetric
multiprocessor (SMP)
• Massively parallel
processor (MPP)
22
OLAP Server Options
• ROLAP (Relational)
• MOLAP (Multidimensional)
• HOLAP (Hybrid)
23
OLAP – Online Analytical Processing
• A definition:
• Data representation is in the form of a CUBE
• OLAP goes beyond SQL with its analysis
capabilities
• Key feature of OLAP: Relevant multi-dimensional
views such as products, time, geography
24
OLAP Cube - 1
25
OLAP Cube - 2
26
OLAP Cube - 3
• Star Structure (quite common)
27
OLAP Cube - 4
The
Cube
28
OLAP Cube - 5
ThreeDimensional
Cube
Display
Page
Region:
North
Columns
Sales
Red
blob
Rows
Year
Blue
blob
Total
1996
1997
Total
29
OLAP Cube - 6
SixDimensional
Cube
Dimension
Brand
Store
Customer segment
Product group
Period
Variable
Example
Mt. Airy
Atlanta
Business
Desks
January
Units sold
30
Rotation (Pivot Table)
31
Drill Down
Region
Africa
Asia
Europe
North America
Pacific
South America
Nation
China
Japan
India
Singapore
Sales variance
105%
57%
122%
97%
85%
163%
Sales variance
123%
52%
87%
95%
32
OLAP Examples
• http://perso.wanadoo.fr/bernard.lupin/english/example.htm
• Excel Pivot Table example (similar to OLAP cube)
33
Sample of OLAP products
Just a snippet from http://www.olapreport.com/ProductsIndex.htm ; not an endorsement
34
Data Mining versus OLAP
35
Data Mining versus OLAP
• OLAP - Online Analytical
Processing
– Provides you with a very
good view of what is
happening, but can not
predict what will happen
in the future or why it is
happening
36
Results of Data Mining Include:
• Forecasting what may happen in the future
• Classifying people or things into groups by
recognizing patterns
• Clustering people or things into groups
based on their attributes
• Associating what events are likely to occur
together
• Sequencing what events are likely to lead
to later events
37
End of Chapter 15
38