Transcript Document

CR32 Knowledge Management
and Adaptive Systems
09: Data Warehousing
based on an online presentation by
Ronald J Norman
http://www.grossmont.edu/ronaldnorman/
by Professor Ronald J Norman of
Grossmont College, CA, USA
• Prof Norman used these slides in his data
mining course based on Data Mining
Techniques (Second Edition) By Michael
J. A. Berry and Gordon S. Linoff
2004 John Wiley & Sons
• Management-oriented textbook...
http://www.data-miners.com/companion/dmt.html
2
Introduction
•
•
•
•
Data, data, data…everywhere!
Information…that’s another story!
Especially, the right information @ the right time!
Data warehousing’s goal is to make the right
information available @ the right time
• Data warehousing is a data store (eg., a filestore
or database of some sort) and a process for
bringing together disparate data from throughout
an organization for decision-support purposes
3
Introduction
• Data warehouses are natural allies for data
mining (work together well)
• Data mining can help fulfill some of the goal of
data warehouses – right information @ the right
time
• Relational database management systems
(RDBMS), such as Oracle, DB2, Sybase,
Informix, Focus, SQL Server, etc. can be used
for data warehousing; or just store as text/HTML
4
Definitions of a Data Warehouse
“A subject-oriented, integrated, time-variant and
1.
non-volatile collection of data in support of
management's decision making process”
- W.H. Inmon
2.
“A copy of transaction data, specifically
structured for query and analysis”
- Ralph Kimball
5
CW corpus as a Data Warehouse
Subject-oriented: English terminology on WWW
Integrated: harvested from many sources, into a
1.
single standard format and file-store
Time-variant : WWW pages change!
- W.H. Inmon
Non-volatile : corpus is a static snap-shot
2.
Copy of transaction data: cache
structured for query and analysis: raw text
yields word-frequency list
- Ralph Kimball
6
Data Warehouse
• For organizational learning to take place, data
from many sources must be gathered together
and organized in a consistent and useful way –
hence, Data Warehousing (DW)
• DW allows an organization to archive snapshots
of its data, and what it has noticed about its data
• Data Mining techniques make use of the data in
a Data Warehouse
7
Data Warehouse
Enterprise
“Database”
Customers
Orders
Transactions
Etc…
Vendors
Etc…
Data Miners:
• “Farmers” – they know
• “Explorers” - unpredictable
Copied,
organized
summarized
Data
Warehouse
Data Mining
8
Data Warehouse
 A data warehouse is a copy of transaction data
specifically structured for querying, analysis, reporting,
and more rigorous data mining
 Note that the data warehouse contains a copy of the
transactions which are not updated or changed later by
the transaction system
 Also note that this data is specially structured, and may
have been transformed when it was copied into the data
warehouse
9
Data Mart
• A Data Mart is a smaller, more focused Data
Warehouse – a mini-warehouse.
• A Data Mart typically reflects the business rules
of a specific business unit within an enterprise.
• Which English dominates the WWW, UK or US:
each student captured a Data Mart for 1 domain.
10
Data Warehouse to Data Mart
Data
Warehouse
Data Mart
Decision
Support
Information
Data Mart
Decision
Support
Information
Data Mart
Decision
Support
Information
11
open source Data Warehouses
• A company may keep its DW private!
• Large data-sets are valuable Gold
Standards for research and development
• Some Universities host “public” DWs
• Eg ICAME: International Computer
Archive of Modern English
• ICAME also runs CORPORA forum
• Martin Krallinger etc on UK v US English:
http://www.uib.no/mailman/public/corporaarchive/2006-November/003387.html
12
Other Data repositories
• UPenn: Linguistic Data Consortium
http://www.ldc.upenn.edu/
• European equivalents:
• ELRA http://www.elra.info/
• ELDA http://www.elda.org/
• Leeds Electronic Text Centre
http://etext.leeds.ac.uk/
• Leeds Centre for Translation Studies
http://corpus.leeds.ac.uk/
13
Generic Architecture of Data
(synonym) Transaction data
14
Transaction (Operational) Data
• Operational (production) systems create
(massive number of) transactions, such as
sales, purchases, deposits, withdrawals, returns,
refunds, phone calls, toll roads, web site “hits”,
web site text, etc…
• Transactions are the base level of data – the raw
material for understanding customer behavior
• Unfortunately, operational systems change, eg
new formats, due to changing business needs
• Data warehousing strategies need to be aware
of operational system changes
15
Operational Summary Data
Summaries are for a
specific time period
and utilize the
transaction data for
that time period
Other Examples???
16
Decision Support Summary Data
• The data that are used to help make decisions about the
business
– Financial Data, such as:
• Income Statements (Profit & Loss)
• Balance Sheets (Assets – Liabilities = Net Worth)
– Sales summaries
– Other examples???
• Data warehouses maintain this type of data, however
financial data “of record” (for audit purposes) usually
comes from databases and not the data warehouse
(confusing???)
• Generally, it is a bad idea to use the same system for
analytic and operational purposes
17
Database Schema
• Database schema defines the structure of data,
not the values of the data (e.g., first name, last
name = structure; Ron Norman = values of the
data)
• In RDBMS:
– Columns = fields = attributes (A,B,C)
– Rows = records = tuples (1-7)
18
Logical & Physical Database Schema
• Describes data in a
way that is familiar to
business users
• Describes the data
the way it will be
stored in an RDBMS
which might be
different than the way
the logical shows it
19
Metadata
• General definition: Data about data !!!
– Examples:
• A library’s card catalog (metadata) describes publications (data)
• A file system maintains permissions (metadata) about files (data)
• A form of system documentation including:
–
–
–
–
–
Values legally allowed in a field (e.g., AZ, CA, OR, UT, WA, etc.)
Description of the contents of each field (e.g., start date)
Date when data were loaded
Indication of currency of the data (last updated)
Mappings between systems (e.g., A.this = B.that)
• Invaluable, otherwise have to research to find it
20
Business Rules
• Highest level of abstraction from operational
(transaction) data
• Describes why relationships exist and how they are
applied
• Examples:
– Need to have 3 forms of ID for credit
– Only allow a maximum daily withdrawal of $200
– After the 3rd log-in attempt, lock the log-in screen
– Accept no bills larger than $20
– Others???
21
OLAP – Online Analytical Processing
• A definition:
• Data representation for ease of visualization
• OLAP goes beyond SQL with its analysis
capabilities
• Key feature of OLAP: Relevant multi-dimensional
views such as products, time, geography
22
OLAP Architecture
23
General Architecture for Data Warehousing
• Source systems
• Extraction, (Clean),
Transformation, &
Load (ETL)
• Central repository
• Metadata repository
• Data marts
• Operational feedback
• End users: analysis,
OLAP, Data-Mining
24
DM vs. OLAP
• Data Mining:
– can handle complex data types of the
attributes and their aggregations
– a more automated process
• Online Analytic Processing (visualization):
– restricted to a small number of dimension and
measure types
– user-controlled process
CS490D
25
25
DM + visualization
• Data Mining:
– can handle complex data types of the
attributes and their aggregations
– “reduces” data to smaller number of patterns
• Visualization:
– restricted to a small number of patterns
– user-controlled process to select patterns
which are “interesting” or “useful”
CS490D
26
26
Q: Is it a Data Warehouse?
•
•
•
•
•
Is ANY data-set a Data Warehouse?
SIS?
Library Catalogue?
VLE?
Text in a textbook?
27
Definitions of a Data Warehouse
“A subject-oriented, integrated, time-variant and
1.
non-volatile collection of data in support of
management's decision making process”
- W.H. Inmon
2.
“A copy of transaction data, specifically
structured for query and analysis”
- Ralph Kimball
28
CW corpus as a Data Warehouse
Subject-oriented: English terminology on WWW
Integrated: harvested from many sources, into a
1.
single standard format and file-store
Time-variant : WWW pages change!
- W.H. Inmon
Non-volatile : corpus is a static snap-shot
2.
Copy of transaction data: cache
structured for query and analysis: raw text
yields word-frequency list
- Ralph Kimball
29