Chapter 4 Data Management: Warehousing, Access and

Download Report

Transcript Chapter 4 Data Management: Warehousing, Access and

Data Warehousing and DSS at
Group Health Cooperative
2-3 million data records are processed monthly





How to use for decision support?
How to hold down costs?
How to improve customer service?
How to utilize resource effectively?
How to improve service quality?
Answers


Develop a comprehensive database (data warehouse)
and DSS approach
Very effective
Data Warehousing, Access,
Analysis and Visualization
What to do with all the data that
organizations collect, store and use?
Information overload!
Solution






Data warehousing
Data access
Data mining
Online analytical processing (OLAP)
Data visualization
Data sources
Decision Support Systems and Intelligent Systems, Efraim Turban and Jay E. Aronson
Copyright 1998, Prentice Hall, Upper Saddle River, NJ
4.3 The Nature and Sources of
Data



Data: Raw
Information: Data organized to convey meaning
Knowledge: Data items organized and processed to
convey understanding, experience, accumulated
learning, and expertise
DSS Data Items
–
–
–
–
–
–
Documents
Pictures
Maps
Sound
Animation
Video
Decision Support Systems and Intelligent Systems, Efraim Turban and Jay E. Aronson
Copyright 1998, Prentice Hall, Upper Saddle River, NJ
Data Sources



Internal
External
Personal
Decision Support Systems and Intelligent Systems, Efraim Turban and Jay E. Aronson
Copyright 1998, Prentice Hall, Upper Saddle River, NJ
The Internet and Commercial
Database Services
For External Data
 The Internet: Major supplier of external data
 Commercial Data “Banks”: Sell access to
specialized databases
Can add external data to the MSS in a timely
manner and at a reasonable cost
DSS In Focus 4.1: Sources of External Data--A Sampler
M any sources of public data exist, some of which are available on the I nternet. Common data
sources are:
Federal Publications
Survey of Current Business (Department of Commerce) (continues Business Conditions Digest in
short form)--monthly, general business conditions
Monthly Labor Review (Department of Labor)--monthly employment statistics (a journal with
articles)
Employment and Earnings (Department of Labor)--monthly, more detailed than M onthly Labor
Review
Other
I nternational Monetary Fund--report of balance of payments, including currency rates, for
participating countries
Moody's--a series of manuals including abstracted information and balance sheets of most large
U.S. corporations, intended for investors
Standard & Poor's--periodically updated report of financial stability of most U.S. corporations
Advertising Age--marketing newspaper, with a great deal of data on marketing
Annual Editor & Publisher Market Guide--annual report of marketing information by SM SA
(standard metropolitan statistical area)
Decision Support Systems and Intelligent Systems, Efraim Turban and Jay E. Aronson
Copyright 1998, Prentice Hall, Upper Saddle River, NJ
TABLE 4.2 Representative Commercial Database (Data Bank)
Services.
CompuServe and The Source. Personal computer networks providing statistical data banks
(business and financial market statistics) as well as bibliographic data banks (news, reference,
library, and electronic encyclopedias). CompuServe is the largest supplier of such services to
personal computer users.
Compustat. Provides financial statistics about more than 12,000 corporations.
Data Resources, Inc. offers statistical data banks in agriculture, banking, commodities,
demographics, economics, energy, finance, insurance, international business, and the steel
and transportation industries. DRI economists maintain a number of these data banks.
Standard & Poor's is also a source. It offers services under the U.S. Central Data Bank.
Dow Jones Information Service. Provides statistical data banks on stock market and other
financial markets and activities, and in-depth financial statistics on all corporations listed on the
New York and American stock exchanges, plus 800 other selected companies. Its Dow Jones
News/Retrieval system provides bibliographic data banks on business, financial, and general
news from The Wall Street Journal, Barron's, the Dow Jones News Service, Wall Street Week,
and the 21-volume American Academic Encyclopedia.
Interactive Data Corporation. A statistical data bank distributor covering agriculture,
automobiles, banking, commodities, demographics, economics, energy, finance, international
business, and insurance. Its main suppliers are Chase Econometric Associates, Standard &
Poor's, and Value Line.
Lockheed Information Systems. The largest bibliographic distributor. Its DIALOG system
offers extracts and summaries of more than 150 different data banks in agriculture, business,
economics, education, energy, engineering, environment, foundations, general new
publications, government, international business, patents, pharmaceuticals, science, and
social sciences. It relies on many economic research firms, trade associations, and
governmental groups for data.
Mead Data Central. This data bank service offers two major bibliographic data banks. Lexis
provides legal research information and legal articles. Nexis provides a full-text (not abstract)
bibliographic database of over 100 newspapers, magazines, and newsletters, news services,
government documents, and so on. It includes full text and abstracts from the New York Times
and the complete 29-volume Encyclopedia Britannica. Also provided is the Advertising &
Marketing Intelligence (AMI) data bank, and the National Automated Accounting Research
System.
Source: Based on Standard & Poor's Compustat Services, Inc., statistics on 6,000
companies’ financial reports.
The Internet/Web and
Corporate Databases and
Systems
Use Web Browsers to





Access vital information by employees
and customers
Implement executive information systems
Implement group support systems (GSS)
Database management systems provide
data in HTML
Web-browsers as DBMS front-ends
Database Management
Systems in DSS

DBMS: Software program for entering (or
adding) information into a database;
updating, deleting, manipulating, storing, and
retrieving information

A DBMS combined with a modeling language
is a typical system development pair, used in
constructing DSS or MSS

DBMS are designed to handle large amounts
of information
Decision Support Systems and Intelligent Systems, Efraim Turban and Jay E. Aronson
Copyright 1998, Prentice Hall, Upper Saddle River, NJ
Database Organization and
Structure





Relational Databases
Hierarchical Databases
Network Databases
Object-oriented Databases
Multimedia-based Databases
Decision Support Systems and Intelligent Systems, Efraim Turban and Jay E. Aronson
Copyright 1998, Prentice Hall, Upper Saddle River, NJ
Data and Applications
Application 1
OS
Application 2
Application 3
Application 1
OS
DBMS
Application 2
Application 3
Application 1
OS
O-O
DBMS
Application 2
Application 3
Traditional File Systems

Advantages
– simple data design to support single or
small group of applications
– fast data access
– inexpensive

Disadvantages
–
–
–
–
lack of data relation
redundancy
lack of standards
low application development productivity
DBMS

Advantages
–
–
–
–
integration, sharing of data
increased data accessibility
minimized redundancy
easier application development and
maintenance
– improved data security
– logical/physical data independence

Disadvantage
– complex data design
– slow access
– expensive
External view
Users view of data
Logical,
integrated view
Conceptual View
• Data Definition Language
• Data Manipulation Language
• Query Language
Internal view
Physical View
Physical storage
structure of data
3 level DB Architecture
Relational Database
Invoice#, Inv.Date
Customer#, Cname, Caddress
Item, Item-Type, Item-Color, Item-Price, Quantity
Customer
Item
Cust# CName
CAddress
12398 Sid B.
Ontario St
Invoice
Invoice# Cust# Inv.Date
Item#
Type
Color
Line Item
Invoice# Item# Qty
Price
Relational Database

Primary Key
– duplicate rows not allowed
– cannot have missing (NULL) value(s) for PK

Foreign Key
– defines relationship between tables
– FK values either reference existing PK values or
they are NULL
(referential integrity)
FK
Customer
Cust# CName
CAddress
12398 Sid B.
Ontario St
Invoice
Invoice# Cust# Inv.Date
Relational Database

Relational Operators
– Select: subset of rows
– Project: subset of columns
– Join: creates new table by linking on common attributes
Select Items with Price > $100
Show Item#, Type, and Price
Item#
Type
Color
Price
I34
I459
I678
abc
prq
xyz
Blue
Red
Purple
$123.50
$57.23
$109.45
All Invoices for Customer Sid B.
Customer
Cust# CName
12398 Sid B.
Invoice
Invoice# Cust# Inv.Date
CAddress
Ontario St
JOIN on Cust#
78912
79710
79923
12398 6/28/97
36743 8/15/97
12398 7/3/98
Data Warehousing






Physical separation of operational and decision
support environments
Purpose: to establish a data repository making
operational data accessible
Uses TPS data needed for decision support
Data are transformed and integrated into a
consistent structure
Data warehousing (or information warehousing): a
solution to the data access problem
End users perform ad hoc query, reporting analysis
and visualization
Data Warehouse:
A Decision Support Focus

DW technology
A set of methods, techniques and tools that
may be leveraged together to produce a
vehicle that delivers data to end users on an
integrated platform
A framework to support the merging of
operational data, informational data,
external data, and personal data
Issue is one of applying the technology to solve
a business problem
What is a data warehouse?
Databases that support decision making and that are
• subject oriented
• time-variant
• integrated
• non volatile
– organized around the essential business entities (customer, product,
policy, claim, order, etc.)
– contains data that has been cleansed, transformed, integrated
– data organized by various time periods; often summarized on time;
data is time-stamped
– not updated in real time; not updated by users
Order processing
•2 second response time
Daily c
•Last 6 months orders
Data
Warehouse
losed o
rders
•Last 5 years data
ry
Product Price/inventory
uc
prod
y
l
k
Wee
•10 second response time
•Last 10 price changes
e
t pric
•Last 20 inventory transactions
Marketing
•30 second response time
We
e
r
ma
y
kl
nto
/Inve
ing
ke t
•Response time 2 seconds
to 60 minutes
•Data is not modified
ra
rog
ms
p
•Last 2 years programs
•Different performance requirements
•Combine data from multiple applications
•Data is mostly non-volatile
•Data saved for a long time period
Figure 3. Reasons for moving data outside the operations systems
Order processing
Customer
orders
Product
price
Available Inventory
Data
Warehouse
Customers
Product Price/inventory
Product
price
Product
Inventory
Products
Orders
Product Inventory
Product Price changes
Product Price
Marketing
Customer
Profile
Product
price
Marketing programs
•No data model restrictions of the source application
•Data warehouse model has business entities
Figure 5. Data warehouse entities align with the business structure
Order Processing
System
Order
Data
Warehouse
Orders (Closed)
Inventory snapshot 1
Up
Inventory snapshot 2
Inventory
• Operational state information is not carried to the data warehouse
• Data is transferred to the data warehouse after all state changes
• Or, data is transferred with period snapshots
Figure 6. Transformation of the operational state information
Transformation
Operational
System A
----------------------cust, cust_id, borrower
>> customer ID
----------------------“1” >> “M”
“2” >> “F”
Operational
System B
----------------------Missing >>> “……..”
Data Warehouse
System
Summarized Data
Detailed
Data
•Uniform business terms
•Single physical definition of an attribute
•Consistent use of entity attributes
•Default and missing values
Figure 8. Physical transformation of application data
Data Warehouse
System
Predefined
reports and
queries
Summarized Data
Queries against
summary data
Detailed
Data
Data mining in
detail data
Other
Applications
•Standard reports and queries
•Queries against summary tables
•Data mining in the detail data
•Interface with other applications
Figure 10. Data warehouse analysis processes
What is a data warehouse?
Kinds of information in data warehouse
• old detail data
• current detail data
• lightly summarized data
• highly summarized data
• meta-data
Meta Data

“data on data”
source, history, and many other aspects of data.

Business meta data
definitions,descriptions and rules used for reporting.

Technical meta data
structures and mapping rules for the data extraction and
staging process.

Allows information stored in warehouse to be used
effectively for reporting and analysis, and ensure that
all users have “one version of the truth”.
Issue
Operational
Warehouse
How built
One application at a time
One or more subject areas at a time
Requirements
Known
Vague
Critical to
Daily business operation
Management decisions
Data access
Smaller number of rows
retrieved in a single
transaction
Tuned for frequent access
to small amounts of data
Volume needed for daily
operation
Large sets of data scanned to retrieve
results
High performance
Flexibility
Tuning
Data volume
Data retention
Data currency
Data
availability
Unit of work
Design priority
Tuned for infrequent access to large
amounts of data
Larger volume needed to support
statistical analysis, forecasting, ad hoc
reporting, querying
To meet daily requirement Retained longer to support historical
reporting, comparison, etc.
Must be up to the minute Represents a static point in time;
important that data does not change in
real time
High availability may be
Same level of availability usually not
needed
needed.
Small, manageable,
Large, unpredictable, highly variable
predictable unit of work
unit of work
Data Warehousing Benefits






Increase in knowledge worker productivity
Supports all decision makers’ data
requirements
Provide ready access to critical data
Insulates operation databases from ad hoc
processing
Provides high-level summary information
Provides drill down capabilities
Yields
–
–
–
–
–
Improved business knowledge
Competitive advantage
Enhances customer service and satisfaction
Facilitates decision making
Help streamline business processes
DW Suitability
For organizations where





Data are in different systems
Information-based approach to management in
use
Large, diverse customer base
Same data have different representations in
different systems
Highly technical, messy data formats
Decision Support Systems and Intelligent Systems, Efraim Turban and Jay E. Aronson
Copyright 1998, Prentice Hall, Upper Saddle River, NJ
PC Files
LAN Servers Mainframe OLTP Databases External Sources
Data Loader
Data Converter
Data Scrubber
Data Transformer
Data Warehouse
OLAP Server
OLAP Interface
Data Marts


Data warehouse designed to meet the needs of a
specific group of users
Should (but may not) be designed with
corporate standards and accessibility in mind
– incorporate standards for hardware, software,
networking, DBMS, naming conventions, etc.
– vendor’s attempt to bypass IT and sell directly
to end-users?
Operational Data Store


Used for operational processing, may be used to
feed the DW
An architectural construct that is
• subject-oriented
• integrated
• volatile
• current valued
• comprised of only corporate detailed data


Multiple applications may use the data, with
updating in one place
Effective in organizations trying to move legacy
systems to integrated environment
OLAP: Data Access and Mining,
Querying and Analysis
Online Analytical processing (OLAP)
– DSS and EIS computing done by end-users in
online systems
– Versus online transaction processing (OLTP)
OLAP Activities
– Generating queries
– Requesting ad hoc reports
– Conducting statistical analyses
OLAP
(On-Line Analytical Processing)



To gain insight into data through fast, interactive
access to a wide variety of possible views of
information that has been transformed from raw data
view and analyze data across multiple dimensions
allows flexible and easy “slicing and dicing” of data,
drill down capabilities
move from a general view to one which is more detailed (known as
"drill-down"), or from a very detailed level to one which is more
aggregated (“roll-up”).
view data from a different perspective by introducing a completely
different analysis criterion ("dicing" or changing view).
OLAP

Multidimensional OLAP vs. Relational OLAP
– MOLAP: data stored in multi-dimensional arrays; use of sparse
matrix techniques
– ROLAP: data stored in relational DBMS; use of star-schema
design
ROLAP: Star Schema Design
Dimension Table 1
Dimension Key 1
Description 1
Aggregation Lvl 1.1
Aggregation Lvl 1.2
Aggregation Lvl 1.3
Fact Table
Dimension Key 1
Dimension Key 2
Dimension Key 3
…….
Fact 1
Fact 2
Fact 3
…….
Dimension Table 2
Dimension Key 2
Description 2
Aggregation Lvl 2.1
Aggregation Lvl 2.2
Aggregation Lvl 2.3
Dimension Table 3
Dimension Key 3
Description 3
Aggregation Lvl 3.1
Aggregation Lvl 3.2
Star-Schema example
Sales Rep ID
Sales Rep Name
Store ID
Store Name
Store Location
Distribution Channel
Fact Table
Sales Rep ID
Product Code
Cust Zip Code
Customer Type
Sales Period Date
Total Qty
Total $
Quota Qty
Returned Qty
Promotion Qty
Product Code
Product Name
Category
Product Type
measures
ZIP Code
City
State/Province
Country
Customer Type
Cust Type Desc
Cust Category
Cust Category Desc
Dimension Tables
Dimension Tables
Multi-dimensional data
OLTP vs OLAP schema
Company
Ship-from
PO
Ship-to
PO-item
• Efficient create, update
and processing of orders
Item
Company
• Query: list purchases by companies,
cost of items, source and destination
Ship-from
Purchases
Ship-to
Item
Date
Dimensional hierarchies
Company
#Units
$-value
Name
City
State
Region
Purchases
Date
Item
Name
Item-Type
Day
Week
Month
Qtr
Product Category
Year
Query examples:
•Purchases by Item
•Purchases by Item and Date
•Purchases by Item, Date and Company
•Purchases by Item by Week
•Purchases by Item-Type by Qtr
•Purchases by Item-Type by State by Year
Further OLAP queries






Compare average purchase-$ in 1998 to that in 1997
Compare average monthly purchase-$ by Region
What are the total 1st-quarter purchase-units by
company over the past 5 years
What is the deviation in weekly purchase-$ by
company over the past year
Model weekly purchase-units by Item-Category over
the past 5 years
Analytical queries - requirements beyond
traditional SQL-type querying constructs
OLAP uses the data warehouse
and a set of tools, usually with
multidimensional capabilities

Query tools

Spreadsheets

Data mining tools

Data visualization tools
Decision Support Systems and Intelligent Systems, Efraim Turban and Jay E. Aronson
Copyright 1998, Prentice Hall, Upper Saddle River, NJ
D a ta
S o u rc e s
Bu sin ess
Com m u n ica t ion
Qu er yin g
In t er n a l
Da t a
Sou r ces
E xt er n a l
Da t a
Sou r ces
Da t a
Acqu isit ion ,
E xt r a ct ion ,
Deliver y
Tr a n sfor m a t ion
Repor t
Gen er a t ion
Da t a
Wa r eh ou se
Spr ea dsh eet
F or eca st in g
An a lysis
Modelin g
Da t a
P r esen t a t ion
and
Visu a liza t ion
Mu lt im edia
E IS,
Ot h er s
On lin e An a lyt ica l P r ocessin g
FIGURE 4.1 Data Warehousing and Online Analytical Processing (OLAP).
Decision Support Systems and Intelligent Systems, Efraim Turban and Jay E. Aronson
Copyright 1998, Prentice Hall, Upper Saddle River, NJ