Data Management for Decision Support
Download
Report
Transcript Data Management for Decision Support
Data Management for Decision
Support
Session-2
Prof. Bharat Bhasker
What is a Data Warehouse?
A Practitioners Viewpoint
“A data warehouse is simply a single,
complete, and consistent store of data
obtained from a variety of sources and made
available to end users in a way they can
understand and use it in a business context.”
-- Barry Devlin, IBM Consultant
Introduction- Business Problem Definition
Business Application domain of Decision Support
• Customer Retention- modeling those who defected to
identify patterns that led their defection. Apply it to present
set and devise prevention.
• Sales & Customer Service - By aggregating the proper
information to frontline sales and service professionals.
Based on customer profile, rule based recommendation of
products, service plans. Cross-selling
• Marketing- Accurate information for retention campaigns,
lifetime value analysis, trending, targeted promotions
• Risk assessment and fraud detection- Accessible customer
base reduces the risk. A mail-order company can identify
payment patterns from a mailing address. Insurance
company can aggregate all drawn out policies, thus total
exposure. A bank can identify risky companies
Introduction- Business Problem Definition
Class of Problems
• Retrospective Analysis - Analysis of past and
present events. Analysis of zone wise, product wise
sales
• Predictive Analysis- Predicting certain events or
behavior based on historical data. Attrition rate of
customers due to competition
Introduction- Business Problem Classifications
Application Techniques for Business Problems
• Classification- based on predefined criteria, examples credit ratings of
individuals
• Clustering and Segmentation - Segment a database in subsets or clusters
based on a set of attributes. In the process of understanding customer
base, an organization may segment the known population. Thus,
discover hitherto unknown attributes (school attended, # of vacations
per year). Clustering may utilize statistical or AI techniques
• Associations- These techniques identify affinity among the collection
reflected in the examined records. These affinities are often referred to
as rules. Foe example, 60% of all record that contain A & B also
contain C & D. Product affinity in Market basket analysis
• Sequencing- Identifies patterns over time,. For example a analysis of
customer visits may reveal that a person who buys engine oil & filter in
one visit, buys gas additive in next visit.
====>
Introduction- Business Problem Classifications
The organizations are faced with a wealth of data stored in archives,. It
is the inability to discover, often previously unknown, information
hidden in data prevents them from deriving the knowledge and value
for the organization. The objective, therefore, should be to extract valid,
previously unknown, and comprehensible knowledge/information from
large database and use it for profit (Competitive advantage). To meet
these objectives-
– Capture and integrate both external and internal data into a
comprehensive view that encompasses whole organization
– Organize the present data and extracted information in ways that
expedites the decision making
– Mine the integrated data for information
Introduction- Operational Data Stores
For variety of On Line Transaction Processing (OLTP) systems
(Financial, Order processing, Point of sale applications) the
organizations maintain an operational data store. For example, bank
withdrawals and deposits.
–
–
–
–
–
–
–
–
Organized by application
Daily business processing on a detailed transactional level
Update intensive
Current data
Optimized for high performance
Access is few records per transaction, often on primary key
Large number of short duration transactions
Large number of concurrent applications
Introduction- Informational Data Stores
Need to organized around subjects such as customers, vendors,
products. Focuses on answering questions like “ What two products
resulted in most frequent calls on hotline?”. Tend to have redundant
and non-updateable. Characteristics exhibited include:
– Data Model- to meet End-use analysis needs. No focus on ACID
– Data Access- Ad hoc queries
– Time base- Recent, Aggregated, derived and historical data
– Data changes- Periodic, scheduled batch updates.
– Unit of Work- Queries rather than concurrent updates
– Record range accessed- Millions for a query vs. few for operational
– Number of Concurrent users- Low for IDS vs. High for ODS.
– Transaction Volumes- Low for IDS vs. High for ODS.
– Type of users- Analytical & Managerial
– Number of Indexes- Often many complex, compound vs. few,
simple.
Introduction- Managing Data for DS
Data Warehouse
Common characteristics:
– Database designed to meet analytical tasks comprising of data from
multiple applications
– Small number of users with intense and long interactions
– Read intensive usage
– Periodic updates to the contents
– Consists of current as well as historical data
– Relatively fewer but large tables
– Queries results is large results sets, involving full table scan and
joins spanning several tables
– Aggregation, vector operation and summarization are common
– The data frequently resides in external heterogeneous sources
Introduction- Terminology
Current Detail Data- data acquired directly from operational databases,
often representing entire enterprise
Old Detail Data- Aged current detail data, historical data organized by
subjects, it helps in trend analysis
Data Marts- A large data store for informational needs where scope is
limited to a department, SBUs etc., In a phased implementation data
marts are a way to build a warehouse.
Summarized Data- Aggregated data along the lines required for
executive reporting,trend analysis and decision support.
Drill Down- Ability of knowledge worker to perform business analysis
in a top down fashion,starting from the highly summarized data to the
current and old detail data.
Metadata- It is data about the data, description of contents, location,
structure, end-user views, identification of authoritative data, history of
updates, security authorizations
Introduction- Architecture
Management
External
Meta data
Current
Information
Delivery System
Report, Query & EIs
Data Mining Tools
Extract,
Cleanup
& Load
Realized
or Virtual
MDDB
Currentl
Repository
OLAP Tools
What is a Data Warehouse?
An Alternative Viewpoint
“A DW is a
–
–
–
–
subject-oriented,
integrated,
time-varying,
non-volatile
collection of data that is used primarily in
organizational decision making.”
-- W.H. Inmon, Building the Data Warehouse, 1992
A Data Warehouse is...
• Stored collection of diverse data
– A solution to data integration problem
– Single repository of information
• Subject-oriented
– Organized by subject, not by application
– Used for analysis, data mining, etc.
• Optimized differently from transactionoriented db
• User interface aimed at executive
… Cont’d
• Large volume of data (Gb, Tb)
• Non-volatile
– Historical
– Time attributes are important
• Updates infrequent
• May be append-only
• Examples
– All transactions ever at WalMart
– Complete client histories at insurance firm
– Stockbroker financial information and portfolios
Summary
Business
Information Guide
Data
Warehouse
Catalog
Business Information
Interface
Data
Warehouse
Data Warehouse
Population
Enterprise
Modeling
Operational Systems
Warehouse is a Specialized DB
•
•
•
•
•
•
•
Standard DB
Mostly updates
Many small transactions
Mb - Gb of data
Current snapshot
Index/hash on p.k.
Raw data
Thousands of users (e.g.,
clerical users)
•
•
•
•
•
•
•
Warehouse
Mostly reads
Queries are long and complex
Gb - Tb of data
History
Lots of scans
Summarized, reconciled data
Hundreds of users (e.g., decisionmakers, analysts)
Warehousing and Industry
• Warehousing is big business
– $2 billion in 1995
– $3.5 billion in early 1997
– $8 billion in 1998
• WalMart has largest warehouse
– 900-CPU, 2,700 disk, 23 TB Teradata system
– ~7 TB in warehouse
– 40-50GB per day