Transcript Outline
Decision support systems for
E-commerce
Decision support systems for EC
DSS: help the knowledge worker (executive, manager,
analyst) make faster and better decisions
–
–
–
Data Warehousing: enables On-line analytical
processing (OLAP)
–
what were the sales volumes by region and product
category for the last year?
How did the share price of computer manufacturers
correlate with quarterly profits over the past 10 years?
Will a 10% discount increase sales volume sufficiently?
OLAP is a component of decision support system
Data mining
–
–
Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases.
Data mining is a powerful, high-performance data
analysis tool for decision support.
Potential Applications of Data
Warehousing and Mining in EC
Analysis of user access patterns and buying patterns
Customer segmentation and target marketing
Cross selling and improved Web advertisement
Personalization
Association (link) analysis
Customer classification and prediction
Time-series analysis
Typical event sequence and user behavior pattern
analysis
Transition and trend analysis
Data Warehousing
The phrase data warehouse was coined by William
Inmon in 1990
Data Warehouse is a decision support database that is
maintained separately from the organization’s
operational database
Definition: A DW is a repository of integrated
information from distributed, autonomous, and
possibly heterogeneous information sources for query,
analysis, decision support, and data mining purposes
Characteristics (cont’d)
Integrated
–
–
No consistency in encoding, naming conventions, …
among different application-oriented data from different
legacy systems, different heterogeneous data sources
When data is moved to the warehouse, it is consolidated
converted, and encoded
Non-volatile
–
–
–
New data is always appended to the database, rather
than replaced
The database continually absorbs new data, integrating it
with the previous data
In contrast, operational data is regularly accessed and
manipulated a record at a time and update is done to data
in the operational environment.
Characteristics (cont’d)
Time-variant
–
–
–
The time horizon for the data warehouse is significantly
longer than that of operational systems.
Operational database contain current value data. Data
warehouse data is nothing more than a sophisticated
series of snapshots, taken as of some moment in time.
Operational data is valid only at the moment of accesscapturing a moment in time. Within seconds, that data
may no longer be valid in its description of current
operations
Operational data may or may not contain some element
of time. Informational data has a time dimension: each
data point is associated with a point in time, and data
points can be compared along that axis.
Reasons to Separate DW from Operational
Systems
Performance:
–
–
–
–
special data organization, access methods, and
implementation methods are needed to support
multidimensional views and operations typical of OLAP
Complex OLAP queries would degrade performance for
operational transactions, Thus DW avoids interruption
of the operational processing at the underlying
information sources
Concurrency control and recovery of OLTP mode are
not compatible with OLAP analysis
Provide fast access to integrated information
Reasons to Separate DW from Operational
Systems
Decision support requires
–
historical data which operational databases do not
typically maintain
–
consolidation of data from heterogeneous sources:
operational databases, external sources
•
–
different sources typically use inconsistent data
representations, codes and formats which have to be
reconciled.
aggregation, summarization, annotation of raw data
System Architecture
End User
Analysis, Query Reports,
Data Mining
Detector
Detector
Detector
Legacy
Flat-file
...
Detector
RDBMS
OODBMS
DW Components
Underlying information sources
–
–
–
–
–
often the operational systems, providing the lowest level
of data.
designed for operational use, not for decision support,
and the data reflect this fact.
Multiple data sources are often from different systems
run on a wide range of hardware and much of the
software is built in-house or highly customized.
Multiple data sources introduce a large number of
issues, such as semantic conflicts.
Distributed, autonomous, and possibly heterogeneous
DW Components (cont’d)
Integrator
–
–
–
–
–
–
Receives updates
makes the data conform to the conceptual schema used
by the warehouse
integrates the changes into the warehouse
merges the data with existing data already present
resolves possible update anomalies
Modifies warehouse views accordingly
User interface
–
Tools to query and perform data analysis and data
mining
DW Components (cont’d)
Change detectors/propagators
–
–
Refresh the warehouse by detecting to an information
source that are of interest to the warehouse and
propagating updates on source data to the data stored in
the warehouse
when to refresh
•
–
determined by usage, types of data source, etc.
how to refresh
•
•
•
data shipping: using triggers to update snapshot log table and
propagate the updated data to the warehouse (define triggers in
a full-functionality DBMS)
transaction shipping: shipping the updates in the transaction
log (examine the updates in the log file)
write programs for legacy systems
Multidimensional Data
Sales volume as a function of product, time, and
geography
OLAP Servers
Relational OLAP (ROLAP)
–
Multidimensional OLAP (MOLAP)
–
Extended relational DBMS that maps operations on
multidimensional data to standard relations operations
Special purpose server that directly implements
multidimensional data and operations
Hybrid OLAP (HOLAP)
–
give users/system administrators freedom to select
different partitions.
Warehouse Design: Conceptual Modeling
Star schema
–
Snowflake schema
–
A single object (fact table) in the middle connected to
a number of objects (dimension tables)
A refinement of star schema where the dimensional
hierarchy is represented explicitly by normalizing the
dimension tables.
Fact constellation schema
–
Multiple fact tables share dimension tables.
A Multidimensional fact table scheme
Example of The Star Schema
Example of the Snowflake Schema
Example of the Fact Constellation Schema
Sales Data
A Sample Data Cube
2Qtr
3Qtr
4Qtr
sum
China
India
Japan
sum
Country
TV
PC
VCR
sum
1Qtr
Date
Total annual sales
of TV in China.
OLAP Operations
roll-up
–
–
aggregating on a specific dimension, I.e., summarize
data
total sales volume last year by product category by
region
drill-down
–
–
–
also called roll down, drill through
inverse of roll-up, go from higher level summary to
lower level summary or detailed data
For a particular product category, find the detailed sales
data for each salesperson by date
OLAP Operations (cont’d)
slicing
–
–
dicing
–
–
projecting data along a subset of dimensions with an
equality selection of other dimensions
Sales of beverages in the West for Jan 98
similar to slicing except that instead of equality selection
of other dimensions, a range selection is used
Sales of beverages in the West over the last 6 months
Pivot
–
reorient cube
Cube Operation
SELECT date, product, customer, SUM (amount)
FROM SALES
CUBE BY date, product, customer
Need to compute the following Group-By
(date, product, customer),
(date,product),(date, customer), (product, customer),
(date), (product) (customer)
Cube Computation -- Array Based Algorithm
An MOLAP approach: the base cuboid is stored as
multidimensional array.
Read in a number of cells to compute partial
cuboids
B
A
C
{}
{ABC}
{AB}
{AC}
{BC}
{A}
{B}
{C}
{}
ROLAP versus MOLAP
ROLAP
–
–
exploits services of relational engine effectively
provides additional OLAP services
•
•
design tools for DSS schema
performance analysis tool to pick aggregates to
materialize
SQL comes in the way of sequential processing
and column aggregation
– Some queries are hard to formulate and can often
be time consuming to execute
–
ROLAP versus MOLAP
MOLAP
–
the storage model is an n-dimensional array
–
Front-end multidimensional queries map to
server capabilities in a straightforward way
–
Direct addressing abilities
–
Handling sparse data in array representation is
expensive
–
Poor storage utilization when the data is sparse
Example
Designed, developed and implemented
QDMS (Quality Database Management System) - A
working system
–
–
–
–
Provides fast access to integrated production and
inspection data
Provides complex data analysis for decision support
Isolates data analysis processing from operational
systems
Encourages manufacturers to evaluate and improve
their performance
Example (Cont’d)
Information Sources
–
Heterogeneous information sources
•
•
•
•
–
Flat files
RDBS - Oracle, Sybase, Paradox, MS Access, FoxPro
Non-relational DBS - IBM IMS
Others - Lotus Notes
Data
•
•
•
Uniform in some cases, e.g., Lot_no; Product id: NSN
Non-uniform in some other cases, e.g. Defect id
Temporal ordering for production records
Actual Application
Com.1
Query:
“overall & detail production performance”
•
•
•
•
manufacturer: Com1
products: all products
date interval: 01-Jan-94 until 01-Jan-1999
source: USDA
Com.1
Lot#1
Contract Number 1
Com.1
Lot#2
Contract Number 2
Com.1
Lot#3
Contract Number 3
Data Mining
Characterization and Comparison
–
Association
–
Generalize, summarize, and possibly contrast data
characteristics, e.g., dry vs. wet regions.
finding rules like: buys(x, diapers) buys(x,
milk)
Classification and Prediction
–
–
Classify data based on the values in a classifying
attribute, e.g., classify countries based on climate,
or classify cars based on gas mileage.
Predict some unknown or missing attribute values
based on other information.
Data Mining (Cont’d)
Clustering:
–
Group data to form new classes, e.g., cluster
houses to find distribution patterns.
Time-series analysis:
–
–
–
Trend and deviation analysis: Find and
characterize evolution trend, sequential patterns,
similar sequences, and deviation data, e.g., stock
analysis.
Similarity-based pattern-directed analysis: Find
and characterize user-specified patterns in large
databases.
Cyclicity/periodicity analysis: Find segmentwise or total cycles or periodic behavior in timerelated data.
Classification
Data categorization based on a set of training objects.
–
–
Applications: credit approval, target marketing,
medical diagnosis, treatment effectiveness
analysis, etc.
Example: classify a set of diseases and provide the
symptoms which describe each class or subclass.
The classification task: Based on the features present in
the class_labeled training data, develop a description or
model for each class. It is used for
–
–
–
classification of future test data,
better understanding of each class, and
prediction of certain properties and behaviors.