cse4701chap25x - University of Connecticut
Download
Report
Transcript cse4701chap25x - University of Connecticut
Chapter 25, 6e - 24, 5e
Distributed Databases
CSE
4701
Prof. Steven A. Demurjian, Sr.
Computer Science & Engineering Department
The University of Connecticut
191 Auditorium Road, Box U-155
Storrs, CT 06269-3155
[email protected]
http://www.engr.uconn.edu/~steve
(860) 486 - 4818
A portion of these slides are being used with the permission of Dr. Ling Lui,
Associate Professor, College of Computing, Georgia Tech.
Remaining slides represent new material.
Chaps25.1
Classical and Distributed Architectures
CSE
4701
Classic/Centralized DBMS Dominated the
Commercial Market from 1970s Forward
Problems of this Approach
Difficult to Scale w.r.t. Performance Gains
If DB Overloaded, replace with a Faster Computer
this can Only Go So Far - Disk Bottlenecks
Distributed DBMS have Evolved to Address a
Number of Issues
Improved Performance
Putting Data “Near” Location where it is Needed
Replication of Data for Fault Tolerance
Vertical and Horizontal Partitioning of DB Tuples
Chaps25.2
Common Features of Centralized DBMS
CSE
4701
Data Independence
High-Level Representation via Conceptual and
External Schemas
Physical Representation (Internal Schema) Hidden
Program Independence
Multiple Applications can Share Data
Views/External Schema Support this Capability
Reduction of Program/Data Redundancy
Single, Unique, Conceptual Schema
Shared Database
Almost No Data Redundancy
Controlled Data Access Reduces Inconsistencies
Programs Execute with Consistent Results
Chaps25.3
Common Features of Centralized DBMS
CSE
4701
Promote Sharing: Automatically Provided via CC
No Longer Programmatic Issue
Most DBMS Offer Locking for Key Shared Data
Oracle Allows Locks on Data Item (Attributes)
For Example, Controlling Access to Shared Identifier
Coherent and Central DB Administration
Semantic DB Integrity via the Automatic Enforcement
of Data Consistency via Integrity Constraints/Rules
Data Resiliency
Physical Integrity of Data in the Presence of Faults
and Errors
Supported by DB Recovery
Data Security: Control Access for Authorized Users
Against Sensitive Data
Chaps25.4
Shared Nothing Architecture
CSE
4701
In this Architecture, Each DBMS
Operates Autonomously
There is No Sharing
Three Separate DBMSs on
Three Different Computers
Applications/Clients Must Know
About the External Schemas of
all Three DBMSs for
Database Retrieval
Client Processing
Complicates Client
Different DBMS Platforms
(Oracle, Sybase, Informix, ..)
Different Access Modes
(Query, Embedded, ODBC)
Difficult for SWE to Code
Chaps25.5
Difficulty in Access – Manage Multiple APIs
CSE
4701
Each Platform has a Different API
API1 , API3 , …. , APIn
An App Programmer Must Utilize All three APIs which
could differ by PL – C++, C, Java, REST, etc.
Any interactions Across 3 DBs – must be
programmatically handled without DB Capabilities
API1
API2
APIn
Chaps25.6
NW Architecture with Centralized DB
CSE
4701
High-Speed NWs/WANs Spawned Centralized DB
Accessible Worldwide
Clients at Any Site can Access Repository
Data May be “Far” Away - Increased Access Time
In Practice, Each Remote Site Needs only Portion
of the Data in DB1 and/or DB2
Inefficient, no Replication w.r.t. Failure
Chaps25.7
Fully Distributed Architecture
CSE
4701
The Five Sites (Chicago, SF, LA, NY, Atlanta) each
have a “Portion” of the Database - its Distributed
Replication is Possible for Fault Tolerance
Queries at one Site May Need to Access Data at
Another Site (e.g., for a Join)
Increased Transaction Processing Complexity
Chaps25.8
Distributed Database Concepts
CSE
4701
A transaction can be executed by multiple networked
computers in a unified manner.
A distributed database (DDB) processes a Unit of
execution (a transaction) in a distributed manner.
A distributed database (DDB) can be defined as
Collection of multiple logically related database
distributed over a computer network
Distributed database management system as a
software system that manages a distributed
database while making the distribution transparent
to the user.
Chaps25.9
Goals of DDBMS
CSE
4701
Support User Distribution Across Multiple Sites
Remote Access by Users Regardless of Location
Distribution and Replication of Database Content
Provide Location Transparency
Users Manipulate their Own Data
Non-Local Sites “Appear” Local to Any User
Provide Transaction Control Akin to Centralized Case
Transaction Control Hides Distribution
CC and Serializability - Must be Extended
Minimize Communications Cost
Optimize Use of Network - a Critical Issue
Distribute DB Design Supported by Partitioning
(Fragmentation) and Replication
Chaps25.10
Goals of DDBMS
CSE
4701
Improve Response Time for DB Access
Use a More Sophisticated Load Control for
Transaction Processing
However, Synchronization Across Sites May
Introduce Additional Overhead
System Availability
Site Independence in the Presence of Site Failure
Subset of Database is Always Available
Replication can Keep All Data Available, Even
When Multiple Sites Fail
Modularity
Incremental Growth with the Addition of Sites
Dedicate Sites to Specific Tasks
Chaps25.11
Advantages of DDBMS
CSE
4701
1.
There are Four Major Advantages
Transparency
Distribution/NW Transparency
User Doesn’t Know about NW Configuration (Location
Transparency)
User can Find Object at any Site (Naming
Transparency)
Replication Transparency (see next PPT)
User Doesn’t Know Location of Data
Replicas are Transparently Accessible
Fragmentation Transparency
Horizontal Fragmentation (Distribute by Row)
Vertical Fragmentation (Distribute by Column)
Chaps25.12
Data Distribution and Replication
CSE
4701
Chaps25.13
Other Advantages of DDBMS
CSE
4701
2. Increased Reliability and Availability
Reliability - System Always Running
Availability - Data Always Present
Achieved via Replication and Distribution
Ability to Make Single Query for Entire DDBMS
3. Improved Performance
Sites Able to Utilize Data that is Local for
Majority of Queries
4. Easier Expansion
Improve Performance of Site by
Upgrading Processor of Computer
Adding Additional Disks
Splitting a Site into Two or More Sites
Expansion over Time as Business Grows
Chaps25.14
Challenges of DDBMS
CSE
4701
Tracking Data - Meta Data More Complex
Must Track Distribution (where is the Data)
V & H Fragmentation (How is Data Split)
Replication (Multiple Copies for Consistency)
Distributed Query Processing
Optimization, Accessibility, etc., More Complex
Block Analysis of Data Size Must also Now
Consider the NW Transmitting Time
Distributed Transaction Processing
TP Potentially Spans Multiple Sites
Submit Query to Multiple Sites
Collect and Collate Results
Distributed Concurrency Control Across Nodes
Chaps25.15
Challenges of DDBMS
CSE
4701
Replicated Data Management
TP Must Choose the Replica to Access
Updates Must Modify All Replica Copies
Distributed Database Recovery
Recovery of Individual Sites
Recovery Across DDBMS
Security
Local and Remote Authorization
During TP, be Able to Verify Remote Privileges
Distributed Directory Management
Meta-Data on Database - Local and Remote
Must maintain Replicas of this - Every Site Tracks
the Meta-Data for All Sites
Chaps25.16
A Complete Schema with Keys ...
CSE
4701
Keys Allow us to
Establish Links
Between Relations
Chaps25.17
…and Corresponding DB Tables
CSE
4701
which Represent Tuples/Instances of Each Relation
A
S
C
null
W
B
null
null
1
4
5
5
Chaps25.18
…with Remaining DB Tables
CSE
4701
Chaps25.19
What is Fragmentation?
CSE
4701
Fragmentation Divides a DB Across Multiple Sites
Two Types of Fragmentation
Horizontal Fragmentation
Given a Relation R with n Total Tuples, Spread Entire
Tuples Across Multiple Sites
Each Site has a Subset of the n Tuples
Essentially Fragmentation is a Selection
Vertical Fragmentation
Given a Relation R with m Attributes and n Total
Tuples, Spread the Columns Across Multiple Sites
Essentially Fragmentation is a Projection
Not Generally Utilized in Practice
In Both Cases, Sites can Overlap for Replication
Chaps25.20
Horizontal Fragmentation
CSE
4701
A horizontal subset of a relation which contain those
of tuples which satisfy selection conditions.
Consider Employee relation with condition DNO = 5
All tuples satisfying this create a subset which will be
a horizontal fragment of Employee relation.
A selection condition may be composed of several
conditions connected by AND or OR.
Derived horizontal fragmentation:
Partitioning of a primary relation to other
secondary relations which are related with Foreign
keys.
Chaps25.21
Horizontal Fragmentation
Site 2 Tracks All Information Related to Dept. 5
CSE
4701
Chaps25.22
Horizontal Fragmentation
CSE
4701
Site 3 Tracks All Information Related to Dept. 4
Note that an Employee Could be Listed in Both Cases,
if s/he Works on a Project for Both Departments
Chaps25.23
Refined Horizontal Fragmentation
CSE
4701
Further Fragment from Site
2 based on Dept. that
Employee Works in
Notice that G1 + G2 + G3 is
the Same as WORKS_ON5
there is no Overlap
Chaps25.24
Refined Horizontal Fragmentation
CSE
4701
Further Fragment from Site
3 based on Dept. that
Employee Works in
Notice that G4 + G5 + G6 is
the Same as WORKS_ON4
Note Some Fragments can
be Empty
Chaps25.25
Vertical Fragmentation
CSE
4701
Subset of a relation created via a subset of columns.
A vertical fragment of a relation will contain
values of selected columns.
There is no selection condition used in vertical
fragmentation.
A strict vertical slice/partition
Consider the Employee relation.
A vertical fragment of can be created by keeping
the values of Name, Bdate, Sex, and Address.
Since no condition for creating a vertical fragment
Each fragment must include the primary key
attribute of the parent relation Employee.
All vertical fragments of a relation are connected.
Chaps25.26
Vertical Fragmentation Example
CSE
4701
Partition the Employee Table as Below
Notice Each Vertical Fragment Needs Key Column
EmpDemo
EmpSupvrDept
Chaps25.27
Homogeneous DDBMS
CSE
4701
Homogeneous
Identical Software (w.r.t. Database)
One DB Product (e.g., Oracle, DB2, Sybase) is
Distributed and Available at All Sites
Uniformity w.r.t. Administration, Maintenance,
Client Access, Users, Security, etc.
Interaction by Programmatic Clients is Consistent
(e.g., JDBC or ODBC or REST API …)
Chaps25.28
Non-Federated Heterogeneous DDBMS
CSE
4701
Non-Federated Heterogeneous
Different Software (w.r.t. Database)
Multiple DB Products (e.g., Oracle at One Site,
MySQL at another, Sybase, Informix, etc.)
Replicated Administration (e.g., Users Needs
Accounts on Multiple Systems)
Varied Programmatic Access - SWEs Must Know
All Platforms/Client Software Complicated
Very Close to Shared Nothing Architecture
Chaps25.29
Federated DDBMS
CSE
4701
Federated
Multiple DBMS
Platforms Overlaid
with a Global
Schema View
Single External
Schema Combines
Schemas from all
Sites
Multiple Data Models
Relational in one
Component DBS
Object-Oriented in
another DBS
Hierarchical in a
3rd DBS
Chaps25.30
Federated DBMS Issues
CSE
4701
Differences in Data Models
Reconcile Relational vs. Object-Oriented Models
Each Different Model has Different Capabilities
These Differences Must be Addressed in Order to
Present a Federated Schema
Differences in Constraints
Referential Integrity Constraints in Different DBSs
Different Constraints on “Similar” Data
Federated Schema Must Deal with these Conflicts
Differences in Query Languages
SQL-89, SQL-92, SQL2, SQL3
Specific Types in Different DBMS (Oracle Blobs )
Differences in Key Processing & Timestamping
Chaps25.31
Heterogeneous Distributed Database Systems
CSE
4701
Federated: Each site may run different database system but the
data access is managed through a single conceptual schema.
The degree of local autonomy is minimum.
Each site must adhere to a centralized access policy
There may be a global schema.
Multi-database: There is no one conceptual global schema
For data access a schema is constructed dynamically as
needed by the application software.
Object Unix Relational
Unix
Oriented Site 5
Site 1
Hierarchical
Window
Communications
Site 4
network
Object
Oriented
Network
DBMS
Site 3
Linux
Site 2
Linux
Relational
Chaps25.32
Query Processing in Distributed Databases
Issues
CSE
4701
Cost of transferring data (files and results) over the network.
This cost is usually high so some optimization is necessary.
Example relations: Employee at site 1 and Department at Site 2
– Employee at site 1. 10,000 rows. Row size = 100 bytes. Table size =
106 bytes.
Fname
Minit
Lname
SSN
Bdate
Address
Sex
Salary
Superssn
Dno
– Department at Site 2. 100 rows. Row size = 35 bytes. Table size =
3,500 bytes.
Dname
Dnumber
Mgrssn
Mgrstartdate
Q: For each employee, retrieve employee name and department
name Where the employee works.
Q: Fname,Lname,Dname (Employee Dno = Dnumber Department)
Chaps25.33
Query Processing in Distributed Databases
CSE
4701
Result
The result of this query will have 10,000 tuples,
assuming that every employee is related to a
department.
Suppose each result tuple is 40 bytes long.
The query is submitted at site 3 and the result is
sent to this site.
Problem: Employee and Department relations are
not present at site 3.
Chaps25.34
Query Processing in Distributed Databases
CSE
4701
Strategies:
1. Transfer Employee and Department to site 3.
Total transfer bytes = 1,000,000 + 3500 = 1,003,500
bytes.
2. Transfer Employee to site 2, execute join at site 2 and send
the result to site 3.
Query result size = 40 * 10,000 = 400,000 bytes. Total
transfer size = 400,000 + 1,000,000 = 1,400,000 bytes.
3. Transfer Department relation to site 1, execute the join at site
1, and send the result to site 3.
Total bytes transferred = 400,000 + 3500 = 403,500 bytes.
Optimization criteria: minimizing data transfer.
Preferred approach: strategy 3.
Chaps25.35
Query Processing in Distributed Databases
CSE
4701
Consider the query
Q’: For each department, retrieve the department
name and the name of the department manager
Relational Algebra expression:
Fname,Lname,Dname (Employee
Mgrssn = SSN
Department)
Chaps25.36
Query Processing in Distributed Databases
CSE
4701
Result of query has 100 tuples, assuming that every
department has a manager, the execution strategies are:
1. Transfer Employee and Department to the result site and
perform the join at site 3.
Total bytes transferred = 1,000,000 + 3500 = 1,003,500
bytes.
2. Transfer Employee to site 2, execute join at site 2 and send
the result to site 3. Query result size = 40 * 100 = 4000
bytes.
Total transfer size = 4000 + 1,000,000 = 1,004,000 bytes.
3. Transfer Department relation to site 1, execute join at site 1
and send the result to site 3.
Total transfer size = 4000 + 3500 = 7500 bytes.
Preferred strategy: Choose strategy 3.
Chaps25.37
Query Processing in Distributed Databases
CSE
4701
Now suppose the result site is 2. Possible strategies :
1. Transfer Employee relation to site 2, execute the
query and present the result to the user at site 2.
Total transfer size = 1,000,000 bytes for both queries Q
and Q’.
2. Transfer Department relation to site 1, execute join
at site 1 and send the result back to site 2.
Total transfer size for Q = 400,000 + 3500 = 403,500
bytes and for Q’ = 4000 + 3500 = 7500 bytes.
Chaps25.38
DDBS Concurrency Control and Recovery
CSE
4701
Distributed Databases encounter a number of
concurrency control and recovery problems which are
not present in centralized databases, including:
Dealing with multiple copies of data items
How are they All Updated if Needed?
Failure of individual sites
How are Queries Restarted or Rerouted?
Communication link failure
Network Failure
Distributed commit
How to Know All Updates Done at all Sites?
Distributed deadlock
How to Detect and Recover?
Chaps25.39
The Next Big Challenge
CSE
4701
Interoperability
Heterogeneous Distributed Databases
Heterogeneous Distributed Systems
Autonomous Applications
Scalability
Rapid and Continuous Growth
Amount of Data
Variety of Data Types
Dealing with personally identifiable information (PII)
and personal health information (PHI)
Emergence of Fitness and Health Monitoring Apps
Google Fit and Apple HealthKit
New Apple ResearchKit for Medical Research
Chaps25.40
Interoperability: A Classic View
CSE
4701
Local
Schema
Simple Federation
Multiple Nested Federation
FDB Global
Schema
FDB Global
Schema 4
Federated
Integration
Federated
Integration
Local
Schema
Local
Schema
FDB 1
Local
Schema
Federation
FDB3
Federation
Chaps25.41
Java Client with Wrapper to Legacy Application
CSE
4701
Java Client
Java Application Code
WRAPPER
Mapping Classes
JAVA LAYER
Interactions Between Java Client
and Legacy Appl. via C and RPC
C is the Medium of Info. Exchange
Java Client with C++/C Wrapper
NATIVE LAYER
Native Functions (C++)
RPC Client Stubs (C)
Legacy
Application
Network
Chaps25.42
COTS and Legacy Appls. to Java Clients
CSE
4701
COTS Application
Legacy Application
Java Application Code
Java Application Code
Native Functions that
Map to COTS Appl
NATIVE LAYER
Native Functions that
Map to Legacy Appl
NATIVE LAYER
JAVA LAYER
JAVA LAYER
Mapping Classes
JAVA NETWORK WRAPPER
Mapping Classes
JAVA NETWORK WRAPPER
Network
Java Client
Java Client
Java is Medium of Info. Exchange - C/C++ Appls with Java Wrappers
Chaps25.43
Java Client to Legacy App via RDBS
CSE
4701
Transformed
Legacy Data
Java Client
Updated Data
Relational
Database
System(RDS)
Extract and
Generate Data
Transform and
Store Data
Legacy
Application
Chaps25.44
Database Interoperability in the Internet
CSE
4701
Technology
Web/HTTP, JDBC/ODBC, CORBA (ORBs +
IIOP), XML, SOAP, REST API, WSDL
Architecture
Information Broker
•Mediator-Based Systems
•Agent-Based Systems
Chaps25.45
JDBC
CSE
4701
JDBC API Provides DB Access Protocols for Open,
Query, Close, etc.
Different Drivers for Different DB Platforms
JDBC API
Java
Application
Driver Manager
Driver
Oracle
Driver
Access
Driver
Driver
Sybase
Chaps25.46
Connecting a DB to the Web
CSE
4701
DBMS
CGI Script Invocation
or JDBC Invocation
Web Server
Internet
Web Server are
Stateless
DB Interactions Tend
to be Stateful
Invoking a CGI
Script on Each DB
Interaction is Very
Expensive, Mainly
Due to the Cost of
DB Open
Browser
Chaps25.47
Connecting More Efficiently
CSE
4701
DBMS
Helper
Processes
CGI Script
or JDBC
Invocation
Web Server
Internet
To Avoid Cost of
Opening Database, One
can Use Helper
Processes that Always
Keep Database Open
and Outlive Web
Connection
Newly Invoked CGI
Scripts Connect to a
Preexisting Helper
Process
System is Still Stateless
Browser
Chaps25.48
DB-Internet Architecture
CSE
4701
WWW Client
(Netscape)
WWW client
(Info. Explore)
WWW Client
(HotJava)
Internet
HTTP Server
DBWeb Gateway
DBWeb Gateway
DBWeb Gateway
DBWeb
Dispatcher
DBWeb Gateway
Chaps25.49
EJB Architecture
CSE
4701
Chaps25.50
Technology Push
CSE
4701
Computer/Communication Technology (Almost Free)
Plenty of Affordable CPU, Memory, Disk,
Network Bandwidth
Next Generation Internet: Gigabit Now
Wireless: Ubiquitous, High Bandwidth
Information Growth
Massively Parallel Generation of Information on
the Internet and from New Generation of Sensors
Disk Capacity on the Order of Peta-bytes
Small, Handy Devices to Access Information
The focus is to make information
available to users, in the right form, at
the right time, in the appropriate place.
Chaps25.51
Research Challenges
CSE
4701
Ubiquitous/Pervasive
Many computers and information
appliances everywhere,
networked together
Inherent Complexity:
Coping with Latency (Sometimes
Unpredictable)
Failure Detection and Recovery
(Partial Failure)
Concurrency, Load Balancing,
Availability, Scale
Service Partitioning
Ordering of Distributed Events
“Accidental” Complexity:
Heterogeneity: Beyond the Local
Case: Platform, Protocol, Plus All
Local Heterogeneity in Spades.
Autonomy: Change and Evolve
Autonomously
Tool Deficiencies: Language Support
(Sockets,rpc), Debugging, Etc.
Chaps25.52
Infosphere
Problem: too many sources,too much information
CSE
4701
Internet:
Information Jungle
Infopipes
Clean, Reliable,
Timely Information,
Anywhere
Digital
Earth
Personalized
Filtering &
Info. Delivery
Sensors
Chaps25.53
Current State-of-Art – Has Mobile Changed This?
CSE
4701
Web
Server
Mainframe
Database
Server
Thin
Client
Chaps25.54
Infosphere Scenario – Where Does Mobile Fit?
CSE
4701
Infotaps &
Fat Clients
Sensors
Variety
of Servers
Many sources
Database
Server
Chaps25.55
Heterogeneity and Autonomy
CSE
4701
Heterogeneity:
How Much can we Really Integrate?
Syntactic Integration
Different Formats and Models
XML/JSON/RDF/OWL/SQL Query Languages
Semantic Interoperability
Basic Research on Ontology, Etc.
DoD Maps (Grid, True, and Magnetic North)
Autonomy
No Central DBA on the Net
Independent Evolution of Schema and Content
Interoperation is Voluntary
Interface Technology DCOM: Microsoft Standard
CORBA, Etc...
Chaps25.56
Security and Data Quality
CSE
4701
Security
System Security in the Broad Sense
Attacks: Penetrations, Denial of Service
System (and Information) Survivability
Security Fault Tolerance
Replication for Performance, Availability, and
Survivability
Data Quality
Web Data Quality Problems
Local Updates with Global Effects
Unchecked Redundancy (Mutual Copying)
Registration of Unchecked Information
Spam on the Rise
Chaps25.57
Legacy Data Challenge
CSE
4701
Legacy Applications and Data
Definition: Important and Difficult to Replace
Typically, Mainframe Mission Critical Code
Most are OLTP and Database Applications
Evolution of Legacy Databases
Client-server Architectures
Wrappers
Expensive and Gradual in Any Case
Chaps25.58
Potential Value Added/Jumping on Bandwagon
CSE
4701
Sophisticated Query Capability
Combining SQL with Keyword Queries
Consistent Updates
Atomic Transactions and Beyond
But Everything has to be in a Database!
Only If we Stick with Classic DB Assumptions
Relaxing DB Assumptions
Interoperable Query Processing
Extended Transaction Updates
Commodities DB Software
A Little Help is Still Good If it is Cheap
Internet Facilitates Software Distribution
Databases as Middleware
Chaps25.59
Concluding Remarks
CSE
4701
Goals of Distributed DBS
Support User Distribution Across Multiple Sites
Provide Location Transparency
Provide Transaction Control Akin to Centralized
Case
Minimize Communications Cost
Advantages of Distributed DBS
Transparency
Increased Reliability and Availability
Improved Performance
Easier Expansion
Chaps25.60
Data Mining/Warehousing
CSE
4701
Prof. Steven A. Demurjian, Sr.
Computer Science & Engineering Department
The University of Connecticut
191 Auditorium Road, Box U-155
Storrs, CT 06269-3155
[email protected]
http://www.engr.uconn.edu/~steve
(860) 486 - 4818
A portion of these slides are being used with the permission of Dr. Ling Lui,
Associate Professor, College of Computing, Georgia Tech.
Remaining slides represent new material.
Chaps25.61
Data Warehousing and Data Mining
CSE
4701
Data Warehousing
Provide Access to Data for Complex Analysis,
Knowledge Discovery, and Decision Making
Underlying Infrastructure in Support of Mining
Provides Means to Interact with Multiple DBs
OLAP (on-Line Analytical Processing) vs. OLTP
Data Mining
Discovery of Information in a Vast Data Sets
Search for Patterns and Common Features based
Discover Information not Previously Known
Medical Records Accessible Nationwide
Research/Discover Cures for Rare Diseases
Relies on Knowledge Discovery in DBs (KDD)
Chaps25.62
What is Purpose of a Data Warehouse?
CSE
4701
Traditional databases are not optimized for data access but have
to balance the requirement of data access to ensure integrity
Most data warehouse users need only read access, but need the
access to be fast over a large volume of data.
Most of the data required for data warehouse analysis comes
from multiple databases and these analysis are recurrent and
predictable to be able to design software meet requirements.
Critical for tools that provide decision makers with information
to make decisions quickly and reliably based on historical data.
Aforementioned Charactereistics achieved by Data
Warehousing and Online analytical processing (OLAP)
W. H Inmon characterized a data warehouse as:
“A subject-oriented, integrated, nonvolatile, time-variant
collection of data in support of management’s decisions.”
Chaps25.63
Data Warehousing and OLAP
CSE
4701
A Data Warehouse
Database is Maintained Separately from an
Operational Database
“A Subject-Oriented, Integrated, Time-Variant,
and Non-Volatile Collection of Data in Support for
Management’s Decision Making Process
[W.H.Inmon]”
OLAP (on-Line Analytical Processing)
Analysis of Complex Data in the Warehouse
Attempt to Attain “Value” through Analysis
Relies on Trained and Adept Skilled Knowledge
Workers who Discover Information
Data Mart
Organized Data for a Subset of an Organization
Chaps25.64
Building a Data Warehouse
CSE
4701
Option 1
Leverage Existing
Repositories
Collate and Collect
May Not Capture All
Relevant Data
Option 2
Start from Scratch
Utilize Underlying
Corporate Data
Corporate
data warehouse
Option 1:
Consolidate Data Marts
Option 2:
Build from
scratch
Data Mart
...
Data Mart
Data Mart
Data Mart
Corporate data
Chaps25.65
Comparison with Traditional Databases
CSE
4701
Data Warehouses are mainly optimized for appropriate
data access
Traditional databases are transactional
Optimized for both access mechanisms and
integrity assurance measures.
Data warehouses emphasize historical data as their
support time-series and trend analysis.
Compared with transactional databases, data
warehouses are nonvolatile.
In transactional databases, transaction is the
mechanism change to the database.
In warehouse, data is relatively coarse grained and
refresh policy is carefully chosen, usually incremental.
Chaps25.66
Classification of Data Warehouses
CSE
4701
Generally, Data Warehouses are an order of
magnitude larger than the source databases.
The sheer volume of data is an issue, based on which
Data Warehouses could be classified as follows.
Enterprise-wide data warehouses
Huge projects requiring massive investment of time and
resources.
Virtual data warehouses
Provide views of operational databases that are
materialized for efficient access.
Data marts
Generally targeted to a subset of organization, such as a
department, and are more tightly focused.
Chaps25.67
Data Warehouse Characteristics
CSE
4701
Utilizes a “Multi-Dimensional” Data Model
Warehouse Comprised of
Store of Integrated Data from Multiple Sources
Processed into Multi-Dimensional Model
Warehouse Supports of
Times Series and Trend Analysis
“Super-Excel” Integrated with DB Technologies
Data is Less Volatile than Regular DB
Doesn’t Dramatically Change Over Time
Updates at Regular Intervals
Specific Refresh Policy Regarding Some Data
Chaps25.68
Three Tier Architecture
CSE
4701
monitor
External data sources
OLAP Server
integrator
Summarization
report
Operational databases
Extraxt
Transform
Load
Refresh
serve
Data Warehouse
Query report
Data mining
metadata
Data marts
Chaps25.69
Data Modeling for Data Warehouses
CSE
4701
Traditional Databases generally deal with twodimensional data (similar to a spread sheet).
However, querying performance in a multidimensional data storage model is much more
efficient.
Data warehouses can take advantage of this feature as
generally these are
Non volatile
The degree of predictability of the analysis that
will be performed on them is high.
Chaps25.70
What is a Multi-Dimensional Data Cube?
CSE
4701
Representation of Information in Two or More
Dimensions
Typical Two-Dimensional - Spreadsheet
In Practice, to Track Trends or Conduct Analysis,
Three or More Dimensions are Useful
Aggregate Raw Data!
Chaps25.71
Data Warehouse Design
CSE
4701
Most of Data Warehouses use a Start Schema to
Represent Multi-Dimensional Data Model
Each Dimension is Represented by a Dimension
Table that Provides its Multidimensional Coordinates
and Stores Measures for those Coordinates
A Fact Table Connects All Dimension Tables with a
Multiple Join
Each Tuple in Fact Table Represents the Content
of One Dimension
Each Tuple in the Fact Table Consists of a Pointer
to Each of the Dimensional Tables
Links Between the Fact Table and the Dimensional
Tables for a Shape Like a Star
Chaps25.72
Sample Fact Tables
CSE
4701
Chaps25.73
Example of Star Schema
CSE
4701
Product
Date
Date
Month
Year
Sale Fact Table
Date
ProductNo
ProdName
ProdDesc
Categoryu
Product
Store
Customer
Unit_Sales
Store
StoreID
City
State
Country
Region
Dollar_Sales
Customer
CustID
CustName
CustCity
CustCountry
Chaps25.74
A Second Example of Star Schema …
CSE
4701
Chaps25.75
and Corresponding Snowflake Schema
CSE
4701
Chaps25.76
Multi-dimensional Schemas
CSE
4701
Fact Constellation
Fact constellation is a set of tables that share some
dimension tables.
However, fact constellations limit the possible
queries for the warehouse.
Chaps25.77
Data Warehouse Issues
CSE
4701
Data Acquisition
Extraction from Heterogeneous Sources
Reformatted into Warehouse Context - Names,
Meanings, Data Domains Must be Consistent
Data Cleaning for Validity and Quality
is the Data as Expected w.r.t. Content? Value?
Transition of Data into Data Model of Warehouse
Loading of Data into the Warehouse
Other Issues Include:
How Current is the Data? Frequency of Update?
Availability of Warehouse? Dependencies of Data?
Distribution, Replication, and Partitioning Needs?
Loading Time (Clean, Format, Copy, Transmit,
Index Creation, etc.)?
Chaps25.78
Processing in a Data Warehouse
CSE
4701
Processing Types are Varied and Include:
Roll-up: Data is summarized with increasing
generalization
Drill-Down: Increasing levels of detail are
revealed
Pivot: Cross tabulation is performed
Slice and dice: Performing projection operations
on the dimensions.
Sorting: Data is sorted by ordinal value.
Selection: Data is available by value or range.
Derived attributes: Attributes are computed by
operations on stored derived values.
Chaps25.79
On-Line Analytical Processing
CSE
4701
Data Cube
A Multidimensonal Array
Each Attribute is a Dimension
In Example Below, the Data Must be Interpreted so
that it Can be Aggregated by Region/Product/Date
Product
Product
Store
Date
Sale
acron
Rolla,MO 7/3/99 325.24
budwiser LA,CA
5/22/99 833.92
large pants NY,NY
2/12/99 771.24
Pants
Diapers
Beer
Nuts
West
East
3’ diaper Cuba,MO 7/30/99 81.99
Region
Central
Mountain
South
Jan
Feb March April
Date
Chaps25.80
Examples of Data Mining
CSE
4701
The Slicing Action
A Vertical or Horizontal Slice Across Entire Cube
Months
Slice
on city Atlanta
Products Sales
Products Sales
Months
Multi-Dimensional Data Cube
Chaps25.81
Examples of Data Mining
CSE
4701
The Dicing Action
A Slide First Identifies on Dimension
A Selection of Any Cube within the Slice which
Essentially Constrains All Three Dimensions
Months
Products Sales
Products Sales
Months
March 2000
Electronics
Atlanta
Dice on Electronics and Atlanta
Chaps25.82
Examples of Data Mining
Drill Down - Takes a Facet (e.g.,
Q1)
and Decomposes into Finer Detail
Jan Feb March
Products Sales
CSE
4701
Drill down
on Q1
Roll Up
on Location
(State, USA)
Roll Up: Combines Multiple Dimensions
From Individual Cities to State
Q1 Q2 Q3 Q4
Products Sales
Products Sales
Q1 Q2 Q3 Q4
Chaps25.83
Mining Other Types of Data
Analysis and Access Dramatically More Complicated!
CSE
4701
Spatial databases
Multimedia databases
World Wide Web
Time series data
Geographical and Satellite Data
Chaps25.84
Advantages/Objectives of Data Mining
CSE
4701
Descriptive Mining
Discover and Describe General Properties
60% People who buy Beer on Friday also have
Bought Nuts or Chips in the Past Three Months
Predictive Mining
Infer Interesting Properties based on Available
Data
People who Buy Beer on Friday usually also Buy
Nuts or Chips
Result of Mining
Order from Chaos
Mining Large Data Sets in Multiple Dimensions
Allows Businesses, Individuals, etc. to Learn about
Trends, Behavior, etc.
Impact on Marketing Strateg
Chaps25.85
Why is Data Mining Popular?
CSE
4701
Technology Push
Technology for Collecting Large Quantity of Data
Bar Code, Scanners, Satellites, Cameras
Technology for Storing Large Collection of Data
Databases, Data Warehouses
Variety of Data Repositories, such as Virtual Worlds,
Digital Media, World Wide Web
Corporations want to Improve Direct Marketing and
Promotions - Driving Technology Advances
Targeted Marketing by Age, Region, Income, etc.
Exploiting User Preferences/Customized Shopping
Chaps25.86
Requirements & Challenges in Data Mining
CSE
4701
Security and Social
What Information is Available to Mine?
Preferences via Store Cards/Web Purchases
What is Your Comfort Level with Trends?
User Interfaces and Visualization
What Tools Must be Provided for End Users of
Data Mining Systems?
How are Results for Multi-Dimensional Data
Displayed?
Performance Guarantees
Range from Real-Time for Some Queries to LongTerm for Other Queries
Data Sources of Complex Data Types or Unstructured
Data - Ability to Format, Clean, and Load Data Sets
Chaps25.87
Successful Data Mining Applications
CSE
4701
Business Data Analysis and Decision Support
Marketing, Customer Profiling, Market Analysis
and Management, Risk Analysis and Management
Fraud Detection
Detecting Telephone Fraud, Automotive and
Health Insurance Fraud, Credit-card Fraud,
Suspicious Money Transactions (Money
Laundering)
Text Mining
Message Filtering (Email, Newsgroups, Etc.)
Newspaper Articles Analysis
Sports
IBM Advanced Scout Analyzed NBA Game
Statistics (Shots Blocked, Assists and Fouls) to
Gain Competitive Advantage
Chaps25.88