cse4701chap25x - University of Connecticut

Transcript cse4701chap25x - University of Connecticut

Chapter 25, 6e - 24, 5e
Distributed Databases
CSE
4701
Prof. Steven A. Demurjian, Sr.
Computer Science & Engineering Department
The University of Connecticut
191 Auditorium Road, Box U-155
Storrs, CT 06269-3155
[email protected]
http://www.engr.uconn.edu/~steve
(860) 486 - 4818


A portion of these slides are being used with the permission of Dr. Ling Lui,
Associate Professor, College of Computing, Georgia Tech.
Remaining slides represent new material.
Chaps25.1
Classical and Distributed Architectures

CSE
4701


Classic/Centralized DBMS Dominated the
Commercial Market from 1970s Forward
Problems of this Approach
 Difficult to Scale w.r.t. Performance Gains
 If DB Overloaded, replace with a Faster Computer
 this can Only Go So Far - Disk Bottlenecks
Distributed DBMS have Evolved to Address a
Number of Issues
 Improved Performance
 Putting Data “Near” Location where it is Needed
 Replication of Data for Fault Tolerance
 Vertical and Horizontal Partitioning of DB Tuples
Chaps25.2
Common Features of Centralized DBMS

CSE
4701


Data Independence
 High-Level Representation via Conceptual and
External Schemas
 Physical Representation (Internal Schema) Hidden
Program Independence
 Multiple Applications can Share Data
 Views/External Schema Support this Capability
Reduction of Program/Data Redundancy
 Single, Unique, Conceptual Schema
 Shared Database
 Almost No Data Redundancy
 Controlled Data Access Reduces Inconsistencies
 Programs Execute with Consistent Results
Chaps25.3
Common Features of Centralized DBMS

CSE
4701
Promote Sharing: Automatically Provided via CC
 No Longer Programmatic Issue
 Most DBMS Offer Locking for Key Shared Data
 Oracle Allows Locks on Data Item (Attributes)
 For Example, Controlling Access to Shared Identifier




Coherent and Central DB Administration
Semantic DB Integrity via the Automatic Enforcement
of Data Consistency via Integrity Constraints/Rules
Data Resiliency
 Physical Integrity of Data in the Presence of Faults
and Errors
 Supported by DB Recovery
Data Security: Control Access for Authorized Users
Against Sensitive Data
Chaps25.4
Shared Nothing Architecture

CSE
4701



In this Architecture, Each DBMS
Operates Autonomously
There is No Sharing
 Three Separate DBMSs on
Three Different Computers
Applications/Clients Must Know
About the External Schemas of
all Three DBMSs for
 Database Retrieval
 Client Processing
Complicates Client
 Different DBMS Platforms
(Oracle, Sybase, Informix, ..)
 Different Access Modes
(Query, Embedded, ODBC)
 Difficult for SWE to Code
Chaps25.5
Difficulty in Access – Manage Multiple APIs

CSE
4701
Each Platform has a Different API
 API1 , API3 , …. , APIn
 An App Programmer Must Utilize All three APIs which
could differ by PL – C++, C, Java, REST, etc.
 Any interactions Across 3 DBs – must be
programmatically handled without DB Capabilities
API1
API2
APIn
Chaps25.6
NW Architecture with Centralized DB

CSE
4701
High-Speed NWs/WANs Spawned Centralized DB
Accessible Worldwide
 Clients at Any Site can Access Repository
 Data May be “Far” Away - Increased Access Time
 In Practice, Each Remote Site Needs only Portion
of the Data in DB1 and/or DB2
 Inefficient, no Replication w.r.t. Failure
Chaps25.7
Fully Distributed Architecture

CSE
4701



The Five Sites (Chicago, SF, LA, NY, Atlanta) each
have a “Portion” of the Database - its Distributed
Replication is Possible for Fault Tolerance
Queries at one Site May Need to Access Data at
Another Site (e.g., for a Join)
Increased Transaction Processing Complexity
Chaps25.8
Distributed Database Concepts

CSE
4701


A transaction can be executed by multiple networked
computers in a unified manner.
A distributed database (DDB) processes a Unit of
execution (a transaction) in a distributed manner.
A distributed database (DDB) can be defined as
 Collection of multiple logically related database
distributed over a computer network
 Distributed database management system as a
software system that manages a distributed
database while making the distribution transparent
to the user.
Chaps25.9
Goals of DDBMS

CSE
4701



Support User Distribution Across Multiple Sites
 Remote Access by Users Regardless of Location
 Distribution and Replication of Database Content
Provide Location Transparency
 Users Manipulate their Own Data
 Non-Local Sites “Appear” Local to Any User
Provide Transaction Control Akin to Centralized Case
 Transaction Control Hides Distribution
 CC and Serializability - Must be Extended
Minimize Communications Cost
 Optimize Use of Network - a Critical Issue
 Distribute DB Design Supported by Partitioning
(Fragmentation) and Replication
Chaps25.10
Goals of DDBMS

CSE
4701


Improve Response Time for DB Access
 Use a More Sophisticated Load Control for
Transaction Processing
 However, Synchronization Across Sites May
Introduce Additional Overhead
System Availability
 Site Independence in the Presence of Site Failure
 Subset of Database is Always Available
 Replication can Keep All Data Available, Even
When Multiple Sites Fail
Modularity
 Incremental Growth with the Addition of Sites
 Dedicate Sites to Specific Tasks
Chaps25.11
Advantages of DDBMS

CSE
4701
1.
There are Four Major Advantages
Transparency
 Distribution/NW Transparency
 User Doesn’t Know about NW Configuration (Location
Transparency)
 User can Find Object at any Site (Naming
Transparency)

Replication Transparency (see next PPT)
 User Doesn’t Know Location of Data
 Replicas are Transparently Accessible

Fragmentation Transparency
 Horizontal Fragmentation (Distribute by Row)
 Vertical Fragmentation (Distribute by Column)
Chaps25.12
Data Distribution and Replication
CSE
4701
Chaps25.13
Other Advantages of DDBMS
CSE
4701
2. Increased Reliability and Availability
 Reliability - System Always Running
 Availability - Data Always Present
 Achieved via Replication and Distribution
 Ability to Make Single Query for Entire DDBMS
3. Improved Performance
 Sites Able to Utilize Data that is Local for
Majority of Queries
4. Easier Expansion
 Improve Performance of Site by
 Upgrading Processor of Computer
 Adding Additional Disks
 Splitting a Site into Two or More Sites

Expansion over Time as Business Grows
Chaps25.14
Challenges of DDBMS

CSE
4701



Tracking Data - Meta Data More Complex
 Must Track Distribution (where is the Data)
 V & H Fragmentation (How is Data Split)
 Replication (Multiple Copies for Consistency)
Distributed Query Processing
 Optimization, Accessibility, etc., More Complex
 Block Analysis of Data Size Must also Now
Consider the NW Transmitting Time
Distributed Transaction Processing
 TP Potentially Spans Multiple Sites
 Submit Query to Multiple Sites
 Collect and Collate Results
Distributed Concurrency Control Across Nodes
Chaps25.15
Challenges of DDBMS

CSE
4701



Replicated Data Management
 TP Must Choose the Replica to Access
 Updates Must Modify All Replica Copies
Distributed Database Recovery
 Recovery of Individual Sites
 Recovery Across DDBMS
Security
 Local and Remote Authorization
 During TP, be Able to Verify Remote Privileges
Distributed Directory Management
 Meta-Data on Database - Local and Remote
 Must maintain Replicas of this - Every Site Tracks
the Meta-Data for All Sites
Chaps25.16
A Complete Schema with Keys ...
CSE
4701
Keys Allow us to
Establish Links
Between Relations
Chaps25.17
…and Corresponding DB Tables
CSE
4701
which Represent Tuples/Instances of Each Relation
A
S
C
null
W
B
null
null
1
4
5
5
Chaps25.18
…with Remaining DB Tables
CSE
4701
Chaps25.19
What is Fragmentation?

CSE
4701

Fragmentation Divides a DB Across Multiple Sites
Two Types of Fragmentation
 Horizontal Fragmentation
 Given a Relation R with n Total Tuples, Spread Entire
Tuples Across Multiple Sites
 Each Site has a Subset of the n Tuples
 Essentially Fragmentation is a Selection

Vertical Fragmentation
 Given a Relation R with m Attributes and n Total
Tuples, Spread the Columns Across Multiple Sites
 Essentially Fragmentation is a Projection
 Not Generally Utilized in Practice

In Both Cases, Sites can Overlap for Replication
Chaps25.20
Horizontal Fragmentation

CSE
4701




A horizontal subset of a relation which contain those
of tuples which satisfy selection conditions.
Consider Employee relation with condition DNO = 5
All tuples satisfying this create a subset which will be
a horizontal fragment of Employee relation.
A selection condition may be composed of several
conditions connected by AND or OR.
Derived horizontal fragmentation:
 Partitioning of a primary relation to other
secondary relations which are related with Foreign
keys.
Chaps25.21
Horizontal Fragmentation

Site 2 Tracks All Information Related to Dept. 5
CSE
4701
Chaps25.22
Horizontal Fragmentation

CSE
4701

Site 3 Tracks All Information Related to Dept. 4
Note that an Employee Could be Listed in Both Cases,
if s/he Works on a Project for Both Departments
Chaps25.23
Refined Horizontal Fragmentation

CSE
4701


Further Fragment from Site
2 based on Dept. that
Employee Works in
Notice that G1 + G2 + G3 is
the Same as WORKS_ON5
there is no Overlap
Chaps25.24
Refined Horizontal Fragmentation

CSE
4701


Further Fragment from Site
3 based on Dept. that
Employee Works in
Notice that G4 + G5 + G6 is
the Same as WORKS_ON4
Note Some Fragments can
be Empty
Chaps25.25
Vertical Fragmentation

CSE
4701


Subset of a relation created via a subset of columns.
 A vertical fragment of a relation will contain
values of selected columns.
 There is no selection condition used in vertical
fragmentation.
 A strict vertical slice/partition
Consider the Employee relation.
 A vertical fragment of can be created by keeping
the values of Name, Bdate, Sex, and Address.
Since no condition for creating a vertical fragment
 Each fragment must include the primary key
attribute of the parent relation Employee.
 All vertical fragments of a relation are connected.
Chaps25.26
Vertical Fragmentation Example

CSE
4701

Partition the Employee Table as Below
Notice Each Vertical Fragment Needs Key Column
EmpDemo
EmpSupvrDept
Chaps25.27
Homogeneous DDBMS

CSE
4701
Homogeneous
 Identical Software (w.r.t. Database)
 One DB Product (e.g., Oracle, DB2, Sybase) is
Distributed and Available at All Sites
 Uniformity w.r.t. Administration, Maintenance,
Client Access, Users, Security, etc.
 Interaction by Programmatic Clients is Consistent
(e.g., JDBC or ODBC or REST API …)
Chaps25.28
Non-Federated Heterogeneous DDBMS

CSE
4701
Non-Federated Heterogeneous
 Different Software (w.r.t. Database)
 Multiple DB Products (e.g., Oracle at One Site,
MySQL at another, Sybase, Informix, etc.)
 Replicated Administration (e.g., Users Needs
Accounts on Multiple Systems)
 Varied Programmatic Access - SWEs Must Know
All Platforms/Client Software Complicated
 Very Close to Shared Nothing Architecture
Chaps25.29
Federated DDBMS

CSE
4701

Federated
 Multiple DBMS
Platforms Overlaid
with a Global
Schema View
 Single External
Schema Combines
Schemas from all
Sites
Multiple Data Models
 Relational in one
Component DBS
 Object-Oriented in
another DBS
 Hierarchical in a
3rd DBS
Chaps25.30
Federated DBMS Issues

CSE
4701



Differences in Data Models
 Reconcile Relational vs. Object-Oriented Models
 Each Different Model has Different Capabilities
 These Differences Must be Addressed in Order to
Present a Federated Schema
Differences in Constraints
 Referential Integrity Constraints in Different DBSs
 Different Constraints on “Similar” Data
 Federated Schema Must Deal with these Conflicts
Differences in Query Languages
 SQL-89, SQL-92, SQL2, SQL3
 Specific Types in Different DBMS (Oracle Blobs )
Differences in Key Processing & Timestamping
Chaps25.31
Heterogeneous Distributed Database Systems

CSE
4701

Federated: Each site may run different database system but the
data access is managed through a single conceptual schema.
 The degree of local autonomy is minimum.
 Each site must adhere to a centralized access policy
 There may be a global schema.
Multi-database: There is no one conceptual global schema
 For data access a schema is constructed dynamically as
needed by the application software.
Object Unix Relational
Unix
Oriented Site 5
Site 1
Hierarchical
Window
Communications
Site 4
network
Object
Oriented
Network
DBMS
Site 3
Linux
Site 2
Linux
Relational
Chaps25.32
Query Processing in Distributed Databases
Issues
CSE
4701

Cost of transferring data (files and results) over the network.
 This cost is usually high so some optimization is necessary.
 Example relations: Employee at site 1 and Department at Site 2
– Employee at site 1. 10,000 rows. Row size = 100 bytes. Table size =
106 bytes.
Fname
Minit
Lname
SSN
Bdate
Address
Sex
Salary
Superssn
Dno
– Department at Site 2. 100 rows. Row size = 35 bytes. Table size =
3,500 bytes.
Dname
Dnumber
Mgrssn
Mgrstartdate
 Q: For each employee, retrieve employee name and department
name Where the employee works.
 Q: Fname,Lname,Dname (Employee Dno = Dnumber Department)
Chaps25.33
Query Processing in Distributed Databases

CSE
4701
Result
 The result of this query will have 10,000 tuples,
assuming that every employee is related to a
department.
 Suppose each result tuple is 40 bytes long.
 The query is submitted at site 3 and the result is
sent to this site.
 Problem: Employee and Department relations are
not present at site 3.
Chaps25.34
Query Processing in Distributed Databases

CSE
4701
Strategies:
1. Transfer Employee and Department to site 3.
 Total transfer bytes = 1,000,000 + 3500 = 1,003,500
bytes.
2. Transfer Employee to site 2, execute join at site 2 and send
the result to site 3.
 Query result size = 40 * 10,000 = 400,000 bytes. Total
transfer size = 400,000 + 1,000,000 = 1,400,000 bytes.
3. Transfer Department relation to site 1, execute the join at site
1, and send the result to site 3.
 Total bytes transferred = 400,000 + 3500 = 403,500 bytes.

Optimization criteria: minimizing data transfer.

Preferred approach: strategy 3.
Chaps25.35
Query Processing in Distributed Databases

CSE
4701

Consider the query
 Q’: For each department, retrieve the department
name and the name of the department manager
Relational Algebra expression:
 Fname,Lname,Dname (Employee
Mgrssn = SSN
Department)
Chaps25.36
Query Processing in Distributed Databases

CSE
4701
Result of query has 100 tuples, assuming that every
department has a manager, the execution strategies are:
1. Transfer Employee and Department to the result site and
perform the join at site 3.
 Total bytes transferred = 1,000,000 + 3500 = 1,003,500
bytes.
2. Transfer Employee to site 2, execute join at site 2 and send
the result to site 3. Query result size = 40 * 100 = 4000
bytes.
 Total transfer size = 4000 + 1,000,000 = 1,004,000 bytes.
3. Transfer Department relation to site 1, execute join at site 1
and send the result to site 3.
 Total transfer size = 4000 + 3500 = 7500 bytes.

Preferred strategy: Choose strategy 3.
Chaps25.37
Query Processing in Distributed Databases

CSE
4701
Now suppose the result site is 2. Possible strategies :
1. Transfer Employee relation to site 2, execute the
query and present the result to the user at site 2.
 Total transfer size = 1,000,000 bytes for both queries Q
and Q’.
2. Transfer Department relation to site 1, execute join
at site 1 and send the result back to site 2.
 Total transfer size for Q = 400,000 + 3500 = 403,500
bytes and for Q’ = 4000 + 3500 = 7500 bytes.
Chaps25.38
DDBS Concurrency Control and Recovery

CSE
4701
Distributed Databases encounter a number of
concurrency control and recovery problems which are
not present in centralized databases, including:
 Dealing with multiple copies of data items
 How are they All Updated if Needed?

Failure of individual sites
 How are Queries Restarted or Rerouted?

Communication link failure
 Network Failure

Distributed commit
 How to Know All Updates Done at all Sites?

Distributed deadlock
 How to Detect and Recover?
Chaps25.39
The Next Big Challenge

CSE
4701

Interoperability
 Heterogeneous Distributed Databases
 Heterogeneous Distributed Systems
 Autonomous Applications
Scalability
 Rapid and Continuous Growth
 Amount of Data
 Variety of Data Types
 Dealing with personally identifiable information (PII)
and personal health information (PHI)
 Emergence of Fitness and Health Monitoring Apps
 Google Fit and Apple HealthKit
 New Apple ResearchKit for Medical Research
Chaps25.40
Interoperability: A Classic View
CSE
4701
Local
Schema
Simple Federation
Multiple Nested Federation
FDB Global
Schema
FDB Global
Schema 4
Federated
Integration
Federated
Integration
Local
Schema
Local
Schema
FDB 1
Local
Schema
Federation
FDB3
Federation
Chaps25.41
Java Client with Wrapper to Legacy Application
CSE
4701
Java Client
Java Application Code
WRAPPER
Mapping Classes
JAVA LAYER
Interactions Between Java Client
and Legacy Appl. via C and RPC
C is the Medium of Info. Exchange
Java Client with C++/C Wrapper
NATIVE LAYER
Native Functions (C++)
RPC Client Stubs (C)
Legacy
Application
Network
Chaps25.42
COTS and Legacy Appls. to Java Clients
CSE
4701
COTS Application
Legacy Application
Java Application Code
Java Application Code
Native Functions that
Map to COTS Appl
NATIVE LAYER
Native Functions that
Map to Legacy Appl
NATIVE LAYER
JAVA LAYER
JAVA LAYER
Mapping Classes
JAVA NETWORK WRAPPER
Mapping Classes
JAVA NETWORK WRAPPER
Network
Java Client
Java Client
Java is Medium of Info. Exchange - C/C++ Appls with Java Wrappers
Chaps25.43
Java Client to Legacy App via RDBS
CSE
4701
Transformed
Legacy Data
Java Client
Updated Data
Relational
Database
System(RDS)
Extract and
Generate Data
Transform and
Store Data
Legacy
Application
Chaps25.44
Database Interoperability in the Internet

CSE
4701

Technology
 Web/HTTP, JDBC/ODBC, CORBA (ORBs +
IIOP), XML, SOAP, REST API, WSDL
Architecture
Information Broker
•Mediator-Based Systems
•Agent-Based Systems
Chaps25.45
JDBC

CSE
4701

JDBC API Provides DB Access Protocols for Open,
Query, Close, etc.
Different Drivers for Different DB Platforms
JDBC API
Java
Application
Driver Manager
Driver
Oracle
Driver
Access
Driver
Driver
Sybase
Chaps25.46
Connecting a DB to the Web

CSE
4701
DBMS

CGI Script Invocation
or JDBC Invocation
Web Server
Internet

Web Server are
Stateless
DB Interactions Tend
to be Stateful
Invoking a CGI
Script on Each DB
Interaction is Very
Expensive, Mainly
Due to the Cost of
DB Open
Browser
Chaps25.47
Connecting More Efficiently

CSE
4701
DBMS
Helper
Processes
CGI Script
or JDBC
Invocation

Web Server
Internet

To Avoid Cost of
Opening Database, One
can Use Helper
Processes that Always
Keep Database Open
and Outlive Web
Connection
Newly Invoked CGI
Scripts Connect to a
Preexisting Helper
Process
System is Still Stateless
Browser
Chaps25.48
DB-Internet Architecture
CSE
4701
WWW Client
(Netscape)
WWW client
(Info. Explore)
WWW Client
(HotJava)
Internet
HTTP Server
DBWeb Gateway
DBWeb Gateway
DBWeb Gateway
DBWeb
Dispatcher
DBWeb Gateway
Chaps25.49
EJB Architecture
CSE
4701
Chaps25.50
Technology Push

CSE
4701


Computer/Communication Technology (Almost Free)
 Plenty of Affordable CPU, Memory, Disk,
Network Bandwidth
 Next Generation Internet: Gigabit Now
 Wireless: Ubiquitous, High Bandwidth
Information Growth
 Massively Parallel Generation of Information on
the Internet and from New Generation of Sensors
 Disk Capacity on the Order of Peta-bytes
Small, Handy Devices to Access Information
The focus is to make information
available to users, in the right form, at
the right time, in the appropriate place.
Chaps25.51
Research Challenges

CSE
4701
Ubiquitous/Pervasive
Many computers and information
appliances everywhere,
networked together

Inherent Complexity:
 Coping with Latency (Sometimes
Unpredictable)
 Failure Detection and Recovery
(Partial Failure)
 Concurrency, Load Balancing,
Availability, Scale
 Service Partitioning
 Ordering of Distributed Events
“Accidental” Complexity:
 Heterogeneity: Beyond the Local
Case: Platform, Protocol, Plus All
Local Heterogeneity in Spades.
 Autonomy: Change and Evolve
Autonomously
 Tool Deficiencies: Language Support
(Sockets,rpc), Debugging, Etc.
Chaps25.52
Infosphere
Problem: too many sources,too much information
CSE
4701
Internet:
Information Jungle
Infopipes
Clean, Reliable,
Timely Information,
Anywhere
Digital
Earth
Personalized
Filtering &
Info. Delivery
Sensors
Chaps25.53
Current State-of-Art – Has Mobile Changed This?
CSE
4701
Web
Server
Mainframe
Database
Server
Thin
Client
Chaps25.54
Infosphere Scenario – Where Does Mobile Fit?
CSE
4701
Infotaps &
Fat Clients
Sensors
Variety
of Servers
Many sources
Database
Server
Chaps25.55
Heterogeneity and Autonomy

CSE
4701
Heterogeneity:
 How Much can we Really Integrate?
 Syntactic Integration
 Different Formats and Models
 XML/JSON/RDF/OWL/SQL Query Languages

Semantic Interoperability
 Basic Research on Ontology, Etc.
 DoD Maps (Grid, True, and Magnetic North)

Autonomy
 No Central DBA on the Net
 Independent Evolution of Schema and Content
 Interoperation is Voluntary
 Interface Technology DCOM: Microsoft Standard
 CORBA, Etc...
Chaps25.56
Security and Data Quality

CSE
4701
Security
 System Security in the Broad Sense
 Attacks: Penetrations, Denial of Service
 System (and Information) Survivability
 Security Fault Tolerance
 Replication for Performance, Availability, and
Survivability

Data Quality
 Web Data Quality Problems




Local Updates with Global Effects
Unchecked Redundancy (Mutual Copying)
Registration of Unchecked Information
Spam on the Rise
Chaps25.57
Legacy Data Challenge

CSE
4701

Legacy Applications and Data
 Definition: Important and Difficult to Replace
 Typically, Mainframe Mission Critical Code
 Most are OLTP and Database Applications
Evolution of Legacy Databases
 Client-server Architectures
 Wrappers
 Expensive and Gradual in Any Case
Chaps25.58
Potential Value Added/Jumping on Bandwagon

CSE
4701




Sophisticated Query Capability
 Combining SQL with Keyword Queries
Consistent Updates
 Atomic Transactions and Beyond
But Everything has to be in a Database!
 Only If we Stick with Classic DB Assumptions
Relaxing DB Assumptions
 Interoperable Query Processing
 Extended Transaction Updates
Commodities DB Software
 A Little Help is Still Good If it is Cheap
 Internet Facilitates Software Distribution
 Databases as Middleware
Chaps25.59
Concluding Remarks

CSE
4701

Goals of Distributed DBS
 Support User Distribution Across Multiple Sites
 Provide Location Transparency
 Provide Transaction Control Akin to Centralized
Case
 Minimize Communications Cost
Advantages of Distributed DBS
 Transparency
 Increased Reliability and Availability
 Improved Performance
 Easier Expansion
Chaps25.60
Data Mining/Warehousing
CSE
4701
Prof. Steven A. Demurjian, Sr.
Computer Science & Engineering Department
The University of Connecticut
191 Auditorium Road, Box U-155
Storrs, CT 06269-3155
[email protected]
http://www.engr.uconn.edu/~steve
(860) 486 - 4818


A portion of these slides are being used with the permission of Dr. Ling Lui,
Associate Professor, College of Computing, Georgia Tech.
Remaining slides represent new material.
Chaps25.61
Data Warehousing and Data Mining

CSE
4701

Data Warehousing
 Provide Access to Data for Complex Analysis,
Knowledge Discovery, and Decision Making
 Underlying Infrastructure in Support of Mining
 Provides Means to Interact with Multiple DBs
 OLAP (on-Line Analytical Processing) vs. OLTP
Data Mining
 Discovery of Information in a Vast Data Sets
 Search for Patterns and Common Features based
 Discover Information not Previously Known
 Medical Records Accessible Nationwide
 Research/Discover Cures for Rare Diseases

Relies on Knowledge Discovery in DBs (KDD)
Chaps25.62
What is Purpose of a Data Warehouse?

CSE
4701





Traditional databases are not optimized for data access but have
to balance the requirement of data access to ensure integrity
Most data warehouse users need only read access, but need the
access to be fast over a large volume of data.
Most of the data required for data warehouse analysis comes
from multiple databases and these analysis are recurrent and
predictable to be able to design software meet requirements.
Critical for tools that provide decision makers with information
to make decisions quickly and reliably based on historical data.
Aforementioned Charactereistics achieved by Data
Warehousing and Online analytical processing (OLAP)
W. H Inmon characterized a data warehouse as:
 “A subject-oriented, integrated, nonvolatile, time-variant
collection of data in support of management’s decisions.”
Chaps25.63
Data Warehousing and OLAP

CSE
4701


A Data Warehouse
 Database is Maintained Separately from an
Operational Database
 “A Subject-Oriented, Integrated, Time-Variant,
and Non-Volatile Collection of Data in Support for
Management’s Decision Making Process
[W.H.Inmon]”
OLAP (on-Line Analytical Processing)
 Analysis of Complex Data in the Warehouse
 Attempt to Attain “Value” through Analysis
 Relies on Trained and Adept Skilled Knowledge
Workers who Discover Information
Data Mart
 Organized Data for a Subset of an Organization
Chaps25.64
Building a Data Warehouse

CSE
4701
Option 1
 Leverage Existing
Repositories
 Collate and Collect
 May Not Capture All
Relevant Data

Option 2
 Start from Scratch
 Utilize Underlying
Corporate Data
Corporate
data warehouse
Option 1:
Consolidate Data Marts
Option 2:
Build from
scratch
Data Mart
...
Data Mart
Data Mart
Data Mart
Corporate data
Chaps25.65
Comparison with Traditional Databases

CSE
4701




Data Warehouses are mainly optimized for appropriate
data access
 Traditional databases are transactional
 Optimized for both access mechanisms and
integrity assurance measures.
Data warehouses emphasize historical data as their
support time-series and trend analysis.
Compared with transactional databases, data
warehouses are nonvolatile.
In transactional databases, transaction is the
mechanism change to the database.
In warehouse, data is relatively coarse grained and
refresh policy is carefully chosen, usually incremental.
Chaps25.66
Classification of Data Warehouses

CSE
4701

Generally, Data Warehouses are an order of
magnitude larger than the source databases.
The sheer volume of data is an issue, based on which
Data Warehouses could be classified as follows.
 Enterprise-wide data warehouses
 Huge projects requiring massive investment of time and
resources.

Virtual data warehouses
 Provide views of operational databases that are
materialized for efficient access.

Data marts
 Generally targeted to a subset of organization, such as a
department, and are more tightly focused.
Chaps25.67
Data Warehouse Characteristics

CSE
4701



Utilizes a “Multi-Dimensional” Data Model
Warehouse Comprised of
 Store of Integrated Data from Multiple Sources
 Processed into Multi-Dimensional Model
Warehouse Supports of
 Times Series and Trend Analysis
 “Super-Excel” Integrated with DB Technologies
Data is Less Volatile than Regular DB
 Doesn’t Dramatically Change Over Time
 Updates at Regular Intervals
 Specific Refresh Policy Regarding Some Data
Chaps25.68
Three Tier Architecture
CSE
4701
monitor
External data sources
OLAP Server
integrator
Summarization
report
Operational databases
Extraxt
Transform
Load
Refresh
serve
Data Warehouse
Query report
Data mining
metadata
Data marts
Chaps25.69
Data Modeling for Data Warehouses

CSE
4701

Traditional Databases generally deal with twodimensional data (similar to a spread sheet).
 However, querying performance in a multidimensional data storage model is much more
efficient.
Data warehouses can take advantage of this feature as
generally these are
 Non volatile
 The degree of predictability of the analysis that
will be performed on them is high.
Chaps25.70
What is a Multi-Dimensional Data Cube?

CSE
4701



Representation of Information in Two or More
Dimensions
Typical Two-Dimensional - Spreadsheet
In Practice, to Track Trends or Conduct Analysis,
Three or More Dimensions are Useful
Aggregate Raw Data!
Chaps25.71
Data Warehouse Design

CSE
4701


Most of Data Warehouses use a Start Schema to
Represent Multi-Dimensional Data Model
Each Dimension is Represented by a Dimension
Table that Provides its Multidimensional Coordinates
and Stores Measures for those Coordinates
A Fact Table Connects All Dimension Tables with a
Multiple Join
 Each Tuple in Fact Table Represents the Content
of One Dimension
 Each Tuple in the Fact Table Consists of a Pointer
to Each of the Dimensional Tables
 Links Between the Fact Table and the Dimensional
Tables for a Shape Like a Star
Chaps25.72
Sample Fact Tables
CSE
4701
Chaps25.73
Example of Star Schema
CSE
4701
Product
Date
Date
Month
Year
Sale Fact Table
Date
ProductNo
ProdName
ProdDesc
Categoryu
Product
Store
Customer
Unit_Sales
Store
StoreID
City
State
Country
Region
Dollar_Sales
Customer
CustID
CustName
CustCity
CustCountry
Chaps25.74
A Second Example of Star Schema …
CSE
4701
Chaps25.75
and Corresponding Snowflake Schema
CSE
4701
Chaps25.76
Multi-dimensional Schemas

CSE
4701
Fact Constellation
 Fact constellation is a set of tables that share some
dimension tables.
 However, fact constellations limit the possible
queries for the warehouse.
Chaps25.77
Data Warehouse Issues

CSE
4701

Data Acquisition
 Extraction from Heterogeneous Sources
 Reformatted into Warehouse Context - Names,
Meanings, Data Domains Must be Consistent
 Data Cleaning for Validity and Quality
is the Data as Expected w.r.t. Content? Value?
 Transition of Data into Data Model of Warehouse
 Loading of Data into the Warehouse
Other Issues Include:
 How Current is the Data? Frequency of Update?
 Availability of Warehouse? Dependencies of Data?
 Distribution, Replication, and Partitioning Needs?
 Loading Time (Clean, Format, Copy, Transmit,
Index Creation, etc.)?
Chaps25.78
Processing in a Data Warehouse

CSE
4701
Processing Types are Varied and Include:
 Roll-up: Data is summarized with increasing
generalization
 Drill-Down: Increasing levels of detail are
revealed
 Pivot: Cross tabulation is performed
 Slice and dice: Performing projection operations
on the dimensions.
 Sorting: Data is sorted by ordinal value.
 Selection: Data is available by value or range.
 Derived attributes: Attributes are computed by
operations on stored derived values.
Chaps25.79
On-Line Analytical Processing

CSE
4701

Data Cube
 A Multidimensonal Array
 Each Attribute is a Dimension
In Example Below, the Data Must be Interpreted so
that it Can be Aggregated by Region/Product/Date
Product
Product
Store
Date
Sale
acron
Rolla,MO 7/3/99 325.24
budwiser LA,CA
5/22/99 833.92
large pants NY,NY
2/12/99 771.24
Pants
Diapers
Beer
Nuts
West
East
3’ diaper Cuba,MO 7/30/99 81.99
Region
Central
Mountain
South
Jan
Feb March April
Date
Chaps25.80
Examples of Data Mining

CSE
4701
The Slicing Action
 A Vertical or Horizontal Slice Across Entire Cube
Months
Slice
on city Atlanta
Products Sales
Products Sales
Months
Multi-Dimensional Data Cube
Chaps25.81
Examples of Data Mining

CSE
4701
The Dicing Action
 A Slide First Identifies on Dimension
 A Selection of Any Cube within the Slice which
Essentially Constrains All Three Dimensions
Months
Products Sales
Products Sales
Months
March 2000
Electronics
Atlanta
Dice on Electronics and Atlanta
Chaps25.82
Examples of Data Mining
Drill Down - Takes a Facet (e.g.,
Q1)
and Decomposes into Finer Detail
Jan Feb March
Products Sales
CSE
4701
Drill down
on Q1
Roll Up
on Location
(State, USA)
Roll Up: Combines Multiple Dimensions
From Individual Cities to State
Q1 Q2 Q3 Q4
Products Sales
Products Sales
Q1 Q2 Q3 Q4
Chaps25.83
Mining Other Types of Data

Analysis and Access Dramatically More Complicated!
CSE
4701
Spatial databases
Multimedia databases
World Wide Web
Time series data
Geographical and Satellite Data
Chaps25.84
Advantages/Objectives of Data Mining

CSE
4701


Descriptive Mining
 Discover and Describe General Properties
 60% People who buy Beer on Friday also have
Bought Nuts or Chips in the Past Three Months
Predictive Mining
 Infer Interesting Properties based on Available
Data
 People who Buy Beer on Friday usually also Buy
Nuts or Chips
Result of Mining
 Order from Chaos
 Mining Large Data Sets in Multiple Dimensions
Allows Businesses, Individuals, etc. to Learn about
Trends, Behavior, etc.
 Impact on Marketing Strateg
Chaps25.85
Why is Data Mining Popular?

CSE
4701
Technology Push
 Technology for Collecting Large Quantity of Data
 Bar Code, Scanners, Satellites, Cameras

Technology for Storing Large Collection of Data
 Databases, Data Warehouses
 Variety of Data Repositories, such as Virtual Worlds,
Digital Media, World Wide Web

Corporations want to Improve Direct Marketing and
Promotions - Driving Technology Advances
 Targeted Marketing by Age, Region, Income, etc.
 Exploiting User Preferences/Customized Shopping
Chaps25.86
Requirements & Challenges in Data Mining

CSE
4701



Security and Social
 What Information is Available to Mine?
 Preferences via Store Cards/Web Purchases
 What is Your Comfort Level with Trends?
User Interfaces and Visualization
 What Tools Must be Provided for End Users of
Data Mining Systems?
 How are Results for Multi-Dimensional Data
Displayed?
Performance Guarantees
 Range from Real-Time for Some Queries to LongTerm for Other Queries
Data Sources of Complex Data Types or Unstructured
Data - Ability to Format, Clean, and Load Data Sets
Chaps25.87
Successful Data Mining Applications

CSE
4701



Business Data Analysis and Decision Support
 Marketing, Customer Profiling, Market Analysis
and Management, Risk Analysis and Management
Fraud Detection
 Detecting Telephone Fraud, Automotive and
Health Insurance Fraud, Credit-card Fraud,
Suspicious Money Transactions (Money
Laundering)
Text Mining
 Message Filtering (Email, Newsgroups, Etc.)
 Newspaper Articles Analysis
Sports
 IBM Advanced Scout Analyzed NBA Game
Statistics (Shots Blocked, Assists and Fouls) to
Gain Competitive Advantage
Chaps25.88