cse4701chap26and28and29x
Download
Report
Transcript cse4701chap26and28and29x
Chapters 26, 28 & 29, 6e - 24, 26 & 27 5e
Database System Architectures, Data
CSE
4701
Mining/Warehousing, Web DB
Prof. Steven A. Demurjian, Sr.
Computer Science & Engineering Department
The University of Connecticut
191 Auditorium Road, Box U-155
Storrs, CT 06269-3155
[email protected]
http://www.engr.uconn.edu/~steve
(860) 486 - 4818
A portion of these slides are being used with the permission of Dr. Ling Lui,
Associate Professor, College of Computing, Georgia Tech.
Remaining slides represent new material.
Chaps26.28.29-1
Classical and Distributed Architectures
CSE
4701
Classic/Centralized DBMS Dominated the
Commercial Market from 1970s Forward
Problems of this Approach
Difficult to Scale w.r.t. Performance Gains
If DB Overloaded, replace with a Faster Computer
this can Only Go So Far - Disk Bottlenecks
Distributed DBMS have Evolved to Address a
Number of Issues
Improved Performance
Putting Data “Near” Location where it is Needed
Replication of Data for Fault Tolerance
Vertical and Horizontal Partitioning of DB Tuples
Chaps26.28.29-2
Common Features of Centralized DBMS
CSE
4701
Data Independence
High-Level Representation via Conceptual and
External Schemas
Physical Representation (Internal Schema) Hidden
Program Independence
Multiple Applications can Share Data
Views/External Schema Support this Capability
Reduction of Program/Data Redundancy
Single, Unique, Conceptual Schema
Shared Database
Almost No Data Redundancy
Controlled Data Access Reduces Inconsistencies
Programs Execute with Consistent Results
Chaps26.28.29-3
Common Features of Centralized DBMS
CSE
4701
Promote Sharing: Automatically Provided via CC
No Longer Programmatic Issue
Most DBMS Offer Locking for Key Shared Data
Oracle Allows Locks on Data Item (Attributes)
For Example, Controlling Access to Shared Identifier
Coherent and Central DB Administration
Semantic DB Integrity via the Automatic Enforcement
of Data Consistency via Integrity Constraints/Rules
Data Resiliency
Physical Integrity of Data in the Presence of Faults
and Errors
Supported by DB Recovery
Data Security: Control Access for Authorized Users
Against Sensitive Data
Chaps26.28.29-4
Shared Nothing Architecture
CSE
4701
In this Architecture, Each DBMS
Operates Autonomously
There is No Sharing
Three Separate DBMSs on
Three Different Computers
Applications/Clients Must Know
About the External Schemas of
all Three DBMSs for
Database Retrieval
Client Processing
Complicates Client
Different DBMS Platforms
(Oracle, Sybase, Informix, ..)
Different Access Modes
(Query, Embedded, ODBC)
Difficult for SWE to Code
Chaps26.28.29-5
Difficulty in Access – Manage Multiple APIs
CSE
4701
Each Platform has a Different API
API1 , API3 , …. , APIn
An App Programmer Must Utilize All three APIs which
could differ by PL – C++, C, Java, REST, etc.
Any interactions Across 3 DBs – must be
programmatically handled without DB Capabilities
API1
API2
APIn
Chaps26.28.29-6
NW Architecture with Centralized DB
CSE
4701
High-Speed NWs/WANs Spawned Centralized DB
Accessible Worldwide
Clients at Any Site can Access Repository
Data May be “Far” Away - Increased Access Time
In Practice, Each Remote Site Needs only Portion
of the Data in DB1 and/or DB2
Inefficient, no Replication w.r.t. Failure
Chaps26.28.29-7
Fully Distributed Architecture
CSE
4701
The Five Sites (Chicago, SF, LA, NY, Atlanta) each
have a “Portion” of the Database - its Distributed
Replication is Possible for Fault Tolerance
Queries at one Site May Need to Access Data at
Another Site (e.g., for a Join)
Increased Transaction Processing Complexity
Chaps26.28.29-8
Distributed Database Concepts
CSE
4701
A transaction can be executed by multiple networked
computers in a unified manner.
A distributed database (DDB) processes a Unit of
execution (a transaction) in a distributed manner.
A distributed database (DDB) can be defined as
Collection of multiple logically related database
distributed over a computer network
Distributed database management system as a
software system that manages a distributed
database while making the distribution transparent
to the user.
Chaps26.28.29-9
Goals of DDBMS
CSE
4701
Support User Distribution Across Multiple Sites
Remote Access by Users Regardless of Location
Distribution and Replication of Database Content
Provide Location Transparency
Users Manipulate their Own Data
Non-Local Sites “Appear” Local to Any User
Provide Transaction Control Akin to Centralized Case
Transaction Control Hides Distribution
CC and Serializability - Must be Extended
Minimize Communications Cost
Optimize Use of Network - a Critical Issue
Distribute DB Design Supported by Partitioning
(Fragmentation) and Replication
Chaps26.28.29-10
Goals of DDBMS
CSE
4701
Improve Response Time for DB Access
Use a More Sophisticated Load Control for
Transaction Processing
However, Synchronization Across Sites May
Introduce Additional Overhead
System Availability
Site Independence in the Presence of Site Failure
Subset of Database is Always Available
Replication can Keep All Data Available, Even
When Multiple Sites Fail
Modularity
Incremental Growth with the Addition of Sites
Dedicate Sites to Specific Tasks
Chaps26.28.29-11
Advantages of DDBMS
CSE
4701
There are Four Major Advantages
Transparency
Distribution/NW Transparency
User Doesn’t Know about NW Configuration (Location
Transparency)
User can Find Object at any Site (Naming
Transparency)
Replication Transparency (see next PPT)
User Doesn’t Know Location of Data
Replicas are Transparently Accessible
Fragmentation Transparency
Horizontal Fragmentation (Distribute by Row)
Vertical Fragmentation (Distribute by Column)
Chaps26.28.29-12
Data Distribution and Replication
CSE
4701
Chaps26.28.29-13
Other Advantages of DDBMS
CSE
4701
Increased Reliability and Availability
Reliability - System Always Running
Availability - Data Always Present
Achieved via Replication and Distribution
Ability to Make Single Query for Entire DDBMS
Improved Performance
Sites Able to Utilize Data that is Local for
Majority of Queries
Easier Expansion
Improve Performance of Site by
Upgrading Processor of Computer
Adding Additional Disks
Splitting a Site into Two or More Sites
Expansion over Time as Business Grows
Chaps26.28.29-14
Challenges of DDBMS
CSE
4701
Tracking Data - Meta Data More Complex
Must Track Distribution (where is the Data)
V & H Fragmentation (How is Data Split)
Replication (Multiple Copies for Consistency)
Distributed Query Processing
Optimization, Accessibility, etc., More Complex
Block Analysis of Data Size Must also Now
Consider the NW Transmitting Time
Distributed Transaction Processing
TP Potentially Spans Multiple Sites
Submit Query to Multiple Sites
Collect and Collate Results
Distributed Concurrency Control Across Nodes
Chaps26.28.29-15
Challenges of DDBMS
CSE
4701
Replicated Data Management
TP Must Choose the Replica to Access
Updates Must Modify All Replica Copies
Distributed Database Recovery
Recovery of Individual Sites
Recovery Across DDBMS
Security
Local and Remote Authorization
During TP, be Able to Verify Remote Privileges
Distributed Directory Management
Meta-Data on Database - Local and Remote
Must maintain Replicas of this - Every Site Tracks
the Meta-Data for All Sites
Chaps26.28.29-16
A Complete Schema with Keys ...
CSE
4701
Keys Allow us to
Establish Links
Between Relations
Chaps26.28.29-17
…and Corresponding DB Tables
CSE
4701
which Represent Tuples/Instances of Each Relation
A
S
C
null
W
B
null
null
1
4
5
5
Chaps26.28.29-18
…with Remaining DB Tables
CSE
4701
Chaps26.28.29-19
What is Fragmentation?
CSE
4701
Fragmentation Divides a DB Across Multiple Sites
Two Types of Fragmentation
Horizontal Fragmentation
Given a Relation R with n Total Tuples, Spread Entire
Tuples Across Multiple Sites
Each Site has a Subset of the n Tuples
Essentially Fragmentation is a Selection
Vertical Fragmentation
Given a Relation R with m Attributes and n Total
Tuples, Spread the Columns Across Multiple Sites
Essentially Fragmentation is a Projection
Not Generally Utilized in Practice
In Both Cases, Sites can Overlap for Replication
Chaps26.28.29-20
Horizontal Fragmentation
CSE
4701
A horizontal subset of a relation which contain those
of tuples which satisfy selection conditions.
Consider Employee relation with condition DNO = 5
All tuples satisfying this create a subset which will be
a horizontal fragment of Employee relation.
A selection condition may be composed of several
conditions connected by AND or OR.
Derived horizontal fragmentation:
Partitioning of a primary relation to other
secondary relations which are related with Foreign
keys.
Chaps26.28.29-21
Horizontal Fragmentation
Site 2 Tracks All Information Related to Dept. 5
CSE
4701
Chaps26.28.29-22
Horizontal Fragmentation
CSE
4701
Site 3 Tracks All Information Related to Dept. 4
Note that an Employee Could be Listed in Both Cases,
if s/he Works on a Project for Both Departments
Chaps26.28.29-23
Refined Horizontal Fragmentation
CSE
4701
Further Fragment from Site
2 based on Dept. that
Employee Works in
Notice that G1 + G2 + G3 is
the Same as WORKS_ON5
there is no Overlap
Chaps26.28.29-24
Refined Horizontal Fragmentation
CSE
4701
Further Fragment from Site
3 based on Dept. that
Employee Works in
Notice that G4 + G5 + G6 is
the Same as WORKS_ON4
Note Some Fragments can
be Empty
Chaps26.28.29-25
Vertical Fragmentation
CSE
4701
Subset of a relation created via a subset of columns.
A vertical fragment of a relation will contain
values of selected columns.
There is no selection condition used in vertical
fragmentation.
A strict vertical slice/partition
Consider the Employee relation.
A vertical fragment of can be created by keeping
the values of Name, Bdate, Sex, and Address.
Since no condition for creating a vertical fragment
Each fragment must include the primary key
attribute of the parent relation Employee.
All vertical fragments of a relation are connected.
Chaps26.28.29-26
Vertical Fragmentation Example
CSE
4701
Partition the Employee Table as Below
Notice Each Vertical Fragment Needs Key Column
EmpDemo
EmpSupvrDept
Chaps26.28.29-27
Homogeneous DDBMS
CSE
4701
Homogeneous
Identical Software (w.r.t. Database)
One DB Product (e.g., Oracle, DB2, Sybase) is
Distributed and Available at All Sites
Uniformity w.r.t. Administration, Maintenance,
Client Access, Users, Security, etc.
Interaction by Programmatic Clients is Consistent
(e.g., JDBC or ODBC or REST API …)
Chaps26.28.29-28
Non-Federated Heterogeneous DDBMS
CSE
4701
Non-Federated Heterogeneous
Different Software (w.r.t. Database)
Multiple DB Products (e.g., Oracle at One Site,
Access another, Sybase, Informix, etc.)
Replicated Administration (e.g., Users Needs
Accounts on Multiple Systems)
Varied Programmatic Access - SWEs Must Know
All Platforms/Client Software Complicated
Very Close to Shared Nothing Architecture
Chaps26.28.29-29
Federated DDBMS
CSE
4701
Federated
Multiple DBMS
Platforms Overlaid
with a Global
Schema View
Single External
Schema Combines
Schemas from all
Sites
Multiple Data Models
Relational in one
Component DBS
Object-Oriented in
another DBS
Hierarchical in a
3rd DBS
Chaps26.28.29-30
Federated DBMS Issues
CSE
4701
Differences in Data Models
Reconcile Relational vs. Object-Oriented Models
Each Different Model has Different Capabilities
These Differences Must be Addressed in Order to
Present a Federated Schema
Differences in Constraints
Referential Integrity Constraints in Different DBSs
Different Constraints on “Similar” Data
Federated Schema Must Deal with these Conflicts
Differences in Query Languages
SQL-89, SQL-92, SQL2, SQL3
Specific Types in Different DBMS (Oracle Blobs )
Differences in Key Processing & Timestamping
Chaps26.28.29-31
Heterogeneous Distributed Database Systems
CSE
4701
Federated: Each site may run different database system but the
data access is managed through a single conceptual schema.
The degree of local autonomy is minimum.
Each site must adhere to a centralized access policy
There may be a global schema.
Multi-database: There is no one conceptual global schema
For data access a schema is constructed dynamically as
needed by the application software.
Object Unix Relational
Unix
Oriented Site 5
Site 1
Hierarchical
Window
Communications
Site 4
network
Object
Oriented
Network
DBMS
Site 3
Linux
Site 2
Linux
Relational
Chaps26.28.29-32
Query Processing in Distributed Databases
Issues
CSE
4701
Cost of transferring data (files and results) over the network.
This cost is usually high so some optimization is necessary.
Example relations: Employee at site 1 and Department at Site 2
– Employee at site 1. 10,000 rows. Row size = 100 bytes. Table size =
106 bytes.
Fname
Minit
Lname
SSN
Bdate
Address
Sex
Salary
Superssn
Dno
– Department at Site 2. 100 rows. Row size = 35 bytes. Table size =
3,500 bytes.
Dname
Dnumber
Mgrssn
Mgrstartdate
Q: For each employee, retrieve employee name and department
name Where the employee works.
Q: Fname,Lname,Dname (Employee Dno = Dnumber Department)
Chaps26.28.29-33
Query Processing in Distributed Databases
CSE
4701
Result
The result of this query will have 10,000 tuples,
assuming that every employee is related to a
department.
Suppose each result tuple is 40 bytes long.
The query is submitted at site 3 and the result is
sent to this site.
Problem: Employee and Department relations are
not present at site 3.
Chaps26.28.29-34
Query Processing in Distributed Databases
CSE
4701
Strategies:
1. Transfer Employee and Department to site 3.
Total transfer bytes = 1,000,000 + 3500 = 1,003,500
bytes.
2. Transfer Employee to site 2, execute join at site 2 and send
the result to site 3.
Query result size = 40 * 10,000 = 400,000 bytes. Total
transfer size = 400,000 + 1,000,000 = 1,400,000 bytes.
3. Transfer Department relation to site 1, execute the join at site
1, and send the result to site 3.
Total bytes transferred = 400,000 + 3500 = 403,500 bytes.
Optimization criteria: minimizing data transfer.
Preferred approach: strategy 3.
Chaps26.28.29-35
Query Processing in Distributed Databases
CSE
4701
Consider the query
Q’: For each department, retrieve the department
name and the name of the department manager
Relational Algebra expression:
Fname,Lname,Dname (Employee
Mgrssn = SSN
Department)
Chaps26.28.29-36
Query Processing in Distributed Databases
CSE
4701
Result of query has 100 tuples, assuming that every
department has a manager, the execution strategies are:
1. Transfer Employee and Department to the result site and
perform the join at site 3.
Total bytes transferred = 1,000,000 + 3500 = 1,003,500
bytes.
2. Transfer Employee to site 2, execute join at site 2 and send
the result to site 3. Query result size = 40 * 100 = 4000
bytes.
Total transfer size = 4000 + 1,000,000 = 1,004,000 bytes.
3. Transfer Department relation to site 1, execute join at site 1
and send the result to site 3.
Total transfer size = 4000 + 3500 = 7500 bytes.
Preferred strategy: Choose strategy 3.
Chaps26.28.29-37
Query Processing in Distributed Databases
CSE
4701
Now suppose the result site is 2. Possible strategies :
1. Transfer Employee relation to site 2, execute the
query and present the result to the user at site 2.
Total transfer size = 1,000,000 bytes for both queries Q
and Q’.
2. Transfer Department relation to site 1, execute join
at site 1 and send the result back to site 2.
Total transfer size for Q = 400,000 + 3500 = 403,500
bytes and for Q’ = 4000 + 3500 = 7500 bytes.
Chaps26.28.29-38
DDBS Concurrency Control and Recovery
CSE
4701
Distributed Databases encounter a number of
concurrency control and recovery problems which are
not present in centralized databases, including:
Dealing with multiple copies of data items
How are they All Updated if Needed?
Failure of individual sites
How are Queries Restarted or Rerouted?
Communication link failure
Network Failure
Distributed commit
How to Know All Updates Done at all Sites?
Distributed deadlock
How to Detect and Recover?
Chaps26.28.29-39
Data Warehousing and Data Mining
CSE
4701
Data Warehousing
Provide Access to Data for Complex Analysis,
Knowledge Discovery, and Decision Making
Underlying Infrastructure in Support of Mining
Provides Means to Interact with Multiple DBs
OLAP (on-Line Analytical Processing) vs. OLTP
Data Mining
Discovery of Information in a Vast Data Sets
Search for Patterns and Common Features based
Discover Information not Previously Known
Medical Records Accessible Nationwide
Research/Discover Cures for Rare Diseases
Relies on Knowledge Discovery in DBs (KDD)
Chaps26.28.29-40
What is Purpose of a Data Warehouse?
CSE
4701
Traditional databases are not optimized for data access but have
to balance the requirement of data access to ensure integrity
Most data warehouse users need only read access, but need the
access to be fast over a large volume of data.
Most of the data required for data warehouse analysis comes
from multiple databases and these analysis are recurrent and
predictable to be able to design software meet requirements.
Critical for tools that provide decision makers with information
to make decisions quickly and reliably based on historical data.
Aforementioned Charactereistics achieved by Data
Warehousing and Online analytical processing (OLAP)
W. H Inmon characterized a data warehouse as:
“A subject-oriented, integrated, nonvolatile, time-variant
collection of data in support of management’s decisions.”
Chaps26.28.29-41
Data Warehousing and OLAP
CSE
4701
A Data Warehouse
Database is Maintained Separately from an
Operational Database
“A Subject-Oriented, Integrated, Time-Variant,
and Non-Volatile Collection of Data in Support for
Management’s Decision Making Process
[W.H.Inmon]”
OLAP (on-Line Analytical Processing)
Analysis of Complex Data in the Warehouse
Attempt to Attain “Value” through Analysis
Relies on Trained and Adept Skilled Knowledge
Workers who Discover Information
Data Mart
Organized Data for a Subset of an Organization
Chaps26.28.29-42
Conceptual Structure of Data Warehouse
CSE
4701
Data Warehouse processing involves
Cleaning and reformatting of data
OLAP
Data Mining
Back Flushing
Data Warehouse
OLAP
Cleaning
Databases
Other Data Inputs
Data
Reformatting
Metadata
DSSI
EIS
Data
Mining
Updates/New Data
Chaps26.28.29-43
Building a Data Warehouse
CSE
4701
Option 1
Leverage Existing
Repositories
Collate and Collect
May Not Capture All
Relevant Data
Option 2
Start from Scratch
Utilize Underlying
Corporate Data
Corporate
data warehouse
Option 1:
Consolidate Data Marts
Option 2:
Build from
scratch
Data Mart
...
Data Mart
Data Mart
Data Mart
Corporate data
Chaps26.28.29-44
Comparison with Traditional Databases
CSE
4701
Data Warehouses are mainly optimized for appropriate
data access
Traditional databases are transactional
Optimized for both access mechanisms and
integrity assurance measures.
Data warehouses emphasize historical data as their
support time-series and trend analysis.
Compared with transactional databases, data
warehouses are nonvolatile.
In transactional databases, transaction is the
mechanism change to the database.
In warehouse, data is relatively coarse grained and
refresh policy is carefully chosen, usually incremental.
Chaps26.28.29-45
Classification of Data Warehouses
CSE
4701
Generally, Data Warehouses are an order of
magnitude larger than the source databases.
The sheer volume of data is an issue, based on which
Data Warehouses could be classified as follows.
Enterprise-wide data warehouses
Huge projects requiring massive investment of time and
resources.
Virtual data warehouses
Provide views of operational databases that are
materialized for efficient access.
Data marts
Generally targeted to a subset of organization, such as a
department, and are more tightly focused.
Chaps26.28.29-46
Data Warehouse Characteristics
CSE
4701
Utilizes a “Multi-Dimensional” Data Model
Warehouse Comprised of
Store of Integrated Data from Multiple Sources
Processed into Multi-Dimensional Model
Warehouse Supports of
Times Series and Trend Analysis
“Super-Excel” Integrated with DB Technologies
Data is Less Volatile than Regular DB
Doesn’t Dramatically Change Over Time
Updates at Regular Intervals
Specific Refresh Policy Regarding Some Data
Chaps26.28.29-47
Three Tier Architecture
CSE
4701
monitor
External data sources
OLAP Server
integrator
Summarization
report
Operational databases
Extraxt
Transform
Load
Refresh
serve
Data Warehouse
Query report
Data mining
metadata
Data marts
Chaps26.28.29-48
Data Modeling for Data Warehouses
CSE
4701
Traditional Databases generally deal with twodimensional data (similar to a spread sheet).
However, querying performance in a multidimensional data storage model is much more
efficient.
Data warehouses can take advantage of this feature as
generally these are
Non volatile
The degree of predictability of the analysis that
will be performed on them is high.
Chaps26.28.29-49
What is a Multi-Dimensional Data Cube?
CSE
4701
Representation of Information in Two or More
Dimensions
Typical Two-Dimensional - Spreadsheet
In Practice, to Track Trends or Conduct Analysis,
Three or More Dimensions are Useful
Aggregate Raw Data!
Chaps26.28.29-50
Multi-Dimensional Schemas
CSE
4701
Supporting Multi-Dimensional Schemas Requires
Two Types of Tables:
Dimension Table: Tuples of Attributes for Each
Dimension
Fact Table: Measured/Observed Variables with
Pointers into Dimension Table
Star Schema
Characterizes Data Cubes by having a Single Fact
Table for Each Dimension
Snowflake Schema
Dimension Tables from Star Schema are
Organized into Hierarchy via Normalization
Both Represent Storage Structures for Cubes
Chaps26.28.29-51
Data Modeling for Data Warehouses
CSE
4701
Advantages of a multi-dimensional model
Multi-dimensional models lend themselves readily
to hierarchical views in what is known as
roll-up display &
drill-down display.
The data can be directly queried in any
combination of dimensions, bypassing complex
database queries.
Chaps26.28.29-52
Data Warehouse Design
CSE
4701
Most of Data Warehouses use a Start Schema to
Represent Multi-Dimensional Data Model
Each Dimension is Represented by a Dimension
Table that Provides its Multidimensional Coordinates
and Stores Measures for those Coordinates
A Fact Table Connects All Dimension Tables with a
Multiple Join
Each Tuple in Fact Table Represents the Content
of One Dimension
Each Tuple in the Fact Table Consists of a Pointer
to Each of the Dimensional Tables
Links Between the Fact Table and the Dimensional
Tables for a Shape Like a Star
Chaps26.28.29-53
Sample Fact Tables
CSE
4701
Chaps26.28.29-54
Example of Star Schema
CSE
4701
Product
Date
Date
Month
Year
Sale Fact Table
Date
ProductNo
ProdName
ProdDesc
Categoryu
Product
Store
Customer
Unit_Sales
Store
StoreID
City
State
Country
Region
Dollar_Sales
Customer
CustID
CustName
CustCity
CustCountry
Chaps26.28.29-55
A Second Example of Star Schema …
CSE
4701
Chaps26.28.29-56
and Corresponding Snowflake Schema
CSE
4701
Chaps26.28.29-57
Multi-dimensional Schemas
CSE
4701
Fact Constellation
Fact constellation is a set of tables that share some
dimension tables.
However, fact constellations limit the possible
queries for the warehouse.
Chaps26.28.29-58
Fact Table i2b2 (Integrating Biology &Bedside)
CSE
4701
Chaps26.28.29-59
Data Warehouse Issues
CSE
4701
Data Acquisition
Extraction from Heterogeneous Sources
Reformatted into Warehouse Context - Names,
Meanings, Data Domains Must be Consistent
Data Cleaning for Validity and Quality
is the Data as Expected w.r.t. Content? Value?
Transition of Data into Data Model of Warehouse
Loading of Data into the Warehouse
Other Issues Include:
How Current is the Data? Frequency of Update?
Availability of Warehouse? Dependencies of Data?
Distribution, Replication, and Partitioning Needs?
Loading Time (Clean, Format, Copy, Transmit,
Index Creation, etc.)?
Chaps26.28.29-60
OLAP Strategies
CSE
4701
OLAP Strategies
Roll-Up: Summarization of Data
Drill-Down: from the General to Specific (Details)
Pivot: Cross Tabulate the Data Cubes
Slice and Dice: Projection Operations Across
Dimensions
Sorting: Ordering Result Sets
Selection: Access by Value or Value Range
Implementation Issues
Persistent with Infrequent Updates (Loading)
Optimization for Performance on Queries is More
Complex - Across Multi-Dimensional Cubes
Recovery Less Critical - Mostly Read Only
Temporal Aspects of Data (Versions) Important
Chaps26.28.29-61
Knowledge Discovery
CSE
4701
Data Warehousing Requires Knowledge Discovery to
Organize/Extract Information Meaningfully
Knowledge Discovery
Technology to Extract Interesting Knowledge
(Rules, Patterns, Regularities, Constraints) from a
Vast Data Set
Process of Non-trivial Extraction of Implicit,
Previously Unknown, and Potentially Useful
Information from Large Collection of Data
Data Mining
A Critical Step in the Knowledge Discovery
Process
Extracts Implicit Information from Large Data Set
KDD: Knowledge Discovery and Data Mining
Chaps26.28.29-62
Steps in a KDD Process
CSE
4701
Learning the Application Domain (goals)
Gathering and Integrating Data
Data Cleaning
Data Integration
Data Transformation/Consolidation
Data Mining
Choosing the Mining Method(s) and Algorithm(s)
Mining: Search for Patterns or Rules of Interest
Analysis and Evaluation of the Mining Results
Use of Discovered Knowledge in Decision Making
Important Caveats
This is Not an Automated Process!
Requires Significant Human Interaction!
Chaps26.28.29-63
Processing in a Data Warehouse
CSE
4701
Processing Types are Varied and Include:
Roll-up: Data is summarized with increasing
generalization
Drill-Down: Increasing levels of detail are
revealed
Pivot: Cross tabulation is performed
Slice and dice: Performing projection operations
on the dimensions.
Sorting: Data is sorted by ordinal value.
Selection: Data is available by value or range.
Derived attributes: Attributes are computed by
operations on stored derived values.
Chaps26.28.29-64
On-Line Analytical Processing
CSE
4701
Data Cube
A Multidimensonal Array
Each Attribute is a Dimension
In Example Below, the Data Must be Interpreted so
that it Can be Aggregated by Region/Product/Date
Product
Product
Store
Date
Sale
acron
Rolla,MO 7/3/99 325.24
budwiser LA,CA
5/22/99 833.92
large pants NY,NY
2/12/99 771.24
Pants
Diapers
Beer
Nuts
West
East
3’ diaper Cuba,MO 7/30/99 81.99
Region
Central
Mountain
South
Jan
Feb March April
Date
Chaps26.28.29-65
Examples of Data Mining
CSE
4701
The Slicing Action
A Vertical or Horizontal Slice Across Entire Cube
Months
Slice
on city Atlanta
Products Sales
Products Sales
Months
Multi-Dimensional Data Cube
Chaps26.28.29-66
Examples of Data Mining
CSE
4701
The Dicing Action
A Slide First Identifies on Dimension
A Selection of Any Cube within the Slice which
Essentially Constrains All Three Dimensions
Months
Products Sales
Products Sales
Months
March 2000
Electronics
Atlanta
Dice on Electronics and Atlanta
Chaps26.28.29-67
Examples of Data Mining
Drill Down - Takes a Facet (e.g.,
Q1)
and Decomposes into Finer Detail
Jan Feb March
Products Sales
CSE
4701
Drill down
on Q1
Roll Up
on Location
(State, USA)
Roll Up: Combines Multiple Dimensions
From Individual Cities to State
Q1 Q2 Q3 Q4
Products Sales
Products Sales
Q1 Q2 Q3 Q4
Chaps26.28.29-68
Mining Other Types of Data
Analysis and Access Dramatically More Complicated!
CSE
4701
Spatial databases
Multimedia databases
World Wide Web
Time series data
Geographical and Satellite Data
Chaps26.28.29-69
Advantages/Objectives of Data Mining
CSE
4701
Descriptive Mining
Discover and Describe General Properties
60% People who buy Beer on Friday also have
Bought Nuts or Chips in the Past Three Months
Predictive Mining
Infer Interesting Properties based on Available
Data
People who Buy Beer on Friday usually also Buy
Nuts or Chips
Result of Mining
Order from Chaos
Mining Large Data Sets in Multiple Dimensions
Allows Businesses, Individuals, etc. to Learn about
Trends, Behavior, etc.
Impact on Marketing Strateg
Chaps26.28.29-70
Data Mining Methods
CSE
4701
Association
Discover the Frequency of Items Occurring
Together in a Transaction or an Event
Example
80% Customers who Buy Milk also Buy Bread
Hence - Bread and Milk Adjacent in Supermarket
50% of Customers Forget to Buy Milk/Soda/Drinks
Hence - Available at Register
Prediction
Predicts Some Unknown or Missing Information
based on Available Data
Example
Forecast Sale Value of Electronic Products for Next
Quarter via Available Data from Past Three Quarters
Chaps26.28.29-71
Association Rules
CSE
4701
Motivated by Market Analysis
Rules of the Form
Item1^Item2^…^ ItemkItemk+1 ^ … ^ Itemn
Example
“Beer ^ Soft Drink Pop Corn”
Problem: Discovering All Interesting Association
Rules in a Large Database is Difficult!
Issues
Interestingness
Completeness
Efficiency
Basic Measurement for Association Rules
Support of the Rule
Confidence of the Rule
Chaps26.28.29-72
Data Mining Methods
CSE
4701
Classification
Determine the Class or Category of an Object
based on its Properties
Example
Classify Companies based on the Final Sale Results in
the Past Quarter
Clustering
Organize a Set of Multi-dimensional Data Objects
in Groups to Minimize Inter-group Similarity is
and Maximize Intra-group Similarity
Example
Group Crime Locations to Find Distribution Patterns
Chaps26.28.29-73
Classification
CSE
4701
Classification is the process of learning a model that
is able to describe different classes of data.
Learning is supervised as the classes to be learned are
predetermined.
Learning is accomplished by using a training set of
pre-classified data.
The model produced is usually in the form of a
decision tree or a set of rules.
Chaps26.28.29-74
One Classification Example
CSE
4701
Rule extracted from the decision tree of Figure 28.7.
IF 50K > salary >= 20K
AND age >=25
THEN class is “yes”
Chaps26.28.29-75
Classification
CSE
4701
Two Stages
Learning Stage: Construction of a Classification
Function or Model
Classification Stage: Predication of Classes of
Objects Using the Function or Model
Tools for Classification
Decision Tree
Bayesian Network
Neural Network
Regression
Problem
Given a Set of Objects whose Classes are Known
(Training Set), Derive a Classification Model
which can Correctly Classify Future Objects
Chaps26.28.29-76
An Example
Attributes
Class Attribute - Play/Don’t Play the Game
Training Set
Values that Set the Condition for the Classification
What are the Pattern Below?
CSE
4701
Attribute
Possible Values
outlook
sunny, overcast, rain
temperature continuous
humidity
continuous
windy
true, false
Outlook Temperature Humidity
sunny
85
85
overcast 83
78
sunny
80
90
sunny
72
95
sunny
72
70
…
…
…
Windy
false
false
true
false
false
…
Play
No
Yes
No
No
Yes
...
Chaps26.28.29-77
Data Mining Methods
CSE
4701
Summarization
Characterization (Summarization) of General
Features of Objects in the Target Class
Example
Characterize People’s Buying Patterns on the Weekend
Potential Impact on “Sale Items” & “When Sales Start”
Department Stores with Bonus Coupons
Discrimination
Comparison of General Features of Objects
Between a Target Class and a Contrasting Class
Example
Comparing Students in Engineering and in Art
Attempt to Arrive at Commonalities/Differences
Chaps26.28.29-78
Summarization Technique
CSE
4701
Attribute-Oriented Induction
Generalization using Concert hierarchy (Taxonomy)
barcode category
14998
milk
brand
diaryland
content
size
Skim
2L
food
12998 mechanical MotorCraft valve 23a 12in
…
…
…
…
...
Milk
…
Skim milk … 2% milk
Category
milk
milk
…
Content Count
skim
2%
…
280
98
...
bread
White
whole
bread … wheat
Lucern … Dairyland
Wonder … Safeway
Chaps26.28.29-79
Building A Data Warehouse
CSE
4701
The builders of Data warehouse should take a broad
view of the anticipated use of the warehouse.
The design should support ad-hoc querying
An appropriate schema should be chosen that
reflects the anticipated usage.
The Design of a Data Warehouse involves following
steps.
Acquisition of data for the warehouse.
Ensuring that Data Storage meets the query
requirements efficiently.
Giving full consideration to the environment in
which the data warehouse resides.
Chaps26.28.29-80
Building A Data Warehouse
CSE
4701
Acquisition of data for the warehouse
The data must be extracted from multiple,
heterogeneous sources.
Data must be formatted for consistency within the
warehouse.
The data must be cleaned to ensure validity.
Difficult to automate cleaning process.
Back flushing, upgrading the data with cleaned data.
The data must be fitted into the data model of the
warehouse.
The data must be loaded into the warehouse.
Proper design for refresh policy should be considered.
Chaps26.28.29-81
Building A Data Warehouse
CSE
4701
Storing the data according to the data model of the
warehouse
Creating and maintaining required data structures
Creating and maintaining appropriate access paths
Providing for time-variant data as new data are added
Supporting the updating of warehouse data.
Refreshing the data
Purging data
Chaps26.28.29-82
Why is Data Mining Popular?
CSE
4701
Technology Push
Technology for Collecting Large Quantity of Data
Bar Code, Scanners, Satellites, Cameras
Technology for Storing Large Collection of Data
Databases, Data Warehouses
Variety of Data Repositories, such as Virtual Worlds,
Digital Media, World Wide Web
Corporations want to Improve Direct Marketing and
Promotions - Driving Technology Advances
Targeted Marketing by Age, Region, Income, etc.
Exploiting User Preferences/Customized Shopping
Chaps26.28.29-83
Requirements & Challenges in Data Mining
CSE
4701
Security and Social
What Information is Available to Mine?
Preferences via Store Cards/Web Purchases
What is Your Comfort Level with Trends?
User Interfaces and Visualization
What Tools Must be Provided for End Users of
Data Mining Systems?
How are Results for Multi-Dimensional Data
Displayed?
Performance Guarantees
Range from Real-Time for Some Queries to LongTerm for Other Queries
Data Sources of Complex Data Types or Unstructured
Data - Ability to Format, Clean, and Load Data Sets
Chaps26.28.29-84
Data Mining Visualization
CSE
4701
Leverage Improving 3D Graphics and Increased PC
Processing Power for Displaying Results
Significant Research in Visualization w.r.t. Displaying
Multi-Dimensional Data
Chaps26.28.29-85
Successful Data Mining Applications
CSE
4701
Business Data Analysis and Decision Support
Marketing, Customer Profiling, Market Analysis
and Management, Risk Analysis and Management
Fraud Detection
Detecting Telephone Fraud, Automotive and
Health Insurance Fraud, Credit-card Fraud,
Suspicious Money Transactions (Money
Laundering)
Text Mining
Message Filtering (Email, Newsgroups, Etc.)
Newspaper Articles Analysis
Sports
IBM Advanced Scout Analyzed NBA Game
Statistics (Shots Blocked, Assists and Fouls) to
Gain Competitive Advantage
Chaps26.28.29-86
Select Data Mining Products
CSE
4701
Chaps26.28.29-87
Databases on WWW
CSE
4701
Web has changed the way we do Business & Research
Facts:
Industry Saw an Opportunity, knew it had to Move
Quickly to Capitalize
Lots of Action, Lots of Money, Lots of Releases
Line Between R&D is Very Narrow
Many Researchers Moved to Industry (Trying to Return
Back to Academia)
Emergence of Java
Java changed the way that Software was Designed,
Developed, and Utilized
Particularly w.r.t. Web-Based Applications,
Database Interoperability, Web Architectures, etc.
Emergence of Enterprise Computing
Chaps26.28.29-88
Internet and the Web
CSE
4701
A Major Opportunity for Business
A Global Marketplace
Business Across State and Country Boundaries
A Way of Extending Services
Online Payment vs. VISA, Mastercard
A Medium for Creation of New Services
Publishers, Travel Agents, Teller, Virtual Yellow
Pages, Online Auctions …
A Boon for Academia
Research Interactions and Collaborations
Free Software for Classroom/Research Usage
Opportunities for Exploration of Technologies in
Student Projects
Chaps26.28.29-89
WWW: Three Market Segments
CSE
4701
Business to Business
Server
Corporate
Network
Server
Intranet
Decision
support
Mfg.. System
monitoring
corporate
repositories
Workgroups
Information sharing
Ordering info./status
Targeted electronic
commerce
Internet
Corporate
Server Network
Internet
Sales
Marketing
Information
Services
Server
Chaps26.28.29-90
Information Delivery Problems on the Net
CSE
4701
Everyone can Publish Information on the Web
Independently at Any Time
Consequently, there is an Information Explosion
Identifying Information Content More Difficult
There are too Many Search Engines but too Few
Capable of Returning High Quality Data
Is this Still True?
Most Search Engines are Useful for Ad-hoc Searches
but Awkward for Tracking Changes
Is this Still True?
Chaps26.28.29-91
Example Web Applications
CSE
4701
Scenario 1: World Wide Wait
A Major Event is Underway and the Latest, Up-tothe Minute Results are Being Posted on the Web
You Want to Monitor the Results for this
Important Event, so you Fire up your Trusty Web
Browser, Pointing at the Result Posting Site, and
Wait, and Wait, and Wait …
What is the Problem?
The Scalability Problems are the Result of a
Mismatch Between the Data Access Characteristics
of the Application and the Technology Used to
Implement the Application
Changed with Emergence of Mobile Computing?
Chaps26.28.29-92
Example Web Applications
CSE
4701
Scenario 2:
Many Applications Today have the Need for
Tracking Changes in Local and Remote Data
Sources and Notifying Changes If Some Condition
Over the Data Source(s) is Met
If You Want to Monitor the Changes on Web, You
Need to Fire Your Trusty Web Browser from Time
to Time, and Cache the Most Recent Result, and do
the Difference Manually Each Time You Poll the
Data Source(s) …
What is the Problem?
Pure Pull is Not the Answer to All Problems
Changed with Emergence of Mobile Computing?
Chaps26.28.29-93
What is the Problem?
CSE
4701
Applications are Asymmetric but the Web is Not
Computation Centric vs. Information Flow Centric
Type of Asymmetry
Network Asymmetry
Satellite, CATV, Mobile Clients, Etc.
Client to Server Ratio
Too Many Clients can Swamp Servers
Data Volume
Mouse and Key Click vs. Content Delivery
Update and Information Creation
Clients Need to be Informed or Must Poll
What have we Seen re. Cell Networks Over Time?
Chaps26.28.29-94
Useful Solutions
CSE
4701
Combination/Interleave of Pull and Push Protocols
User-initiated, Comprehensive Search-based
Information Delivery (Pull)
Server-initiated Information Dissemination (Push)
Provide Support for a Variety of Data Delivery
Protocols, Frequencies, and Delivery Modes
Information Delivery Frequencies
Periodic, Conditional, Ad-Hoc
Information Delivery Modes
Information Delivery Protocols (IDP)
Request/Respond, Polling, Publish/Subscribe,
Broadcast
Information Delivery Styles (IDS)
Pull, Push, Hybrid
Chaps26.28.29-95
Information Delivery Frequencies
CSE
4701
Periodic
Data is Delivered from a Server to Clients
Periodically
Period can be Defined by System-default or by
Clients Using their Profiles
Period can be Influenced by Client and Bandwidth
Mobile Device vs. PC w/Modem
PC w/DSL vs. PC w/Cable Modem
Multiple Mobile Devices of All Types
Streaming of Videos, Live Streaming of Events
Conditional (Aperiodic)
Data is Delivered from a Server when Conditions
Installed by Clients in their Profiles are Satisfied
Ad-hoc (or Irregular)
Chaps26.28.29-96
Information Delivery Modes
CSE
4701
Uni-cast
Data is Sent from a Data Source (a Single Server)
to Another Machine
1-to-n
Data is Sent by a Single Data Source and Received
by Multiple Machines
Multicast vs. Broadcast
Multicast: Data is Sent to a Specific Set of Clients
Broadcast: Sending Data Over a Medium which an
Unidentified or Unbounded Set of Clients can Listen
Chaps26.28.29-97
IDP: Request/Respond
CSE
4701
Semantics of Request/Respond
Clients Send their Request to Servers to Ask the
Information of their Interest
Servers Respond to the Client Request by
Delivering the Information Requested
Client can Wait (Synchronous) or Not
Applications
Most Database Systems and Web Search Engines
are Using the Request/Respond Protocol for
Client-Server Communication
What has Changed with Mobile Computing?
Chaps26.28.29-98
IDP: Programmed Polling vs. User Polling
CSE
4701
Semantics:
Programmed Polling: a System Periodically
Sends Requests to Other Sites to Obtain Status
Information or Detect Changed Values
User Polling: a User or Application Periodically
or Aperiodically Polls the Data Sites and Obtains
the Changes
Applications
Programmed Polling: Save the Users from
having to Click, but does Nothing to Solve the
Scalability Problems Caused by the
Request/Respond Mechanism
What do Today’s Mobile Devices Use?
Chaps26.28.29-99
IDP: Publish/Subscribe
CSE
4701
Semantics: Servers Publish/Clients Subscribe
Servers Publish Information Online
Clients Subscribe to the Information of Interest
(Subscription-based Information Delivery)
Data Flow is Initiated by the Data Sources
(Servers) and is Aperiodic
Danger: Subscriptions can Lead to Other
Unwanted Subscriptions
Applications
Unicast: Database Triggers and Active Databases
1-to-n: Online News Groups
How is this Utilized in Mobile Devices?
Chaps26.28.29-100
Information Delivery Styles
CSE
4701
Pull-Based System
Transfer of Data from Server to Client is Initiated
by a Client Pull
Clients Determine when to Get Information
Potential for Information to be Old Unless Client
Periodically Pulls
Push-Based System
Transfer of Data from Server to Client is Initiated
by a Server Push
Clients may get Overloaded if Push is Too
Frequent
Hybrid
Pull and Push Combined
Pull First and then Push Continually
Chaps26.28.29-101
Summary: Pull vs. Push
CSE
4701
Request/
Respond
Pure Pull
Conditional Ad-hoc
Y
Pure Push
Hybrid
Publish/
Broadcast Periodic
Subscribe
Y
Y
Y
Y
Y
Y*
Y
Y
Y
Y
Y*
Chaps26.28.29-102
Design Options for Nodes
CSE
4701
Three Types of Nodes:
Data Sources
Provide Base Data which is to be Disseminated
Clients
Who are the Net Consumers of the Information
Information Brokers
Acquire Information from Other Data Sources, Add
Value to that Information and then Distribute this
Information to Other Consumers
By Creating a Hierarchy of Brokers, Information
Delivery can be Tailored to the Need of Many Users
How has this Changed with Today’s Mobile
Computing?
Chaps26.28.29-103
The Next Big Challenge
CSE
4701
Interoperability
Heterogeneous Distributed Databases
Heterogeneous Distributed Systems
Autonomous Applications
Scalability
Rapid and Continuous Growth
Amount of Data
Variety of Data Types
Dealing with personally identifiable information (PII)
and personal health information (PHI)
Emergence of Fitness and Health Monitoring Apps
Google Fit and Apple HealthKit
New Apple ResearchKit for Medical Research
Chaps26.28.29-104
Interoperability: A Classic View
CSE
4701
Local
Schema
Simple Federation
Multiple Nested Federation
FDB Global
Schema
FDB Global
Schema 4
Federated
Integration
Federated
Integration
Local
Schema
Local
Schema
FDB 1
Local
Schema
Federation
FDB3
Federation
Chaps26.28.29-105
Java Client with Wrapper to Legacy Application
CSE
4701
Java Client
Java Application Code
WRAPPER
Mapping Classes
JAVA LAYER
Interactions Between Java Client
and Legacy Appl. via C and RPC
C is the Medium of Info. Exchange
Java Client with C++/C Wrapper
NATIVE LAYER
Native Functions (C++)
RPC Client Stubs (C)
Legacy
Application
Network
Chaps26.28.29-106
COTS and Legacy Appls. to Java Clients
CSE
4701
COTS Application
Legacy Application
Java Application Code
Java Application Code
Native Functions that
Map to COTS Appl
NATIVE LAYER
Native Functions that
Map to Legacy Appl
NATIVE LAYER
JAVA LAYER
JAVA LAYER
Mapping Classes
JAVA NETWORK WRAPPER
Mapping Classes
JAVA NETWORK WRAPPER
Network
Java Client
Java Client
Java is Medium of Info. Exchange - C/C++ Appls with Java Wrappers
Chaps26.28.29-107
Java Client to Legacy App via RDBS
CSE
4701
Transformed
Legacy Data
Java Client
Updated Data
Relational
Database
System(RDS)
Extract and
Generate Data
Transform and
Store Data
Legacy
Application
Chaps26.28.29-108
Database Interoperability in the Internet
CSE
4701
Technology
Web/HTTP, JDBC/ODBC, CORBA (ORBs +
IIOP), XML, SOAP, REST API, WSDL
Architecture
Information Broker
•Mediator-Based Systems
•Agent-Based Systems
Chaps26.28.29-109
JDBC
CSE
4701
JDBC API Provides DB Access Protocols for Open,
Query, Close, etc.
Different Drivers for Different DB Platforms
JDBC API
Java
Application
Driver Manager
Driver
Oracle
Driver
Access
Driver
Driver
Sybase
Chaps26.28.29-110
Connecting a DB to the Web
CSE
4701
DBMS
CGI Script Invocation
or JDBC Invocation
Web Server
Internet
Web Server are
Stateless
DB Interactions Tend
to be Stateful
Invoking a CGI
Script on Each DB
Interaction is Very
Expensive, Mainly
Due to the Cost of
DB Open
Browser
Chaps26.28.29-111
Connecting More Efficiently
CSE
4701
DBMS
Helper
Processes
CGI Script
or JDBC
Invocation
Web Server
Internet
To Avoid Cost of
Opening Database, One
can Use Helper
Processes that Always
Keep Database Open
and Outlive Web
Connection
Newly Invoked CGI
Scripts Connect to a
Preexisting Helper
Process
System is Still Stateless
Browser
Chaps26.28.29-112
DB-Internet Architecture
CSE
4701
WWW Client
(Netscape)
WWW client
(Info. Explore)
WWW Client
(HotJava)
Internet
HTTP Server
DBWeb Gateway
DBWeb Gateway
DBWeb Gateway
DBWeb
Dispatcher
DBWeb Gateway
Chaps26.28.29-113
EJB Architecture
CSE
4701
Chaps26.28.29-114
Technology Push
CSE
4701
Computer/Communication Technology (Almost Free)
Plenty of Affordable CPU, Memory, Disk,
Network Bandwidth
Next Generation Internet: Gigabit Now
Wireless: Ubiquitous, High Bandwidth
Information Growth
Massively Parallel Generation of Information on
the Internet and from New Generation of Sensors
Disk Capacity on the Order of Peta-bytes
Small, Handy Devices to Access Information
The focus is to make information
available to users, in the right form, at
the right time, in the appropriate place.
Chaps26.28.29-115
Research Challenges
CSE
4701
Ubiquitous/Pervasive
Many computers and information
appliances everywhere,
networked together
Inherent Complexity:
Coping with Latency (Sometimes
Unpredictable)
Failure Detection and Recovery
(Partial Failure)
Concurrency, Load Balancing,
Availability, Scale
Service Partitioning
Ordering of Distributed Events
“Accidental” Complexity:
Heterogeneity: Beyond the Local
Case: Platform, Protocol, Plus All
Local Heterogeneity in Spades.
Autonomy: Change and Evolve
Autonomously
Tool Deficiencies: Language Support
(Sockets,rpc), Debugging, Etc.
Chaps26.28.29-116
Infosphere
Problem: too many sources,too much information
CSE
4701
Internet:
Information Jungle
Infopipes
Clean, Reliable,
Timely Information,
Anywhere
Digital
Earth
Personalized
Filtering &
Info. Delivery
Sensors
Chaps26.28.29-117
Current State-of-Art – Has Mobile Changed This?
CSE
4701
Web
Server
Mainframe
Database
Server
Thin
Client
Chaps26.28.29-118
Infosphere Scenario – Where Does Mobile Fit?
CSE
4701
Infotaps &
Fat Clients
Sensors
Variety
of Servers
Many sources
Database
Server
Chaps26.28.29-119
Heterogeneity and Autonomy
CSE
4701
Heterogeneity:
How Much can we Really Integrate?
Syntactic Integration
Different Formats and Models
XML/JSON/RDF/OWL/SQL Query Languages
Semantic Interoperability
Basic Research on Ontology, Etc.
Autonomy
No Central DBA on the Net
Independent Evolution of Schema and Content
Interoperation is Voluntary
Interface Technology DCOM: Microsoft Standard
CORBA, Etc...
Chaps26.28.29-120
Security and Data Quality
CSE
4701
Security
System Security in the Broad Sense
Attacks: Penetrations, Denial of Service
System (and Information) Survivability
Security Fault Tolerance
Replication for Performance, Availability, and
Survivability
Data Quality
Web Data Quality Problems
Local Updates with Global Effects
Unchecked Redundancy (Mutual Copying)
Registration of Unchecked Information
Spam on the Rise
Chaps26.28.29-121
Legacy Data Challenge
CSE
4701
Legacy Applications and Data
Definition: Important and Difficult to Replace
Typically, Mainframe Mission Critical Code
Most are OLTP and Database Applications
Evolution of Legacy Databases
Client-server Architectures
Wrappers
Expensive and Gradual in Any Case
Chaps26.28.29-122
Potential Value Added/Jumping on Bandwagon
CSE
4701
Sophisticated Query Capability
Combining SQL with Keyword Queries
Consistent Updates
Atomic Transactions and Beyond
But Everything has to be in a Database!
Only If we Stick with Classic DB Assumptions
Relaxing DB Assumptions
Interoperable Query Processing
Extended Transaction Updates
Commodities DB Software
A Little Help is Still Good If it is Cheap
Internet Facilitates Software Distribution
Databases as Middleware
Chaps26.28.29-123
Concluding Remarks
CSE
4701
Four-Fold Objective
Distributed Database Processing
Data Warehouses
Data Mining of Vast Information Repositories
Web-Based Architectures for DB Interoperability
All Three are Tightly Related
DDBMS can Improve Performance of Mining
Repositories as Backend Database Processors
Web-Based Architectures Provide Access Means
for DDBMS or Mining
Warehouses are Infrastructure to Facilitate Mining
Geographic Information Systems, Deductive DBMS,
Multi-Media DBMS, Mobile DBMS, Embedded/RealTime DBMS, etc.
Chaps26.28.29-124