Databases - The University of Texas at Dallas
Download
Report
Transcript Databases - The University of Texas at Dallas
Data and Knowledge
Management
Sumit Sarkar & Varghese S. Jacob
University of Texas at Dallas
Sumit Sarkar & Varghese Jacob, UT Dallas
Overview of Session
Data
Management
Technical considerations: Databases and
database management systems
Design Issues
Managerial issues
Data
Warehousing
Knowledge Discovery and Data Mining
Knowledge Management
Sumit Sarkar & Varghese Jacob, UT Dallas
What Should Managers Know
About Data Management?
Organizations
store and use large
quantities of data
Sheer volume of data alone means that
proper management is essential
Data are a valuable resource that must be
managed
Value can be derived by capturing,
validating, and protecting the data
Sumit Sarkar & Varghese Jacob, UT Dallas
What Should Managers Know
About Database Management?
Management
of database is an
important issue
No longer just a technical issue, the
management of databases has
become increasingly important
throughout all types of
organizations.
Sumit Sarkar & Varghese Jacob, UT Dallas
Data Issues
Amount
of data we capture doubles
approximately every 18 months.
How do we capture it?
Where do we store it?
What do we do with it?
Sumit Sarkar & Varghese Jacob, UT Dallas
Database Approach
Database:
A non redundant collection of logically
related records or files. It enables a common pool of
data records to serve many processing applications.
Abstraction of the real world.
Database Management Systems (DBMS): A set of
computer programs that handle all access to the
database. It allows one to define a collection of
related files, manipulate this collection (i.e. insert,
delete, and update) and read records. E.g. Access,
Oracle
Data occupies the central position; referenced as
needed.
Sumit Sarkar & Varghese Jacob, UT Dallas
DBMS: Database Management Systems
Reports
Database
Database
Management
Systems
Application
Programs
Q ueries
Sumit Sarkar & Varghese Jacob, UT Dallas
Database Approach (cont’d)
Data Sharing: data is not the property of one
person; one representation for each piece of data;
avoids (minimizes) redundancy.
User Views: allows a user to have a view of the
database that is different from the view used by
others; user is isolated from changes in the
data/programs.
Query Language: An English-like language
designed for end-users to query the database. E.g.,
Structured Query Language (SQL)
Database Administrator: Specialist who manages
the database.
Sumit Sarkar & Varghese Jacob, UT Dallas
Major Database Activities
Database
development - design and implement the
database
Database interrogation - query the database
Database maintenance - keep it operational
Application development - write application
programs against it
Sumit Sarkar & Varghese Jacob, UT Dallas
Database Administrator
Coordinates
all the activities of the database
management system
Must have a good understanding of the enterprise’s
information resources and needs
Sumit Sarkar & Varghese Jacob, UT Dallas
Database Administrator
Duties
include:
DB Planning
Design & Development
Performance Analysis
Liaison with Users
Security and Authorization
Backup & Recovery
Maintenance
Training
Sumit Sarkar & Varghese Jacob, UT Dallas
Database Users
Differentiated by the way they expect to interact with
the system
Application programmers – interact with system
through query languages and other programs
Sophisticated users – form requests in a database
query language
Naive users – invoke one of the permanent
application programs that have been written
previously
Sumit Sarkar & Varghese Jacob, UT Dallas
What Should Managers Know
about Database Technology?
How
data is represented can
significantly affect the ways in which
data
can be manipulated to assist in decisionmaking
can be controlled and managed
Sumit Sarkar & Varghese Jacob, UT Dallas
Data Models
• Hierarchical
• Network
• Relational
• Object Oriented
• Multi-Dimensional
Sumit Sarkar & Varghese Jacob, UT Dallas
Which Data Model is Best?
Depends!
Transaction Processing (OLTP)
• Relational
• Hierarchical
Decision Support (OLAP)
• Multi-Dimensional
• Relational
Multi-Media
Object Oriented
• Relational
Sumit Sarkar & Varghese Jacob, UT Dallas
Relational Model
Most
widely used type of database
management system
Examples:
Oracle (Oracle), DB2 (IBM),
Informix (Informix), SQL Server
(Microsoft), MS Access (Microsoft), etc.
Sumit Sarkar & Varghese Jacob, UT Dallas
Relational Model
One
basic construct: the relation
Relations represent both entities and
relationships
Data Manipulation Language is English-like
(e.g., SQL)
Note - A relationship occurs during
conceptual design (E-R Model) and should
not be confused with a relation
Sumit Sarkar & Varghese Jacob, UT Dallas
Book Relation
Column
Record
Field
Sumit Sarkar & Varghese Jacob, UT Dallas
Notation
Relations
can be expressed concisely by
eliminating the sample data and showing just
the relation name and the column names. For
example:
Book: [title-id, title, type, price, pub_id]
(Note that the key is underlined)
Sumit Sarkar & Varghese Jacob, UT Dallas
Properties of relations
Each
row is unique. No two rows may have
identical values.
Each
column must have a unique name.
The
sequence of columns (left to right) is
insignificant.
The
sequence of rows (top to bottom) is
insignificant.
Attribute values are atomic
Sumit Sarkar & Varghese Jacob, UT Dallas
Relations and keys
A primary key is a column or group of columns
in a relation whose values uniquely identify each
row in a relation.
A foreign
key is a column or group of columns
whose values are the same as the primary key of
another relation. You can think of a foreign key
as a copy of the primary key from another
relation. Note that you do not underline a foreign
key.
Sumit Sarkar & Varghese Jacob, UT Dallas
Example
Patient Bill
Patient No
Location
3249
137-2
Patient Name
Date Admitted
5/6/99
Date Discharged 5/9/99
Item_Code
200
205
307
Description
Room Semi-pr
Television
X-Ray
Mary Baker
Charge
$150.00
$10.00
$100.00
Total:
$260.00
Sumit Sarkar & Varghese Jacob, UT Dallas
Example (Cont.)
Patient & Charges Data
P_No
Patient_ Patient
Date
Date
Item
Desc
Charge
Fname
Lname
Admit
Disch
Code
1379
1379
John
John
Cribbs
Cribbs
4/23/99
4/23/99
4/28/99
4/28/99
413
307
Lab Test
X-Ray
$40.00
$100.00
3249
Mary
Baker
5/6/99
5/9/99
307
X-Ray
$100.00
3249
Mary
Baker
5/6/99
5/9/99
205
Television $10.00
3249
Mary
Baker
5/6/99
5/9/99
200
Room S-pr$150.00
6213
David
Rose
5/2/99
5/6/99
205
Television $10.00
Redundancies and Anomalies
Sumit Sarkar & Varghese Jacob, UT Dallas
Relational Databases (RDBMS)
PATIENT TABLE
Patient_
No
1379
3249
6213
…
…
Patient_LName
Cribbs
Baker
Rose
…
…
Patient_
FName
John
Mary
David
…
…
Date_Admitted Date_Discharged
4/23/99
4/28/99
5/6/99
5/9/99
5/2/99
5/6/99
…
…
…
…
ITEM TABLE
Primary Key
Item_Code
200
207
307
…
…
Description
Room S-Pr
Television
X-Ray
…
…
Amount
150
10
100
…
…
Foreign Key
CHARGES TABLE
Patient_No
1379
1379
3249
…
Item_Code
413
307
307
…
Sumit Sarkar & Varghese Jacob, UT Dallas
Design Steps
User Requirements
Conceptual Design
(E.g. entity-relationship model)
Logical Design
(E.g. relational model)
Physical Design
(optimize for performance)
Sumit Sarkar & Varghese Jacob, UT Dallas
Relational Operations
Database
operators (operations)
represent ways in which data can be
manipulated to assist in decisionmaking
Without some sense of the
possibilities of queries and reports,
people often have a misconception of
what they can expect from a database.
Sumit Sarkar & Varghese Jacob, UT Dallas
What does the user want?
• Patients admitted after February 26, 2001
• Items with Charge Amount > $100
Select
• Admit Date for all Patients
Project
• All Patients and associated Charges
Join
• Patients in Semi Private rooms
Combination
Sumit Sarkar & Varghese Jacob, UT Dallas
Relational Databases (RDBMS)
PATIENT TABLE
Patient_
No
1379
3249
6213
…
…
Patient_LName
Cribbs
Baker
Rose
…
…
Patient_
FName
John
Mary
David
…
…
Date_Admitted Date_Discharged
4/23/99
4/28/99
5/6/99
5/9/99
5/2/99
5/6/99
…
…
…
…
ITEM TABLE
Item_Code
200
207
307
…
…
Description
Room S-Pr
Television
X-Ray
…
…
Amount
150
10
100
…
…
CHARGES TABLE
Patient_No
1379
1379
3249
…
Item_Code
413
307
307
…
Sumit Sarkar & Varghese Jacob, UT Dallas
Security Features
Four Dimensions
•Who (User)
•What (Data Objects)
•Where (Point of Access)
•How (Nature of Access)
Sumit Sarkar & Varghese Jacob, UT Dallas
Questions?
Sumit Sarkar & Varghese Jacob, UT Dallas
In Class Exercise
XYZ is a luxury hotel chain with several branches across the US. The clientele
consists of corporate travelers and well-heeled individuals. A key to XYZ’s
success is in providing personalized attention to its guests. XYZ relies on two
information systems. One is GLOBAL, which is a centralized reservation
system owned by an international travel organization TRV. The other is
LOCAL, which keeps records of reservations for each individual hotel and
preferences for guests who have stayed in each hotel.
Given their clientele, XYZ’s management feels that it is imperative to remember
their repeat guests, and their preferences. While LOCAL allows a hotel to
easily access its own guest list, it is currently not possible to locate the
preferences of a guest who may have stayed in a different hotel of the chain.
A related problem is that when a repeat guest with a common name (e.g.,
John Doe) calls, all the John Doe’s in the system would be displayed, and the
reservation agents would have no way of knowing who that caller is.
Furthermore, if John Doe had been listed previously as John Dow, the system
would not recognize this as the same person. Consequently, a new record
would be created for John Doe.
Sumit Sarkar & Varghese Jacob, UT Dallas
In Class Exercise (cont.)
Your group has been designated to address the above problems.
What could be done to enable a reservation agent to access data
from some other hotel? What issues must you consider in
making such a decision?
What can help in identifying a caller with a common name?
What can be done to resolve the “duplicate records” problem?
In your preliminary discussions with the CEO, the possibility of
moving to a completely new system has been considered. You
believe that it would be easier to justify a new system if
additional value could be generated from the data on your
guests. What could be possible ways to achieve that?
Lead discussion group: 0153
Sumit Sarkar & Varghese Jacob, UT Dallas
DATA WAREHOUSING
Repository of Data for Decision Support Applications
End Users: Business Analysts, Executive Management,
etc.
Goal: Deliver needed data (information) when user
wants, in form that user understands, to user’s preferred
location
Sumit Sarkar & Varghese Jacob, UT Dallas
DATA WAREHOUSING
Working Definitions:
Technique to assemble and manage data from
various sources in order to gain a single,
detailed view of a business (Stephen R.
Gardner, NCR)
An integrated and consistent store of subject
oriented data, structured for query and
retrieval in order to support management
decision making
Sumit Sarkar & Varghese Jacob, UT Dallas
TYPICAL APPLICATIONS
Marketing Applications
Scanner Data
Market Basket analyses
One-to-one Marketing
Customer Support
Logistics
Financial
Services
Health Care
Sumit Sarkar & Varghese Jacob, UT Dallas
DRIVERS FOR DATA
WAREHOUSING
Business Drivers
• Quick reaction to business needs
• Extensive analysis requirements
• E.g, Understanding customer behavior,
Assessing performance in different
markets, etc.
Sumit Sarkar & Varghese Jacob, UT Dallas
DRIVERS FOR DATA
WAREHOUSING
Data Drivers
• Patterns hidden in transactional data
• Need for “Unified View” of data from
“Islands” of data
External data needs
Disparate platforms and formats
Multiple database technologies
Different application priorities
Sumit Sarkar & Varghese Jacob, UT Dallas
DATA WAREHOUSE
ARCHITECTURE
O perational
databases
Re ports
Data
Extraction
Routine s
Exte rnal data
Data
Warehouse
Q uering and
Analysis
Software
Q ue rie s
Sumit Sarkar & Varghese Jacob, UT Dallas
KEY FUNCTIONS: FRONT
END
Data Retrieval and Analysis
• Querying Tools
• Multidimensional Analysis
• Data Visualization
• Knowledge Discovery and Data Mining
Sumit Sarkar & Varghese Jacob, UT Dallas
KEY FUNCTIONS: BACK END
Data Preparation
• Extraction
• Transformation
• Cleansing
• Data Integration/Reconciliation
• Aggregation
• Summarization
• Loading
Sumit Sarkar & Varghese Jacob, UT Dallas
MULTIDIMENSIONAL
ANALYSIS
Business Analysis based on
• Products
Trends
• Customers
in Sales of Product P23 (Jazz Drives)
Purchase
• Dates
•
habits of Customer C1234 (Jill)
Seasonality in
sales
Combination of dimensions (Product & Date)
Sales of Product P23 in Spring
On-Line Analytical Processing (OLAP): Techniques and
tools used to analyze data in warehousing environments
Sumit Sarkar & Varghese Jacob, UT Dallas
MULTIDIMENSIONAL
DATABASES
Product
e
m
i
T
Multi
Dimensional Data
Customer
OLAP tools: On Line Analytical Processing
• Relational OLAP (ROLAP)
• Multidimensional OLAP (MOLAP)
Sumit Sarkar & Varghese Jacob, UT Dallas
RELATIONAL: Example
Customer
Product
Month
Quantity
Jill
Jill
Joe
Joe
Joe
John
Jaz
Floppy
Jaz
Zip
Floppy
Jaz
9
9
9
9
9
9
2
1
1
2
7
15
Sumit Sarkar & Varghese Jacob, UT Dallas
MULTIDIMENSIONAL: Example
e Quantity Sold in September
m
Ti
2
1
15
Zip
0
2
0
Floppy
1
7
0
Product
Jaz
Jill
Joe
John
Customer
Implications for viewing data from an end-user standpoint: Slice the data on
relevant dimensions
Sumit Sarkar & Varghese Jacob, UT Dallas
Data Warehouse Vs Operational
(Transactional) Database
Data Warehouse
Operational DB
Integrated
Limited Integration
Non-volatile
Continuously Updated
Ad-hoc Retrieval
Predictable Retrieval
Historical data
Current data
Sumit Sarkar & Varghese Jacob, UT Dallas
Development Approaches
Data
Centered
Variety of Applications Supported
Application
Centered (Data Marts)
Quick implementation
Low cost (relatively)
High payoffs
Staged
Data Marts
Sumit Sarkar & Varghese Jacob, UT Dallas
DATA WAREHOUSING LIFE
CYCLE
Define Business Strategy
Identify Scope
Design Architecture/Infrastructure
Determine Data Model
Identify Data Extraction Routines
Implementation & Testing
On-going application development
Process, not Product
Sumit Sarkar & Varghese Jacob, UT Dallas
KNOWLEDGE DISCOVERY & DATA
MINING (KDD)
Process of Identifying
• Novel
• Valid
• Potentially Useful
Patterns
in Data
Sumit Sarkar & Varghese Jacob, UT Dallas
KDD: Concepts
Pattern
If Income > $20,000 and House-owner then person will
not default payment
Novel
Relationships previously unknown to users
Valid
Discovered patterns should be true for new situations in
general
Potentially
Useful
Interesting
Actionable
Sumit Sarkar & Varghese Jacob, UT Dallas
Sample Examples of Data
Mining
Campaign management by propensity modeling
Risk management by credit scoring
People who bought Data Mining books are likely to buy Data Warehousing
books
Customized products and care
Customers with a credit score of 700 or less default at an average rate of 5%
Cross-selling by affinity prediction
Young Males are 75% more likely to respond to sports memorabilia offers
than young girls
Residents in 90210 prefer Geo Metros
Assignment of call center agents
People who call on Monday Mornings call due to billing problems
Sumit Sarkar & Varghese Jacob, UT Dallas
Data Mining: Analytical
Techniques
Regression
(Linear and Non-Linear)
Multivariate Analysis
Discriminant Analysis
Clustering
Multi-Dimensional Analysis (MDA)
Decision
Tree Induction
Neural Networks
Dependency Graphs & Belief Networks
Genetic Algorithms
Sumit Sarkar & Varghese Jacob, UT Dallas
Web Mining: Emerging Area
Types of Web mining
Web content mining
Web structure mining
Web usage mining
Interesting problems on Web mining
Mining what Web search engine finds
Weblog mining (usage, access)
Web document classification
New Applications
Personalization of web sites
Targeted advertising on web sites
Product recommendations (Books, Movies etc.)
Sumit Sarkar & Varghese Jacob, UT Dallas
COLLABORATIVE
FILTERING
Very
popular in Recommendation systems
E.g., when a customer of Amazon.com logs
on, they are provided recommendations for
books, music, etc. How is it done?
Case-based approach used to identify other
customers with similar profiles (e.g.,
common purchases, demographics, etc.)
Recommendation made based on purchases
of other such customers
Sumit Sarkar & Varghese Jacob, UT Dallas
Data Mining Myths
Data
Mining is a Magic Wand
All patterns are useful
You don’t need to understand statistical
tools to use data mining
Sumit Sarkar & Varghese Jacob, UT Dallas
Knowledge Management
Case - KPMG Peat Marwick U.S.: One Giant Brain
Lead discussion group 0224
•
•
•
Which primary activity of KPMG’s value chain is most significantly
impacted by it’s Knowledge Management initiative?
Why was the Shadow Partner initiative not implemented? Was that a
good decision?
How should KPMG approach the technology and systems
implementation issue?
Lead discussion group 0225
How should knowledge and intellectual assessment be performed?
• How can KPMG integrate knowledge management activities into its
business processes?
• What cultural changes are necessary? How can they be achieved?
Sumit Sarkar & Varghese Jacob, UT Dallas