Databases - The University of Texas at Dallas

Download Report

Transcript Databases - The University of Texas at Dallas

Data and Knowledge
Management
Sumit Sarkar & Varghese S. Jacob
University of Texas at Dallas
Sumit Sarkar & Varghese Jacob, UT Dallas
Overview of Session
 Data



Management
Technical considerations: Databases and
database management systems
Design Issues
Managerial issues
 Data
Warehousing
 Knowledge Discovery and Data Mining
 Knowledge Management
Sumit Sarkar & Varghese Jacob, UT Dallas
What Should Managers Know
About Data Management?
 Organizations
store and use large
quantities of data
 Sheer volume of data alone means that
proper management is essential
 Data are a valuable resource that must be
managed
 Value can be derived by capturing,
validating, and protecting the data
Sumit Sarkar & Varghese Jacob, UT Dallas
What Should Managers Know
About Database Management?
Management
of database is an
important issue
 No longer just a technical issue, the
management of databases has
become increasingly important
throughout all types of
organizations.
Sumit Sarkar & Varghese Jacob, UT Dallas
Data Issues
Amount
of data we capture doubles
approximately every 18 months.
 How do we capture it?
 Where do we store it?
 What do we do with it?
Sumit Sarkar & Varghese Jacob, UT Dallas
Database Approach
 Database:
A non redundant collection of logically
related records or files. It enables a common pool of
data records to serve many processing applications.
 Abstraction of the real world.
 Database Management Systems (DBMS): A set of
computer programs that handle all access to the
database. It allows one to define a collection of
related files, manipulate this collection (i.e. insert,
delete, and update) and read records. E.g. Access,
Oracle
 Data occupies the central position; referenced as
needed.
Sumit Sarkar & Varghese Jacob, UT Dallas
DBMS: Database Management Systems
Reports
Database
Database
Management
Systems
Application
Programs
Q ueries
Sumit Sarkar & Varghese Jacob, UT Dallas
Database Approach (cont’d)
Data Sharing: data is not the property of one
person; one representation for each piece of data;
avoids (minimizes) redundancy.
 User Views: allows a user to have a view of the
database that is different from the view used by
others; user is isolated from changes in the
data/programs.
 Query Language: An English-like language
designed for end-users to query the database. E.g.,
Structured Query Language (SQL)
 Database Administrator: Specialist who manages
the database.

Sumit Sarkar & Varghese Jacob, UT Dallas
Major Database Activities
 Database
development - design and implement the
database
 Database interrogation - query the database
 Database maintenance - keep it operational
 Application development - write application
programs against it
Sumit Sarkar & Varghese Jacob, UT Dallas
Database Administrator
 Coordinates
all the activities of the database
management system
 Must have a good understanding of the enterprise’s
information resources and needs
Sumit Sarkar & Varghese Jacob, UT Dallas
Database Administrator
 Duties
include:
 DB Planning
 Design & Development
 Performance Analysis
 Liaison with Users
 Security and Authorization
 Backup & Recovery
 Maintenance
 Training
Sumit Sarkar & Varghese Jacob, UT Dallas
Database Users

Differentiated by the way they expect to interact with
the system
 Application programmers – interact with system
through query languages and other programs
 Sophisticated users – form requests in a database
query language
 Naive users – invoke one of the permanent
application programs that have been written
previously
Sumit Sarkar & Varghese Jacob, UT Dallas
What Should Managers Know
about Database Technology?
How
data is represented can
significantly affect the ways in which
data
 can be manipulated to assist in decisionmaking
 can be controlled and managed
Sumit Sarkar & Varghese Jacob, UT Dallas
Data Models
• Hierarchical
• Network
• Relational
• Object Oriented
• Multi-Dimensional
Sumit Sarkar & Varghese Jacob, UT Dallas
Which Data Model is Best?
 Depends!
 Transaction Processing (OLTP)
• Relational
• Hierarchical
 Decision Support (OLAP)
• Multi-Dimensional
• Relational
 Multi-Media
 Object Oriented
• Relational
Sumit Sarkar & Varghese Jacob, UT Dallas
Relational Model
Most
widely used type of database
management system
Examples:
Oracle (Oracle), DB2 (IBM),
Informix (Informix), SQL Server
(Microsoft), MS Access (Microsoft), etc.
Sumit Sarkar & Varghese Jacob, UT Dallas
Relational Model
 One
basic construct: the relation
 Relations represent both entities and
relationships
 Data Manipulation Language is English-like
(e.g., SQL)
 Note - A relationship occurs during
conceptual design (E-R Model) and should
not be confused with a relation
Sumit Sarkar & Varghese Jacob, UT Dallas
Book Relation
Column
Record
Field
Sumit Sarkar & Varghese Jacob, UT Dallas
Notation
 Relations
can be expressed concisely by
eliminating the sample data and showing just
the relation name and the column names. For
example:
Book: [title-id, title, type, price, pub_id]
(Note that the key is underlined)
Sumit Sarkar & Varghese Jacob, UT Dallas
Properties of relations
 Each
row is unique. No two rows may have
identical values.
 Each
column must have a unique name.
 The
sequence of columns (left to right) is
insignificant.
 The
sequence of rows (top to bottom) is
insignificant.

Attribute values are atomic
Sumit Sarkar & Varghese Jacob, UT Dallas
Relations and keys

A primary key is a column or group of columns
in a relation whose values uniquely identify each
row in a relation.
 A foreign
key is a column or group of columns
whose values are the same as the primary key of
another relation. You can think of a foreign key
as a copy of the primary key from another
relation. Note that you do not underline a foreign
key.
Sumit Sarkar & Varghese Jacob, UT Dallas
Example
Patient Bill
Patient No
Location
3249
137-2
Patient Name
Date Admitted
5/6/99
Date Discharged 5/9/99
Item_Code
200
205
307
Description
Room Semi-pr
Television
X-Ray
Mary Baker
Charge
$150.00
$10.00
$100.00
Total:
$260.00
Sumit Sarkar & Varghese Jacob, UT Dallas
Example (Cont.)
Patient & Charges Data
P_No
Patient_ Patient
Date
Date
Item
Desc
Charge
Fname
Lname
Admit
Disch
Code
1379
1379
John
John
Cribbs
Cribbs
4/23/99
4/23/99
4/28/99
4/28/99
413
307
Lab Test
X-Ray
$40.00
$100.00
3249
Mary
Baker
5/6/99
5/9/99
307
X-Ray
$100.00
3249
Mary
Baker
5/6/99
5/9/99
205
Television $10.00
3249
Mary
Baker
5/6/99
5/9/99
200
Room S-pr$150.00
6213
David
Rose
5/2/99
5/6/99
205
Television $10.00
Redundancies and Anomalies
Sumit Sarkar & Varghese Jacob, UT Dallas
Relational Databases (RDBMS)
PATIENT TABLE
Patient_
No
1379
3249
6213
…
…
Patient_LName
Cribbs
Baker
Rose
…
…
Patient_
FName
John
Mary
David
…
…
Date_Admitted Date_Discharged
4/23/99
4/28/99
5/6/99
5/9/99
5/2/99
5/6/99
…
…
…
…
ITEM TABLE
Primary Key
Item_Code
200
207
307
…
…
Description
Room S-Pr
Television
X-Ray
…
…
Amount
150
10
100
…
…
Foreign Key
CHARGES TABLE
Patient_No
1379
1379
3249
…
Item_Code
413
307
307
…
Sumit Sarkar & Varghese Jacob, UT Dallas
Design Steps
User Requirements
Conceptual Design
(E.g. entity-relationship model)
Logical Design
(E.g. relational model)
Physical Design
(optimize for performance)
Sumit Sarkar & Varghese Jacob, UT Dallas
Relational Operations
Database
operators (operations)
represent ways in which data can be
manipulated to assist in decisionmaking
 Without some sense of the
possibilities of queries and reports,
people often have a misconception of
what they can expect from a database.
Sumit Sarkar & Varghese Jacob, UT Dallas
What does the user want?
• Patients admitted after February 26, 2001
• Items with Charge Amount > $100
Select
• Admit Date for all Patients
Project
• All Patients and associated Charges
Join
• Patients in Semi Private rooms
Combination
Sumit Sarkar & Varghese Jacob, UT Dallas
Relational Databases (RDBMS)
PATIENT TABLE
Patient_
No
1379
3249
6213
…
…
Patient_LName
Cribbs
Baker
Rose
…
…
Patient_
FName
John
Mary
David
…
…
Date_Admitted Date_Discharged
4/23/99
4/28/99
5/6/99
5/9/99
5/2/99
5/6/99
…
…
…
…
ITEM TABLE
Item_Code
200
207
307
…
…
Description
Room S-Pr
Television
X-Ray
…
…
Amount
150
10
100
…
…
CHARGES TABLE
Patient_No
1379
1379
3249
…
Item_Code
413
307
307
…
Sumit Sarkar & Varghese Jacob, UT Dallas
Security Features
Four Dimensions
•Who (User)
•What (Data Objects)
•Where (Point of Access)
•How (Nature of Access)
Sumit Sarkar & Varghese Jacob, UT Dallas
Questions?
Sumit Sarkar & Varghese Jacob, UT Dallas
In Class Exercise
XYZ is a luxury hotel chain with several branches across the US. The clientele
consists of corporate travelers and well-heeled individuals. A key to XYZ’s
success is in providing personalized attention to its guests. XYZ relies on two
information systems. One is GLOBAL, which is a centralized reservation
system owned by an international travel organization TRV. The other is
LOCAL, which keeps records of reservations for each individual hotel and
preferences for guests who have stayed in each hotel.
Given their clientele, XYZ’s management feels that it is imperative to remember
their repeat guests, and their preferences. While LOCAL allows a hotel to
easily access its own guest list, it is currently not possible to locate the
preferences of a guest who may have stayed in a different hotel of the chain.
A related problem is that when a repeat guest with a common name (e.g.,
John Doe) calls, all the John Doe’s in the system would be displayed, and the
reservation agents would have no way of knowing who that caller is.
Furthermore, if John Doe had been listed previously as John Dow, the system
would not recognize this as the same person. Consequently, a new record
would be created for John Doe.
Sumit Sarkar & Varghese Jacob, UT Dallas
In Class Exercise (cont.)
Your group has been designated to address the above problems.
 What could be done to enable a reservation agent to access data
from some other hotel? What issues must you consider in
making such a decision?
 What can help in identifying a caller with a common name?
 What can be done to resolve the “duplicate records” problem?
 In your preliminary discussions with the CEO, the possibility of
moving to a completely new system has been considered. You
believe that it would be easier to justify a new system if
additional value could be generated from the data on your
guests. What could be possible ways to achieve that?

Lead discussion group: 0153
Sumit Sarkar & Varghese Jacob, UT Dallas
DATA WAREHOUSING
 Repository of Data for Decision Support Applications
 End Users: Business Analysts, Executive Management,
etc.
 Goal: Deliver needed data (information) when user
wants, in form that user understands, to user’s preferred
location
Sumit Sarkar & Varghese Jacob, UT Dallas
DATA WAREHOUSING
 Working Definitions:

Technique to assemble and manage data from
various sources in order to gain a single,
detailed view of a business (Stephen R.
Gardner, NCR)

An integrated and consistent store of subject
oriented data, structured for query and
retrieval in order to support management
decision making
Sumit Sarkar & Varghese Jacob, UT Dallas
TYPICAL APPLICATIONS
 Marketing Applications




Scanner Data
Market Basket analyses
One-to-one Marketing
Customer Support
 Logistics
 Financial
Services
 Health Care
Sumit Sarkar & Varghese Jacob, UT Dallas
DRIVERS FOR DATA
WAREHOUSING
 Business Drivers
• Quick reaction to business needs
• Extensive analysis requirements
• E.g, Understanding customer behavior,
Assessing performance in different
markets, etc.
Sumit Sarkar & Varghese Jacob, UT Dallas
DRIVERS FOR DATA
WAREHOUSING
 Data Drivers
• Patterns hidden in transactional data
• Need for “Unified View” of data from
“Islands” of data
External data needs
Disparate platforms and formats
Multiple database technologies
Different application priorities
Sumit Sarkar & Varghese Jacob, UT Dallas
DATA WAREHOUSE
ARCHITECTURE
O perational
databases
Re ports
Data
Extraction
Routine s
Exte rnal data
Data
Warehouse
Q uering and
Analysis
Software
Q ue rie s
Sumit Sarkar & Varghese Jacob, UT Dallas
KEY FUNCTIONS: FRONT
END
 Data Retrieval and Analysis
• Querying Tools
• Multidimensional Analysis
• Data Visualization
• Knowledge Discovery and Data Mining
Sumit Sarkar & Varghese Jacob, UT Dallas
KEY FUNCTIONS: BACK END
 Data Preparation
• Extraction
• Transformation
• Cleansing
• Data Integration/Reconciliation
• Aggregation
• Summarization
• Loading
Sumit Sarkar & Varghese Jacob, UT Dallas
MULTIDIMENSIONAL
ANALYSIS
 Business Analysis based on
• Products
 Trends
• Customers
in Sales of Product P23 (Jazz Drives)
 Purchase
• Dates
•
habits of Customer C1234 (Jill)
 Seasonality in
sales
Combination of dimensions (Product & Date)
 Sales of Product P23 in Spring
 On-Line Analytical Processing (OLAP): Techniques and
tools used to analyze data in warehousing environments
Sumit Sarkar & Varghese Jacob, UT Dallas
MULTIDIMENSIONAL
DATABASES
Product
e
m
i
T
Multi
Dimensional Data
Customer
 OLAP tools: On Line Analytical Processing
• Relational OLAP (ROLAP)
• Multidimensional OLAP (MOLAP)
Sumit Sarkar & Varghese Jacob, UT Dallas
RELATIONAL: Example
Customer
Product
Month
Quantity
Jill
Jill
Joe
Joe
Joe
John
Jaz
Floppy
Jaz
Zip
Floppy
Jaz
9
9
9
9
9
9
2
1
1
2
7
15
Sumit Sarkar & Varghese Jacob, UT Dallas
MULTIDIMENSIONAL: Example
e Quantity Sold in September
m
Ti
2
1
15
Zip
0
2
0
Floppy
1
7
0
Product
Jaz
Jill

Joe
John
Customer
Implications for viewing data from an end-user standpoint: Slice the data on
relevant dimensions
Sumit Sarkar & Varghese Jacob, UT Dallas
Data Warehouse Vs Operational
(Transactional) Database
Data Warehouse
Operational DB
Integrated
Limited Integration
Non-volatile
Continuously Updated
Ad-hoc Retrieval
Predictable Retrieval
Historical data
Current data
Sumit Sarkar & Varghese Jacob, UT Dallas
Development Approaches
 Data
Centered
 Variety of Applications Supported
 Application
Centered (Data Marts)
 Quick implementation
 Low cost (relatively)
 High payoffs
 Staged
Data Marts
Sumit Sarkar & Varghese Jacob, UT Dallas
DATA WAREHOUSING LIFE
CYCLE
 Define Business Strategy
 Identify Scope
 Design Architecture/Infrastructure
 Determine Data Model
 Identify Data Extraction Routines
 Implementation & Testing
 On-going application development
Process, not Product
Sumit Sarkar & Varghese Jacob, UT Dallas
KNOWLEDGE DISCOVERY & DATA
MINING (KDD)
 Process of Identifying
• Novel
• Valid
• Potentially Useful
Patterns
in Data
Sumit Sarkar & Varghese Jacob, UT Dallas
KDD: Concepts

Pattern
 If Income > $20,000 and House-owner then person will
not default payment

Novel
 Relationships previously unknown to users

Valid
 Discovered patterns should be true for new situations in
general
 Potentially
Useful
 Interesting
 Actionable
Sumit Sarkar & Varghese Jacob, UT Dallas
Sample Examples of Data
Mining

Campaign management by propensity modeling


Risk management by credit scoring


People who bought Data Mining books are likely to buy Data Warehousing
books
Customized products and care


Customers with a credit score of 700 or less default at an average rate of 5%
Cross-selling by affinity prediction


Young Males are 75% more likely to respond to sports memorabilia offers
than young girls
Residents in 90210 prefer Geo Metros
Assignment of call center agents

People who call on Monday Mornings call due to billing problems
Sumit Sarkar & Varghese Jacob, UT Dallas
Data Mining: Analytical
Techniques
 Regression
(Linear and Non-Linear)
 Multivariate Analysis
 Discriminant Analysis
 Clustering
 Multi-Dimensional Analysis (MDA)
 Decision
Tree Induction
 Neural Networks
 Dependency Graphs & Belief Networks
 Genetic Algorithms
Sumit Sarkar & Varghese Jacob, UT Dallas
Web Mining: Emerging Area

Types of Web mining
 Web content mining
 Web structure mining
 Web usage mining

Interesting problems on Web mining
 Mining what Web search engine finds
 Weblog mining (usage, access)
 Web document classification
 New Applications
 Personalization of web sites
 Targeted advertising on web sites
 Product recommendations (Books, Movies etc.)
Sumit Sarkar & Varghese Jacob, UT Dallas
COLLABORATIVE
FILTERING
 Very
popular in Recommendation systems
 E.g., when a customer of Amazon.com logs
on, they are provided recommendations for
books, music, etc. How is it done?
 Case-based approach used to identify other
customers with similar profiles (e.g.,
common purchases, demographics, etc.)
 Recommendation made based on purchases
of other such customers
Sumit Sarkar & Varghese Jacob, UT Dallas
Data Mining Myths
 Data
Mining is a Magic Wand
 All patterns are useful
 You don’t need to understand statistical
tools to use data mining
Sumit Sarkar & Varghese Jacob, UT Dallas
Knowledge Management


Case - KPMG Peat Marwick U.S.: One Giant Brain
Lead discussion group 0224
•
•
•

Which primary activity of KPMG’s value chain is most significantly
impacted by it’s Knowledge Management initiative?
Why was the Shadow Partner initiative not implemented? Was that a
good decision?
How should KPMG approach the technology and systems
implementation issue?
Lead discussion group 0225
 How should knowledge and intellectual assessment be performed?
• How can KPMG integrate knowledge management activities into its
business processes?
• What cultural changes are necessary? How can they be achieved?
Sumit Sarkar & Varghese Jacob, UT Dallas