Decision Tree Construction - Department of Computer Science

download report

Transcript Decision Tree Construction - Department of Computer Science

The Software Infrastructure
for Electronic Commerce
Databases and Data Mining
Lecture 2: Data Warehousing
Johannes Gehrke
[email protected]
http://www.cs.cornell.edu/johannes
Overview
• Conceptual design
• Querying relational data
• Dimensional data modeling: OLTP versus
decision support
We will design and analyze a data mart with
click-stream data as an illustrative
example.
The Database Design Process
• Requirement analysis
• Conceptual design using the entityrelationship (ER) model
• Schema refinement
• Normalization
• Physical tuning
Overview of Database Design
• Conceptual design: (ER Model is used at this
stage.)
• What are the entities and relationships in the
enterprise?
• What information about these entities and
relationships should we store in the database?
• What are the integrity constraints or business rules
that hold?
• A database `schema’ in the ER Model can be
represented pictorially (ER diagrams).
• Can map an ER diagram into a relational schema.
ER Model Basics
name
• Entity:
ssn
lot
Real-world object
distinguishable from
Employees
other objects. An
entity is described (in
DB) using a set of attributes.
• Entity Set: A collection of similar entities. E.g.,
all employees.
• All entities in an entity set have the same set of
attributes.
• Each entity set has a key.
• Each attribute has a domain.
ER Model Basics (Contd.)
• Relationship: Association among two or
more entities. E.g., Johannes works in
the computer science department.
• Relationship set: Collection of similar
relationships.
since
name
ssn
dname
lot
Employees
did
Works_In
budget
Departments
ER-Model Basics (Contd.)
• An n-ary relationship
set R relates n entity
sets E1 ... En; each
relationship in R
involves entities
e1, ..., en
• Same entity set could
participate in different
relationship sets, or in
different “roles” in
same set.
name
ssn
address
Employees
supervisor
subordinate
Reports_To
Key Constraints
• Consider Works_In: An employee can work in
many departments; a department can have
many employees.
• In contrast, each deptartment has at most one
manager, according to the key constraint on
Manages.
since
name
ssn
dname
lot
Employees
did
Manages
budget
Departments
Key Constraints (Contd.)
• Several types of key-constraints:
1-to-1
1-to Many
Many-to-1
Many-to-Many
Key constraints: Examples
• Example Scenario 1: An inventory database
contains information about parts and
manufacturers. Each part is constructed by
exactly one manufacturer.
• Example Scenario 2: A customer database
contains information about customers and sales
persons. Each customer has exactly one primary
sales person.
• What do the ER diagrams look like?
Participation Constraints
Does every department have a manager? If so, this is a
participation constraint: The participation of
Departments in Manages is said to be total (vs. partial).
(Compare with foreign key constraints.)
since
name
ssn
dname
did
lot
Employees
Manages
Works_In
since
budget
Departments
Participation Constraints: Examples
• Example Scenario 1 (Contd.): Each part is
constructed by exactly one manufacturer.
• Example Scenario 2: Each customer has
exactly one primary sales person.
ER Modeling: Case Study
Drugwarehouse.com has offered you a free life-time supply
of prescription drugs (no questions asked) if you design
its database schema. Given the rising cost of health
care, you agree. Here is the information that you
gathered:
• Patients are identified by their SSN, and we also store
their names and age.
• Doctors are identified by their SSN, and we also store
their names and specialty.
• Each patient has one primary care physician, and we
want to know since when the patient has been with her
primary care physician.
• Each doctor has at least one patient.
ER Modeling: Summary
• After the requirement analysis, the conceptual
design develops a high-level description of the
data
• Main components:
•
•
•
•
Entities
Relationships
Attributes
Integrity constraints: Key constraints and
participation constraints
• We covered only a subset
Querying Relational Databases
• “The” relational query language: SQL
Structured Query Language
SELECT target-list
FROM relation-list
WHERE qualifications
• relation-list: A list of relation names
• target-list: A list of attributes of relations in relation-list
• qualification: Comparisons (Attr op const or Attr1 op
Attr2, where op is one of <,>,=,<=,>=,<>) combined
using AND, OR and NOT.
Recall: Customer Relation
• Relation schema:
Customers(cid: integer, name: string, byear: integer,
state: string)
• Relation instance:
cid
1
2
3
name
Jones
Smith
Smith
byear
1960
1974
1950
state
NY
CA
NY
Example Query
• Example Schema:
Customers(
cid: integer,
name: string,
byear: integer,
state: string)
• Query:
SELECT
Customers.cid,
Customers.name,
Customers.byear,
Customers.state
FROM Customers
WHERE cid = 1950
cid
1
2
3
name
Jones
Smith
Smith
byear
1960
1974
1950
state
NY
CA
NY
cid
3
name
Smith
byear
1950
state
NY
Example Query
SELECT
Customers.cid,
Customers.name,
Customers.byear,
Customers.state
FROM Customers
WHERE cid = 1960
cid
1
2
3
name
Jones
Smith
Smith
byear
1960
1974
1950
state
NY
CA
NY
cid
1
name
Jones
byear
1960
state
NY
Range Variables
SELECT
Customers.cid,
Customers.name,
Customers.byear,
Customers.state
FROM Customers C
WHERE cid = 1960
cid
1
2
3
name
Jones
Smith
Smith
byear
1960
1974
1950
state
NY
CA
NY
cid
1
name
Jones
byear
1960
state
NY
Example Query
A range variable is a substitute for a relation name.
• Query:
SELECT C.cid, C.name
FROM Customers C
WHERE C.byear = 1960
cid
1
2
3
name
Jones
Smith
Smith
byear
1960
1974
1950
state
NY
CA
NY
cid
1
name
Jones
Common Shortcuts
• “*” is a shortcut for
all fields
• Query:
SELECT *
FROM Customers C
WHERE
C.name=“Smith”
cid
1
2
3
name
Jones
Smith
Smith
cid
2
3
name
Smith
Smith
byear
1960
1974
1950
byear
1974
1950
state
NY
CA
NY
state
CA
NY
Example Query
SELECT C.state
FROM Customers C
cid
1
2
3
name
Jones
Smith
Smith
state
NY
CA
NY
byear
1960
1974
1950
state
NY
CA
NY
DISTINCT Keyword
DISTINCT:
Eliminates duplicates in the output
Query:
SELECT DISTINCT C.state
FROM Customers C
cid
1
2
3
name
Jones
Smith
Smith
byear
1960
1974
1950
state
NY
CA
NY
state
NY
CA
Example Query
• Query:
SELECT C.cid, C.name, C.byear, C.state
FROM Customers C
WHERE C.name=“Smith”
• Answer:
cid
2
3
name
Smith
Smith
byear
1974
1950
state
CA
NY
Selection Predicates
• Query:
SELECT C.cid, C.name, C.byear, C.state
FROM Customers C
WHERE C.name=“Smith” AND C.state=“NY”
• Answer:
cid
3
name
Smith
byear
1950
state
NY
Combining Relations: Joins
SELECT P.pid, P.pname
FROM Products P, Transactions T
WHERE P.pid = T.tid AND T.tdate < “2/1/2000”
pid
1
2
3
4
pname
Intel PIII-700
MS Office Pro
IBM DB2
Thinkpad 600E
price
300.00
500.00
5000.00
5000.00
pid
1
2
4
category
hardware
software
software
hardware
pname
Intel PIII-700
MS Office Pro
Thinkpad 600E
tid
1
1
2
3
3
tdate
1/1/2000
1/1/2000
1/1/2000
2/1/2000
2/1/2000
cid pid
1
1
1
2
1
4
2
3
2
4
Example Query
• Query: “Find the names
and ids of customers who
have made purchases
before February 1, 2000.”
• SQL:
SELECT C.name, C.id
FROM Customers C,
Transactions T
WHERE C.cid = T.cid AND
T.tdate < “2/1/2000”
cid
1
2
3
tid
1
1
2
3
3
name
Jones
Smith
Smith
byear
1960
1974
1950
tdate
1/1/2000
1/1/2000
1/1/2000
2/1/2000
2/1/2000
state
NY
CA
NY
cid pid
1
1
1
2
1
4
2
3
2
4
Example Query
• Query: “Find the names and ids of the
customers who have purchased MS Office
Pro.”
• SQL:
SELECT C.name, C.id
FROM Customers C, Transactions T,
Products P
WHERE C.cid = T.cid AND T.pid = P.pid
AND P.pname = “MS Office Pro”
Aggregate Operators
• SQL allows computation of summary
statistics for a collection of records.
• Operators:
•
•
•
•
•
MAX (maximum value)
MIN (minimum value)
SUM
AVG (average)
COUNT (distinct number)
Example Queries
• Query: “Tell me the minimum and maximum
price of all products.”
• SQL:
SELECT MAX(P.price), MIN(P.price)
FROM Products P
• Query: “How many different products do we
have?”
• SQL:
SELECT COUNT(*)
FROM Products
GROUP BY and HAVING
• Instead of applying aggregate operators to all
(qualifying) tuples, apply aggregate operators to
each of several groups of tuples.
• Example: For each year, show the number of
customers who are born that year.
• Conceptually, many queries, one query per year.
• Suppose we know that years are between 1900 and
2000, we can write 1001 queries that look like this:
SELECT COUNT(*)
FROM Customers C
WHERE C.byear = i
Queries With GROUP BY and HAVING
• Extended SQL Query Structure:
SELECT
[DISTINCT] target-list
FROM
relation-list
WHERE
tuple-qualification
GROUP BY grouping-list
HAVING
group-qualification
Example: GROUP BY
• Example: For each year, show the number
of customers who are born that year.
• SQL:
SELECT C.byear, COUNT(*)
FROM Customers C
GROUP BY C.byear
Example
• Query: “For each customer, list the price
of the most expensive product she
purchased.”
• SQL:
SELECT C.cid, C.name, MAX(P.price)
FROM Customers C, Transactions T, Products P
WHERE C.cid = T.cid and T.pid = P.pid
GROUP BY C.cid, C.name
Example
• Query: “For each product that has been
sold at least twice, output how often it
has been sold so far.”
• SQL:
SELECT P.pid, P.pname, COUNT(*)
FROM Products P, Transactions T
WHERE P.pid = T.pid
GROUP BY P.pid, P.pname
HAVING COUNT(*) > 1
Notes on GROUP BY and HAVING
SELECT
FROM
WHERE
GROUP BY
HAVING
[DISTINCT] attribute-list, aggregate-list
relation-list
record-qualification
grouping-list
group-qualification
• The query generates one output record
per group. A group is a set of tuples that
have the same value for all attributes in
grouping-list.
Notes on GROUP BY and HAVING
SELECT
FROM
WHERE
GROUP BY
HAVING
[DISTINCT] attribute-list, aggregate-list
relation-list
record-qualification
grouping-list
group-qualification
• The attribute list must be a subset of the
grouping-list. Why? Each answer record
corresponds to a group, and there must be a
single value per group.
• The aggregate-list generates one value per
group.
• What about the group-qualification?
Summary: SQL
• Powerful query language for relational database
systems
• End-users usually do not write SQL, but
graphical user front-ends generate SQL queries
• SQL completely isolates users from the physical
structure of the DBMS  You can tune your
DBMS for performance and your applications do
not change (This is physical data
independence!)
From OLTP To The Data Warehouse
• Traditionally, database systems stored data
relevant to current business processes
• Old data was archived or purged
• Your database stores the current snapshot of
your business:
•
•
•
•
Your current customers with current addresses
Your current inventory
Your current orders
My current account balance
The Data Warehouse
• The data warehouse is a historical collection of
all your data for analysis purposes
• Examples:
• Current customers versus all customers
• Current orders versus history of all orders
• Current inventory versus history of all shipments
• Thus the data warehouse stores information
that might be useless for the operational part of
your business
Terminology
•
•
•
•
OLTP (Online Transaction Processing)
DSS (Decision Support System)
DW (Data Warehouse)
OLAP (Online Analytical Processing)
OLTP Architecture
Clients
OLTP
DBMSs
Cash
Register
Product Purchase
Inventory Update
DW Architecture
Clients
Information Sources
Data Warehouse
Server
OLAP Servers
MOLAP
OLTP
DBMSs
Analysis
Query/Reporting
Other Data Extract
Clean
Sources
Transform
Aggregate
Load
Update
Data Mining
Data Marts
ROLAP
OLTP Versus Data Warehousing
OLTP
Data Warehouse
Typical user
Clerical
Management
System usage
Regular business
Analysis
Workload
Read/Write
Read only
Types of queries
Predefined
Ad-hoc
Unit of interaction
Transaction
Query
Level of isolation required
High
Low
No of records accessed
<100
>1,000,000
No of concurrent users
Thousands
Hundreds
Focus
Data in and out
Information out
The Data Warehouse Market
• Market forecast for warehousing tools in
2002: $8 billion (IDC 7/1999)
• Revenue forecast for data warehouse
front-end tools:
1998 $1.6 billion
1999 $2.1 billion
2000 $2.8 billion
(Gartner Group 2/1999)
Dimensional Data Modeling
• Recall: The relational model.
The dimensional data model:
• Relational model with two different types of attributes
and tables.
• Attribute level: Facts (numerical, additive, dependent)
versus dimensions (descriptive, independent).
• Table level: Fact tables (large tables with facts and
foreign keys to dimensions) versus dimension tables
(small tables with dimensions).
Dimensional Modeling (Contd.)
• Fact (attribute):
Measures performance of
a business.
• Example facts:
• Sales, budget, profit,
inventory
• Example fact table:
• Transactions (timekey,
storekey, pkey, promkey,
ckey, units, price)
• Dimension (attribute):
Specifies a fact.
• Example dimension:
• Product, customer
data, sales person,
store
• Example dimension
table:
• Customer (ckey,
firstname, lastname,
address, dateOfBirth,
occupation, …)
OLTP Versus The Data Warehouse
OLTP
• Regular relational schema
• Update queries change
data in the database:
One instance of a
customer with a unique
customerID
• Queries return
information about the
current state of affairs
The data warehouse
• Dimensional model
• Update queries create
new records in the
database:
Several instances of the
same customer (with
different data), in case
the customer moved
• Queries return aggregate
information about
historical facts
Example: Dimensional Data Modeling
ckey
cid
name byear state
timekey Day
ckey timekey
pkey
Month
#units
pkey pid pname price
Customers:
Dimension Table
Year
$price
category
Time:
Dim. Table
Transactions:
Fact Table
Products:
Dim. Table
Another View: Star Schema
Time
Customers
Transactions
(timekey,
storekey,
pkey,
promkey,
ckey,
units,
price)
Promotions
Store
Products
Grain
• The grain defines the level of resolution of a
single record in the fact table.
• Example fact tables:
• Transactions (timekey, storekey, pkey, promkey, ckey,
units, price); grain is individual item
• Transactions(timekey, storekey, ckey, units, price);
grain is one market basket
• CustomerSessions(timekey, ckey, pagekey,
sessionkey, sessionSeconds, pagesVisited);
what is the grain?
Tips
• Fact tables are usually very large; they
can grow to several hundred GB and TB
• Dimension tables are usually smaller
(although can grow large, e.g., Customers
table), but they have many fields
• Queries over fact tables usually involve
many records
• Indexes usually increase the size of each
table a factor of three or four
Typical Queries
• SQL:
SELECT
FROM
WHERE
GROUP BY
HAVING
D1.d1, …, Dk.dk, agg1(F.f1,)
Dimension D1, …,
Dimension Dk, Fact F
D1.key = F.key1 AND … AND
Dk.keyk = F.keyk AND
otherPredicates
D1.d1, …, Dk.dk
groupPredicates
• This query is called a “Star Join”.
Example Query
• “Break down sales by year and category for the
last two years; show only categories with more
than $1M in sales.”
• SQL:
SELECT T.year, P.category, SUM(X.units * X.price)
FROM Time T, Products P, Transactions X
WHERE T.year = 1999 OR T.year = 2000
GROUP BY T.year, P.category
HAVING SUM(X.units * X.price) > 1000000
Design of the Clickstream Schema
• Dimensions:
• Date. One record for each calendar day.
• Time. One record per second (is that accurate
enough?).
• Customer. One record per customer. Several groups
of customers, depending on knowledge about the
customer:
• Anonymous web site visitor with cookie ID
• Know name, address, and customer ID is assigned
• Know demographics
• Can you think of other dimensions?
Schema Design (Contd.)
Often used “clickstream” dimensions:
• Page. Grain: One record per page or one record per
page type?
Page(pagekey, pageFunction, pageType, contentsType)
• Event. Captures what the users does at a page (enter
data, click on a link, etc.).
Event(eventkey, eventtype)
• Session. Attributes that characterize short-term behavior.
Session(sessionkey, sessiontype, sequencetype, context,
status)
• Referral. Captures how the user arrived at our site.
Referral(referralkey, referraltype, url, site, domain)
Schema Design (Contd.)
• Example clickstream fact table schema for
analyzing sessions:
Sessionfacts(datekey, timekey, ckey, totalSessionTime,
numPages, numItems, orderAmount)
• Foreign keys: datekey, timekey, ckey
• Facts: totalSessionTime, number of pages visited,
number of items ordered, total order amount
• Example clickstream fact table schema for
analyzing page use:
Pagefacts(datekey, timekey, ckey, pagekey, pagetime,
numItems)
• Foreign keys: datekey, timekey, ckey, pagekey
• Facts: pagetime, number of items orderd,
Building a Data Warehouse: Tips
• A data warehouse is a collection of data marts.
A data mart contains one dimensional star
schema that captures one business aspect.
• Notes:
• It is crucial to centralize the logical definition and
format of dimensions and facts (political challenge;
assign a dimension authority to each dimension).
Everything else is a distributed effort throughout your
company (technical challenge).
• Each data mart will have its own fact table, but we
will duplicate dimension tables over several data
marts.
Summary: Dimensional Design
• A dimensional data model is similar to a
relational data model, but we plan for
historical data
• Facts versus dimensions
• Queries are in star join format
Questions?
(In the third lecture: Data analysis)