Decision Tree Construction

Download Report

Transcript Decision Tree Construction

The Software Infrastructure
for Electronic Commerce
Databases and Data Mining
Lecture 1:
A Manager’s View of Database Technology
Johannes Gehrke
[email protected]
http://www.cs.cornell.edu/johannes
Goal Of Lectures One and Two
• Understand the basics of modern data
management
• DBMS technology
• Data models
• Transaction processing versus decision
support
• An data management architecture of an
e-business
Goals of the DBMS Lectures (Contd.)
• Gain technical understanding to feel
confident to ask questions.
• Learn to ask the right questions.
The Big Picture
WWW Site
Visitor
Internal User
INTRANET,
VPN
THE WEB
Public Web Server
Business
Transaction
Server
Main
Memory
Cache
DBMS
Internal
Web Server
Data
Warehouse
Application
Server
Web Pages with Database Contents
• Web pages contain the results of database
queries. How do we generate such pages?
• Web server creates a new process for a
program interacts with the database.
• Web server communicates with this program
via some data exchange protocol
• Program generates result page with content
from the database
Business Transaction Server
• Application server: Piece of software between
the web server and the applications
• Functionality:
• Hold a set of pre-forked threads or processes for
performance
• Database connection pooling (reuse a set of existing
connections)
• Integration of heterogeneous data sources
• Transaction management involving several data
sources
• Session management
Other Server-Side Processing
• Java Servlets: Java programs that run on
the server and interact with the server
through a well-defined API
• JavaBeans: Reusable software
components written in Java
• Java Server Pages and Active Server
Pages: Code inside a web page that is
interpreted by the web server
What Happens If You Click On A Link?
www.company.com
1
Company web server
www.company.com
Johannes Gehrke
2
3
Hidden link
CLIENT SIDE
4
Banner web server
www.banners.com
Additional log server
www.mycookies.com
SERVER SIDE
Web Server Log Data
Common Log Format:
host ident authuser date request status bytes
•
•
•
•
host: domain name or IP address of the client
date: date and time of the request
request: the request line from the client
status: three digit status code returned to the
client
• bytes: number of bytes in the object returned to
the client
Web Server Log Data (Contd.)
Usually, even more data available:
• URL of the referring server
• Name and version of the browser
• Name of the file requested
• Time to serve the request
• IP address of the requesting browser
• Cookie
Cookies
• The communication protocol (HTTP) between
your browser and the web server is stateless.
(Compare to a vending machine.)
• Remedy: Store information (called a cookie) at
the browser of the user
• Example cookie (from www.google.com):
PREFID=3415aaaf73b7bfe3,TM=956724506|google.com/|0|261887
833632111634|2662722800|29339450|*
(name=value|domain|secure|expiration date|expiration time|last
used date|last used time)
Cookies (Contd.)
• A cookie is always associated with a specific
domain (amazon.com, bn.com,
preferences.com, doubleclick.net, cornell.edu)
• Cookies have expiration dates
• The secrets are in the (name=value) pairs
(usually encrypted):
PREFID=3415aaaf73b7bfe3,TM=956724506
• Cookies have their own life on your computer:
• \Documents and Settings\UserName\Cookies,
\Windows\Cookies
\Windows\Profiles\UserName\Cookies
\ProgramFiles\Netscape\Users\Default\cookies.txt
Cookies (Contd.)
• Applications of cookies:
• Shopping carts
• “Log in once” (example: New York Times)
• Personalization (amazon.com: Welcome back,
Johannes)
• General tracking of user behavior
• User privacy
• Other personalization/tracking techniques:
Hidden fields in html pages, unique page names
• Online demonstration
And Then You Click Purchase …
(Simplified process)
• Insert customer data into the database/check
customer data
• Check order availability
• Insert order data into the database
• Return order confirmation number to the
customer
All this data is stored in a database system
(DBMS).
Why Store Data in a DBMS?
• Benefits
• Transactions (concurrent data access,
recovery from system crashes)
• High-level abstractions for data access,
manipulation, and administration
• Data integrity and security
• Performance and scalability
Transactions
• A transaction is an atomic sequence of
database actions (reading, writing or updating a
database object).
• Each transaction must leave the DB in a
consistent state (if DB is consistent when the
transaction starts).
• The ACID Properties:
•
•
•
•
Atomicity
Consistency
Isolation
Durability
Example Transaction: Online Store
Your purchase transaction:
• Atomicity: Either the complete purchase
happens, or nothing
• Consistency: The inventory and internal
accounts are updated correctly
• Isolation: It does not matter whether other
customers are also currently making a purchase
• Durability: Once you have received the order
confirmation number, your order information is
permanent, even if the site crashes
Transactions (Contd.)
A transaction will commit after completing
all its actions, or it could abort (or be
aborted by the DBMS) after executing
some actions.
Example Transaction: ATM
You withdraw money from the ATM machine
• Atomicity
• Consistency
• Isolation
• Durability
Commit versus Abort?
What are reasons for commit or abort?
Concurrency Control
(Start: A=$100; B=$100)
Consider two transactions:
• T1: START, A=A+100, B=B-100, COMMIT
• T2: START, A=1.06*A, B=1.06*B, COMMIT
The first transaction is transferring $100 from B’s
account to A’s account. The second transaction
is crediting both accounts with a 6% interest
payment.
Example (Contd.)
(Start: A=$100; B=$100)
• Consider a possible interleaving (schedule):
T1: A=A+$100, B=B-$100 COMMIT
T2:
A=1.06*A, B=1.06*B COMMIT
End result: A=$106; B=$0
• Another possible interleaving:
T1: A=A+100,
B=B-100 COMMIT
T2:
A=1.06*A, B=1.06*B COMMIT
End result: A=$106; B=$6
The second interleaving is incorrect! Concurrency control of a database
system makes sure that the second schedule does not happen.
Ensuring Atomicity
• DBMS ensures atomicity (all-or-nothing
property) even if the system crashes in
the middle of a transaction.
• Idea: Keep a log (history) of all actions
carried out by the DBMS while executing :
• Before a change is made to the database, the
corresponding log entry is forced to a safe
location.
• After a crash, the effects of partially executed
transactions are undone using the log.
Recovery
• A DBMS logs all elementary events on
stable storage. This data is called the log.
• The log contains everything that changes
data: Inserts, updates, and deletes.
• Reasons for logging:
• Need to UNDO transactions
• Recover from a systems crash
Recovery: Example
(Simplified process)
• Insert customer data into the database
• Check order availability
• Insert order data into the database
• Write recovery data (the log) to stable
storage
• Return order confirmation number to the
customer
Tips
• Transactions, concurrency control, and recovery
are important aspects of the functionality of a
database system
• Tips for capacity planning:
• Load influences level of concurrency 
Determines hardware requirements
• Insufficient resources for concurrency and recovery
can force serialization of transactions 
Bad performance
• Need ample space for the log, often mirrored onto
two disks at the same time
Why Store Data in a DBMS?
• Benefits
• Transactions (concurrent data access,
recovery from system crashes)
• High-level abstractions for data access,
manipulation, and administration
• Data integrity and security
• Performance and scalability
Data Independence
Applications should be insulated from how data is
structured and stored. Thus the DBMS needs to
provide high-level abstractions to applications!
View 1
View 2
View 3
Conceptual Schema
Physical Schema
Data Model
• A data model is a collection of concepts
for describing data.
• Examples:
• ER model (used for conceptual modeling)
• Relational model, object-oriented model,
object-relational model (actually implemented
in current DBMS)
The Relational Data Model
A relational database is a set of relations. Turing
Award (Nobel Price in CS) for Codd in 1980 for
his work on the relational model
• Example relation:
Customers(cid: integer, name: string, byear: integer, state: string)
cid
1
2
3
name
Jones
Smith
Smith
byear
1960
1974
1950
state
NY
CA
NY
The Relational Model: Terminology
•
•
•
•
Relation instance and schema
Field (column)
Record or tuple (row)
Cardinality
cid
1
2
3
name
Jones
Smith
Smith
byear
1960
1974
1950
state
NY
CA
NY
Customer Relation (Contd.)
• In your enterprise, you are more likely to have a
schema similar to the following:
Customers(cid, identifier, nameType, salutation,
firstName, middleNames, lastName,
culturalGreetingStyle, gender, customerType, degrees,
ethnicity, companyName, departmentName, jobTitle,
primaryPhone, primaryFax, email, website, building,
floor, mailstop, addressType, streetNumber, streetName,
streetDirection, POBox, city, state, zipCode, region,
country, assembledAddressBlock, currency,
maritalStatus, bYear, profession)
Product Relation
• Relation schema:
Products(pid: integer, pname: string, price: float,
category: string)
• Relation instance:
pid
1
2
3
4
pname
Intel PIII-700
MS Office Pro
IBM DB2
Thinkpad 600E
price
300.00
500.00
5000.00
5000.00
category
hardware
software
software
hardware
Transaction Relation
• Relation schema:
Transactions(
tid: integer,
tdate: date,
cid: integer,
pid: integer)
• Relation instance:
tid
1
1
2
3
3
tdate
1/1/2000
1/1/2000
1/1/2000
2/1/2000
2/1/2000
cid pid
1
1
1
2
1
4
2
3
2
4
TIPS
• Any enterprise has an abundance of
different relations in its DBMS. Good
management of this meta-data is crucial:
•
•
•
•
Documentation
Evolution
Assignment of responsibilities
Security
• (ERP packages usually create several
thousand relations)
The Relational DBMS Market
Market Share Of RDBMS In 1998
(Gartner Group 4/1999)
70.0%
60.0%
50.0%
40.0%
30.0%
20.0%
10.0%
0.0%
NT
Other
NCR
Informix
Sybase
ASA/ASA
IBM DB2
Microsoft
SQL
Oracle 7/8
and Lite
Unix
The Relational DBMS Market (Contd.)
Best-Selling Client/Server DBMS
(Computer Reseller News 8/1999)
50%
43%
41%
40%
32%
30%
25%
22%
16%
20%
10%
6%6%
4%5%
Sybase
IBM
0%
Microsoft
Oracle
Others
Jan-Jun 98
Jan-Jun 99
The Relational DBMS Market (Contd.)
Market Share Of Database Revenues
(Computer Reseller News, 5/1999)
35.0%
30.0%
25.0%
20.0%
In 1997
15.0%
In 1998
10.0%
5.0%
0.0%
IBM
Oracle Microsoft Informix
Sybase
Other
The Object-Oriented Data Model
• Richer data model. Goal: Bridge impedance
mismatch between programming languages and
the database system.
• Example components of the data model:
Relationships between objects directly as
pointers.
• Result: Can store abstract data types directly in
the DBMS
•
•
•
•
Pictures
Geographic coordinates
Movies
CAD objects
Object-Oriented DBMS
• Advantages: Engineering applications
(CAD and CAM and CASE computer aided
software engineering), multimedia
applications.
• Disadvantages:
• Technology not as mature as relational DMBS
• Not suitable for decision support, weak
security
• Vendors are much smaller companies and
their financial stability is questionable.
Object-Oriented DBMS (Contd.)
Vendors:
• Gemstone (www.gemstone.com)
• Objectivity (www.objy.com)
• Object Design (www.odi.com)
• POET (www.poet.com)
• Versant (www.versant.com)
Organizations:
• OMDG: Object Database Management Group
(www.omdg.org)
• OMG: Object Management Group (www.omg.org)
The OO DBMS Market
Forecast Revenues For OO Systems Software
Worldwide (IDC 3/1999)
1,000.0
Million $
800.0
600.0
400.0
200.0
0.0
1997
1998
1999
2000
Year
2001
2002
Object-Relational DBMS
• Mixture between the object-oriented and
the object-relational data model
• Combines ease of querying with ability to
store abstract data types
• Conceptually, the relational model, but every
field
• All major relational vendors are currently
extending their relational DBMS to the
object-relational model
Query Languages
We need a high-level language to describe and
manipulate the data
Requirements:
• Precise semantics
• Easy integration into applications written in
C++/Java/Visual Basic/etc.
• Easy to learn
• DBMS needs to be able to efficiently evaluate
queries written in the language
Relational Query Languages
• The relational model supports simple,
powerful querying of data.
• Precise semantics for relational queries
• Efficient execution of queries by the DBMS
• Independent of physical storage
SQL: Structured Query Language
• Developed by IBM (System R) in the
1970s
• ANSI standard since 1986:
•
•
•
•
SQL-86
SQL-89 (minor revision)
SQL-92 (major revision, current standard)
SQL-99 (major extensions)
Why Store Data in a DBMS?
• Benefits
• Transactions (concurrent data access,
recovery from system crashes)
• High-level abstractions for data access,
manipulation, and administration
• Data integrity and security
• Performance and scalability
Integrity Constraints
• Integrity Constraints (ICs): Condition that
must be true for any instance of the
database.
• ICs are specified when schema is defined.
• ICs are checked when relations are modified.
• A legal instance of a relation is one that
satisfies all specified ICs.
• DBMS should only allow legal instances.
• Example: Domain constraints.
Primary Key Constraints
• A set of fields is a superkey for a relation if no
two distinct tuples can have same values in all
key fields.
• A set of fields is a key if the set is a superkey,
and none of its subsets is a superkey.
• Example:
• {cid, name} is a superkey for Customers
• {cid} is a key for Customers
• Where do primary key constraints come from?
Primary Key Constraints (Contd.)
• Can there be more than one key for a
relation?
• What is the maximum number of
superkeys for a relation?
• What is the primary key of the Products
relation? How about the Transactions
relation?
Foreign Keys, Referential Integrity
• Foreign key : Set of fields in one relation that is
refers to a unique tuple in another relation.
(The foreign key must be a superkey of the
second relation.)
• Example: The field cid in the Transactions
relation is a foreign key referring to Customers.
• If all foreign key constraints are enforced, we
say that referential integrity is achieved.
• No dangling references.
• Compare to links in HTML.
Foreign Keys: Example
• The pid field of the Transactions relation
refers to the cid field of the Customer
relation.
tid
1
1
2
3
3
tdate
1/1/2000
1/1/2000
1/1/2000
2/1/2000
2/1/2000
cid pid
1
1
1
2
1
4
2
3
2
4
cid
1
2
3
name
Jones
Smith
Smith
byear
1960
1974
1950
state
NY
CA
NY
Enforcing Referential Integrity
• What should be done if a Transaction tuple with
a non-existent Customer id is inserted? (Reject
it!)
• What should be done if a Customer tuple is
deleted?
• Also delete all Transaction tuples that refer to it.
• Disallow deletion of a Customer tuple that has
associated Transactions.
• Set cid in Transactions tuples to a default or special
cid.
• SQL supports all three choices
Where do ICs Come From?
• ICs are based upon the semantics of the realworld enterprise that is being described in the
database relations.
• We can check a database instance to see if an
IC is violated, but we can NEVER infer that an
IC is true by looking at an instance.
• An IC is a statement about all possible instances!
• From example, we know state cannot be a key, but
the assertion that cid is a key is given to us.
• Key and foreign key ICs are very common; a
DBMS supports more general ICs.
Security
• Secrecy: Users should not be able to see
things they are not supposed to.
•
E.g., A student can’t see other students’
grades.
• Integrity: Users should not be able to
modify things they are not supposed to.
•
E.g., Only instructors can assign grades.
• Availability: Users should be able to see
and modify things they are allowed to.
Discretionary Access Control
• Based on the concept of access rights or
privileges for objects (tables and views), and
mechanisms for giving users privileges (and
revoking privileges).
• Creator of a table or a view automatically gets
all privileges on it.
• DMBS keeps track of who subsequently gains and
loses privileges, and ensures that only requests from
users who have the necessary privileges (at the time
the request is issued) are allowed.
• Users can grant and revoke privileges
Role-Based Authorization
• In SQL-92, privileges are actually assigned to
authorization ids, which can denote a single
user or a group of users.
• In SQL:1999 (and in many current systems),
privileges are assigned to roles.
• Roles can then be granted to users and to other
roles.
• Reflects how real organizations work.
• Illustrates how standards often catch up with “de
facto” standards embodied in popular systems.
Security Is Important
Financial estimate of database losses, by activity
in 1998
(ENT/CSI/FBI Computer Crime Survey, 5/1999)
Activity
Theft of Proprietary Information
System Penetration by Outsiders
Financial Fraud
Unauthorized Insider Access
Laptop Theft
Total
Loss in million $
28.51
13.39
11.24
50.57
0.16
103.87
Why Store Data in a DBMS?
• Benefits
• Transactions (concurrent data access,
recovery from system crashes)
• High-level abstractions for data access,
manipulation, and administration
• Data integrity and security
• Performance and scalability
Why Parallel Access To Data?
At 10 MB/s
1.2 days to scan
1 Terabyte
10 MB/s
1,000 x parallel
1.5 minute to scan.
1 Terabyte
Parallelism:
Divide a big problem
into many smaller ones
to be solved in parallel.
Parallel DBMS: Intro
• Parallelism is natural to DBMS processing
• Pipeline parallelism: many machines each doing one step in a
multi-step process.
• Partition parallelism: many machines doing the same thing to
different pieces of data.
• Both are natural in DBMS!
Pipeline
Partition
Any
Sequential
Program
Sequential
Any
Sequential
Sequential
Program
Any
Sequential
Program
Any
Sequential
Program
outputs split N ways, inputs merge M ways
DBMS: The || Success Story
• DBMSs are the most (only?) successful
application of parallelism.
• Every major DBMS vendor has some || server
• Workstation manufacturers now depend on || DB
server sales.
• Reasons for success:
•
•
•
•
Bulk-processing (= partition ||-ism).
Natural pipelining.
Inexpensive hardware can do the trick!
Users/app-programmers don’t need to think in ||
• More resources means
proportionally less
time for given amount
of data.
• Scale-Up
• If resources increased
in proportion to
increase in data size,
time is constant.
sec./Xact
(response time)
• Speed-Up
Xact/sec.
(throughput)
Some || Terminology
Ideal
degree of ||-ism
Ideal
degree of ||-ism
Architecture Issue: Shared What?
Shared Memory
(SMP)
CLIENTS
Shared Disk
Shared Nothing
(network)
CLIENTS
CLIENTS
Processors
Memory
Hard to program
Cheap to build
Easy to scaleup
Easy to program
Expensive to build
Difficult to scaleup
Sybase
Oracle
Compaq, Teradata, SP2
Distributed Database Systems
• Data is stored at several sites, each managed by
a DBMS that can run independently.
• Distributed data independence: Users should
not have to know where data is located
(extends physical and logical data independence
principles).
• Distributed transaction atomicity: Users should
be able to write transactions accessing multiple
sites just like local transactions.
Types of Distributed Databases
• Homogeneous: Every site runs same type
of DBMS.
• Heterogeneous: Different sites run
different DBMSs (different RDBMSs or
even non-relational DBMSs).
Gateway
DBMS1
DBMS2
DBMS3
Distributed DBMS Architectures
QUERY
• Client-Server
Client ships query
to single site. All query
processing at server.
- Thin vs. fat clients.
- Set-oriented
communication,
client side caching.
v
Collaborating-Server
Query can span multiple
sites.
CLIENT
SERVER
CLIENT
SERVER
SERVER
SERVER
SERVER
QUERY
SERVER
DBMS and Performance
• Efficient implementation of all database
operations
• Indexes: Auxiliary structures that allow
fast access to the portion of data that a
query is about
• Smart buffer management
• Query optimization: Finds the best way to
execute a query
• Automatic query parallelization
Performance Tips
• Bad connection management is the number one
scalability problem (can make two order of
magnitude difference)
• Reuse once-executed queries as much as
possible
• Physical performance database tuning is crucial
for good performance
• Performance tuning is an art and a science
• Recent trend: DBMS include tuning wizards for
automatic performance tuning
• Include space for indexes during space planning
Summary Of DBMS Benefits
• Transactions
• ACID properties, concurrency control, recovery
• High-level abstractions for data access
• Data models
• Data integrity and security
• Key constraints, foreign key constraints, access
control
• Performance and scalability
• Parallel DBMS, distributed DBMS, performance tuning
The Future of Data Exchange: Markup
From HTML to XML
Markup Languages: HTML
• Simple markup language
• Text is annotated with language
commands called tags, usually consisting
of a start tag and an end tag
HTML Example: Book Listing
<HTML><BODY>
Fiction:
<UL><LI>Author: Milan Kundera</LI?
<LI>Title: Identity</LI>
<LI>Published: 1998</LI>
</UL>
Science:
<UL><LI>Author: Richard Feynman</LI>
<LI>Title: The Character of Physical Law</LI>
<LI>Hardcover</LI>
</UL></BODY></HTML>
Beyond HTML: XML
• Extensible Markup Language (XML):
“Extensible HTML”
• Confluence of SGML and HTML: The
power of SGML with the simplicity of
HTML
• Allows definition of new markup
languages, called document type
declarations (DTDs)
XML: Language Constructs
• Elements
• Main structural building blocks of XML
• Start and end tag
• Must be properly nested
• Element can have attributes that provide
additional information about the element
• Entities: like macros, represent common text.
• Comments
• Document type declarations (DTDs)
Booklist Example in XML
<?XML version=“1.0” standalone=“yes”?>
<!DOCTYPE BOOKLIST SYSTEM “booklist.dtd”>
<BOOKLIST>
<BOOK genre=“Fiction”>
<AUTHOR>
<FIRST>Milan</FIRST><LAST>Kundera</LAST>
</AUTHOR>
<TITLE>Identity</TITLE>
<PUBLISHED>1998</PUBLISHED>
<BOOK genre=“Science” format=“Hardcover”>
<AUTHOR>
<FIRST>Richard</FIRST><LAST>Feynman</LAST>
</AUTHOR>
<TITLE>The Character of Physical Law</TITLE>
</BOOK></BOOKLIST>
XML: DTDs
• A DTD is a set of rules that defines the
elements, attributes, and entities that are
allowed in the document.
• An XML document is well-formed if it does
not have an associated DTD but it is
properly nested.
• An XML document is valid if it has a DTD
and the document follows the rules in the
DTD.
An Example DTD
<!DOCTYPE BOOKLIST [
<!ELEMENT BOOKLIST (BOOK)*>
<!ELEMENT BOOK (AUTHOR, TITLE, PUBLISHED?)>
<!ELEMENT AUTHOR (FIRST, LAST)>
<!ELEMENT FIRST (#PCDATA)>
<!ELEMENT LAST (#PCDATA)>
<!ELEMENT TITLE (#PCDATA)>
<!ELEMENT PUBLISHED (#PCDATA)>
<!ATTLIST BOOK genre (Science|Fiction) #REQUIRED>
<!ATTLIST BOOK format (Paperback|Hardcover) “Paperback”>
]>
Domain-Specific DTDs
• Development of standardized DTDs for
specialized domains enables data exchange
between heterogeneous sources
• Example: Mathematical Markup Language
(MathML)
• Encodes mathematical material on the web
• In HTML: <IMG SRC=“xysquare.gif” ALT=“(x+y)^2”>
• In MathML:
<apply> <power/>
<apply> <plus/> <ci>x</ci> <ci>y</ci> </apply>
<cn>2</cn>
</apply>
XML: The Future
• Seamless integration of heterogeneous B2B data
sources and business processes
• XML-specified services, discovered by
autonomous software agents
• Industry-specific DTDs that simplify data
exchange
• “Only XML”-industry revenue forecasts:
1999: $31 millions, 2001: $93 millions
(Interactive Week, 4/1999)
Summary Of Lecture One
• Database systems are powerful, but also
complex pieces of software
• Transactions (concurrent data access, recovery from
system crashes)
• High-level abstractions for data access, manipulation,
and administration
• Data integrity and security
• Performance and scalability
• XML is the Future of Data Exchange
In The Second Lecture
• Database design
• A short introduction to SQL
• Transaction processing versus decision support:
On to the data warehouse!
Questions?