Slide Template - Text Summarization
Download
Report
Transcript Slide Template - Text Summarization
Chapter 10
Managing Databases with Oracle 9i
SII 654
Fall 2005
Fundamentals, Design,
and Implementation, 9/e
Introduction
Oracle is the world’s most popular DBMS
It is a powerful and robust DBMS that runs on
many different operating systems
Oracle DBMS engine: Personal Oracle and
Enterprise Oracle
Example of Oracle products
– SQL*Plus: a utility for processing SQL and creating
components like stored procedures and triggers
• PL/SQL is a programming language that adds programming
constructs to the SQL language
– Oracle Developer (Forms & Reports Builder)
– Oracle Designer
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/2
Creating an Oracle Database
Installing Oracle
– Install Oracle 9i Client to use an already
created database
– Install Oracle 9i Personal Edition to create your
own databases
Three ways to create an Oracle database
– Via the Oracle Database Configuration
Assistant
– Via the Oracle-supplied database creation
procedures
– Via the SQL CREATE DATABASE command
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/3
SQL*Plus
Oracle SQL*Plus or the Oracle Enterprise
Manager Console may be used to manage an
Oracle database
SQL*Plus is a text editor available in all Oracle
Except inside quotation marks of strings, Oracle
commands are case-insensitive
The semicolon (;) terminates a SQL statement
The right-leaning slash (/) executes SQL
statement stored in Oracle buffer
SQL*Plus can be used to
– Enter SQL statements
– Submit SQL files created by text editors, e.g., notepad, to
Oracle
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/4
Example: SQL*Plus Prompt
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/5
SQL*Plus Buffer
SQL*Plus keeps the current statements in
a multi-line buffer without executing it
LIST is used to see the contents of the buffer
– LIST [line_number] is used to change the current line
CHANGE/astring/bstring/ is used to change the
contents of the current line
– astring = the string you want to change
– bstring = what you want to change it to
Example: change/Table_Name/*/
– ‘Table_Name’ is replaced with ‘*’
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/6
Example: SQL*Plus Buffer
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/7
Creating Tables
Some of the SQL-92 CREATE TABLE statements
need to be modified for Oracle
– Oracle does not support a CASCADE UPDATE
constraint
– Int data type is interpreted by Oracle as Number(38)
– Varchar data type is interpreted as VarChar2
– Money or currency is defined in Oracle using the Numeric
data type
Oracle sequences must be used for surrogate
keys
DESCRIBE or DESC command is used to view
table status
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/8
Oracle Data Types
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/9
Oracle Sequences
A sequence is an object that generates a
sequential series of unique numbers
It is the best way to work with surrogate keys in
Oracle
Two sequence methods
– NextVal provides the next value in a sequence
– CurrVal provides the current value in a sequence
Using sequences does not guarantee valid
surrogate key values because it is possible to
have missing, duplicate, or wrong sequence value
in the table
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/10
Example: Sequences
Creating sequence
CREATE SEQUENCE CustID INCREMENT BY 1 START
WITH 1000;
Entering data using sequence
INSERT INTO CUSTOMER
(CustomerID, Name, AreaCode, PhoneNumber)
VALUES (CustID.NextVal, ‘Mary Jones’, ‘350’, ‘555–1234);
Retrieving the row just created
SELECT *
FROM CUSTOMER
WHERE CustomerID = CustID.CurrVal
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/11
DROP and ALTER Statements
Drop statements may be used to remove
structures from the database
– DROP TABLE MYTABLE;
• Any data in the MYTABLE table will be lost
– DROP SEQUENCE MySequence;
ALTER statement may be used to drop (add) a
column
– ALTER TABLE MYTABLE DROP COLUMN MyColumn;
– ALTER TABLE MYTABLE ADD C1 NUMBER(4);
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/12
TO_DATE Function
Oracle requires dates in a particular format
TO_DATE function may be used to identify the
format
– TO_DATE(‘11/12/2002’,’MM/DD/YYYY’)
• 11/12/2002 is the date value
• MM/DD/YYYY is the pattern to be used when interpreting
the date
TO_DATE function can be used with the INSERT
and UPDATE statement to enter data
– INSERT INTO T1 VALUES (100, TO_DATE (‘01/05/02’,
‘DD/MM/YY’);
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/13
Creating Indexes
Indexes are created to
– Enforce uniqueness on columns
– Facilitate sorting
– Enable fast retrieval by column values
Good candidates for indexes are columns that are
frequently used with equal conditions in WHERE
clause or in a join
Example:
– CREATE INDEX CustNameIdx ON CUSTOMER(Name);
– CREATE UNIQUE INDEX WorkUniqueIndex ON
WORK(Title, Copy, ArtistID);
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/14
Restrictions On Column
Modifications
A column may be dropped at any time and
all data will be lost
A column may be added at any time as
long as it is a NULL column
To add a NOT NULL column
– Add a NULL column
– Fill the new column in every row with data
– Change its structure to NOT NULL
• ALTER TABLE T1 MODIFY C1 NOT NULL;
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/15
Creating Views
SQL-92 CREATE VIEW command can be used to
create views in SQL*Plus
Oracle allows the ORDER BY clause in view
definitions
Only Oracle 9i supports the JOIN…ON syntax
Example:
CREATE VIEW CustomerInterests AS
SELECT C.Name as Customer, A.Name as Artist
FROM CUSTOMER C JOIN CUSTOMER_ARTIST_INT I
ON C.CustomerID = I.CustomerID JOIN ARTIST A
ON I.ArtistID = A.ArtistID;
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/16
Enterprise Manager Console
The Oracle Enterprise Manager Console
provides graphical facilities for managing
an Oracle database
The utility can be used to manage
– Database structures such as tables and views
– User accounts, passwords, roles, and privileges
The Manager Console includes a SQL
scratchpad for executing SQL statements
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/17
Application Logic
Oracle database application can be
processed using
– Programming language to invoke Oracle DBMS
commands
– Stored procedures
– Start command to invoke database commands
stored in .sql files
– Triggers
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/18
Stored Procedures
A stored procedure is a PL/SQL or Java program
stored within the database
Stored procedures are programs that can
–
–
–
–
Have parameters
Invoke other procedures and functions
Return values
Raise exceptions
A stored procedure must be compiled and stored
in the database
Execute or Exec command is used to invoke a
stored procedure
– Exec Customer_Insert (‘Michael Bench’, ‘203’, ‘5552014’, ‘US’);
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/19
Example: Stored Procedure
Insert Figure 10-20
IN signifies input parameters
OUT signifies an output parameter
IN OUT signifies a parameter used for
both input and output
Variables are declared after the
keyword AS
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/20
Triggers
Oracle triggers are PL/SQL or Java
procedures that are invoked when
specified database activity occurs
Triggers can be used to
–
–
–
–
–
Enforce a business rule
Set complex default values
Update a view
Perform a referential integrity action
Handle exceptions
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/21
Triggers (cont.)
Trigger types
– A command trigger will be fired once per SQL
command
– A row trigger will be fired once for every row
involved in the processing of a SQL command
• Three types of row triggers: BEFORE, AFTER, and
INSTEAD OF
• BEFORE and AFTER triggers are placed on tables
while INSTEAD OF triggers are placed on views
• Each trigger can be fired on insert, update, or delete
commands
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/22
Data Dictionary
Oracle maintains a data dictionary of
metadata
The metadata of the dictionary itself are
stored in the table DICT
SELECT Table_Name, Comments
FROM DICT
WHERE Table_Name LIKE (‘%TABLES%’);
USER_TABLES contains information about
user or system tables
DESC USER_TABLES;
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/23
Example Oracle Metadata
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/24
Concurrency Control
Oracle processes database changes by
maintaining a System Change Number (SCN)
– SCN is a database-wide value that is incremented by
Oracle when database changes are made
With SCN, SQL statements always read a
consistent set of values; those that were
committed at or before the time the statement was
started
Oracle only reads committed changes; it will never
reads dirty data
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/25
Oracle Transaction Isolation
Oracle supports the following transaction
isolation levels
– Read Committed: Oracle’s default transaction
isolation level since it never reads uncommitted
data changes
– Serializable: Dirty reads are not possible,
repeated reads yield the same results, and
phantoms are not possible
– Read Only: All statements read consistent data.
No inserts, updates, or deletions are possible
– Explicit locks: Not recommended
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/26
Oracle Security
Oracle security components:
– An ACCOUNT is a user account
– A PROFILE is a set of system resource maximums that
are assigned to an account
– A PRIVILEGE is the right to perform a task
– A ROLE consists of groups of PRIVILEGEs and other
ROLEs
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/27
Account System Privileges
Each ACCOUNT can be allocated many SYSTEM
PRIVILEGEs and many ROLEs
An ACCOUNT has all the PRIVILEGEs
– That have been assigned directly
– Of all of its ROLEs
– Of all of its ROLEs that are inherited through ROLE
connections
A ROLE can have many SYSTEM PRIVILEGEs and
it may also have a relationship to other ROLEs
ROLEs simplify the administration of the database
– A set of privileges can be assigned to or removed from a
ROLE just once
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/28
Account Authentication
Accounts can be authenticated by
– Password
– The host operating system
Password management can be
specified via PROFILEs
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/29
Oracle Recovery Facilities
Three file types for Oracle recovery:
– Datafiles contain user and system data
– ReDo log files contain logs of database
changes
• OnLine ReDo files are maintained on disk and
contain the rollback segments from recent database
changes
• Offline or Archive ReDo files are backups of the
OnLine ReDo files
– Control files describe the name, contents, and
locations of various files used by Oracle
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/30
Oracle Recovery Facilities
(cont.)
Oracle can operate in either ARCHIVELOG
or NOARCHIVELOG mode
– If running in ARCHIVELOG mode, Oracle logs
all changes to the database
– When the OnLine ReDo files fill up, they are
copied to the Archive ReDo files
The Oracle Recovery Manager
(RMAN) is a utility program used to
create backups and to perform
recovery
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/31
Types of Failure
Oracle recovery techniques depend on the type of
failure
– An application failure due to application logic errors
– An instance failure occurs when Oracle itself fails due to
an operating system or computer hardware failure
• Oracle can recover from application and instance
failure without using the archived log file
– A media failure occurs when Oracle is unable to write to a
physical file because of a disk failure or corrupted files
• The database is restored from a backup
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/32
Oracle Backup Facilities
Two kinds of backups
A consistent backup: Database activity must be
stopped and all uncommitted changes have been
removed from the datafiles
– Cannot be done if the database supports 24/7 operations
An inconsistent backup: Backup is made while
Oracle is processing the database
– An inconsistent backup can be made consistent by
processing an archive log file
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/33
Chapter 10
Managing Databases with Oracle 9i
Instructor: Dragomir R. Radev
Winter 2005
Fundamentals, Design,
and Implementation, 9/e
KDD and Data Mining
Instructor: Dragomir R. Radev
Winter 2005
Fundamentals, Design,
and Implementation, 9/e
The big problem
Billions of records
A small number of interesting patterns
“Data rich but information poor”
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/36
Data mining
Knowledge discovery
Knowledge extraction
Data/pattern analysis
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/37
Types of source data
Relational databases
Transactional databases
Web logs
Textual databases
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/38
Association rules
65% of all customers who buy beer
and tomato sauce also buy pasta and
chicken wings
Association rules: X Y
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/39
Association analysis
IF
20 < age < 30
AND
20K < INCOME < 30K
THEN
– Buys (“CD player”)
SUPPORT = 2%, CONFIDENCE =
60%
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/40
Basic concepts
Minimum support threshold
Minimum confidence threshold
Itemsets
Occurrence frequency of an itemset
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/41
Association rule mining
Find all frequent itemsets
Generate strong association rules
from the frequent itemsets
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/42
Support and confidence
Support (X)
Confidence (X Y) = Support(X+Y) /
Support (X)
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/43
Example
TID
T100
T200
T300
T400
T500
T600
T700
T800
T900
List of item IDs
I1, I2, I5
I2, I4
I2, I3
I1, I2, I4
I1, I3
I2, I3
I1, I3
I1, I2, I3, I5
I1, I2, I3
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/44
Example (cont’d)
Frequent itemset l = {I1, I2, I5}
I1 AND I2 I5
C = 2/4 = 50%
I1 AND I5 I2
I2 AND I5 I1
I1 I2 AND I5
I2 I1 AND I5
I3 I1 AND I2
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/45
Example 2
TID
date
items
T100
10/15/99
{K, A, D, B}
T200
10/15/99
{D, A, C, E, B}
T300
10/19/99
{C, A, B, E}
T400
10/22/99
{B, A, D}
min_sup = 60%, min_conf = 80%
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/46
Correlations
Corr (A,B) = P (A OR B) / P(A) P (B)
If Corr < 1: A discourages B (negative
correlation)
(lift of the association rule A B)
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/47
Contingency table
Game
^Game
Sum
Video
4,000
3,500
7,500
^Video
2,000
500
2,500
Sum
6,000
4,000
10,000
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/48
Example
P({game}) = 0.60
P({video}) = 0.75
P({game,video}) = 0.40
P({game,video})/(P({game})x(P({video
})) = 0.40/(0.60 x 0.75) = 0.89
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/49
Example 2
hotdogs
^hotdogs Sum
hamburgers
2000
500
2500
^hamburgers
1000
1500
2500
Sum
3000
2000
5000
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/50
Classification using
decision trees
Expected information need
I (s1, s2, …, sm) = -
S
pi log (pi)
s = data samples
m = number of classes
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/51
RID
Age
Income
student
credit
buys?
1
<= 30
High
No
Fair
No
2
<= 30
High
No
Excellent
No
3
31 .. 40
High
No
Fair
Yes
4
> 40
Medium
No
Fair
Yes
5
> 40
Low
Yes
Fair
Yes
6
> 40
Low
Yes
Excellent
No
7
31 .. 40
Low
Yes
Excellent
Yes
8
<= 30
Medium
No
Fair
No
9
<= 30
Low
Yes
Fair
Yes
10
> 40
Medium
Yes
Fair
Yes
11
<= 30
Medium
Yes
Excellent
Yes
12
31 .. 40
Medium
No
Excellent
Yes
13
31 .. 40
High
Yes
Fair
Yes
14
> 40
Medium
no
excellent
no
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/52
Decision tree induction
I(s1,s2)
= I(9,5) =
= - 9/14 log 9/14 – 5/14 log 5/14 =
= 0.940
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/53
Entropy and information gain
•E(A) =
S
S1j + … + smj
s
I (s1j,…,smj)
Entropy = expected information based on the partitioning into
subsets by A
Gain (A) = I (s1,s2,…,sm) – E(A)
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/54
Entropy
Age <= 30
s11 = 2, s21 = 3, I(s11, s21) = 0.971
Age in 31 .. 40
s12 = 4, s22 = 0, I (s12,s22) = 0
Age > 40
s13 = 3, s23 = 2, I (s13,s23) = 0.971
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/55
Entropy (cont’d)
E (age) =
5/14 I (s11,s21) + 4/14 I (s12,s22) + 5/14 I
(S13,s23) = 0.694
Gain (age) = I (s1,s2) – E(age) = 0.246
Gain (income) = 0.029, Gain (student) =
0.151, Gain (credit) = 0.048
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/56
Final decision tree
age
> 40
31 .. 40
student
credit
yes
no
yes
no
yes
excellent
no
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
fair
yes
Chapter 10/57
Other techniques
Bayesian classifiers
X: age <=30, income = medium,
student = yes, credit = fair
P(yes) = 9/14 = 0.643
P(no) = 5/14 = 0.357
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/58
Example
P (age < 30 | yes) = 2/9 = 0.222
P (age < 30 | no) = 3/5 = 0.600
P (income = medium | yes) = 4/9 = 0.444
P (income = medium | no) = 2/5 = 0.400
P (student = yes | yes) = 6/9 = 0.667
P (student = yes | no) = 1/5 = 0.200
P (credit = fair | yes) = 6/9 = 0.667
P (credit = fair | no) = 2/5 = 0.400
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/59
Example (cont’d)
P (X | yes) = 0.222 x 0.444 x 0.667 x 0.667 =
0.044
P (X | no) = 0.600 x 0.400 x 0.200 x 0.400 = 0.019
P (X | yes) P (yes) = 0.044 x 0.643 = 0.028
P (X | no) P (no) = 0.019 x 0.357 = 0.007
Answer: yes/no?
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/60
Predictive models
Inputs (e.g., medical history, age)
Output (e.g., will patient experience
any side effects)
Some models are better than others
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/61
Principles of data mining
Training/test sets
Error analysis and overfitting
error
test
training
Cross-validation
input size
Supervised vs. unsupervised methods
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/62
Representing data
Vector space
credit
pay off
default
salary
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/63
Decision surfaces
credit
pay off
default
salary
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/64
Decision trees
credit
pay off
default
salary
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/65
Linear boundary
credit
pay off
default
salary
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/66
kNN models
Assign each element to the closest
cluster
Demos:
– http://www2.cs.cmu.edu/~zhuxj/courseproject/knnd
emo/KNN.html
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/67
Other methods
Decision trees
Neural networks
Support vector machines
Demos
– http://www.cs.technion.ac.il/~rani/LocBo
ost/
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/68
arff files
@data
sunny,85,85,FALSE,no
@relation weather
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
@attribute outlook {sunny, overcast, rainy}
@attribute temperature real
@attribute humidity real
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/69
Weka
http://www.cs.waikato.ac.nz/ml/weka
Methods:
rules.ZeroR
bayes.NaiveBayes
trees.j48.J48
lazy.IBk
trees.DecisionStump
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/70
kMeans clustering
http://www.cc.gatech.edu/~dellaert/html/sof
tware.html
java weka.clusterers.SimpleKMeans -t
data/weather.arff
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/71
More useful pointers
http://www.kdnuggets.com/
http://www.twocrows.com/booklet.htm
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/72
More types of data mining
Classification and prediction
Cluster analysis (clustering)
Outlier analysis
Evolution analysis
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/73
Clustering
Exclusive/overlapping clusters
Hierarchical/flat clusters
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/74
Methods
Single-linkage
– One common pair is sufficient
– disadvantages: long chains
Complete-linkage
– All pairs have to match
– Disadvantages: too conservative
Average-linkage
Centroid-based (online)
– Look at distances to centroids
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/75
k-means
Needed: small number k of desired
clusters
hard vs. soft decisions
Example: Weka
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/76
k-means
1 initialize cluster centroids to arbitrary vectors
2 while further improvement is possible do
3 for each document d do
4
find the cluster c whose centroid is closest to d
5
assign d to cluster c
6 end for
7 for each cluster c do
8
recompute the centroid of cluster c based on its
documents
9 end for
10 end while
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/77
Example
Cluster the following vectors into two
groups:
– A = <1,6>
– B = <2,2>
– C = <4,0>
– D = <3,3>
– E = <2,5>
– F = <2,1>
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/78
Demos
http://vivisimo.com/
http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_h
tml/AppletKM.html
http://cgm.cs.mcgill.ca/~godfried/student_projects/bonnef_kmeans
http://www.cs.washington.edu/research/imagedatabase/dem
o/kmcluster
http://www.cc.gatech.edu/~dellaert/html/software.html
http://www-2.cs.cmu.edu/~awm/tutorials/kmeans11.pdf
http://www.ece.neu.edu/groups/rpl/projects/kmeans/
% cd /data2/tools/weka-3-3-4
% export CLASSPATH=/data2/tools/weka-3-3-4/weka.jar
% java weka.clusterers.SimpleKMeans -t data/weather.arff
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/79