Slide Template - Text Summarization

Download Report

Transcript Slide Template - Text Summarization

Chapter 10
Managing Databases with Oracle 9i
SII 654
Fall 2005
Fundamentals, Design,
and Implementation, 9/e
Introduction
 Oracle is the world’s most popular DBMS
 It is a powerful and robust DBMS that runs on
many different operating systems
 Oracle DBMS engine: Personal Oracle and
Enterprise Oracle
 Example of Oracle products
– SQL*Plus: a utility for processing SQL and creating
components like stored procedures and triggers
• PL/SQL is a programming language that adds programming
constructs to the SQL language
– Oracle Developer (Forms & Reports Builder)
– Oracle Designer
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/2
Creating an Oracle Database
 Installing Oracle
– Install Oracle 9i Client to use an already
created database
– Install Oracle 9i Personal Edition to create your
own databases
 Three ways to create an Oracle database
– Via the Oracle Database Configuration
Assistant
– Via the Oracle-supplied database creation
procedures
– Via the SQL CREATE DATABASE command
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/3
SQL*Plus
 Oracle SQL*Plus or the Oracle Enterprise
Manager Console may be used to manage an
Oracle database
 SQL*Plus is a text editor available in all Oracle
 Except inside quotation marks of strings, Oracle
commands are case-insensitive
 The semicolon (;) terminates a SQL statement
 The right-leaning slash (/) executes SQL
statement stored in Oracle buffer
 SQL*Plus can be used to
– Enter SQL statements
– Submit SQL files created by text editors, e.g., notepad, to
Oracle
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/4
Example: SQL*Plus Prompt
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/5
SQL*Plus Buffer
 SQL*Plus keeps the current statements in
a multi-line buffer without executing it
 LIST is used to see the contents of the buffer
– LIST [line_number] is used to change the current line
 CHANGE/astring/bstring/ is used to change the
contents of the current line
– astring = the string you want to change
– bstring = what you want to change it to
 Example: change/Table_Name/*/
– ‘Table_Name’ is replaced with ‘*’
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/6
Example: SQL*Plus Buffer
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/7
Creating Tables
 Some of the SQL-92 CREATE TABLE statements
need to be modified for Oracle
– Oracle does not support a CASCADE UPDATE
constraint
– Int data type is interpreted by Oracle as Number(38)
– Varchar data type is interpreted as VarChar2
– Money or currency is defined in Oracle using the Numeric
data type
 Oracle sequences must be used for surrogate
keys
 DESCRIBE or DESC command is used to view
table status
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/8
Oracle Data Types
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/9
Oracle Sequences
 A sequence is an object that generates a
sequential series of unique numbers
 It is the best way to work with surrogate keys in
Oracle
 Two sequence methods
– NextVal provides the next value in a sequence
– CurrVal provides the current value in a sequence
 Using sequences does not guarantee valid
surrogate key values because it is possible to
have missing, duplicate, or wrong sequence value
in the table
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/10
Example: Sequences
 Creating sequence
CREATE SEQUENCE CustID INCREMENT BY 1 START
WITH 1000;
 Entering data using sequence
INSERT INTO CUSTOMER
(CustomerID, Name, AreaCode, PhoneNumber)
VALUES (CustID.NextVal, ‘Mary Jones’, ‘350’, ‘555–1234);
 Retrieving the row just created
SELECT *
FROM CUSTOMER
WHERE CustomerID = CustID.CurrVal
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/11
DROP and ALTER Statements
 Drop statements may be used to remove
structures from the database
– DROP TABLE MYTABLE;
• Any data in the MYTABLE table will be lost
– DROP SEQUENCE MySequence;
 ALTER statement may be used to drop (add) a
column
– ALTER TABLE MYTABLE DROP COLUMN MyColumn;
– ALTER TABLE MYTABLE ADD C1 NUMBER(4);
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/12
TO_DATE Function
 Oracle requires dates in a particular format
 TO_DATE function may be used to identify the
format
– TO_DATE(‘11/12/2002’,’MM/DD/YYYY’)
• 11/12/2002 is the date value
• MM/DD/YYYY is the pattern to be used when interpreting
the date
 TO_DATE function can be used with the INSERT
and UPDATE statement to enter data
– INSERT INTO T1 VALUES (100, TO_DATE (‘01/05/02’,
‘DD/MM/YY’);
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/13
Creating Indexes
 Indexes are created to
– Enforce uniqueness on columns
– Facilitate sorting
– Enable fast retrieval by column values
 Good candidates for indexes are columns that are
frequently used with equal conditions in WHERE
clause or in a join
 Example:
– CREATE INDEX CustNameIdx ON CUSTOMER(Name);
– CREATE UNIQUE INDEX WorkUniqueIndex ON
WORK(Title, Copy, ArtistID);
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/14
Restrictions On Column
Modifications
 A column may be dropped at any time and
all data will be lost
 A column may be added at any time as
long as it is a NULL column
 To add a NOT NULL column
– Add a NULL column
– Fill the new column in every row with data
– Change its structure to NOT NULL
• ALTER TABLE T1 MODIFY C1 NOT NULL;
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/15
Creating Views
 SQL-92 CREATE VIEW command can be used to
create views in SQL*Plus
 Oracle allows the ORDER BY clause in view
definitions
 Only Oracle 9i supports the JOIN…ON syntax
 Example:
CREATE VIEW CustomerInterests AS
SELECT C.Name as Customer, A.Name as Artist
FROM CUSTOMER C JOIN CUSTOMER_ARTIST_INT I
ON C.CustomerID = I.CustomerID JOIN ARTIST A
ON I.ArtistID = A.ArtistID;
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/16
Enterprise Manager Console
 The Oracle Enterprise Manager Console
provides graphical facilities for managing
an Oracle database
 The utility can be used to manage
– Database structures such as tables and views
– User accounts, passwords, roles, and privileges
 The Manager Console includes a SQL
scratchpad for executing SQL statements
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/17
Application Logic
 Oracle database application can be
processed using
– Programming language to invoke Oracle DBMS
commands
– Stored procedures
– Start command to invoke database commands
stored in .sql files
– Triggers
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/18
Stored Procedures
 A stored procedure is a PL/SQL or Java program
stored within the database
 Stored procedures are programs that can
–
–
–
–
Have parameters
Invoke other procedures and functions
Return values
Raise exceptions
 A stored procedure must be compiled and stored
in the database
 Execute or Exec command is used to invoke a
stored procedure
– Exec Customer_Insert (‘Michael Bench’, ‘203’, ‘5552014’, ‘US’);
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/19
Example: Stored Procedure




Insert Figure 10-20
IN signifies input parameters
OUT signifies an output parameter
IN OUT signifies a parameter used for
both input and output
 Variables are declared after the
keyword AS
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/20
Triggers
 Oracle triggers are PL/SQL or Java
procedures that are invoked when
specified database activity occurs
 Triggers can be used to
–
–
–
–
–
Enforce a business rule
Set complex default values
Update a view
Perform a referential integrity action
Handle exceptions
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/21
Triggers (cont.)
 Trigger types
– A command trigger will be fired once per SQL
command
– A row trigger will be fired once for every row
involved in the processing of a SQL command
• Three types of row triggers: BEFORE, AFTER, and
INSTEAD OF
• BEFORE and AFTER triggers are placed on tables
while INSTEAD OF triggers are placed on views
• Each trigger can be fired on insert, update, or delete
commands
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/22
Data Dictionary
 Oracle maintains a data dictionary of
metadata
 The metadata of the dictionary itself are
stored in the table DICT
SELECT Table_Name, Comments
FROM DICT
WHERE Table_Name LIKE (‘%TABLES%’);
 USER_TABLES contains information about
user or system tables
DESC USER_TABLES;
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/23
Example Oracle Metadata
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/24
Concurrency Control
 Oracle processes database changes by
maintaining a System Change Number (SCN)
– SCN is a database-wide value that is incremented by
Oracle when database changes are made
 With SCN, SQL statements always read a
consistent set of values; those that were
committed at or before the time the statement was
started
 Oracle only reads committed changes; it will never
reads dirty data
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/25
Oracle Transaction Isolation
 Oracle supports the following transaction
isolation levels
– Read Committed: Oracle’s default transaction
isolation level since it never reads uncommitted
data changes
– Serializable: Dirty reads are not possible,
repeated reads yield the same results, and
phantoms are not possible
– Read Only: All statements read consistent data.
No inserts, updates, or deletions are possible
– Explicit locks: Not recommended
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/26
Oracle Security
 Oracle security components:
– An ACCOUNT is a user account
– A PROFILE is a set of system resource maximums that
are assigned to an account
– A PRIVILEGE is the right to perform a task
– A ROLE consists of groups of PRIVILEGEs and other
ROLEs
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/27
Account System Privileges
 Each ACCOUNT can be allocated many SYSTEM
PRIVILEGEs and many ROLEs
 An ACCOUNT has all the PRIVILEGEs
– That have been assigned directly
– Of all of its ROLEs
– Of all of its ROLEs that are inherited through ROLE
connections
 A ROLE can have many SYSTEM PRIVILEGEs and
it may also have a relationship to other ROLEs
 ROLEs simplify the administration of the database
– A set of privileges can be assigned to or removed from a
ROLE just once
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/28
Account Authentication
 Accounts can be authenticated by
– Password
– The host operating system
 Password management can be
specified via PROFILEs
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/29
Oracle Recovery Facilities
 Three file types for Oracle recovery:
– Datafiles contain user and system data
– ReDo log files contain logs of database
changes
• OnLine ReDo files are maintained on disk and
contain the rollback segments from recent database
changes
• Offline or Archive ReDo files are backups of the
OnLine ReDo files
– Control files describe the name, contents, and
locations of various files used by Oracle
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/30
Oracle Recovery Facilities
(cont.)
 Oracle can operate in either ARCHIVELOG
or NOARCHIVELOG mode
– If running in ARCHIVELOG mode, Oracle logs
all changes to the database
– When the OnLine ReDo files fill up, they are
copied to the Archive ReDo files
 The Oracle Recovery Manager
(RMAN) is a utility program used to
create backups and to perform
recovery
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/31
Types of Failure
 Oracle recovery techniques depend on the type of
failure
– An application failure due to application logic errors
– An instance failure occurs when Oracle itself fails due to
an operating system or computer hardware failure
• Oracle can recover from application and instance
failure without using the archived log file
– A media failure occurs when Oracle is unable to write to a
physical file because of a disk failure or corrupted files
• The database is restored from a backup
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/32
Oracle Backup Facilities
 Two kinds of backups
 A consistent backup: Database activity must be
stopped and all uncommitted changes have been
removed from the datafiles
– Cannot be done if the database supports 24/7 operations
 An inconsistent backup: Backup is made while
Oracle is processing the database
– An inconsistent backup can be made consistent by
processing an archive log file
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/33
Chapter 10
Managing Databases with Oracle 9i
Instructor: Dragomir R. Radev
Winter 2005
Fundamentals, Design,
and Implementation, 9/e
KDD and Data Mining
Instructor: Dragomir R. Radev
Winter 2005
Fundamentals, Design,
and Implementation, 9/e
The big problem
 Billions of records
 A small number of interesting patterns
 “Data rich but information poor”
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/36
Data mining
 Knowledge discovery
 Knowledge extraction
 Data/pattern analysis
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/37
Types of source data




Relational databases
Transactional databases
Web logs
Textual databases
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/38
Association rules
 65% of all customers who buy beer
and tomato sauce also buy pasta and
chicken wings
 Association rules: X Y
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/39
Association analysis
 IF
20 < age < 30
AND
20K < INCOME < 30K
 THEN
– Buys (“CD player”)
 SUPPORT = 2%, CONFIDENCE =
60%
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/40
Basic concepts
 Minimum support threshold
 Minimum confidence threshold
 Itemsets
 Occurrence frequency of an itemset
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/41
Association rule mining
 Find all frequent itemsets
 Generate strong association rules
from the frequent itemsets
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/42
Support and confidence
 Support (X)
 Confidence (X  Y) = Support(X+Y) /
Support (X)
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/43
Example
TID
T100
T200
T300
T400
T500
T600
T700
T800
T900
List of item IDs
I1, I2, I5
I2, I4
I2, I3
I1, I2, I4
I1, I3
I2, I3
I1, I3
I1, I2, I3, I5
I1, I2, I3
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/44
Example (cont’d)







Frequent itemset l = {I1, I2, I5}
I1 AND I2  I5
C = 2/4 = 50%
I1 AND I5  I2
I2 AND I5  I1
I1  I2 AND I5
I2  I1 AND I5
I3  I1 AND I2
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/45
Example 2
TID
date
items
T100
10/15/99
{K, A, D, B}
T200
10/15/99
{D, A, C, E, B}
T300
10/19/99
{C, A, B, E}
T400
10/22/99
{B, A, D}
min_sup = 60%, min_conf = 80%
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/46
Correlations
 Corr (A,B) = P (A OR B) / P(A) P (B)
 If Corr < 1: A discourages B (negative
correlation)
 (lift of the association rule A  B)
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/47
Contingency table
Game
^Game
Sum
Video
4,000
3,500
7,500
^Video
2,000
500
2,500
Sum
6,000
4,000
10,000
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/48
Example




P({game}) = 0.60
P({video}) = 0.75
P({game,video}) = 0.40
P({game,video})/(P({game})x(P({video
})) = 0.40/(0.60 x 0.75) = 0.89
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/49
Example 2
hotdogs
^hotdogs Sum
hamburgers
2000
500
2500
^hamburgers
1000
1500
2500
Sum
3000
2000
5000
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/50
Classification using
decision trees
 Expected information need
 I (s1, s2, …, sm) = -
S
pi log (pi)
 s = data samples
 m = number of classes
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/51
RID
Age
Income
student
credit
buys?
1
<= 30
High
No
Fair
No
2
<= 30
High
No
Excellent
No
3
31 .. 40
High
No
Fair
Yes
4
> 40
Medium
No
Fair
Yes
5
> 40
Low
Yes
Fair
Yes
6
> 40
Low
Yes
Excellent
No
7
31 .. 40
Low
Yes
Excellent
Yes
8
<= 30
Medium
No
Fair
No
9
<= 30
Low
Yes
Fair
Yes
10
> 40
Medium
Yes
Fair
Yes
11
<= 30
Medium
Yes
Excellent
Yes
12
31 .. 40
Medium
No
Excellent
Yes
13
31 .. 40
High
Yes
Fair
Yes
14
> 40
Medium
no
excellent
no
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/52
Decision tree induction
 I(s1,s2)
= I(9,5) =
= - 9/14 log 9/14 – 5/14 log 5/14 =
= 0.940
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/53
Entropy and information gain
•E(A) =
S
S1j + … + smj
s
I (s1j,…,smj)
Entropy = expected information based on the partitioning into
subsets by A
Gain (A) = I (s1,s2,…,sm) – E(A)
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/54
Entropy
 Age <= 30
s11 = 2, s21 = 3, I(s11, s21) = 0.971
 Age in 31 .. 40
s12 = 4, s22 = 0, I (s12,s22) = 0
 Age > 40
s13 = 3, s23 = 2, I (s13,s23) = 0.971
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/55
Entropy (cont’d)
 E (age) =
5/14 I (s11,s21) + 4/14 I (s12,s22) + 5/14 I
(S13,s23) = 0.694
 Gain (age) = I (s1,s2) – E(age) = 0.246
 Gain (income) = 0.029, Gain (student) =
0.151, Gain (credit) = 0.048
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/56
Final decision tree
age
> 40
31 .. 40
student
credit
yes
no
yes
no
yes
excellent
no
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
fair
yes
Chapter 10/57
Other techniques
 Bayesian classifiers
 X: age <=30, income = medium,
student = yes, credit = fair
 P(yes) = 9/14 = 0.643
 P(no) = 5/14 = 0.357
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/58
Example
 P (age < 30 | yes) = 2/9 = 0.222
P (age < 30 | no) = 3/5 = 0.600
P (income = medium | yes) = 4/9 = 0.444
P (income = medium | no) = 2/5 = 0.400
P (student = yes | yes) = 6/9 = 0.667
P (student = yes | no) = 1/5 = 0.200
P (credit = fair | yes) = 6/9 = 0.667
P (credit = fair | no) = 2/5 = 0.400
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/59
Example (cont’d)
 P (X | yes) = 0.222 x 0.444 x 0.667 x 0.667 =
0.044
 P (X | no) = 0.600 x 0.400 x 0.200 x 0.400 = 0.019
 P (X | yes) P (yes) = 0.044 x 0.643 = 0.028
 P (X | no) P (no) = 0.019 x 0.357 = 0.007
 Answer: yes/no?
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/60
Predictive models
 Inputs (e.g., medical history, age)
 Output (e.g., will patient experience
any side effects)
 Some models are better than others
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/61
Principles of data mining
 Training/test sets
 Error analysis and overfitting
error
test
training
 Cross-validation
input size
 Supervised vs. unsupervised methods
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/62
Representing data
 Vector space
credit
pay off
default
salary
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/63
Decision surfaces
credit
pay off
default
salary
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/64
Decision trees
credit
pay off
default
salary
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/65
Linear boundary
credit
pay off
default
salary
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/66
kNN models
 Assign each element to the closest
cluster
 Demos:
– http://www2.cs.cmu.edu/~zhuxj/courseproject/knnd
emo/KNN.html
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/67
Other methods




Decision trees
Neural networks
Support vector machines
Demos
– http://www.cs.technion.ac.il/~rani/LocBo
ost/
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/68
arff files
@data
sunny,85,85,FALSE,no
@relation weather
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
@attribute outlook {sunny, overcast, rainy}
@attribute temperature real
@attribute humidity real
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/69
Weka
http://www.cs.waikato.ac.nz/ml/weka
Methods:
rules.ZeroR
bayes.NaiveBayes
trees.j48.J48
lazy.IBk
trees.DecisionStump
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/70
kMeans clustering
 http://www.cc.gatech.edu/~dellaert/html/sof
tware.html
 java weka.clusterers.SimpleKMeans -t
data/weather.arff
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/71
More useful pointers
 http://www.kdnuggets.com/
 http://www.twocrows.com/booklet.htm
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/72
More types of data mining




Classification and prediction
Cluster analysis (clustering)
Outlier analysis
Evolution analysis
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/73
Clustering
 Exclusive/overlapping clusters
 Hierarchical/flat clusters
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/74
Methods
 Single-linkage
– One common pair is sufficient
– disadvantages: long chains
 Complete-linkage
– All pairs have to match
– Disadvantages: too conservative
 Average-linkage
 Centroid-based (online)
– Look at distances to centroids
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/75
k-means
 Needed: small number k of desired
clusters
 hard vs. soft decisions
 Example: Weka
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/76
k-means
1 initialize cluster centroids to arbitrary vectors
2 while further improvement is possible do
3 for each document d do
4
find the cluster c whose centroid is closest to d
5
assign d to cluster c
6 end for
7 for each cluster c do
8
recompute the centroid of cluster c based on its
documents
9 end for
10 end while
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/77
Example
 Cluster the following vectors into two
groups:
– A = <1,6>
– B = <2,2>
– C = <4,0>
– D = <3,3>
– E = <2,5>
– F = <2,1>
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/78
Demos
 http://vivisimo.com/
 http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_h
tml/AppletKM.html
 http://cgm.cs.mcgill.ca/~godfried/student_projects/bonnef_kmeans
 http://www.cs.washington.edu/research/imagedatabase/dem
o/kmcluster
 http://www.cc.gatech.edu/~dellaert/html/software.html
 http://www-2.cs.cmu.edu/~awm/tutorials/kmeans11.pdf
 http://www.ece.neu.edu/groups/rpl/projects/kmeans/

% cd /data2/tools/weka-3-3-4
% export CLASSPATH=/data2/tools/weka-3-3-4/weka.jar
% java weka.clusterers.SimpleKMeans -t data/weather.arff
Copyright © 2004 Database Processing: Fundamentals, Design, and Implementation, 9/e
by David M. Kroenke
Chapter 10/79