Transcript COMP5331

COMP5331
Knowledge Discovery in Databases
Overview
Prepared by Raymond Wong
Presented by Raymond Wong
raywong@cse
COMP5331
1
Course Details

Reference books/materials:



Papers
Data Mining: Concepts and Techniques.
Jiawei Han and Micheline Kamber. Morgan
Kaufmann Publishers (3rd edition)
Introduction to Data Mining. Pang-Ning
Tan, Michael Steinbach, Vipin Kumar
Boston : Pearson Addison Wesley (2006)
COMP5331
2
Area


DB or AI
This course can count towards one of
the areas ONLY and cannot be double
counted towards the required credits
COMP5331
3
Course Details

Grading Scheme:



Assignment 30%
Project 30%
Final Exam 40%
COMP5331
4
Assignment

If the students can answer the selected
questions in class correctly,



for each corrected answer,
I will give him/her a coupon
This coupon can be used to waive one
question in an assignment
which means that s/he can get full marks for
this question without answering this question
COMP5331
5
Assignment

Guideline

For each assignment, each student can waive at most
one question only.



s/he can waive any question he wants and obtain full marks
for this question (no matter whether s/he answer this
question or not)
s/he may also answer this question. But, we will also mark it
but will give full marks to this question.
When the student submits the assignment,


please staple the coupon to the submitted assignment
please write down the question no. s/he wants to waive on
the coupon
COMP5331
6
Project




Each project is completed by a group.
The number of students in a group
depends on the class size.
The duration of each presentation
depends on the class size.
It will be announced soon.
COMP5331
7
Project

Project Type (One of the following)

Survey

Implementation-oriented Project
Your group only needs to
read about 2~5 papers
Your group only needs to
read about 1~2 papers

Research-oriented Project
You can read some papers
and conduct research
COMP5331
8
Project

Project Type (One of the following)
1.
2.
3.
Proposal
Presentation
Final report

Survey

Implementation-oriented Project

Research-oriented
1.
2.
3.
COMP5331
4.
1.
2.
Project 3.
4.
Full Score = 80%
Full Score = 90%
Proposal
Presentation
Final report
Coding
Proposal
Full Score = 100%
Presentation
Final report (containing your
proposed methodology)
9
Coding (if any)
Project

Project Topic



Some pre-selected topics/papers
Your own choice
For fairness, please do not choose the
topic which is closely related to your
own research
COMP5331
10
Exam


You are allowed to bring a calculator
with you.
Please remember to prepare a
calculator for the exam
COMP5331
11
Major Topics
1.
2.
3.
4.
5.
6.
7.
Association
Clustering
Classification
Data Warehouse
Data Mining over Data Streams
Web Databases
Multi-criteria Decision Making
COMP5331
12
1. Association
Customer
Apple
Orange
Raymond
Apple
Orange
Ada
Grace
Orange
Apple
Orange
…
…
…
Items/Itemsets
Frequency
Apple
2
Orange
3
Milk
1
{Apple, Orange}
2
{Orange,
Milk}
COMP5331
1
Milk
Milk
We are interested in
the items/itemsets
with frequency >= 2
…
Frequent Pattern
(or Frequent Item)
Frequent Pattern
(or Frequent Item)
Frequent Pattern
(or Frequent Itemset)
13
1. Association
Customer
Apple
Orange
Raymond
Apple
Orange
Ada
Grace
Orange
Apple
Orange
…
…
…
Items/Itemsets
Frequency
Apple
2
Orange
33
Milk
1
{Apple, Orange}
22
Milk
Milk
We are interested in
the items/itemsets
with frequency >= 2
Association Rule:
…Apple  Orange
1.
( 100% customers who buy
apple will probably buy orange.)
2. Orange  Apple
( 67%
customer who buy
orange will probably buy apple.)
Problem:
toMilk}
find all frequent
{Orange,
1 patterns and association rules
COMP5331
14
Major Topics
1.
2.
3.
4.
5.
6.
7.
Association
Clustering
Classification
Data Warehouse
Data Mining over Data Streams
Web Databases
Multi-criteria Decision Making
COMP5331
15
2. Clustering
Raymond
Louis
Wyman
…
Computer
100
History
40
90
20
45
95
…
Cluster 2
(e.g. High Score in History
and Low Score in Computer)
History
…
Cluster 1
(e.g. High Score in Computer
and Low Score in History)
Computer
Problem: to find all clusters
COMP5331
16
Major Topics
1.
2.
3.
4.
5.
6.
7.
Association
Clustering
Classification
Data Warehouse
Data Mining over Data Streams
Web Databases
Multi-criteria Decision Making
COMP5331
17
3. Classification
Suppose there is a person.
Race
Income
Child
Insurance
white
high
no
?
child=yes
root
child=no
100% Yes
0% No
Income=high
100% Yes
0% No
Income=low
0% Yes
100% No
Decision tree
COMP5331
18
Major Topics
1.
2.
3.
4.
5.
6.
7.
Association
Clustering
Classification
Data Warehouse
Data Mining over Data Streams
Web Databases
Multi-criteria Decision Making
COMP5331
19
4. Warehouse
Query
Users
Databases
Need to wait for a long time
(e.g., 1 day to 1 week)
Databases
Data
Warehouse
Users
Pre-computed results
COMP5331
20
Major Topics
1.
2.
3.
4.
5.
6.
7.
Association
Clustering
Classification
Data Warehouse
Data Mining over Data Streams
Web Databases
Multi-criteria Decision Making
COMP5331
21
5. Data Mining over Static
Data
1. Association
2. Clustering
3. Classification
Static
Data
COMP5331
Output
(Data Mining Results)
22
5. Data Mining over Data
Streams
1. Association
2. Clustering
3. Classification
…
Unbounded Data
COMP5331
Output
(Data Mining Results)
Real-time Processing
23
Major Topics
1.
2.
3.
4.
5.
6.
7.
Association
Clustering
Classification
Data Warehouse
Data Mining over Data Streams
Web Databases
Multi-criteria Decision Making
COMP5331
24
6. Web Databases
Raymond Wong
COMP5331
25
How to rank the webpages?
COMP5331
26
Major Topics
1.
2.
3.
4.
5.
6.
7.
Association
Clustering
Classification
Data Warehouse
Data Mining over Data Streams
Web Databases
Multi-criteria Decision Making
COMP5331
27
7. Multi-criteria Decision
Making
Suppose we want to look for a hotel which is close to a beach.
3 hotels
Hotel
Price
Distance to beach (km)
a
1000
4
b
2400
5
c
3000
1
COMP5331
Suppose we compare hotel
a and hotel b
We know that hotel a is “better”
than hotel b
because
1. Price of hotel a is smaller
2. Distance of hotel a is smaller
We have two attributes.
Which hotel should we
select?
28
7. Multi-criteria Decision
Making
Suppose we want to look for a hotel which is close to a beach.
3 hotels
Hotel
Price
Distance to beach (km)
a
1000
4
b
2400
5
c
3000
COMP5331
Suppose we compare hotel
a and hotel c
We cannot determine hotel a is “better”
1
than hotel c (wrt two attributes).
We cannot determine hotel c is “better”
than hotel a (wrt two attributes)..
This is because
1. Price of hotel a is smaller
2. Distance of hotel c is smaller
We have two attributes.
Which hotel should we
select?
29