Transcript COMP1942
COMP1942
Exploring and Visualizing Data
Overview
Prepared by Raymond Wong
Presented by Raymond Wong
raywong@cse
COMP1942
1
Course Details
Instructor
Dr. Raymond Wong
TA
Kai Ho CHAN
Dandan LIN
Junqiu WEI
COMP1942
2
Course Details
Webpage
http://course.cse.ust.hk/comp1942/
COMP1942
3
Course Details
Lecture
Time: Monday (1:30pm - 2:50pm) and
Friday (9:00am - 10:20am)
Venue: G010 (CYT Building)
Tutorial will be announced via email.
Tutorial
Time: Monday (12:30pm-1:20pm)
Venue: Room 5583 (LT 29-30) (Academic Building) or
CSE Lab 3 (Rm 4213 (Academic Building))
Time: Tuesday (12:30pm-1:20pm)
Venue: Rm 2302 (LT 17-18) (Academic Building) or
CSE Lab 3 (Rm 4213 (Academic Building))
COMP1942
4
Course Details
Textbook
Data Mining for Business Intelligence:
Concepts, Techniques, and Applications in
Microsoft Office Excel with XLMiner. Galit
Shmueli, Nitin R. Patel and Peter C. Bruce,
Wiley 2010 (2nd edition)
COMP1942
5
Course Details
Reference books/materials:
Data Mining: Concepts and Techniques.
Jiawei Han, Micheline Kamber and Jian PEI.
Morgan Kaufmann Publishers (3rd edition)
Introduction to Data Mining. Pang-Ning
Tan, Michael Steinbach, Vipin Kumar
Boston : Pearson Addison Wesley (2006)
COMP1942
6
Common Core Requirement
My ability to use quantitative methods to
define, analyze and solve problems in daily life
has been enhanced.
I am more able to process quantitative data
and to use the data to reach a conclusion in a
logical way.
The course has aroused my interest in learning
more about mathematical models or
quantitative methods.
COMP1942
7
Course Details
Grading Scheme:
Assignment 10%
Project 20%
In-class Participation 10%
Mid-Term Exam 20%
Final Exam 40%
COMP1942
8
Assignment
2 assignments
Assignment 1
Assignment 2
Content before the mid-term exam
Content after the mid-term exam
NOTE: No late submissions are allowed.
Assignment 10%
Project 20%
In-class Participation 10%
Mid-Term Exam 20%
Final Exam 40%
COMP1942
9
Assignment
If the students can answer the selected
questions in class correctly,
for each correct answer,
I will give him/her a coupon
This coupon can be used to waive one
question in an assignment
which means that s/he can get full marks for
this question without answering this question
COMP1942
10
Assignment
Guideline
For each assignment, each student can waive at most
one question only.
s/he can waive any question s/he wants and obtain full marks
for this question (no matter whether s/he answer this
question or not)
s/he may also answer this question. But, we will also mark it
but will give full marks to this question.
When the student submits the assignment,
please staple the coupon to the submitted assignment
please write down the question no. s/he wants to waive on
the coupon
COMP1942
11
Project
Phase 1 (Excel file)
Phase 2 (Design Report)
Phase 3 (Final Report and Output files)
Assignment 10%
Project 20%
In-class Participation 10%
Mid-Term Exam 20%
Final Exam 40%
COMP1942
12
Project
You are required to form a group.
Each group contains 1 or 2 members.
3-member group is NOT allowed.
Please fill in the following information of each member in the
link
https://goo.gl/forms/WguJdkHO8TkpOFYn1
student ID
student name
Email
One group needs to submit the grouping information ONCE.
The group forming deadline is 15 Feb (Wed) 1pm.
COMP1942
13
Project
Data Mining Tool: XLMiner (in MS Excel)
Installed in CSE Lab 3 (Rm 4213)
All non-CSE students and all non-CPEG
students need to apply for the CSE
account.
You can see the details in our course
webpage.
COMP1942
14
Project
In Phase 3 (the last phase), you are
required to hand in some output files
We will check the output files
You can use at most one coupon to
obtain full marks for all output files
Each group can use at most one coupon
Please staple your coupon with your
final report.
COMP1942
15
In-class Participation
In each lesson, you are required to bring
one of the following with you.
your smart phone installed with iPRS (Internet
enabled Personal Response System) or
your PRS device
COMP1942
Assignment 10%
Project 20%
In-class Participation 10%
Mid-Term Exam 20%
Final Exam 40%
16
In-class Participation
If you have a smart phone (Android/iOS),
please install an app called “HKUST iLearn” in
your smart phone (Android/iOS).
If you do not have a smart phone,
you have to borrow your PRS device.
please visit ITSC Service Desk at Rm 2021 (Lift 2)
to borrow your PRS device.
COMP1942
17
In-class Participation
In each lesson, you may be asked
about some multiple-choice questions
(e.g., 1-3 questions)
You have to use your iPRS to answer
the questions
COMP1942
18
In-class Participation
You can obtain 1 unit for in-class
participation when you answer a
question in class with your iPRS (no
matter whether you answer it correctly
or not)
Those questions may be in the midterm exam and the final exam.
COMP1942
19
In-class Participation
In some cases,
Some students may be absent for some
reasons in class
The iPRS system could not record your answer
E.g., your smart device and the iPRS system crash
You are required to obtain 20 units in order to
obtain the full score (10%) for the in-class
participation
We will give at least 25 questions in the course.
COMP1942
20
Midterm and Final Exam
You are allowed to bring a calculator
with you.
Please remember to prepare a
calculator for the exam
Assignment 10%
Project 20%
In-class Participation 10%
Mid-Term Exam 20%
Final Exam 40%
COMP1942
21
Midterm Exam
In-class Midterm
Date: 17 March, 2017 (Fri)
Time: 9:00-10:20
Venue: G010 (CYT Building)
Rm 5619 (LT 31/32) (Academic Building)
COMP1942
22
Major Topics
In this course, you are expected to
learn something related to “Exploring
and Visualizing Data”.
Not only this!
In this course, you are expected to
learn how to solve problems and how to
analyze problems.
This is very important to your future.
COMP1942
23
Major Topics
1.
2.
3.
4.
5.
6.
Association
Clustering
Classification
Data Warehouse
Dimension Reduction
Web Databases
COMP1942
24
1. Association
Customer
Apple
Orange
Raymond
Apple
Orange
Ada
Grace
Orange
Apple
Orange
…
…
…
Items/Itemsets
Frequency
Apple
2
Orange
3
Milk
1
{Apple, Orange}
2
{Orange,
Milk}
COMP1942
1
Milk
Milk
We are interested in
the items/itemsets
with frequency >= 2
…
Frequent Pattern
(or Frequent Item)
Frequent Pattern
(or Frequent Item)
Frequent Pattern
(or Frequent Itemset)
25
1. Association
Customer
Apple
Orange
Raymond
Apple
Orange
Ada
Grace
Orange
Apple
Orange
…
…
…
Items/Itemsets
Frequency
Apple
2
Orange
33
Milk
1
{Apple, Orange}
22
Milk
Milk
We are interested in
the items/itemsets
with frequency >= 2
Association Rule:
…Apple Orange
1.
( 100% customers who buy
apple will probably buy orange.)
2. Orange Apple
( 67%
customer who buy
orange will probably buy apple.)
Problem:
toMilk}
find all frequent
{Orange,
1 patterns and association rules
COMP1942
26
Major Topics
1.
2.
3.
4.
5.
6.
Association
Clustering
Classification
Data Warehouse
Dimension Reduction
Web Databases
COMP1942
27
2. Clustering
Raymond
Louis
Wyman
…
Computer
100
History
40
90
20
45
95
…
Cluster 2
(e.g. High Score in History
and Low Score in Computer)
History
…
Cluster 1
(e.g. High Score in Computer
and Low Score in History)
Computer
Problem: to find all clusters
COMP1942
28
Major Topics
1.
2.
3.
4.
5.
6.
Association
Clustering
Classification
Data Warehouse
Dimension Reduction
Web Databases
COMP1942
29
3. Classification
Suppose there is a person.
Race
Income
Child
Insurance
white
high
no
?
child=yes
root
child=no
100% Yes
0% No
Income=high
100% Yes
0% No
Income=low
0% Yes
100% No
Decision tree
COMP1942
30
Major Topics
1.
2.
3.
4.
5.
6.
Association
Clustering
Classification
Data Warehouse
Dimension Reduction
Web Databases
COMP1942
31
4. Warehouse
Query
Users
Databases
Need to wait for a long time
(e.g., 1 day to 1 week)
Databases
Data
Warehouse
Users
Pre-computed results
COMP1942
32
Major Topics
1.
2.
3.
4.
5.
6.
Association
Clustering
Classification
Data Warehouse
Dimension Reduction
Web Databases
COMP1942
33
Suppose we have the
following data set
COMP1942
34
According to the data, we find the
following vectors (marked in red)
e1
COMP1942
e2
35
Consider that the data points are
projected on e1
COMP1942
36
Suppose all data points are projected
on vector e1
e1
This corresponds to the
information loss
This corresponds to
another information loss
COMP1942
e2
37
After all data points are projected on
vector e1
e1
Thus, the total information loss is
small.
COMP1942
e2
38
We can use only 1 dimension to
represent all data points (i.e., vector e1)
COMP1942
39
Major Topics
1.
2.
3.
4.
5.
6.
Association
Clustering
Classification
Data Warehouse
Dimension Reduction
Web Databases
COMP1942
40
6. Web Databases
Raymond Wong
COMP1942
41
How to rank the webpages?
COMP1942
42