Transcript COMP1942

COMP1942
Exploring and Visualizing Data
Overview
Prepared by Raymond Wong
Presented by Raymond Wong
raywong@cse
COMP1942
1
Course Details

Instructor


Dr. Raymond Wong
TA



Kai Ho CHAN
Dandan LIN
Junqiu WEI
COMP1942
2
Course Details

Webpage

http://course.cse.ust.hk/comp1942/
COMP1942
3
Course Details


Lecture
 Time: Monday (1:30pm - 2:50pm) and
Friday (9:00am - 10:20am)
 Venue: G010 (CYT Building)
Tutorial will be announced via email.
Tutorial
 Time: Monday (12:30pm-1:20pm)
Venue: Room 5583 (LT 29-30) (Academic Building) or
CSE Lab 3 (Rm 4213 (Academic Building))
 Time: Tuesday (12:30pm-1:20pm)
Venue: Rm 2302 (LT 17-18) (Academic Building) or
CSE Lab 3 (Rm 4213 (Academic Building))
COMP1942
4
Course Details

Textbook

Data Mining for Business Intelligence:
Concepts, Techniques, and Applications in
Microsoft Office Excel with XLMiner. Galit
Shmueli, Nitin R. Patel and Peter C. Bruce,
Wiley 2010 (2nd edition)
COMP1942
5
Course Details

Reference books/materials:


Data Mining: Concepts and Techniques.
Jiawei Han, Micheline Kamber and Jian PEI.
Morgan Kaufmann Publishers (3rd edition)
Introduction to Data Mining. Pang-Ning
Tan, Michael Steinbach, Vipin Kumar
Boston : Pearson Addison Wesley (2006)
COMP1942
6
Common Core Requirement



My ability to use quantitative methods to
define, analyze and solve problems in daily life
has been enhanced.
I am more able to process quantitative data
and to use the data to reach a conclusion in a
logical way.
The course has aroused my interest in learning
more about mathematical models or
quantitative methods.
COMP1942
7
Course Details

Grading Scheme:





Assignment 10%
Project 20%
In-class Participation 10%
Mid-Term Exam 20%
Final Exam 40%
COMP1942
8
Assignment

2 assignments

Assignment 1


Assignment 2


Content before the mid-term exam
Content after the mid-term exam
NOTE: No late submissions are allowed.
Assignment 10%
Project 20%
In-class Participation 10%
Mid-Term Exam 20%
Final Exam 40%

COMP1942
9
Assignment

If the students can answer the selected
questions in class correctly,



for each correct answer,
I will give him/her a coupon
This coupon can be used to waive one
question in an assignment
which means that s/he can get full marks for
this question without answering this question
COMP1942
10
Assignment

Guideline

For each assignment, each student can waive at most
one question only.



s/he can waive any question s/he wants and obtain full marks
for this question (no matter whether s/he answer this
question or not)
s/he may also answer this question. But, we will also mark it
but will give full marks to this question.
When the student submits the assignment,


please staple the coupon to the submitted assignment
please write down the question no. s/he wants to waive on
the coupon
COMP1942
11
Project



Phase 1 (Excel file)
Phase 2 (Design Report)
Phase 3 (Final Report and Output files)
Assignment 10%
Project 20%
In-class Participation 10%
Mid-Term Exam 20%
Final Exam 40%

COMP1942
12
Project




You are required to form a group.
Each group contains 1 or 2 members.
3-member group is NOT allowed.
Please fill in the following information of each member in the
link
https://goo.gl/forms/WguJdkHO8TkpOFYn1





student ID
student name
Email
One group needs to submit the grouping information ONCE.
The group forming deadline is 15 Feb (Wed) 1pm.
COMP1942
13
Project




Data Mining Tool: XLMiner (in MS Excel)
Installed in CSE Lab 3 (Rm 4213)
All non-CSE students and all non-CPEG
students need to apply for the CSE
account.
You can see the details in our course
webpage.
COMP1942
14
Project





In Phase 3 (the last phase), you are
required to hand in some output files
We will check the output files
You can use at most one coupon to
obtain full marks for all output files
Each group can use at most one coupon
Please staple your coupon with your
final report.
COMP1942
15
In-class Participation

In each lesson, you are required to bring
one of the following with you.


your smart phone installed with iPRS (Internet
enabled Personal Response System) or
your PRS device
COMP1942
Assignment 10%
Project 20%
In-class Participation 10%
Mid-Term Exam 20%
Final Exam 40%
16
In-class Participation

If you have a smart phone (Android/iOS),


please install an app called “HKUST iLearn” in
your smart phone (Android/iOS).
If you do not have a smart phone,


you have to borrow your PRS device.
please visit ITSC Service Desk at Rm 2021 (Lift 2)
to borrow your PRS device.
COMP1942
17
In-class Participation


In each lesson, you may be asked
about some multiple-choice questions
(e.g., 1-3 questions)
You have to use your iPRS to answer
the questions
COMP1942
18
In-class Participation


You can obtain 1 unit for in-class
participation when you answer a
question in class with your iPRS (no
matter whether you answer it correctly
or not)
Those questions may be in the midterm exam and the final exam.
COMP1942
19
In-class Participation

In some cases,


Some students may be absent for some
reasons in class
The iPRS system could not record your answer



E.g., your smart device and the iPRS system crash
You are required to obtain 20 units in order to
obtain the full score (10%) for the in-class
participation
We will give at least 25 questions in the course.
COMP1942
20
Midterm and Final Exam


You are allowed to bring a calculator
with you.
Please remember to prepare a
calculator for the exam
Assignment 10%
Project 20%
In-class Participation 10%
Mid-Term Exam 20%
Final Exam 40%

COMP1942
21
Midterm Exam




In-class Midterm
Date: 17 March, 2017 (Fri)
Time: 9:00-10:20
Venue: G010 (CYT Building)
Rm 5619 (LT 31/32) (Academic Building)
COMP1942
22
Major Topics

In this course, you are expected to
learn something related to “Exploring
and Visualizing Data”.
Not only this!

In this course, you are expected to
learn how to solve problems and how to
analyze problems.
This is very important to your future.
COMP1942
23
Major Topics
1.
2.
3.
4.
5.
6.
Association
Clustering
Classification
Data Warehouse
Dimension Reduction
Web Databases
COMP1942
24
1. Association
Customer
Apple
Orange
Raymond
Apple
Orange
Ada
Grace
Orange
Apple
Orange
…
…
…
Items/Itemsets
Frequency
Apple
2
Orange
3
Milk
1
{Apple, Orange}
2
{Orange,
Milk}
COMP1942
1
Milk
Milk
We are interested in
the items/itemsets
with frequency >= 2
…
Frequent Pattern
(or Frequent Item)
Frequent Pattern
(or Frequent Item)
Frequent Pattern
(or Frequent Itemset)
25
1. Association
Customer
Apple
Orange
Raymond
Apple
Orange
Ada
Grace
Orange
Apple
Orange
…
…
…
Items/Itemsets
Frequency
Apple
2
Orange
33
Milk
1
{Apple, Orange}
22
Milk
Milk
We are interested in
the items/itemsets
with frequency >= 2
Association Rule:
…Apple  Orange
1.
( 100% customers who buy
apple will probably buy orange.)
2. Orange  Apple
( 67%
customer who buy
orange will probably buy apple.)
Problem:
toMilk}
find all frequent
{Orange,
1 patterns and association rules
COMP1942
26
Major Topics
1.
2.
3.
4.
5.
6.
Association
Clustering
Classification
Data Warehouse
Dimension Reduction
Web Databases
COMP1942
27
2. Clustering
Raymond
Louis
Wyman
…
Computer
100
History
40
90
20
45
95
…
Cluster 2
(e.g. High Score in History
and Low Score in Computer)
History
…
Cluster 1
(e.g. High Score in Computer
and Low Score in History)
Computer
Problem: to find all clusters
COMP1942
28
Major Topics
1.
2.
3.
4.
5.
6.
Association
Clustering
Classification
Data Warehouse
Dimension Reduction
Web Databases
COMP1942
29
3. Classification
Suppose there is a person.
Race
Income
Child
Insurance
white
high
no
?
child=yes
root
child=no
100% Yes
0% No
Income=high
100% Yes
0% No
Income=low
0% Yes
100% No
Decision tree
COMP1942
30
Major Topics
1.
2.
3.
4.
5.
6.
Association
Clustering
Classification
Data Warehouse
Dimension Reduction
Web Databases
COMP1942
31
4. Warehouse
Query
Users
Databases
Need to wait for a long time
(e.g., 1 day to 1 week)
Databases
Data
Warehouse
Users
Pre-computed results
COMP1942
32
Major Topics
1.
2.
3.
4.
5.
6.
Association
Clustering
Classification
Data Warehouse
Dimension Reduction
Web Databases
COMP1942
33
Suppose we have the
following data set
COMP1942
34
According to the data, we find the
following vectors (marked in red)
e1
COMP1942
e2
35

Consider that the data points are
projected on e1
COMP1942
36
Suppose all data points are projected
on vector e1
e1
This corresponds to the
information loss
This corresponds to
another information loss
COMP1942
e2
37
After all data points are projected on
vector e1
e1
Thus, the total information loss is
small.
COMP1942
e2
38

We can use only 1 dimension to
represent all data points (i.e., vector e1)
COMP1942
39
Major Topics
1.
2.
3.
4.
5.
6.
Association
Clustering
Classification
Data Warehouse
Dimension Reduction
Web Databases
COMP1942
40
6. Web Databases
Raymond Wong
COMP1942
41
How to rank the webpages?
COMP1942
42