KDDCUP Survey - Department of Computer Science and Engineering

Transcript KDDCUP Survey - Department of Computer Science and Engineering

KDD-Cup A Survey: 1997-2012
Special Thanks to Prof. Qiang YANG’s
course materials!
(partly based on Xinyue Liu’s slides @SFU, and
Nathan Liu’s slides @hkust)
Hong Kong University of Science and Technology
1
About ACM KDDCUP


ACM KDD: Premiere Conference in knowledge discovery
and data mining
ACM KDDCUP:


Worldwide competition in conjunction with ACM KDD
conferences.
It aims at:



showcase the best methods for discovering higher-level
knowledge from data.
Helping to close the gap between research and industry
Stimulating further KDD research and development
2
Statistics
 Participation in KDD Cup grew steadily
 Average person-hours per submission: 204
Max person-hours per submission: 910
Year
Submissions
97 98
16
21
99
24
2000 2005 2011
30
32 1000+
3
KDD Cup 97


A classification task – to
predict financial services
industry (direct mail
response)
Winners



Charles Elkan, a Prof from
UC-San Diego with his
Boosted Naive Bayesian
(BNB)
Silicon Graphics, Inc with
their software MineSet
Urban Science Applications,
Inc. with their software gain,
Direct Marketing Selection
System
4
MineSet (Silicon Graphics Inc.)

A KDD tool that combines data access, transformation,
classification, and visualization.
5
KDD Cup 98: CRM Benchmark



URL:
www.kdnuggets.com/meetings/kd
d98/kdd-cup-98.html
A classification task – to analyze
fund raising mail responses to a
non-profit organization
Winners



Urban Science Applications,
Inc. with their software
GainSmarts.
SAS Institute, Inc. with their
software SAS Enterprise Miner
™
Quadstone Limited with their
software Decisionhouse ™
6
KDDCUP 1998 Results
$70,000
$65,000
$60,000
$55,000
$50,000
$45,000
$40,000
$35,000
$30,000
$25,000
$20,000
$15,000
$10,000
$5,000
$-
Maximum Possible Profit Line
($72,776 in profits with 4,873 mailed)
100%
90%
80%
Mail to Everyone Solution
($10,560 in profits with 96,367 mailed)
70%
60%
50%
GainSmarts
SAS/Enterprise Miner
Quadstone/Decisionhouse
40%
30%
20%
10%
0%
ACM KDD Cup 1999




URL:
www.cse.ucsd.edu/users/elkan/
kdresults.html
Problem
To detect network intrusion
and protect a computer network
from unauthorized users,
including perhaps insiders
Data: from DoD
Winners
 SAS Institute Inc. with their
software Enterprise Miner.
 Amdocs with their
Information Analysis
Environment
8
KDDCUP 2000: Data Set and Goal:
Data collected from

Gazelle.com, a legwear
and legcare Web retailer
 Pre-processed
Training set: 2 months
 Test sets: one month
 Data collected includes: 


Click streams
Order information
The goal – to design
models to support website personalization and
to improve the
profitability of the site by
increasing customer
response.
Questions - When given
a set of page views,



characterize heavy
spenders
characterize killer pages
characterize which product
brand a visitor will view in
the remainder of the
session?
9
KDD Cup 2001

3 Bioinformatics Tasks

Dataset 1: Prediction of
Molecular Bioactivity for
Drug Design


half a gigabyte when
uncompressed
Dataset 2: Prediction of
Gene/Protein Function (task
2) and Localization (task 3)


Dataset 2 is smaller and
easier to understand
7 megabytes uncompressed

A total of 136 groups
participated to produce
a total of 200
submitted predictions
over the 3 tasks: 114
for Thrombin, 41 for
Function, and 45 for
Localization.
10
2001 Winners

Task 1, Thrombin:



Jie Cheng (Canadian Imperial
Bank of Commerce).
Bayesian network learner and
classifier
Task 2, Function: Mark-A.
Krogel (University of
Magdeburg).



Task 2:


the genes of one particular
type of organism
A gene/protein can have
more than one function, but
only one localization.
Inductive Logic programming
Task 3, Localization: Hisashi
Hayashi, Jun Sese, and
Shinichi Morishita (University
of Tokyo).

K nearest neighbor
11

molecular
biology : Two tasks


Task 1: Document
extraction from
biological articles
Task 2: Classification of
proteins based on gene
deletion experiments

Winners:

Task 1: ClearForest and
Celera, USA


Yizhar Regev and Michal
Finkelstein
Task 2: Telstra
Research Laboratories
, Australia

Adam Kowalczyk and
Bhavani Raskutti
12
2003 KDDCUP

Information
Retrieval/Citation Mining of
Scientific research papers





based on a very large
archive of research papers
First Task: predict how many
citations each paper will receive
during the three months
leading up to the KDD 2003
conference
Second Task: a citation graph
of a large subset of the archive
from only the LaTex sources
Third Task: each paper's
popularity will be estimated
based on partial download logs
Last Task: devise their own
questions
13
2004 Tasks and Results


(Particle physics; plus protein homology
prediction）
Winners of the two tasks：


David S. Vogel, Eric Gottschalk, and Morgan C. Wang
Bernhard Pfahringer, Yan Fu, RuiXiang Sun, Qiang
Yang, Simin He, Chunli Wang, Haipeng Wang,
Shiguang Shan, Junfa Liu, Wen Gao.
14
Past KDDCUP Overview: 2005-2010
Year
Host
Task
Technique
Winner
2005
Microsoft
Web query
categorization
Feature Engineering,
Ensemble
HKUST （Shen,
Yang, etc.）
2006
Siemens
Pulmonary emboli
detection
Multi-instance, Non-IID
sample, Cost sensitive,
Class Imbalance, Noisy
data
AT&T, Budapest
University of
Technology &
Economics
2007
Netflix
Consumer
recommendation
Collaborative Filtering,
Time series, Ensemble
IBM Research,
Hungarian
Academy of
Sciences
2008
Siemens
Breast cancer
detection from
medical images
Ensemble, Class
imbalance, Score
calibration
IBM Research,
National Taiwan
University
2009
Orange
Customer
relationship
prediction in telecom
Feature selection,
Ensemble
IBM Research,
University of
Melbourne
2010
PSLC Data
Shop
Student performance
prediction in ELearning
Feature engineering,
Ensemble,
Collaborative filtering
National Taiwan
University （CJ
Lin, S. Lin, etc.)
KDDCUP’11 Dataset




11 years of data
Rated items are
 Tracks
 Albums
 Artists
 Genres
Items arranges in a taxonomy
Two tasks
Track 1
Track 2
#ratings
263M
63M
#items
625K
296K
#users
1M
249K
Items in a Taxonomy
Track 1 Details
Track 1 Highlights





Largest publicly available dataset
Large number of items (50 times more than
Netflix)
Extreme rating sparsity (20 times more
sparse than Netflix)
Taxonomy can help in combating sparsely
rated items.
Fine time stamps with both date and time
allow sophisticated temporal modeling.
Track 2 Details
Track 2 Highlights



Performance metric focus on ranking/
classification, which differs from traditional
collaborative filtering.
No validation data provided, need to selfconstruct binary labeled data from rating
data.
Unlike track 1, track 2 removed time stamps
to focus more than long term preference
rather than short term behaviors.
Submission Stats
Winners
Track 1
Track 2
1st place
National Taiwan University
National Taiwan University
2nd place
Commendo (Netflix Prize
Winnder)
Chinese Academy of Science,
Hulu Labs
3rd place
Hong Kong University of
Science and Technology,
Shanghai Jiaotong University
Commendo (Netflix Prize
Winnder)
Chinese Teams at KDDCUP (NTU,
CAS, HKUST)
Nathan Liu:
HKUST CSE
PhD student
KDDCUP 2012


Tencent
Task 1: Micro-blog (Weibo) User Recommendation


Recommends a popular person
/ an organization
/ a group
TO
a user
Task 2: Ad click-through rate prediction from
search log

How often will an Ad be clicked by a user?
26
Task1: User recommendation UI
Popular user
recommendation
Task2: Ad click-through rate
prediction
Ad click-through rate prediction
28
Task1 Data – User-Item Matrix

rec_log_train.txt / rec_log_test.txt
2088948
2088948
2088948
601635
601635
601635
1529353
1760350
1774722
786313
1775029
1902321
462104
1774509
-1
-1
-1
-1
-1
-1
-1
1318348785
1318348785
1318348785
1318348785
1318348785
1318348785
1318348786
UserID ItemID ?followed TimeStamp
 ~75M records in training data
 ?followed: -1/1, user accepts the recommendation or not
 In test data, it is filled with 0, to be predicted as -1/1.
 TimeStamp: unix-timestamp
 Seconds from 70.1.1 00:00:00 (UTC time)
29
Task2 Data – Main Data Table
Extremely Large Training Data ~150M records
 10Gig raw csv file + keywords + userProfiles
 Predicting CTR to helps search provider to rank/price ads
correctly

Winners
Track 1
Track 2
1st place
Shanghai Jiao Tong University
National Taiwan University
2nd place
Steffen Rendle, University of
Konstanz
Opera Solutions
3rd place
Team FICO Model Builder
Steffen Rendle, University of
Konstanz
Summary

To place on top of KDDCUP requires




Team work
Expertise in domain knowledge as well as mathematical
tools
Often done by world famous institutes and companies
Recent trends:



Dataset increasingly more realistic
Participants increasingly more professional
Tasks are increasingly more difficult
31
Summary




KDD Cup is an excellent source to learn the
state-of-art KDD techniques
KDDCUP dataset often becomes the
standard benchmark for future research,
development and teaching
Top winners are highly regarded and
respected
References:
http://www.sigkdd.org/kddcup/index.php
32

KDDCUP Survey - Department of Computer Science and Engineering

Transcript KDDCUP Survey - Department of Computer Science and Engineering

Directory