KDDCUP Survey - Department of Computer Science and Engineering
Download
Report
Transcript KDDCUP Survey - Department of Computer Science and Engineering
KDD-Cup A Survey: 1997-2012
Special Thanks to Prof. Qiang YANG’s
course materials!
(partly based on Xinyue Liu’s slides @SFU, and
Nathan Liu’s slides @hkust)
Hong Kong University of Science and Technology
1
About ACM KDDCUP
ACM KDD: Premiere Conference in knowledge discovery
and data mining
ACM KDDCUP:
Worldwide competition in conjunction with ACM KDD
conferences.
It aims at:
showcase the best methods for discovering higher-level
knowledge from data.
Helping to close the gap between research and industry
Stimulating further KDD research and development
2
Statistics
Participation in KDD Cup grew steadily
Average person-hours per submission: 204
Max person-hours per submission: 910
Year
Submissions
97 98
16
21
99
24
2000 2005 2011
30
32 1000+
3
KDD Cup 97
A classification task – to
predict financial services
industry (direct mail
response)
Winners
Charles Elkan, a Prof from
UC-San Diego with his
Boosted Naive Bayesian
(BNB)
Silicon Graphics, Inc with
their software MineSet
Urban Science Applications,
Inc. with their software gain,
Direct Marketing Selection
System
4
MineSet (Silicon Graphics Inc.)
A KDD tool that combines data access, transformation,
classification, and visualization.
5
KDD Cup 98: CRM Benchmark
URL:
www.kdnuggets.com/meetings/kd
d98/kdd-cup-98.html
A classification task – to analyze
fund raising mail responses to a
non-profit organization
Winners
Urban Science Applications,
Inc. with their software
GainSmarts.
SAS Institute, Inc. with their
software SAS Enterprise Miner
™
Quadstone Limited with their
software Decisionhouse ™
6
KDDCUP 1998 Results
$70,000
$65,000
$60,000
$55,000
$50,000
$45,000
$40,000
$35,000
$30,000
$25,000
$20,000
$15,000
$10,000
$5,000
$-
Maximum Possible Profit Line
($72,776 in profits with 4,873 mailed)
100%
90%
80%
Mail to Everyone Solution
($10,560 in profits with 96,367 mailed)
70%
60%
50%
GainSmarts
SAS/Enterprise Miner
Quadstone/Decisionhouse
40%
30%
20%
10%
0%
ACM KDD Cup 1999
URL:
www.cse.ucsd.edu/users/elkan/
kdresults.html
Problem
To detect network intrusion
and protect a computer network
from unauthorized users,
including perhaps insiders
Data: from DoD
Winners
SAS Institute Inc. with their
software Enterprise Miner.
Amdocs with their
Information Analysis
Environment
8
KDDCUP 2000: Data Set and Goal:
Data collected from
Gazelle.com, a legwear
and legcare Web retailer
Pre-processed
Training set: 2 months
Test sets: one month
Data collected includes:
Click streams
Order information
The goal – to design
models to support website personalization and
to improve the
profitability of the site by
increasing customer
response.
Questions - When given
a set of page views,
characterize heavy
spenders
characterize killer pages
characterize which product
brand a visitor will view in
the remainder of the
session?
9
KDD Cup 2001
3 Bioinformatics Tasks
Dataset 1: Prediction of
Molecular Bioactivity for
Drug Design
half a gigabyte when
uncompressed
Dataset 2: Prediction of
Gene/Protein Function (task
2) and Localization (task 3)
Dataset 2 is smaller and
easier to understand
7 megabytes uncompressed
A total of 136 groups
participated to produce
a total of 200
submitted predictions
over the 3 tasks: 114
for Thrombin, 41 for
Function, and 45 for
Localization.
10
2001 Winners
Task 1, Thrombin:
Jie Cheng (Canadian Imperial
Bank of Commerce).
Bayesian network learner and
classifier
Task 2, Function: Mark-A.
Krogel (University of
Magdeburg).
Task 2:
the genes of one particular
type of organism
A gene/protein can have
more than one function, but
only one localization.
Inductive Logic programming
Task 3, Localization: Hisashi
Hayashi, Jun Sese, and
Shinichi Morishita (University
of Tokyo).
K nearest neighbor
11
molecular
biology : Two tasks
Task 1: Document
extraction from
biological articles
Task 2: Classification of
proteins based on gene
deletion experiments
Winners:
Task 1: ClearForest and
Celera, USA
Yizhar Regev and Michal
Finkelstein
Task 2: Telstra
Research Laboratories
, Australia
Adam Kowalczyk and
Bhavani Raskutti
12
2003 KDDCUP
Information
Retrieval/Citation Mining of
Scientific research papers
based on a very large
archive of research papers
First Task: predict how many
citations each paper will receive
during the three months
leading up to the KDD 2003
conference
Second Task: a citation graph
of a large subset of the archive
from only the LaTex sources
Third Task: each paper's
popularity will be estimated
based on partial download logs
Last Task: devise their own
questions
13
2004 Tasks and Results
(Particle physics; plus protein homology
prediction)
Winners of the two tasks:
David S. Vogel, Eric Gottschalk, and Morgan C. Wang
Bernhard Pfahringer, Yan Fu, RuiXiang Sun, Qiang
Yang, Simin He, Chunli Wang, Haipeng Wang,
Shiguang Shan, Junfa Liu, Wen Gao.
14
Past KDDCUP Overview: 2005-2010
Year
Host
Task
Technique
Winner
2005
Microsoft
Web query
categorization
Feature Engineering,
Ensemble
HKUST (Shen,
Yang, etc.)
2006
Siemens
Pulmonary emboli
detection
Multi-instance, Non-IID
sample, Cost sensitive,
Class Imbalance, Noisy
data
AT&T, Budapest
University of
Technology &
Economics
2007
Netflix
Consumer
recommendation
Collaborative Filtering,
Time series, Ensemble
IBM Research,
Hungarian
Academy of
Sciences
2008
Siemens
Breast cancer
detection from
medical images
Ensemble, Class
imbalance, Score
calibration
IBM Research,
National Taiwan
University
2009
Orange
Customer
relationship
prediction in telecom
Feature selection,
Ensemble
IBM Research,
University of
Melbourne
2010
PSLC Data
Shop
Student performance
prediction in ELearning
Feature engineering,
Ensemble,
Collaborative filtering
National Taiwan
University (CJ
Lin, S. Lin, etc.)
KDDCUP’11 Dataset
11 years of data
Rated items are
Tracks
Albums
Artists
Genres
Items arranges in a taxonomy
Two tasks
Track 1
Track 2
#ratings
263M
63M
#items
625K
296K
#users
1M
249K
Items in a Taxonomy
Track 1 Details
Track 1 Highlights
Largest publicly available dataset
Large number of items (50 times more than
Netflix)
Extreme rating sparsity (20 times more
sparse than Netflix)
Taxonomy can help in combating sparsely
rated items.
Fine time stamps with both date and time
allow sophisticated temporal modeling.
Track 2 Details
Track 2 Highlights
Performance metric focus on ranking/
classification, which differs from traditional
collaborative filtering.
No validation data provided, need to selfconstruct binary labeled data from rating
data.
Unlike track 1, track 2 removed time stamps
to focus more than long term preference
rather than short term behaviors.
Submission Stats
Winners
Track 1
Track 2
1st place
National Taiwan University
National Taiwan University
2nd place
Commendo (Netflix Prize
Winnder)
Chinese Academy of Science,
Hulu Labs
3rd place
Hong Kong University of
Science and Technology,
Shanghai Jiaotong University
Commendo (Netflix Prize
Winnder)
Chinese Teams at KDDCUP (NTU,
CAS, HKUST)
Nathan Liu:
HKUST CSE
PhD student
KDDCUP 2012
Tencent
Task 1: Micro-blog (Weibo) User Recommendation
Recommends a popular person
/ an organization
/ a group
TO
a user
Task 2: Ad click-through rate prediction from
search log
How often will an Ad be clicked by a user?
26
Task1: User recommendation UI
Popular user
recommendation
Task2: Ad click-through rate
prediction
Ad click-through rate prediction
28
Task1 Data – User-Item Matrix
rec_log_train.txt / rec_log_test.txt
2088948
2088948
2088948
601635
601635
601635
1529353
1760350
1774722
786313
1775029
1902321
462104
1774509
-1
-1
-1
-1
-1
-1
-1
1318348785
1318348785
1318348785
1318348785
1318348785
1318348785
1318348786
UserID ItemID ?followed TimeStamp
~75M records in training data
?followed: -1/1, user accepts the recommendation or not
In test data, it is filled with 0, to be predicted as -1/1.
TimeStamp: unix-timestamp
Seconds from 70.1.1 00:00:00 (UTC time)
29
Task2 Data – Main Data Table
Extremely Large Training Data ~150M records
10Gig raw csv file + keywords + userProfiles
Predicting CTR to helps search provider to rank/price ads
correctly
Winners
Track 1
Track 2
1st place
Shanghai Jiao Tong University
National Taiwan University
2nd place
Steffen Rendle, University of
Konstanz
Opera Solutions
3rd place
Team FICO Model Builder
Steffen Rendle, University of
Konstanz
Summary
To place on top of KDDCUP requires
Team work
Expertise in domain knowledge as well as mathematical
tools
Often done by world famous institutes and companies
Recent trends:
Dataset increasingly more realistic
Participants increasingly more professional
Tasks are increasingly more difficult
31
Summary
KDD Cup is an excellent source to learn the
state-of-art KDD techniques
KDDCUP dataset often becomes the
standard benchmark for future research,
development and teaching
Top winners are highly regarded and
respected
References:
http://www.sigkdd.org/kddcup/index.php
32