KDD Cup Intro - Department of Computer Science and Engineering

Download Report

Transcript KDD Cup Intro - Department of Computer Science and Engineering

ACM KDD Cup
A Survey: 1997-2011
Qiang Yang 杨强
(partly based on Xinyue Liu’s slides @SFU, and
Nathan Liu’s slides @hkust)
Hong Kong University of Science and
Technology
香港科大
1
About KDD Cup (1997 – 2011)

Competition is a strong
mover for Science and
Engineering:

ACM Programming
Contest


World College level
Programming skills
ROBOCUP

World Robotics
Competition
2
About ACM KDDCUP


ACM KDD: Premiere Conference in knowledge discovery
and data mining
ACM KDDCUP:


Worldwide competition in conjunction with ACM KDD
conferences.
It aims at:



showcase the best methods for discovering higher-level
knowledge from data.
Helping to close the gap between research and industry
Stimulating further KDD research and development
3
Statistics
 Participation in KDD Cup grew steadily
 Average person-hours per submission: 204
Max person-hours per submission: 910
Year
Submissions
97 98
16
21
99
24
2000 2005 2011
30
32 1000+
4
Algorithms (up to 2000)
5
KDD Cup 97


A classification task – to
predict financial services
industry (direct mail
response)
Winners



Charles Elkan, a Prof from
UC-San Diego with his
Boosted Naive Bayesian
(BNB)
Silicon Graphics, Inc with
their software MineSet
Urban Science Applications,
Inc. with their software gain,
Direct Marketing Selection
System
6
MineSet (Silicon Graphics Inc.)

A KDD tool that combines data access, transformation,
classification, and visualization.
7
KDD Cup 98: CRM Benchmark



URL:
www.kdnuggets.com/meetings/kd
d98/kdd-cup-98.html
A classification task – to analyze
fund raising mail responses to a
non-profit organization
Winners



Urban Science Applications,
Inc. with their software
GainSmarts.
SAS Institute, Inc. with their
software SAS Enterprise Miner
™
Quadstone Limited with their
software Decisionhouse ™
8
KDDCUP 1998 Results
$70,000
$65,000
$60,000
$55,000
$50,000
$45,000
$40,000
$35,000
$30,000
$25,000
$20,000
$15,000
$10,000
$5,000
$-
Maximum Possible Profit Line
($72,776 in profits with 4,873 mailed)
100%
90%
80%
Mail to Everyone Solution
($10,560 in profits with 96,367 mailed)
70%
60%
50%
GainSmarts
SAS/Enterprise Miner
Quadstone/Decisionhouse
40%
30%
20%
10%
0%
ACM KDD Cup 1999




URL:
www.cse.ucsd.edu/users/elkan/
kdresults.html
Problem
To detect network intrusion
and protect a computer network
from unauthorized users,
including perhaps insiders
Data: from DoD
Winners
 SAS Institute Inc. with their
software Enterprise Miner.
 Amdocs with their
Information Analysis
Environment
10
KDDCUP 2000: Data Set and Goal:
Data collected from

Gazelle.com, a legwear
and legcare Web retailer
 Pre-processed
Training set: 2 months
 Test sets: one month
 Data collected includes: 


Click streams
Order information
The goal – to design
models to support website personalization and
to improve the
profitability of the site by
increasing customer
response.
Questions - When given
a set of page views,



characterize heavy
spenders
characterize killer pages
characterize which product
brand a visitor will view in
the remainder of the
session?
11
KDDCUP 2000: The Winners



Question 1 & 5 Winner:
Amdocs
Question 2 & 3 Winner:
Salford Systems
Question 4 Winner: esteam
12
KDD Cup 2001

3 Bioinformatics Tasks

Dataset 1: Prediction of
Molecular Bioactivity for
Drug Design


half a gigabyte when
uncompressed
Dataset 2: Prediction of
Gene/Protein Function (task
2) and Localization (task 3)


Dataset 2 is smaller and
easier to understand
7 megabytes uncompressed

A total of 136 groups
participated to produce
a total of 200
submitted predictions
over the 3 tasks: 114
for Thrombin, 41 for
Function, and 45 for
Localization.
13
2001 Winners

Task 1, Thrombin:



Jie Cheng (Canadian Imperial
Bank of Commerce).
Bayesian network learner and
classifier
Task 2, Function: Mark-A.
Krogel (University of
Magdeburg).



Task 2:


the genes of one particular
type of organism
A gene/protein can have
more than one function, but
only one localization.
Inductive Logic programming
Task 3, Localization: Hisashi
Hayashi, Jun Sese, and
Shinichi Morishita (University
of Tokyo).

K nearest neighbor
14

molecular
biology : Two tasks


Task 1: Document
extraction from
biological articles
Task 2: Classification of
proteins based on gene
deletion experiments

Winners:

Task 1: ClearForest and
Celera, USA


Yizhar Regev and Michal
Finkelstein
Task 2: Telstra
Research Laboratories
, Australia

Adam Kowalczyk and
Bhavani Raskutti
15
2003 KDDCUP

Information
Retrieval/Citation Mining of
Scientific research papers





based on a very large
archive of research papers
First Task: predict how many
citations each paper will receive
during the three months
leading up to the KDD 2003
conference
Second Task: a citation graph
of a large subset of the archive
from only the LaTex sources
Third Task: each paper's
popularity will be estimated
based on partial download logs
Last Task: devise their own
questions
16
2003 KDDCUP: Results

Task 1:



Task 2:



1st place: David Vogel
AI Insight Inc.
Task 3:



Claudia Perlich, Foster Provost,
Sofus Kacskassy
New York University
Janez Brank and Jure Leskovec
Jozef Stefan Institute, Slovenija
Task 4:





Amy McGovern, Lisa Friedland,
Michael Hay,
Brian Gallagher, Andrew Fast,
Jennifer Neville,
and David Jensen
University of Massachusetts
Amherst, USA
17
2004 Tasks and Results

粒子物理学和同调蛋白
质预测(Particle
physics; plus protein
homology prediction)

两个子任务的冠军分别
为:David S. Vogel, Eric
Gottschalk, and
Morgan C. Wang以及
Bernhard Pfahringer,
Yan Fu (付岩),
RuiXiang Sun, Qiang
Yang (杨强), Simin He,
Chunli Wang, Haipeng
Wang, Shiguang Shan,
Junfa Liu, Wen Gao.
18
Past KDDCUP Overview: 2005-2010
Year
Host
Task
Technique
Winner
2005
Microsoft
Web query
categorization
Feature Engineering,
Ensemble
HKUST (沈抖,
杨强,等)
2006
Siemens
Pulmonary emboli
detection
Multi-instance, Non-IID
sample, Cost sensitive,
Class Imbalance, Noisy
data
AT&T, Budapest
University of
Technology &
Economics
2007
Netflix
Consumer
recommendation
Collaborative Filtering,
Time series, Ensemble
IBM Research,
Hungarian
Academy of
Sciences
2008
Siemens
Breast cancer
detection from
medical images
Ensemble, Class
imbalance, Score
calibration
IBM Research,
National Taiwan
University
2009
Orange
Customer
relationship
prediction in telecom
Feature selection,
Ensemble
IBM Research,
University of
Melbourne
2010
PSLC Data
Shop
Student performance
prediction in ELearning
Feature engineering,
Ensemble,
Collaborative filtering
National Taiwan
University (CJ
Lin, S. Lin, etc.)
KDDCUP’11 Dataset




11 years of data
Rated items are
 Tracks
 Albums
 Artists
 Genres
Items arranges in a taxonomy
Two tasks
Track 1
Track 2
#ratings
263M
63M
#items
625K
296K
#users
1M
249K
Items in a Taxonomy
Track 1 Details
Track 1 Highlights





Largest publicly available dataset
Large number of items (50 times more than
Netflix)
Extreme rating sparsity (20 times more
sparse than Netflix)
Taxonomy can help in combating sparsely
rated items.
Fine time stamps with both date and time
allow sophisticated temporal modeling.
Track 2 Details
Track 2 Highlights



Performance metric focus on ranking/
classification, which differs from traditional
collaborative filtering.
No validation data provided, need to selfconstruct binary labeled data from rating
data.
Unlike track 1, track 2 removed time stamps
to focus more than long term preference
rather than short term behaviors.
Submission Stats
Winners
Track 1
Track 2
1st place
National Taiwan University
National Taiwan University
2nd place
Commendo (Netflix Prize
Winnder)
Chinese Academy of Science,
Hulu Labs
3rd place
Hong Kong University of
Science and Technology,
Shanghai Jiaotong University
Commendo (Netflix Prize
Winnder)
Chinese Teams at KDDCUP (NTU,
CAS, HKUST)
Key Techniques

Track 1:






Blending of multiple techniques
Matrix factorization models
Nearest neighbor models
Restricted Bolzmann machines
Temporal modelings
Track 2:



Importance sampling of negative instances
Taxonomical modelings
Use of pairwise ranking objective functions
Summary

To place on top of KDDCUP requires




Team work
Expertise in domain knowledge as well as mathematical
tools
Often done by world famous institutes and companies
Recent trends:



Dataset increasingly more realistic
Participants increasingly more professional
Tasks are increasingly more difficult
30
Summary



KDD Cup is an excellent source to
learn the state-of-art KDD techniques
KDDCUP dataset often becomes the
standard benchmark for future
research, development and teaching
Top winners are highly regarded and
respected
31
References
Elkan C. (1997). Boosting and Naive Bayesian Learning.
Technical Report No. CS97-557, September 1997, UCSD.
Decisionhouse (1998). KDD Cup 98: Quadstone Take Bronze
Miner Award. Retrieved March 15, 2001 from
http://www.kdnuggets.com/meetings/kdd98/quadstone/i
ndex.html
Urbane Science (1998). Urbane Science wins the KDD-98 Cup.
Retrieved March 15, 2001 from
http://www.kdnuggets.com/meetings/kdd98/gainkddcup98-release.html
Georges, J. & Milley, A. (1999). KDD’99 Competition:
Knowledge Discovery Contest. Retrieved March 15, 2001
from http://www.cse.ucsd.edu/users/elkan/saskdd99.pdf
Rosset, S. & Inger A. (1999). KDD-Cup 99 : Knowledge
Discovery In a Charitable Organization’s Donor Database.
Retrieved March 15, 2001 from
http://www.cse.ucsd.edu/users/elkan/KDD2.doc
32
References
(Cont.)
Sebastiani P., Ramoni M. & Crea A. (1999). Profiling your
Customers using Bayesian Networks. Retrieved March 15,
2001 from
http://bayesware.com/resources/tutorials/kddcup99/kddcu
p99.pdf
Inger A., Vatnik N., Rosset S. & Neumann E. (2000). KDD-Cup
2000: Question 1 Winner’s Report. Retrieved March 18, 2000
from
http://www.ecn.purdue.edu/KDDCUP/amdocs-slides-1.ppt
Neumann E., Vatnik N., Rosset S., Duenias M., Sasson I. & Inger
A. (2000). KDD-Cup 2000: Question 5 Winner’s Report.
Retrieved March 18, 2000 from
http://www.ecn.purdue.edu/KDDCUP/amdocs-slides-5.ppt
Salford System white papers:
http://www.salford-systems.com/whitepaper.html
Summary talk presented at KDD (2000)
http://robotics.stanford.edu/~ronnyk/kddCupTalk.ppt
33
References (cont)



http://www.cs.wisc.edu/~dpage/kddcup2001/Cheng.pdf
http://www.cs.wisc.edu/~dpage/kddcup2001/Krogel.pdf
http://www.cs.wisc.edu/~dpage/kddcup2001/Hayashi.pdf
34