Transcript 投影片 1

Predicting the winner of C.Y.
award
指導教授:黃三益博士
組員:
尹川
陳隆賢
陳偉聖
1
Introduction

Baseball sport in Taiwan


MLB (Major League Baseball)


CPBL (Chinese Professional Baseball League)
Baseball sport in USA
Cy Young Award since 1956



Baseball Writers Association of America
Weighted scores
Each league has one winner per year.
2
Measurements


There are no definite rules be used to judge.
Nevertheless, many measurements could be used
to judge whether a pitcher is good or not.
Wins
 ERA
 WHIP
 G/F
etc.

3
Aim of the study



To analysis the historical statistics of pitchers.
Building a predictive model.
To predict the Cy Young Award winner of the
year in the future.
4
Data mining procedure

Ten data mining methodology steps
5
Step 1:Translate the Problem

Directed data mining problem




Target variable: Cy Young Award
Classification
Decision tree
Purposes


Gambling game
Predictive activities
6
Step 2:Select Appropriate Data

Just MLB statistics data (1871 ~ 2006)

Cy Young Award: 1956 ~ 2006



“Time” factor

1999 as the dividing year.


total 21456 records
List of Cy Young Award winners
Because of the emerging items.
Variables: to remove the items that are not
representative of a pitcher.
7
Step 3:Get to know the data




The materials that we used all come from
MLB official site
These data have already been disclosed for a
lot of years
The quality of data is very good
some attributes has value since 1999
8
Step 4:Create a model set




We divide the data into training data and
testing data
We do not create a balanced sample
The record of MLB is not the seasonal
materials
we will pick the materials since 1999
9
Step 5:Fix problems with the data



These data are taken from MLB official side
No missing values
single source
10
Step 6:Transform data to bring
information to the surface




There are no combinations of attributes
We delete some attributes
We add a attribute-Year
We add a attribute (CyYoungAward_Winner)
for classification
11
Step 7:Build Models





Tools Used
Weka Crash Problem
Blank Attributes
Build Model
Handling Blank Attributes
12
Tools Used
13
Weka Crash Problem

Raw data




21456 data instances
42 attributes
Weka crashed during model construction
Give Weka more memory
14
Blank Attributes
15
Build Model

MLB 1956~2006



MLB 1956~2006



with blank attributes
ADTree
without blank attributes
ADTree
MLB 1999~2006

ADTree
16
Handling Blank Attributes
17
1956~2006, with blank attributes, ADTree
18
1956~2006, with blank attributes, ADTree
=== Confusion Matrix ===
NONWINNER WINNER <-- classified as
21343
21
NONWINNER
58
34
WINNER
19
1956~2006, without blank attributes, ADTree
20
1956~2006, without blank attributes, ADTree
=== Confusion Matrix ===
NONWINNER WINNER <-- classified as
21350
14
NONWINNER
62
30
WINNER
21
1999~2006, ADTree
22
1999~2006, ADTree
=== Confusion Matrix ===
NONWINNER WINNER <-- classified as
5090
3
NONWINNER
13
3
WINNER
23
Step 8:Assess Models(1/2)

Not good enough for gambling
=== Confusion Matrix ===
NONWINNER WINNER <-- classified as
21350
14
NONWINNER
62
30
WINNER
=== Confusion Matrix ===
NONWINNER WINNER <-- classified as
5090
3
NONWINNER
13
3
WINNER
24
Step 8:Assess Models(2/2)

Some attributes are more important
Number of Appearance of Attributes in Different Models
W BB WPCT
OBA WHIP K/9 ERA GF
1956~2006
ADTree
2
1
1956~2006 Without Blank Attributes
ADTree
2
1
1999~2006
ADTree
2
1
1956~2006 Without Blank Attributes
J48
3
2
3
1
1
1
1
1
1
1
1
1
1
25
Step 9:Deploy Models


To implement a computer program with the
built model.
To predict the Cy Young Award winner more
easily.
26
Step 10:Assess Results


To compare the predictive and the final Cy
Young Award winner directly.
Not “business” but “interest”.

Assessment from the judgment of the person.
27
Conclusions





We have used the classification technology to
set up the model of predicting
We find the accuracy of the built model is not
high
Some factors that we are not to consider
It can not use in the place with essential
benefits
Just for fun
28