Transcript 投影片 1
Predicting the winner of C.Y.
award
指導教授:黃三益博士
組員:
尹川
陳隆賢
陳偉聖
1
Introduction
Baseball sport in Taiwan
MLB (Major League Baseball)
CPBL (Chinese Professional Baseball League)
Baseball sport in USA
Cy Young Award since 1956
Baseball Writers Association of America
Weighted scores
Each league has one winner per year.
2
Measurements
There are no definite rules be used to judge.
Nevertheless, many measurements could be used
to judge whether a pitcher is good or not.
Wins
ERA
WHIP
G/F
etc.
3
Aim of the study
To analysis the historical statistics of pitchers.
Building a predictive model.
To predict the Cy Young Award winner of the
year in the future.
4
Data mining procedure
Ten data mining methodology steps
5
Step 1:Translate the Problem
Directed data mining problem
Target variable: Cy Young Award
Classification
Decision tree
Purposes
Gambling game
Predictive activities
6
Step 2:Select Appropriate Data
Just MLB statistics data (1871 ~ 2006)
Cy Young Award: 1956 ~ 2006
“Time” factor
1999 as the dividing year.
total 21456 records
List of Cy Young Award winners
Because of the emerging items.
Variables: to remove the items that are not
representative of a pitcher.
7
Step 3:Get to know the data
The materials that we used all come from
MLB official site
These data have already been disclosed for a
lot of years
The quality of data is very good
some attributes has value since 1999
8
Step 4:Create a model set
We divide the data into training data and
testing data
We do not create a balanced sample
The record of MLB is not the seasonal
materials
we will pick the materials since 1999
9
Step 5:Fix problems with the data
These data are taken from MLB official side
No missing values
single source
10
Step 6:Transform data to bring
information to the surface
There are no combinations of attributes
We delete some attributes
We add a attribute-Year
We add a attribute (CyYoungAward_Winner)
for classification
11
Step 7:Build Models
Tools Used
Weka Crash Problem
Blank Attributes
Build Model
Handling Blank Attributes
12
Tools Used
13
Weka Crash Problem
Raw data
21456 data instances
42 attributes
Weka crashed during model construction
Give Weka more memory
14
Blank Attributes
15
Build Model
MLB 1956~2006
MLB 1956~2006
with blank attributes
ADTree
without blank attributes
ADTree
MLB 1999~2006
ADTree
16
Handling Blank Attributes
17
1956~2006, with blank attributes, ADTree
18
1956~2006, with blank attributes, ADTree
=== Confusion Matrix ===
NONWINNER WINNER <-- classified as
21343
21
NONWINNER
58
34
WINNER
19
1956~2006, without blank attributes, ADTree
20
1956~2006, without blank attributes, ADTree
=== Confusion Matrix ===
NONWINNER WINNER <-- classified as
21350
14
NONWINNER
62
30
WINNER
21
1999~2006, ADTree
22
1999~2006, ADTree
=== Confusion Matrix ===
NONWINNER WINNER <-- classified as
5090
3
NONWINNER
13
3
WINNER
23
Step 8:Assess Models(1/2)
Not good enough for gambling
=== Confusion Matrix ===
NONWINNER WINNER <-- classified as
21350
14
NONWINNER
62
30
WINNER
=== Confusion Matrix ===
NONWINNER WINNER <-- classified as
5090
3
NONWINNER
13
3
WINNER
24
Step 8:Assess Models(2/2)
Some attributes are more important
Number of Appearance of Attributes in Different Models
W BB WPCT
OBA WHIP K/9 ERA GF
1956~2006
ADTree
2
1
1956~2006 Without Blank Attributes
ADTree
2
1
1999~2006
ADTree
2
1
1956~2006 Without Blank Attributes
J48
3
2
3
1
1
1
1
1
1
1
1
1
1
25
Step 9:Deploy Models
To implement a computer program with the
built model.
To predict the Cy Young Award winner more
easily.
26
Step 10:Assess Results
To compare the predictive and the final Cy
Young Award winner directly.
Not “business” but “interest”.
Assessment from the judgment of the person.
27
Conclusions
We have used the classification technology to
set up the model of predicting
We find the accuracy of the built model is not
high
Some factors that we are not to consider
It can not use in the place with essential
benefits
Just for fun
28