彥璋M10021020x

Download Report

Transcript 彥璋M10021020x

Using data mining methods to identify
college freshmen who need special assistance
in their academic performance.
指導教授:呂學毅 教授
學生:陳彥璋
Introduction
review literature
method
background
加文字敘述
colleges
Instruction
grade
Introduction
review literature
method
background(cont)
加文字敘述
students
factors
grade
Introduction
review literature
method
background(cont)
will affect
the physical and mental health.
Lower
grade
Drop-out
Introduction
review literature
method
background(cont)
College
freshmen
Have more adapting problems
than other higher grades.
will affect
the physical and mental health.
Lower
grade
will affect grade
in future several semesters.
Introduction
review literature
method
background(cont)
Guidance in collage
Learning
For
low academic
achievement
Career
Planning
Emotion
& life
problems
※Taking Taiwan's national Yunlin university of science and technology for example :
From academic year 94 building 「強化學生輔導新體制工作計畫」.
Building the achievements warning policy for students' academic achievement.
Introduction
review literature
method
motivation(cont)
The general practice:
Out of the test scores
Final exam
time
List of need special assistance students
This study:
The freshmen
entering
Out of the test scores
Final exam
time
List of need special assistance students
Introduction
review literature
method
motivation
Emotion,
Personality,
…
Learning
Motivation,
learning
engagement,
…
Family,
Intelligence,
Sex,
…
grade
Introduction
review literature
method
objective
The aim of this study is to construct a model with data mining
tools in predicting college freshmen of low academic achievement.
Finding students who need special assistance in their academic
performance, and help students with improving their academic
performance through guidance as earlier as possible.
Introduction
review literature
method
The negative effects of low academic achievement.
author
year
finding
2012
The student's major sources of stress is
academic stress that more than other
sources of stress three times. Academic
stress will bring a lot of negative effects.
Kaplan, D. S., Liu, R. X.,
& Kaplan, H. B.
2005
low academic achievement will bring the
Academic stress. The Academic stress
will affect the physical and mental health.
張慧儀
2003
Lower grade will affect the physical and
mental health.
家扶基金會
Introduction
review literature
method
The problems of college freshmen with low academic achievement.
author
業邵國,何英奇,陳舜芬
潘正德
黃春枝
year
finding
2007
As the college environment is more
complex than high school, the freshmen
who attend the new environment will
encounter a lot of adapting problems.
2007
The college freshmen who have poor
academic achievements may be affected
in future several semesters.
1999
The college freshmen who attend the
new environment will encounter more
adapting problems than other higher
grades.
Introduction
review literature
method
The relationship between personality (emotion) and grade.
author
year
finding
2002
The students have good personality and
behavior will contribute to their academic
performance.
2000
Intelligence, personality, motivation and
academic achievement of the students
have a positive correlation.
Yeh et al.
2007
The students have a anxiety or
depression whoes academic
achievements will be affected
Parker et al.
2004
Emotional and academic achievement of
the students have a correlation.
McIlroy & Bunting
Busato, Prins,
Elshout, Hamaker
Introduction
review literature
Forecasting model construct process
method
Introduction
Coding Data
review literature
attributes of primary data
屬性名稱
性別
學院
method
屬性內容
男性;女性
工 程 學 院 ; 管 理 學 院 ;
設計學院;人文與科學學院
屬性類型
屬性個數
二元變數
2
類別變數
4
社交性
0至20分
數值變數
-
主導性
0至20分
數值變數
-
行動力
0至20分
數值變數
-
思考性
0至20分
數值變數
-
活動性
0至20分
數值變數
-
攻擊性
0至20分
數值變數
-
挑剔性
0至20分
數值變數
-
客觀性
0至20分
數值變數
-
神經質
0至20分
數值變數
-
自卑感
0至20分
數值變數
-
情緒轉變性
0至20分
數值變數
-
憂鬱性
0至20分
數值變數
-
憂鬱程度
正常;輕微;明顯;嚴重
類別變數
4
名單結果
一般學生;高關懷學生
二元變數
2
Introduction
review literature
method
Feature selection
Some attributes are noisy or redundant.This noise makes it more difficult to
discover meaningful patterns from the data.
Dash,〈1997〉 Sequential Backward Selection: Using Shannon‘s Entropy
as identification rule to find out attributes that have more explanatory capability.
Sequential Backward Selection:
T = Original Variable Set
For k = 1 to M – 1 {/* Iteratively remove variables one at a time */
For every variable v in T {/* Determine which variable to be removed */
Tv = T – v
Calculate ETv on D using eqn. 1}
Let vk be the variable that minimizes ETv
T = T – vk /* Remove vk as the least important variable */
Output vk }
Introduction
review literature
Feature selection(cont)
Shannon's Entropy:
𝑁
𝑁
𝐸=−
𝑆𝑖𝑗 × 𝑙𝑜𝑔 𝑆𝑖𝑗 + 1 − 𝑆𝑖𝑗 × 𝑙𝑜𝑔 1 − 𝑆𝑖𝑗
𝑖=1 𝑗=1
Sij = e−α×Dij
N 為資料筆數。
Sij 為資料集中任兩資料間相似的程度,數值為 0 到 1 間的值。
Dij 表示資料中的xi 與xj 之距離。
α 為實驗參數。
method
Introduction
Data mining
review literature
method
Introduction
review literature
method
Data mining :K-Fold Cross-vaildation
K-Fold is mainly used in settings where the goal is prediction. To estimate how
accurately a predictive model will perform in practice.
• One round of cross-validation involves partitioning a sample of data into
complementary subsets.
• Performing the analysis on one subset (called the training set).
• Vthe analysis on the other subset (called the testing set).
• To reduce variability, multiple rounds of cross-validation are performed using
different partitions, and the validation results are averaged over the rounds.
Introduction
review literature
method
Data mining : C4.5 Decision Trees
C4.5 is an extension of Quinlan's earlier ID3 algorithm〈1993 〉.
The decision trees generated by C4.5 can be used for classification.
internal node
(attribute)
branches
leaf node
(class)
Introduction
review literature
method
Data mining : C4.5 Decision Trees(cont)
At each node of the tree,C4.5 chooses one attribute of the data that most effectively
splits its set of samples into subsets enriched in one class or the other.
• Its criterion is the normalized information gain (Gain ratio) that results from choosing
an attribute for splitting the data.
• The attribute with the highest normalized information gain is chosen to make the
decision (class).
• The C4.5 algorithm then recurses on the smaller sublists.
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛 𝐴 = 𝐼 𝑝, 𝑛 − 𝐸 𝐴
𝑝
𝑝
𝑛
𝑛
𝐼 𝑝, 𝑛 = −
log 2
−
log
;𝐸 𝐴 =
𝑝+𝑛
𝑝+𝑛 𝑝+𝑛 2𝑝+𝑛
𝐺𝑎𝑖𝑛 𝑟𝑎𝑡𝑖𝑜 =
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛(𝐴)
𝐸(𝐴)
其中𝐼 𝑝, 𝑛 為屬性測試前的資訊量,𝐸(𝐴)為測試後的資訊量。
𝑝為資料屬於其中一個類別的比例,數值為 0 到 1 間的值。
𝑛為節點所含的的資訊量。 𝑣為屬性個數。
𝑣
𝑖=1
𝑝𝑖 + 𝑛𝑖
𝐼 𝑝𝑖 , 𝑛𝑖
𝑝+𝑛
Introduction
review literature
method
Data mining : Naïve Bayes Classifier
A Naïve Bayes classifier is a simple probabilistic classifier based on applying
Bayes' theorem with strong (Naïve) independence assumptions.
𝑁𝐵𝐶 𝑋 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃 𝑦 𝑥 = 𝑎𝑟𝑔𝑚𝑎𝑥
𝑦∈𝑌
𝑦∈𝑦
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃 𝑦
𝑦∈𝑌
𝑃 𝑥𝑦 𝑃 𝑦
𝑃 𝑥
𝑃 𝑥𝑖 𝑦 .
𝑖
𝑃 𝐶 = 1 𝑋 = 𝑥1 𝑥1 … 𝑥𝑖 … 𝑥𝑛
𝑃 𝐶=1
𝑜𝑑𝑑 𝑟𝑎𝑡𝑖𝑜 =
=
𝑃 𝐶 = 0 𝑋 = 𝑥1 𝑥1 … 𝑥𝑖 … 𝑥𝑛
𝑃 𝐶=0
𝑃 𝐶 = 1 𝑋 = 𝑥1 𝑥1 … 𝑥𝑖 … 𝑥𝑛
𝑃 𝐶=1
log
= log
+
𝑃 𝐶 = 0 𝑋 = 𝑥1 𝑥1 … 𝑥𝑖 … 𝑥𝑛
𝑃 𝐶=0
log
𝑛
𝑛
𝑖=1 𝑃𝑖
𝑛
𝑖=1 𝑃𝑖
𝑖=1
𝑥𝑖 𝐶 = 1
𝑥𝑖 𝐶 = 0
𝑃𝑖 𝑥𝑖 𝐶 = 1
𝑃𝑖 𝑥𝑖 𝐶 = 0
P C = 1 X = x1 x 2 … x i … x n
≥θ
P C = 0 X = x1 x 2 … x i … x n
其中P 𝑥 為 x 屬性集的機率,P(y) 為分類 y 的機率,𝑃 𝑥 𝑦 為分類 y 下 x 屬性集的機率。
𝑜𝑑𝑑 𝑟𝑎𝑡𝑖𝑜 為後驗勝算比。
θ為實驗參數。
Introduction
review literature
method
Data mining : MLP Artificial neural network
A ANN model where members of the class are obtained by varying parameters,
connection weights, or specifics of the architecture such as the number of
neurons or their connectivity.
neurons
weight
y
adder
active function
Introduction
review literature
method
Data mining : MLP Artificial neural network(cont)
A multilayer perceptron (MLP) is a feedforward artificial neural network model
that maps sets of input data onto a set of appropriate output.
An MLP consists of multiple layers of nodes in a directed graph, with each layer
fully connected to the next one.
Input Layer
hidden layer
Output Layer
Introduction
review literature
method
Data mining : MLP Artificial neural network(cont)
Activation function:
f x =
1
1 + e−x
Learning through backpropagation:
yj = f net nj
wij xin−1 − θj
net j =
i
(Tj − xj )2
E = (1/2)
j
∆wij = −η
𝜕E
𝜕Wij
其中xin−1 為第n層的第j個單元的輸入值。
E為誤差函數目的用來降低網路輸出值與目標輸出值間的差距。
wij 為介於第n-1層的第i個處理單元,與第n層的第j個處理單元間的連結加權值。
η為學習速率,為控制每次以最陡坡降法最小化誤差函數的步幅。
Introduction
review literature
method
model evaluation: Confusion Matrix
predicted class
actual
class
True
False
Positve
TP
FN
Negative
FP
TN
Accuracy =
TP  TN
TP  FN  FP  TN
TP
Sensitivity (true positive rate) = TP  FN
TN
Specificity = FP  TN
FP
false positive rate = FP  TN
Introduction
review literature
method
model evaluation:Receiver Operating Characteristic
Expected result
• This study is expect to construct a forecasting model through
collage freshmen’s data.
• The forecasting model is using data mining methods to
constructed that be select from three classifier (C4.5 decision
trees, Naïve Bayes classifier, MLP artificial neural network).
• Using the forecasting model can identify college freshmen who
need special assistance in their academic performance.
• And the collages can use model to help students with improving
their academic performance through guidance as earlier as
possible.
Gantt Chart
Q&A