Think-Aloud Protocols

Download Report

Transcript Think-Aloud Protocols

Educational Data Mining
March 3, 2010
Today’s Class
• EDM
• Assignment#5
• Mega-Survey
Educational Data Mining
• “Educational Data Mining is an emerging
discipline, concerned with developing
methods for exploring the unique types of
data that come from educational settings, and
using those methods to better understand
students, and the settings which they learn
in.”
– www.educationaldatamining.org
Classes of EDM Method
(Romero & Ventura, 2007)
• Information Visualization
• Web mining
– Clustering, Classification, Outlier Detection
– Association Rule Mining/Sequential Pattern
Mining
– Text Mining
Classes of EDM Method
(Baker & Yacef, 2009)
•
•
•
•
•
Prediction
Clustering
Relationship Mining
Discovery with Models
Distillation of Data For Human Judgment
Prediction
• Develop a model which can infer a single
aspect of the data (predicted variable) from
some combination of other aspects of the
data (predictor variables)
• Which students are using CVS?
• Which students will fail the class?
Clustering
• Find points that naturally group together,
splitting full data set into set of clusters
• Usually used when nothing is known about
the structure of the data
– What behaviors are prominent in domain?
– What are the main groups of students?
Relationship Mining
• Discover relationships between variables in a
data set with many variables
– Association rule mining
– Correlation mining
– Sequential pattern mining
– Causal data mining
• Beck & Mostow (2008) article is a great
example of this
Discovery with Models
• Pre-existing model (developed with EDM
prediction methods… or clustering… or
knowledge engineering)
• Applied to data and used as a component in
another analysis
Distillation of Data for Human
Judgment
• Making complex data understandable by
humans to leverage their judgment
• Text replays are a simple example of this
Focus of today’s class
•
•
•
•
•
Prediction
Clustering
Relationship Mining
Discovery with Models
Distillation of Data For Human Judgment
• There will be a term-long class on this, taught by
Joe Beck, in coordination with Carolina Ruiz’s
Data Mining class, in a future year
– Strongly recommended
Prediction
• Pretty much what it says
• A student is using a tutor right now.
Is he gaming the system or not?
• A student has used the tutor for the last half hour.
How likely is it that she knows the knowledge
component in the next step?
• A student has completed three years of high school.
What will be her score on the SAT-Math exam?
Two Key Types of Prediction
This slide adapted from slide by Andrew W. Moore, Google
http://www.cs.cmu.edu/~awm/tutorials
Classification
•
•
•
•
General Idea
Canonical Methods
Assessment
Ways to do assessment wrong
Classification
• There is something you want to predict (“the
label”)
• The thing you want to predict is categorical
– The answer is one of a set of categories, not a number
– CORRECT/WRONG (sometimes expressed as 0,1)
– HELP REQUEST/WORKED EXAMPLE
REQUEST/ATTEMPT TO SOLVE
– WILL DROP OUT/WON’T DROP OUT
– WILL SELECT PROBLEM A,B,C,D,E,F, or G
Classification
• Associated with each label are a set of
“features”, which maybe you can use to
predict the label
Skill
ENTERINGGIVEN
ENTERINGGIVEN
USEDIFFNUM
ENTERINGGIVEN
REMOVECOEFF
REMOVECOEFF
USEDIFFNUM
….
pknow
0.704
0.502
0.049
0.967
0.792
0.792
0.073
time
9
10
6
7
16
13
5
totalactions
1
2
1
3
1
2
2
right
WRONG
RIGHT
WRONG
RIGHT
WRONG
RIGHT
RIGHT
Classification
• The basic idea of a classifier is to determine
which features, in which combination, can
predict the label
Skill
ENTERINGGIVEN
ENTERINGGIVEN
USEDIFFNUM
ENTERINGGIVEN
REMOVECOEFF
REMOVECOEFF
USEDIFFNUM
….
pknow
0.704
0.502
0.049
0.967
0.792
0.792
0.073
time
9
10
6
7
16
13
5
totalactions
1
2
1
3
1
2
2
right
WRONG
RIGHT
WRONG
RIGHT
WRONG
RIGHT
RIGHT
Classification
• Of course, usually there are more than 4
features
• And more than 7 actions/data points
• I’ve recently done analyses with 800,000
student actions, and 26 features
Classification
• Of course, usually there are more than 4
features
• And more than 7 actions/data points
• I’ve recently done analyses with 800,000
student actions, and 26 features
• 5 years ago that would’ve been a lot of data
• These days, in the EDM world, it’s just a
medium-sized data set
Classification
• One way to classify is with a Decision Tree
(like J48)
PKNOW
<0.5
>=0.5
TIME
TOTALACTIONS
<6s.
>=6s.
RIGHT
WRONG
<4
RIGHT
>=4
WRONG
Classification
• One way to classify is with a Decision Tree
(like J48)
PKNOW
<0.5
>=0.5
TIME
Skill
COMPUTESLOPE
TOTALACTIONS
<6s.
>=6s.
RIGHT
WRONG
pknow
0.544
time
9
<4
RIGHT
>=4
WRONG
totalactions
1
right
?
Classification
• Another way to classify is with step
regression
• Linear regression (discussed later), with a cutoff
And of course…
• There are lots of other classification
algorithms you can use...
• SMO (support vector machine)
• In your favorite Machine Learning package
And of course…
• There are lots of other classification
algorithms you can use...
• SMO (support vector machine)
• In your favorite Machine Learning package
– WEKA
And of course…
• There are lots of other classification
algorithms you can use...
• SMO (support vector machine)
• In your favorite Machine Learning package
– WEKA
– RapidMiner
And of course…
• There are lots of other classification algorithms
you can use...
• SMO (support vector machine)
• In your favorite Machine Learning package
– WEKA
– RapidMiner
– KEEL
And of course…
• There are lots of other classification algorithms
you can use...
• SMO (support vector machine)
• In your favorite Machine Learning package
–
–
–
–
WEKA
RapidMiner
KEEL
RapidMiner
And of course…
• There are lots of other classification algorithms you can
use...
• SMO (support vector machine)
• In your favorite Machine Learning package
–
–
–
–
–
WEKA
RapidMiner
KEEL
RapidMiner
RapidMiner
And of course…
• There are lots of other classification algorithms you can
use...
• SMO (support vector machine)
• In your favorite Machine Learning package
–
–
–
–
–
–
WEKA
RapidMiner
KEEL
RapidMiner
RapidMiner
RapidMiner
Comments? Questions?
How can you tell if
a classifier is any good?
How can you tell if
a classifier is any good?
• What about accuracy?
•
# correct classifications
total number of classifications
• 9200 actions were classified correctly, out of
10000 actions = 92% accuracy, and we declare
victory.
What are some limitations of accuracy?
Biased training set
• What if the underlying distribution that you
were trying to predict was:
• 9200 correct actions, 800 wrong actions
• And your model predicts that every action is
correct
• Your model will have an accuracy of 92%
• Is the model actually any good?
What are some alternate metrics
you could use?
What are some alternate metrics
you could use?
• Kappa
(Accuracy – Expected Accuracy)
(1 – Expected Accuracy)
What are some alternate metrics
you could use?
• A’
• The probability that if the model is given an
example from each category, it will accurately
identify which is which
Comparison
• Kappa
– easier to compute
– works for an unlimited number of categories
– wacky behavior when things are worse than
chance
– difficult to compare two kappas in different data
sets (K=0.6 is not always better than K=0.5)
Comparison
• A’
– more difficult to compute
– only works for two categories (without
complicated extensions)
– meaning is invariant across data sets (A’=0.6 is
always better than A’=0.55)
– very easy to interpret statistically
Comments? Questions?
What data set should you generally test
on?
• A vote…
– Raise your hands as many times as you like
What data set should you generally test
on?
• The data set you trained your classifier on
• A data set from a different tutor
• Split your data set in half (by students), train on
one half, test on the other half
• Split your data set in ten (by actions). Train on
each set of 9 sets, test on the tenth. Do this ten
times.
• Votes?
What data set should you generally test
on?
• The data set you trained your classifier on
• A data set from a different tutor
• Split your data set in half (by students), train on
one half, test on the other half
• Split your data set in ten (by actions). Train on
each set of 9 sets, test on the tenth. Do this ten
times.
• What are the benefits and drawbacks of each?
The dangerous one
(though still sometimes OK)
• The data set you trained your classifier on
• If you do this, there is serious danger of overfitting
The dangerous one
(though still sometimes OK)
• You have ten thousand data points.
• You fit a parameter for each data point.
• “If data point 1, RIGHT. If data point 78,
WRONG…”
• Your accuracy is 100%
• Your kappa is 1
• Your model will neither work on new data, nor
will it tell you anything.
The dangerous one
(though still sometimes OK)
• The data set you trained your classifier on
• When might this one still be OK?
K-fold cross validation (standard)
• Split your data set in ten (by action). Train on
each set of 9 sets, test on the tenth. Do this
ten times.
• What can you infer from this?
K-fold cross validation (standard)
• Split your data set in ten (by action). Train on
each set of 9 sets, test on the tenth. Do this
ten times.
• What can you infer from this?
– Your detector will work with new data from the
same students
K-fold cross validation (student-level)
• Split your data set in half (by student), train on
one half, test on the other half
• What can you infer from this?
K-fold cross validation (student-level)
• Split your data set in half (by student), train on
one half, test on the other half
• What can you infer from this?
– Your detector will work with data from new
students from the same population (whatever it
was)
A data set from a different tutor
• The most stringent test
• When your model succeeds at this test, you
know you have a good/general model
• When it fails, it’s sometimes hard to know why
An interesting alternative
• Leave-out-one-tutor-cross-validation
(cf. Baker, Corbett, & Koedinger, 2006)
– Train on data from 3 or more tutors
– Test on data from a different tutor
– (Repeat for all possible combinations)
– Good for giving a picture of how well your model
will perform in new lessons
Comments? Questions?
Regression
Regression
• There is something you want to predict (“the
label”)
• The thing you want to predict is numerical
– Number of hints student requests
– How long student takes to answer
– What will the student’s test score be
Regression
• Associated with each label are a set of
“features”, which maybe you can use to
predict the label
Skill
ENTERINGGIVEN
ENTERINGGIVEN
USEDIFFNUM
ENTERINGGIVEN
REMOVECOEFF
REMOVECOEFF
USEDIFFNUM
….
pknow
0.704
0.502
0.049
0.967
0.792
0.792
0.073
time
9
10
6
7
16
13
5
totalactions
1
2
1
3
1
2
2
numhints
0
0
3
0
1
0
0
Regression
• The basic idea of regression is to determine
which features, in which combination, can
predict the label’s value
Skill
ENTERINGGIVEN
ENTERINGGIVEN
USEDIFFNUM
ENTERINGGIVEN
REMOVECOEFF
REMOVECOEFF
USEDIFFNUM
….
pknow
0.704
0.502
0.049
0.967
0.792
0.792
0.073
time
9
10
6
7
16
13
5
totalactions
1
2
1
3
1
2
2
numhints
0
0
3
0
1
0
0
Linear Regression
• The most classic form of regression is linear
regression
– Alternatives include Poisson regression, Neural
Networks...
Linear Regression
• The most classic form of regression is linear
regression
• Numhints = 0.12*Pknow + 0.932*Time –
0.11*Totalactions
Skill
COMPUTESLOPE
pknow
0.544
time
9
totalactions
1
numhints
?
Linear Regression
• Linear regression only fits linear functions
(except when you apply transforms to the
input variables, which RapidMiner can do for
you…)
Linear Regression
• However…
• It is blazing fast
• It is often more accurate than more complex models,
particularly once you cross-validate
– Data Mining’s “Dirty Little Secret”
• It is feasible to understand your model
(with the caveat that the second feature in your model
is in the context of the first feature, and so on)
Example of Caveat
• Let’s study a classic example
Example of Caveat
• Let’s study a classic example
• Drinking too much prune nog at a party, and
having an emergency trip to the Little
Researcher’s Room
Data
1
0.9
Number of emergencies
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
1
2
3
4
5
6
Number of drinks of prune nog
7
8
9
10
Data
1
0.9
Number of emergencies
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
1
2
3
4
5
6
Number of drinks of prune nog
7
8
9
10
Some people
are resistent
to the
deletrious
effects of
prunes and
can safely
enjoy high
quantities of
prune nog!
Learned Function
• Probability of “emergency”=
0.25 * # Drinks of nog last 3 hours
- 0.018 * (Drinks of nog last 3 hours)2
• But does that actually mean that
(Drinks of nog last 3 hours)2 is associated with
less “emergencies”?
Learned Function
• Probability of “emergency”=
0.25 * # Drinks of nog last 3 hours
- 0.018 * (Drinks of nog last 3 hours)2
• But does that actually mean that
(Drinks of nog last 3 hours)2 is associated with
less “emergencies”?
• No!
Example of Caveat
1
0.9
Number of emergencies
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
1
2
3
4
5
6
7
8
9
10
Number of drinks of prune nog
• (Drinks of nog last 3 hours)2 is actually
positively correlated with emergencies!
– r=0.59
Example of Caveat
1
0.9
Number of emergencies
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
1
2
3
4
5
6
7
8
9
10
Number of drinks of prune nog
• The relationship is only in the negative
direction when (Drinks of nog last 3 hours) is
already in the model…
Example of Caveat
• So be careful when interpreting linear
regression models (or almost any other type
of model)
Comments? Questions?
Discovery with Models
Why do Discovery with Models?
• Let’s say you have a model of some construct
of interest or importance
– Knowledge
– Meta-Cognition
– Motivation
– Affect
– Inquiry Skill
– Collaborative Behavior
– Etc.
Why do Discovery with Models?
• You can use that model to
– Find outliers of interest by finding out where the
model makes extreme predictions
– Inspect the model to learn what factors are involved
in predicting the construct
– Find out the construct’s relationship to other
constructs of interest, by studying its
correlations/associations/causal relationships with
data/models on the other constructs
– Study the construct across contexts or students, by
applying the model within data from those contexts or
students
– And more…
Most frequently
• Done using prediction models
• Though other types of models (in particular
knowledge engineering models) are amenable
to this as well!
Boosting
Boosting
• Let’s say that you have 300 labeled actions randomly sampled
from 600,000 overall actions
– Not a terribly unusual case, in these days of massive data sets, like
those in the PSLC DataShop
• You can train the model on the 300, cross-validate it, and then
apply it to all 600,000
• And then analyze the model across all actions
– Makes it possible to study larger-scale problems than a human could
do without computer assistance
– Especially nice if you have some unlabeled data set with nice
properties
• For example, additional data such as questionnaire data
(cf. Baker, Walonoski, Heffernan, Roll, Corbett, & Koedinger, 2008)
However…
• To do this and trust the result,
• You should validate that the model can
transfer across students, populations, and to
the learning software you’re using
– As discussed earlier
A few examples…
Middle School Gaming Detector
GAMED
HURT
HARDEST
SKILLS
(pknow<
20%)
12% of the
time
GAMED NOT 2% of the
HURT
time
EASIEST
SKILLS
(pknow>
90%)
2% of the
time
4% of the
time
Probability of learning
skill at each opportunity
Skills from the Algebra Tutor
Initial probability
of knowing skill
skill
L0
T
0.01
0.01
ApplyExponentExpandExponentsevalradicalE
0.333
0.497
CalculateEliminateParensTypeinSkillElimi
0.979
0.001
CalculatenegativecoefficientTypeinSkillM
0.953
0.001
Changingaxisbounds
0.01
0.01
Changingaxisintervals
0.01
0.01
ChooseGraphicala
0.001
0.306
combineliketermssp
0.943
0.001
AddSubtractTypeinSkillIsolatepositiveIso
Which skills could probably be
removed from the tutor?
skill
L0
T
0.01
0.01
ApplyExponentExpandExponentsevalradicalE
0.333
0.497
CalculateEliminateParensTypeinSkillElimi
0.979
0.001
CalculatenegativecoefficientTypeinSkillM
0.953
0.001
Changingaxisbounds
0.01
0.01
Changingaxisintervals
0.01
0.01
ChooseGraphicala
0.001
0.306
combineliketermssp
0.943
0.001
AddSubtractTypeinSkillIsolatepositiveIso
Which skills could use better
instruction?
skill
L0
T
0.01
0.01
ApplyExponentExpandExponentsevalradicalE
0.333
0.497
CalculateEliminateParensTypeinSkillElimi
0.979
0.001
CalculatenegativecoefficientTypeinSkillM
0.953
0.001
Changingaxisbounds
0.01
0.01
Changingaxisintervals
0.01
0.01
ChooseGraphicala
0.001
0.306
combineliketermssp
0.943
0.001
AddSubtractTypeinSkillIsolatepositiveIso
Comments? Questions?
A lengthier example
(if there’s time)
• Applying Baker et al’s (2008) gaming detector
across contexts
Research Question
• Do students game the system because of state or trait
factors?
• If trait factors are the main explanation, differences between
students will explain much of the variance in gaming
• If state factors are the main explanation, differences between
lessons could account for many (but not all) state factors, and
explain much of the variance in gaming
• So: is the student or the lesson a better predictor of gaming?
Application of Detector
• After validating its transfer
• We applied the gaming detector across 35
lessons, used by 240 students, from a single
Cognitive Tutor
• Giving us, for each student in each lesson, a
gaming frequency
Model
• Linear Regression models
• Gaming frequency = Lesson + a0
• Gaming frequency = Student + a0
Model
• Categorical variables transformed to a set of
binaries
•
•
•
•
•
•
•
i.e. Lesson = Scatterplot becomes
3DGeometry = 0
Percents = 0
Probability = 0
Scatterplot = 1
Boxplot = 0
Etc…
Metrics
r2
• The correlation, squared
• The proportion of variability in the data set
that is accounted for by a statistical model
r2
• The correlation, squared
• The proportion of variability in the data set
that is accounted for by a statistical model
r2
• However, a limitation
• The more variables you have, the more variance
you should be expected to predict, just by
chance
r2
•
•
•
•
We should expect
240 students
To predict gaming better than
35 lessons
• Just by overfitting
So what can we do?
BiC
• Bayesian Information Criterion
(Raftery, 1995)
• Makes trade-off between goodness of fit and
flexibility of fit (number of parameters)
Predictors
The Lesson
• Gaming frequency = Lesson + a0
• 35 parameters
• r2 = 0.55
• BiC’ = -2370
– Model is significantly better than chance would
predict given model size & data set size
The Student
• Gaming frequency = Student + a0
• 240 parameters
• r2 = 0.16
• BiC’ = 1382
– Model is worse than chance would predict given
model size & data set size!
Standard deviation bars, not standard error bars
Comments? Questions?
EDM – where?
Holistic
Existentialist
Essentialist
Entitative
Today’s Class
• EDM
• Assignment#5
• Mega-Survey
Any questions?
Today’s Class
• EDM
• Assignment#5
• Mega-Survey
Mega-Survey
• I need a volunteer to bring these surveys to
Jim Doyle after class
• *NOT THE REGISTRAR*
Mega-Survey Additional Questions
(See back)
• #1: In future years, should this class be given
1: In half a semester, as part of a unified semester class,
along with Professor Skorinko’s Research Methods class
3: Unsure/neutral
5: As a full-semester class, with Professor Skorinko’s class as
a prerequisite
#2: Are there any topics you think should be dropped from
this class? [write your answer in the space to the right]
#3: Are there any topics you think should be added to this
class? [write your answer in the space to the right]