Transcript pptx

Special Topics in
Educational Data Mining
HUDK5199
Spring, 2013
April 24, 2012
Today’s Class
• Sequential Pattern Mining
Two categories of SPM
(That we’ll discuss today)
• Approaches Coming from Association Rule
Mining
• MOTIF Extraction
Association Rule Mining
• Try to automatically find if-then rules within
the data set
Sequential Pattern Mining
• Try to automatically find temporal patterns
within the data set
ARM Example
• If person X buys diapers,
• Person X buys beer
• Purchases occur at the same time
SPM Example
• If person X buys novel Foundation now,
• Person X buys novel Second Foundation in a
later transaction
• Conclusion: recommend Second Foundation
to people who have previously purchased
Foundation
SPM Example
• Many customers rent Star Wars, then the
Empire Strikes Back, then Return of the Jedi
• Doesn’t matter if they rent other stuff inbetween
SPM Example
• Many customers buy flowers, and then buy
diapers AND diaper cream several months
later
SPM Example
• Many learners become confused, then game
the system, then become frustrated, then
complete gaming the system, then become reengaged
Different Constraints than ARM
• If-then elements do not need to occur in the
same data point
• Instead
– If-then elements should have same user (or other
organizing variable)
– If elements can be within a certain time window
of each other
– Then element time should be within a certain
window after if times
Sequential Pattern Mining
• Find all subsequences in data with high
support
• Support calculated as number of sequences
that contain subsequence, divided by total
number of sequences
Sequential Pattern Mining
• What are some subsequences with high
support? (What is their support?)
•
•
•
•
Chuck: a, abc, ac, de, cef
Darlene: af, ab, acd, dabc, ef
Egoberto: aef, ab, aceh, d, ae
Francine: a, bc, acf, d, abeg
Questions? Comments?
Algorithms for SPM
GSP (Generalized Sequential Pattern)
• Classic Algorithm
• (Srikant & Agrawal, 1996)
Data pre-processing
• Data transformed from individual actions to
sequences by user
• E.g.
• Bob: {GAMING and BORED, OFF-TASK and
BORED, ON-TASK and BORED, GAMING and
BORED, GAMING and FRUSTRATED, ON-TASK
and BORED}
Data pre-processing
• In some cases, time also included
• E.g.
• Bob: {GAMING and BORED 5:05:20, OFF-TASK
and BORED 5:05:40, ON-TASK and BORED
5:06:00, GAMING and BORED 5:06:20,
GAMING and FRUSTRATED 5:06:40, ON-TASK
and BORED 5:07:00}
Algorithm
• Take the whole set of sequences of length 1
– May include “ANDed” combinations at same time
• Find which sequences of length 1 have support over
pre-chosen threshold
• Compose potential sequences out of pairs of
sequences of length 1 with acceptable support
• Find which sequences of length 2 have support over
pre-chosen threshold
• Compose potential sequences out of triplets of
sequences of length 1 and 2 with acceptable support
• Continue until no new sequences found
Let’s execute GPS algorithm
• With min support = 50%
Let’s execute GPS algorithm
• With min support = 50%
•
•
•
•
Chuck: a, abc, ac, de, cef
Darlene: af, ab, acd, dabc, ef
Egoberto: aef, ab, aceh, d, ae
Francine: a, bc, acf, d, abeg
Other algorithms
• Free-Span
• Prefix-Span
• Select sub-sets of data to search within
• Faster, but same basic idea as in GPS
Uses in educational domains
Perera et al. (2009)
• What were the three ways that Perera et al.
(2009) used sequential pattern mining?
• What did they learn, and how did they use the
information?
Perera et al. (2009)
1. Overall uses of collaborative tools by groups
2. Sequences of collaborative tool use by different
group members
3. Sequences of access of specific resources by
different group members
• In all cases, they found common patterns and
then looked at how support differed for
successful and unsuccessful groups
Perera et al. (2009):
Important Findings
1. Overall uses of collaborative tools by groups
– Successful groups used ticketing system more
than the wiki; weaker groups used wiki more
– Patterns were particularly strong for group
leaders
Perera et al. (2009):
Important Findings
2. Sequences of collaborative tool use by
different group members
– Successful groups characterized by leader
opening ticket and other student working on
ticket
– Successful groups characterized by students
other than leader opening ticket, and other
students working on ticket
Perera et al. (2009):
Important Findings
3. Sequences of access of specific resources by
different group members
– The best groups had interactions around the
same resource by multiple students
– The poor groups did no work on tickets before
closing them
Questions? Comments?
MOTIF Extraction
Motif
• Short, recurring pattern in a sequence of
categories occurring over time
Motif in Music
• Short, recurring pattern of notes in a musical
composition
Motif in Music
• What’s the motif?
• http://www.youtube.com/watch?v=rRgXUFnf
KIY
• How many times does the motif occur?
Motif in Music
• What’s the motif?
• http://www.youtube.com/watch?v=rRgXUFnf
KIY
• How many times does the motif occur?
– Depends on how you define it, right?
– And that’s part of the challenge…
Motif in Language
• Short, recurring pattern of characters in a
sequence of characters occurring over time
Motif in Genetics
• Short, recurring pattern of genes in a
sequence of genes occurring over time
• Typically written as letters
Goal of Motif Extraction
• Discern a common pattern of characters in a
large corpus of characters
• The characters may vary slightly from case to
case
Can you find the motif?
Can you find the motif?
UBSWWDFKLWPRHUC
INUSUNSGDAAICAV
JBDPXBDVEJVMBKK
XRZZWCDXOVZZJKQ
VBDWNLROFVUBFFW
OWIFTIENDOXJXIOB
AUAAOOXZAABZSBT
VOVCROMCJTOLXYU
MUAWSNTVZXSFHMI
LFQRKUTFRIENDOV
ONJORIFCGAUGIRN
HUVRYFREENDOBBGC
AQJBVXJCAJLEMAU
PJGCHBDQIWJJTMQ
LOMTPOQHJVYYMFJ
LWGJMVPKYOZNMSA
IQYQHKKBNBVDFPV
JJLHWPZAYZIGGEH
RUPMFOHPVSPPVPT
BAZXVFTPQFQJVBM
IGJZRMAAWJBESSS
HLPMOKUOXGRIENDO
IRPWYIRJISLFVFF
JXZFRIEMDOVZRBJY
How would you describe the motif?
UBSWWDFKLWPRHUC
INUSUNSGDAAICAV
JBDPXBDVEJVMBKK
XRZZWCDXOVZZJKQ
VBDWNLROFVUBFFW
OWIFTIENDOXJXIOB
AUAAOOXZAABZSBT
VOVCROMCJTOLXYU
MUAWSNTVZXSFHMI
LFQRKUTFRIENDOV
ONJORIFCGAUGIRN
HUVRYFREENDOBBGC
AQJBVXJCAJLEMAU
PJGCHBDQIWJJTMQ
LOMTPOQHJVYYMFJ
LWGJMVPKYOZNMSA
IQYQHKKBNBVDFPV
JJLHWPZAYZIGGEH
RUPMFOHPVSPPVPT
BAZXVFTPQFQJVBM
IGJZRMAAWJBESSS
HLPMOKUOXGRIENDO
IRPWYIRJISLFVFF
JXZFRIEMDOVZRBJY
Finding motifs
• Several algorithms
Finding motifs
• Variant on PROJECTION algorithm (Tompa &
Buhler, 2001) used in (Shanabrook et al.,
2010)
– Only example of motif extraction in educational
data mining so far
Big idea
• For each character string C that could be a
motif example (e.g. all character strings of
desired length)
– Create a set of projections, random variations of C
that vary in one or more ways
Big idea
• For each pair of strings C1 and C2, see how many
overlaps there are between their projection
matrices
• Take the pair with the most matches and combine
into a motif
– Creating multi-example motif if 3+ get added together
• Repeat until goal number of motifs is found, or
until new motif is below criterion goodness
Motif in Education
• Short, recurring pattern of behaviors in a
sequence of behaviors occurring over time
• Written as letters in Shanabrook et al. (2010)
Detail for education
• How do you segment student behavior?
• Could use student’s interaction on an entire problem,
and compute letters across whole problem
– Might make more sense in tutors with shorter problems
• Could use student’s interaction on an entire problem,
and define letters differently for context within whole
problem
– Approach used by Shanabrook et al. (2010)
• Could use “sliding window” of N actions
Behaviors in Shanabrook et al.
• “hints (a, b, c) – Hints is a measure of the number of hints viewed
for this problem. Although each problem has a maximum number
of hints, the hint count does not have an upper bound because
students can repeat hints and the count will increase at each
repeated view. The three categories for hints are: (a) no hints,
meaning that thestudent did not use the hint facility for that
problem, (b) meaning the student used the hint facility, but was
not given the solution, and (c) last hint solved, meaning that the
student was given the solution to the problem by the last hint. As
described above, this metric combines two values logged by the
tutor: the count of hints seen, and an indicator that the final hint
giving the answer was seen. The data could have been simply
binned low, medium, high hints; however, this would have missed
the significance of zero hints and using hints to reveal the problem
solution.”
Behaviors in Shanabrook et al.
• “secFirst (d, e, f) – The seconds to first attempt is an
important measure as it is during this time that the student
is reading the problem and formulating their response. In
previous research [6], five seconds was determined to be a
threshold for this metric representing gaming: students
who make a first attempt in less than five seconds are
considered not working on-task. We divide secFirst into
three bins: (d) less than 5 sec, (e) 5 to 30 sec, (f) greater
than 30 sec. (d) represents students who are gaming the
system, (e) represents a moderate time to the first attempt,
(f) represents a long time to the first attempt. The cut at 30
seconds was chosen because it equalizes the distribution of
bins (e and f), representing a division between a moderate
and a long time to the first attempt.”
Behaviors in Shanabrook et al.
• “secOther (g, h, i, j, k) – This variable represents actions related to
answering the problem after the first attempt was made. While the first
attempt includes the problem reading and solution time, subsequent
solution attempts could be much quicker and the student could still be
making good effort. secOther is categorized in five bins: (g) skip, (h) solved
on first, (i) 0 to 1.2 sec, (j) 1.2 to 2.9 sec, (k) greater than 2.9 sec. First,
there are two categorical bins, skip and solve on first attempt. These are
each determined from an indicator in the log data for that problem.
Skipping a problem implies only that students never clicked on a correct
answer; they could have worked on the problem and then given up, or
immediately skipped to the next problem with only a quick look. Solved
on first attempt indicates correctly solving the problem. If neither of the
first two bins are indicated in the logs, then the secOther metric measures
the mean time for all attempts after the first. The divisions of 1.2 sec and
2.9 sec for the latter three bins were obtained using the mean and one
standard deviation above the mean for all tutor usage; (i) less than 1.2
seconds would indicate guessing, (j) would indicate normal attempts, and
(k) would indicate a long time between attempts.”
Behaviors in Shanabrook et al.
• “numIncorrect – (o, p, q) - Each problem has four
or five possible answer choices, that we divide
into three groups: (o) zero incorrect attempts,
indicates either solved on first attempt, skipped
problem, or last hint solves problem (defined by
the other metrics); (p) indicates choosing the
correct answer in the second or third attempt,
and (q) obtaining the answer by default in a four
answer problem or possibly guessing when there
is five answer problem.”
What other constructs
could be used?
• What other kinds of constructs could be used
for the atoms of motif analyses in educational
analyses?
– At this grain-size (e.g. specific actions)
What other constructs
could be used?
• What other kinds of constructs could be used
for the atoms of motif analyses in educational
analyses?
– At other grain-sizes?
Common Motifs
• {adgo, adip, adiq}
• {aeho, afho}
• {ceho}
• {adgo, aeho}
• {aeiq aeho aeho aekp aeho aeiq aeho aeip
aeho aeip}
Interpretations
(Shanabrook et al., 2010)
• {adgo, adip, adiq} – gaming the system
• {aeho, afho} – “This student is using the tutor
appropriately, but not being challenged.”
• {ceho} – problem is too difficult
• {adgo, aeho} – student is skipping problems
• {aeiq aeho aeho aekp aeho aeiq aeho aeip aeho
aeip} – working on-task
Do you agree with interpretations?
• {adgo, adip, adiq} – gaming the system
• {aeho, afho} – “This student is using the tutor
appropriately, but not being challenged.”
• {ceho} – problem is too difficult
• {adgo, aeho} – student is skipping problems
• {aeiq aeho aeho aekp aeho aeiq aeho aeip aeho
aeip} – working on-task
How can researchers form good
interpretations?
Questions? Comments?
What else?
• What else could sequential pattern mining
and motif extraction be used for in education?
– Beyond Perera et al. and Shanabrook et al.
Asgn. 8
• Solutions
Asgn. 9
• Questions?
• Comments?
Next Class
• Monday, April 29
• Learnograms
• Guest lecturer: Arnon Hershkovitz
– World-famous inventor of learnograms
• Readings: None
• Assignments Due: None
The End