Text Block #1 - Title - Worcester Polytechnic Institute
Download
Report
Transcript Text Block #1 - Title - Worcester Polytechnic Institute
Towards Assessing Students’ Fine
Grained Knowledge: Using an
Intelligent Tutor for Assessing
Mingyu Feng
August 18th, 2009
Ph.D. Dissertation Committee:
Prof. Neil T. Heffernan (WPI)
Prof. Joseph E. Beck (WPI)
Prof. Carolina Ruiz (WPI)
Prof. Kenneth R. Koedinger (CMU)
Worcester Polytechnic Institute
Motivation – the need
Concerns about poor student performance on new state
tests
High-stakes standards-based tests are required by the No
Child Left Behind (NCLB) Act
Student performance are not satisfactory
Massachusetts (2003, 20% failed 10th grade math on the first try)
Worcester
Secondary teachers are asked to be data-driven
MCAS test reports
Formative assessment and practice tests
Provided by Northwest Evaluation Association; Measured Progress;
Pearson Assessments, etc.
2
Motivation – the problems
I: Formative assessment takes time from
instruction
NCLB or NCLU (No Child Left Untested)?
Every hour spent assessing students is an hour lost
from instruction
Limited classroom time compels teachers to make
a choice
3
Motivation – the problems
II: Performance reports are not satisfactory
Teachers want more frequent and more detailed reports
4
Confrey, J., Valenzuela, A., & Ortiz, A. (2002). Recommendation to the Texas State Board of Education on
the Setting of TAKS Standards: A Call to Responsible Action. At http://www.syrce.org/State_Board.htm
Main Contributions
Improved assessment system by taking into account how
much assistance students need (WWW’06; ITS’06;
EDM’08; UMUAI Journal’09 (nominated for James Chen
award))
Established a way to track and predict performance
longitudinally over multiple years (WWW’06; EDM’08)
Rigorously evaluated the effectiveness of the skill models of
various granularities (AAAI’06 EDM Workshop; TICL’07;
IEEE Journal’09)
Used data mining approach to evaluate effectiveness of
individual contents (AIED’09)
Used data mining to refine existing skill models (EDM’09;
in preparation)
Developed an online reporting system deployed and used by
real teachers (AIED’05; Book chapter’07; TICL Journal’06;
JILR Juornal’07)
5
Roadmap
Motivation
Contributions
Background - ASSISTment
Using tutoring system as an assessor
Dynamic assessment
Longitudinal modeling
Cognitive diagnostic modeling
Conclusion & general implications
6
ASSISTments System
A web-based tutoring system that assists students
in learning mathematics and gives teachers
assessment of their students’ progress
Teachers like ASSISTments
Students like ASSISTments
7
An ASSISTment
We break multi-step items
(original question) into
scaffolding questions
Attempt: student take an action to
answer a question
Response: the correctness of
student answer (1/0)
Hint Messages: given on demand
that give hints about what step to
do next
Buggy Message: a context
sensitive feedback message
Skill: a piece of knowledge
required to answer a question
8
Facts about ASSISTments
5000+ students have used the system regularly
More than 10 million data records collected
Other features
Learning experiments; authoring tools, account and
class management toolkit …
The dissertation uses data of about 1000 students
who used ASSISTments during 2004-2006
AIED’05: Razzaq, L., Feng, M., Nuzzo-Jones, G., Heffernan, N.T., Koedinger, K. R., Junker, B., Ritter, S., Knight, A., Aniszczyk, C., Choksey,
S., Livak, T., Mercado, E., Turner, T.E., Upalekar. R, Walonoski, J.A., Macasek. M.A., Rasmussen, K.P. (2005). The Assistment Project:
Blending Assessment and Assisting. In C.K. Looi, G. McCalla, B. Bredeweg, & J. Breuker (Eds.) Proceedings of the 12th International
9
Conference on Artificial Intelligence in Education, pp. 555-562. Amsterdam: ISO Press.
Book Chapter: Razzaq, L., Feng, M., Heffernan, N., Koedinger, K., Nuzzo-Jones, G., Junker, B., Macasek, M., Rasmussen, K., Turner, T., &
Walonoski, J. (2007). Blending Assessment and Instructional Assistance. In Nedjah, Mourelle, Borges and Almeida (Eds). Intelligent
Educational Machines within the Intelligent Systems Engineering Book Series . pp.23-49. Springer Berlin / Heidelberg.
Roadmap
Motivation
Contributions
Background - ASSISTments
Using tutoring system as an assessor
Dynamic assessment
Longitudinal modeling
Cognitive diagnostic modeling
Conclusion & general implications
10
A Grade Book Report
Where does this
score come
from?
11
JILR Journal: Feng, M. & Heffernan, N. (2007). Towards Live Informing and Automatic Analyzing of Student Learning: Reporting in the Assistment System. Journal of
Interactive Learning Research. 18 (2), pp. 207-230. Chesapeake, VA: AACE.
TICL Journal: Feng, M., Heffernan, N.T. (2006). Informing Teachers Live about Student Learning: Reporting in the Assistment System. Technology, Instruction, Cognition,
and Learning Journal. Vol. 3. Old City Publishing, Philadelphia, PA. 2006.
Automated Assessment
Big idea: use data collected while a student
uses ASSISTment to assess him
Lots of types of data available
(last screen just used % correct on original
questions)
Lots of other possible measures
Why should we be more complicated?
12
Worcester Polytechnic Institute
A Grade Book Report
Static – does not distinguish “Tom”
and “Jack”
Average – ignores development
over time
Uninformative – not informative for
classroom instruction
Dynamic assessment
Longitudinal modeling
Cognitive diagnostic assessment
13
Dynamic Assessment – the idea
Dynamic testing began before computerized testing (Brown,
Bryant, & Campione, 1983).
14
Brown, A. L., Bryant, N.R., & Campione, J. C. (1983). Preschool children’s learning and transfer of matrices problems:
Potential for improvement. Paper presented at the Society for Research in Child Development meetings, Detroit.
Dynamic vs. Static Assessment
Developing dynamic testing metrics
# attempts
# minutes to come up with an answer; # minutes to
complete an ASSISTment
# hint requests; # hint-before-attempt requests;
#bottom-out hints
% correct on scaffolds
# problems solved
“Static” measure
correct/wrong on original questions
15
Dynamic Assessment – data
2004-2005 Data
Sept, 2004 – May, 2005
391 students
Online data
267 minutes (sd. = 79); 9 days; 147 items (sd. = 60)
8th grade MCAS scores (May, 2005)
2005-2006 Data
Sept, 2005 – May, 2006
616 students
Online data
196 minutes (sd. = 76); 6 days; 88 items (sd. = 42)
8th grade MCAS scores (May, 2006)
16
Dynamic Assessment - modeling
Three linear stepwise regression models
1-parameter IRT
proficiency
estimate
The standard test model
MCAS
Score
The assistance model
1-parameter IRT
proficiency
estimate +
all online
metrics
The mixed model
All online
metrics
17
1-parameter IRT: One parameter item response theory model
Dynamic Assessment - evaluation
Bayesian Information Criterion (BIC)
Widely used model selection criterion
Resolves overfitting problem by introducing a penalty term
for the number of parameters
Formula
Prefer model with lower BIC
Mean Absolute Deviation (MAD)
Cross-validated prediction error
Function
Prefer model with lower MAD
18
Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 25, 111-163.
Dynamic Assessment - results
1-parameter IRT
proficiency
estimate
1-parameter IRT
proficiency
estimate +
all online
metrics
The standard test model
The assistance model
All online
metrics
Model
The standard test model
The assistance model
The mixed model
The mixed model
MAD
6.40
p=0.001
p=0.001
5.46
p=0.001
5.04
Correlation with 2005
8th grade MCAS
0.733
-295
19
0.821
-402
BIC
-450
0.841
Dynamic Assessment – what variables are
important?
20
Dynamic Assessment - robustness
See if model can generalize
Test model on other year’s data
21
Compare Models from Two Years
(Constant)
IRT_Proficiency_Estimate
Scaffold_Percent_Correct
Avg_Question_Time
Avg_Attempt
Avg_Hint_Request
Question_Count
Avg_Item_Time
Total_Attempt
2004-2005 data
2005-2006 data
32.414
26.8
20.427
-0.17
-10.5
-3.217
3.284
32.944
21.327
-0.102
0.072
0.045
-0.044
Which metrics are stable across years?
22
Worcester Polytechnic Institute
Dynamic Assessment - conclusion
ASSISTments data enables us to assess more
accurately
The relative success of the assistance model over the
standard test model highlights the power of the
dynamic measures
Feng, M., Heffernan, N.T, Koedinger, K.R. (2006a). Addressing the Testing Challenge with a Web-Based EAssessment System that Tutors as it Assesses. In Proceedings of the 15th International World Wide Web
23
Conference. pp. 307-316. New York, NY: ACM Press. 2006. Best Student Paper Nominee.
Feng, M., Heffernan, N.T., & Koedinger, K.R. (2009). Addressing the assessment challenge in an online System
that tutors as it assesses. User Modeling and User-Adapted Interaction: The Journal of Personalization Research
(UMUAI journal). 19(3), 2009.
Roadmap
Motivation
Contributions
Background - ASSISTments
Using tutoring system as an assessor
Dynamic assessment
Longitudinal modeling
Cognitive diagnostic modeling
Conclusion & general implications
24
Can we have our cake and eat it, too?
Most large standardized tests are
unidimensional or low-dimensional.
Yet, teachers need fine grained diagnostic
reports (Militello, Sireci, & Schweid, 2008;
Wylie, & Ciofalo, 2008; Stiggins, 2005)
Can we have our cake and eat it, too?
Militello, M., Sireci, S., & Schweid, J. (2008). Intent, purpose, and fit: An examination of formative assessment systems in
school districts. Paper presented at the American Educational Research Association, New York City, NY.
25
Wylie, E. C., & Ciofalo, J. (2008). Supporting teachers' use of individual diagnostic items. Teachers College Record.
Retrieved from http://www.tcrecord.org/PrintContent.asp?ContentID=15363 on October 13, 2008.
Stiggins, R. (2005). From formative assessment to assessment FOR learning: A path to success in standards-based schools.
Phi Delta Kappan, 87(4), 324-328.
Cognitive Diagnostic Assessment
McCalla & Greer (1994) pointed out that the ability to
represent and reason about knowledge at various
levels of detail is important for robust tutoring.
Griel, Wang & Zhou (2008) proposed one direction
for future research is to increase understanding of how
to select an appropriate grain size or level of analysis
Can we use MCAS test results to help select the right
grain-sized model from a series of models of different
granularities?
McCalla, G. I. and Greer, J. E. (1994). Granularity- based reasoning and belief revision in student models. In Greer, J. E.
and McCalla, G. I., (eds), Student Modeling: The Key to Individualized Knowledge-Based Instruction, pages 39-62.
Springer-Verlag, Berlin.
26
Gierl, M.J., Wang, C., & Zhou, J. (2008). Using the attribute hierarchy method to make diagnostic inferences about
examinees’ cognitive skills in Algebra on the SAT. Journal of Technology, Learning, and Assessment, 6(6).
Building Skill Models
Math
Patterns,
Relations,
and Algebra
Data Analysis,
Statistics and
Probability
…
…
WPI - 1
Geometry
…
UnderstandingSetting-updataand-solving- Understanding presentation- Understandingand-applyingequation
techniques
-pattern
congruence-andsimilarity
…
…
Equationsolving
…
Circle-graph
Number Sense
and Operations
WPI - 5
…
Usingmeasurement
-formulas- Convertingunderstandingandfrom-one- numbertechniques measure-torepresentations
another
Area
Similar-triangles
Perimeter
WPI - 39
…
…
Congruence
XY-graph
Inducing-function
…
…
Plot-graph
Equation-concept
Measurement
…
Unit-conversion
EquivalentWPI - 78
FractionsDecimals-Percents
Ordering-fractions
27
Building Skill Models
Math
Patterns,
Relations,
and Algebra
Data Analysis,
Statistics and
Probability
…
…
WPI - 1
Geometry
…
UnderstandingSetting-updataand-solving- Understanding presentation- Understandingand-applyingequation
techniques
-pattern
congruence-andsimilarity
…
…
Equationsolving
…
Circle-graph
Number Sense
and Operations
WPI - 5
…
Usingmeasurement
-formulas- Convertingunderstandingandfrom-one- numbertechniques measure-torepresentations
another
Area
Similar-triangles
Perimeter
WPI - 39
…
…
Congruence
XY-graph
Inducing-function
…
…
Plot-graph
Equation-concept
Measurement
…
Unit-conversion
EquivalentWPI - 78
FractionsDecimals-Percents
Ordering-fractions
28
Cognitive Diagnostic Assessment – data
2004-2005 Data
2005-2006 Data
Sept, 2004 – May, 2005
447 students
Online data: 7.3 days; 87 items (sd. = 35)
Item level response of 8th grade MCAS test (May, 2005)
Sept, 2005 – May, 2006
474 students
Online data: 5 days; 51 items (sd. = 24)
Item level 8th grade MCAS scores (May, 2006)
All online and MCAS items have been tagged with all
four skill models
29
Cognitive Diagnostic Assessment - modeling
Fit mixed-effects logistic regression model
Longitudinal model
(e.g. Singer & Willett,
2003)
-- Xijkt is the 0/1 response of student i on question j tapping skill k in month t
-- Montht is elapsed month in the study; 0 for September, 1 for October, and so on
-- β0k and β1k : respective fixed effects for baseline and rate of change in probability of
correctly answering a question tapping skill k.
-- β00 and β10 : the group average incoming knowledge level and rate of change
-- β0 and β1 : the baseline level of achievement and rate of change of the student
Predict MCAS score
Extrapolate the fitted model in time to the month of the MCAS test
Obtain probability of getting each MCAS question correct, based upon
skill tagging of the MCAS item
Sum up probabilities to get total score
30
How do I Evaluate Models?
04-05
Data
Real
MCAS
score
Skill Models
Absolute Difference
ASSISTment Predicted Score
WPI-1
WPI-5
WPI-39
WPI-78
WPI-1
WPI-5
WPI-39
WPI-78
Mary
25.00
23.31
22.85
22.18
20.47
1.69
2.15
2.82
4.53
Tom
32.00
29.66
29.15
28.67
27.13
2.34
2.85
3.33
4.87
…
…
Sue
29.00
28.46
28.23
27.85
26.26
0.54
0.77
1.15
2.74
Dick
28.00
27.41
26.70
26.12
24.30
0.59
1.30
1.88
3.70
Harry
22.00
23.33
22.58
22.02
20.14
1.33
0.58
0.02
1.86
MAD
4.42
4.37
4.22
4.11
%Error
13.00%
12.85%
12.41%
12.09%
Paired two-sample t-test
31
Comparing Models of Different Granularities
04-05 Data
MAD
%Error
WPI-1
WPI-5
4.37 >
4.42 >
13.00% >
WPI-1
WPI-78
4.22
>
12.85% > 12.41% >
P =0.006
05-06 Data
WPI-39
P <0.001
WPI-5
P =0.21
WPI-39
1-parameter
IRT model
4.11
4.36
12.09%
12.83%
P =0.10
WPI-78
MAD
6.58
6.51
4.83
4.99
4.67
%Error
19.37%
19.14%
15.10%
14.70%
13.70%
P <0.001
P <0.001
P <0.001
P =0.03
32
The Effect of Scaffolding - hypothesis
Only using original questions makes it hard to
decide which skill to “blame”
Scaffolding questions aid in diagnosis by
directly assessing a single skill
Hypotheses
Using responses to scaffolding questions will
improve prediction accuracy
Scaffolding questions are more useful for fine
grained models
33
The Effect of Scaffolding - results
04-05
Data
WPI-1
Only original
questions used
Original + Scaffolding
questions used
14.91%
13.00%
WPI-5
14.06%
12.85%
WPI-39
15.29%
12.41%
WPI-78
17.75%
12.09%
05-06
Data
WPI-1
Only original
questions used
Original + Scaffolding
questions used
20.05%
19.37%
WPI-5
19.88%
19.14%
WPI-39
18.68%
15.10%
WPI-78
16.91%
14.70%
34
Cognitive Diagnostic Assessment - usage
Results presented in a nested structure of different
granularities to serve a variety of stake-holders
35
Cognitive Diagnostic Assessment - conclusion
Fine-grained models do the best job estimating student
skill level overall
Not necessarily the best for all consumers (e.g.
principals)
Need ability to diagnosis (e.g. scaffolding questions)
Scaffolding questions
Helps improve overall prediction accuracy
More useful for fine-grained models
Feng, M., Heffernan, N.T, Mani, M. & Heffernan C. (2006). Using Mixed-Effects Modeling to Compare Different Grain-Sized Skill Models. In
Beck, J., Aimeur, E., & Barnes, T. (Eds). Educational Data Mining: Papers from the AAAI Workshop. Menlo Park, CA: AAAI Press. pp. 57-66.
Feng, M, Heffernan, N., Heffernan, C. & Mani, M. (2009). Using mixed-effects modeling to analyze different grain-sized skill models. IEEE
Transactions on Learning Technologies Special Issue on Real-World Applications of Intelligent Tutoring Systems. (Featured article of the issue)36
Pardos, Z., Feng, M. & Heffernan, N. T. & Heffernan-Lindquist, C. (2007).Analyzing fine-grained skill models using bayesian and mixed effect
methods. In Luckin & Koedinger (Eds.) Proceedings of the 13th Conference on Artificial Intelligence in Education. Amsterdam, Netherlands: IOS
Press.pp.626-628.
Future Work - Skill Model Refinement
We found that WPI-78 is good enough to better
predict a state test than some less fine-grained models
However, WPI-78 may have some mis-taggings
Expert-built models are subject to the risk of “expert blind
spot”
Our best-guess in a 7-hour coding session
A best guess model should be iteratively tested and
refined
37
Skill Model Refinement - approaches
Human experts manually update hand-crafted models
(1,000+ items ) * (100+ skills)
Not practical to do it often
Data mining can help
Skills or items with high residuals
Skills consistently over-predicted or under-predicted
“Un-learned” skills (i.e. negative slopes from mixedeffects models)
38
Feng, M., Heffernan, N., Beck, J, & Koedinger, K. (2008). Can we predict which groups of questions students will learn from? In Beck &
Baker (Eds.). Proceedings of the 1st International Conference on Education Data Mining. Montreal, 2008.
Skill Model Refinement - approaches
Searching for better models automatically
Learning Factor Analysis (LFA) (Koedinger, & Junker,
1999)
Human identify
Auto-methods
A semi-automated method
Three parts
difficulty factors
through task
analysis
search for better
models based
upon factors
Difficulty factors associated with problems
A combinatorial search space by applying operators (add, split,
merge) on the base model
A statistical model that evaluate how a model fit the data
Can we increase the efficiency of LFA?
Auto-methods
search for better
models based
upon factors
39
Suggesting Difficulty Factors
Some items in a random
sequence cause significantly
less learning than others
Hypothesis
Problems that “don’t help”
students learn might be
teaching a different skill(s)
Create factor tables
Preliminary results show
some validity
Skill
Factor
Circle-area
High
Circle-area
High
Circle-area
High
Circle-area
Low
40
Feng, M., Heffernan, N., & Beck, J. (2009). Using learning decomposition to analyze instructional effectiveness in the ASSISTment
system. In Dimitrova, Mizoguchi, du Boulay, & Graesser (Eds), Proceedings of the 14th International Conference on Artificial
Intelligence in Education (AIED-2009). Amsterdam, Netherlands: IOS Press. Brighton, UK.
Roadmap
Motivation
Contributions
Background - ASSISTments
Using tutoring system as an assessor
Dynamic assessment
Longitudinal modeling
Cognitive diagnostic modeling
Conclusion & general implications
41
Conclusion of the Dissertation
The dissertation establishes novel assessment
methods to better assess students in tutoring
systems
Assess students better by analyzing their learning
behaviors when using the tutor
Assess students longitudinally by tracking learning
over time
Assess students diagnostically by modeling finegrained skills
42
Comments from the Education Secretary
Secretary of Education, Arne Duncan weighed
in (in Feb 2009) on the NCLB Act, and called
for continuous assessment
Duncan says he is concerned about overtesting but
he thinks states could solve the problem by
developing better tests. He also wants to help them
develop better data management systems that help
teachers track individual student progress. "If you
have great assessments and real-time data for
teachers and parents that say these are [the
student's] strengths and weaknesses, that's a real
healthy thing," he says.
43
Ramírez, E., & Clark, K. (Feb., 2009). What Arne Duncan Thinks of No Child Left Behind: The new education secretary talks
about the controversial law and financial aid forms. (Electronic version) Retrieved on March 8th, 2009 from
http://www.usnews.com/articles/education/2009/02/05/what-arne-duncan-thinks-of-no-child-left-behind.html.
General implication
Continuous assessment systems are possible to
build (we built one)
Save classroom instruction time by assessing
students during tutoring
Track individual progress and help stakeholders
get student performance information
Provide teachers with fine-grained, cognitively
diagnostic feedbacks to be “data-driven”
44
A metaphor for this shift
Businesses don’t close down periodically to take
inventory of stock any more
Bar code; auto-checkout
Non-stopped business
Richer information
Committee on the Foundations of Assessment
Board on Testing and Assessment
Center for Education
National Research Council
James W. Pellegrino
Naomi Chudowsky
Robert Glaser
(page 284).
45
Acknowledgement
My advisor
Neil Heffernan
Committee members
Ken Koedinger
Carolina Ruiz
Joe Beck
The ASSISTment team
My family
Many more…
46
Thanks!
Questions?
Worcester Polytechnic Institute
Backup slides
48
Motivation – the problems
III: The “moving” target problem
Testing and instruction have been separate fields
of research with their own goals
Psychometric theory assumes a fixed target for
measurement
ITS wants student ability to “move”
49
More Contributions
Working systems
www.ASSISTment.org
The reporting system that gives cognitive diagnostic
reports to teachers in a timely fashion
Establish an easy approach to detect the effectiveness
of individual tutoring content
AIED’05: Razzaq, L., Feng, M., Nuzzo-Jones, G., Heffernan, N.T., Koedinger, K. R., Junker, B., Ritter, S., Knight, A., Aniszczyk, C., Choksey,
S., Livak, T., Mercado, E., Turner, T.E., Upalekar. R, Walonoski, J.A., Macasek. M.A., Rasmussen, K.P. (2005). The Assistment Project:
Blending Assessment and Assisting. In C.K. Looi, G. McCalla, B. Bredeweg, & J. Breuker (Eds.) Proceedings of the 12th International
Conference on Artificial Intelligence in Education, pp. 555-562. Amsterdam: ISO Press.
Book Chapter: Razzaq, L., Feng, M., Heffernan, N., Koedinger, K., Nuzzo-Jones, G., Junker, B., Macasek, M., Rasmussen, K., Turner, T., &
Walonoski, J. (2007). Blending Assessment and Instructional Assistance. In Nedjah, Mourelle, Borges and Almeida (Eds). Intelligent
Educational Machines within the Intelligent Systems Engineering Book Series . pp.23-49. Springer Berlin / Heidelberg.
JILR Journal: Feng, M. & Heffernan, N. (2007). Towards Live Informing and Automatic Analyzing of Student Learning: Reporting in the
Assistment System. Journal of Interactive Learning Research. 18 (2), pp. 207-230. Chesapeake, VA: AACE.
TICL Journal: Feng, M., Heffernan, N.T. (2006). Informing Teachers Live about Student Learning: Reporting in the Assistment System.
50
Technology, Instruction, Cognition, and Learning Journal. Vol. 3. Old City Publishing, Philadelphia, PA. 2006.
AIED’09: Feng, M., Heffernan, N.T., Beck, J. (2009). Using learning decomposition to analyze instructional effectiveness in the ASSISTment
system. In Dimitrova, Mizoguchi, du Boulay, and Grasser (Eds), Proceedings of the 14th International Conference on Artificial
Intelligence in Education (AIED-2009). pp. 523-530. Amsterdam, Netherlands: IOS Press.
Evidence
62%
50%
37%
37%
51
Evidence
1.
2.
3.
Congruence
Perimeter
Equation-Solving
52
Terminology
MCAS
Item/question/problem
Response
Original question
Scaffolding question
Hint message
Bottom-out hint
Buggy message
Attempt
Skill/knowledge
component
Skill model/cognitive
model/Q-matrix
Single mapping model
Multi-mapping model
53
54
The reporting system
I developed the first reporting system for
ASSISTments in 2004 that
is online, live, and gives detailed feedback at a
grain size for guiding instruction
55
Worcester Polytechnic Institute
The grade book
“It’s spooky; he’s watching everything we do”. – a student
56
Identifying difficult steps
57
Informing hard skills
58
Linear Regression Model
An approach to modeling relationship between one or more
variables (y) and one or more variables (X)
Y depends linearly on X
How linear regression works?
Minimizing sum-of-squares
Example of linear regression
with one independent variable
Stepwise regression
Forward; backward; Combination
59
Worcester Polytechnic Institute
1-Parameter IRT Model
Item response theory (IRT) model relates the
probability of an examinee's response to a test
item to an underlying ability in a logistic
function
1-PL IRT model
where βn is the ability of person n and δi is the difficulty of item i.
I used BI-LOG MG to run the model and get
estimate of student ability and item difficulty
60
Worcester Polytechnic Institute
Dynamic assessment - The models
61
Dynamic assessment - The models
62
Dynamic assessment – The models
63
Dynamic assessment - Validation
64
Longitudinal Modeling - data
Average %correct on original questions over time (FAKE data)
What does our real data look like?
65
66
239
MCASS core
54.00
240
243
45.00
36.00
27.00
18.00
9.00
245
244
0.00
246
MCASS core
54.00
247
45.00
36.00
27.00
248
9.00
316
320
321
327
331
45.00
36.00
27.00
18.00
36.00
668
805
18.00
9.00
0.00
806
54.00
45.00
36.00
27.00
807
809
810
18.00
67
9.00
0.00
669
27.00
667
45.00
666
54.00
9.00
0.00
MCASS core
18.00
54.00
MCASS core
315
0.00
MCASS core
314
0
2
4
6
8
Centered Month
0
2
4
6
8
Centered Month
0
2
4
6
8
Centered Month
0
2
4
6
8
Centered Month
Longitudinal Modeling - methodology
What do we get from (linear) mixed effects
models?
Average population trajectory for the specified group
Trajectory indicated by two parameters
intercept: 00
slope:
10
The average estimated score for a group at time j is
Each student got two parameters to vary from
the group average
j 00 10 * TIME j
One trajectory for every single student
Intercept: 00
0i
slope: 10
1i
The estimated score for student i at time j is
ij ( 00 0i ) ( 10 1i ) * TIME j
68
Singer, J. D. & Willett, J. B. (2003). Applied Longitudinal Data Analysis: Modeling Change and Occurrence. Oxford
University Press, New York.
68
Longitudinal Modeling - results
BIC: Bayesian Information Criterion
(the lower, the better)
Feng, M., Heffernan, N.T, Koedinger, K.R. (2006a) Addressing the Testing Challenge with a Web-Based EAssessment System that Tutors as it Assesses. In Proceedings of the 15th International World Wide Web
Conference. pp. 307-316. New York, NY: ACM Press. 2006. Best Student Paper Nominee.
Feng, M., Heffernan, N.T, Koedinger, K.R. (2006b). Predicting State Test Scores Better with Intelligent 69
Tutoring Systems: Developing Metrics to Measure Assistance Required. In Ikeda, Ashley & Chan (Eds.).
Proceedings of the 8th International Conference on Intelligent Tutoring Systems. Springer-Verlag: Berlin.
pp. 31-40. 2006.
Mixed effects models
Individuals in the population are assumed to have their own
subject-specific mean response trajectories over time
The mean response is modeled as a combination of
population characteristics (fixed effects) and subject-specific
effects that are unique to a particular individual (random
effects)
It is possible to predict how individual response trajectories
change over time
Flexibility in accommodating imbalance in longitudinal data
Methodological features: 1) 3 or more waves of data 2) an
outcome variable (dependent variable) whose values change
systematically over time 3) A sensible metric for time that is
the fundamental predictor in the longitudinal study
70
Sample longitudinal data
71
Comparison of Approaches
Ayers & Junker (2006)
Estimate student proficiency using
1-PL IRT model
LLTM (linear logistic test model)
Main question difficulty decomposed into K skills
1-PL IRT fits dramatically better
Only main questions used
Additive, non-temporal
WinBUGS
72
Worcester Polytechnic Institute
Comparison of Approaches
Pardos et al. (2006)
Conjunctive Bayes nets
Non-temporal
Scaffolding used
Bayes Net Toolbox (Murphy, 2001)
DINA model
(Anozie, 2006)
73
Worcester Polytechnic Institute
Comparison of Approaches
Feng, Heffernan, Mani & Heffernan (2006)
Logistic mixed-effects model (Generalized Linear Mixedeffects Model, GLMM)
Temporal
Xi j is the 0/1 response of student i on question j tapping
KC k in month t,
Montht is elapsed month in the study; β0k and β1k are respective fixed effects for baseline and rate
of change in probability of correctly answering a question tapping KC k.
R lme4 library
74
Worcester Polytechnic Institute
Comparison of Approaches
Comparing to LLTM in Ayers & Junker (2006)
Student proficiency depends on time
Question difficulty depends on KC and time
Assign only the most difficult skill instead of full Q-matrix
mapping of multiple skills as in LLTM
Scaffolding used to gain identifiability
Ayers & Junker (2006) use regression to predict MCAS after
obtaining estimate of student ability (θ) (MAD= 10.93%)
No such regression process in my work
logit(p=1) = θ – 0; estimated score = full score * p
Higher MAD, but provide diagnostic information
75
Worcester Polytechnic Institute
Comparison of Approaches
Comparing to Bayes nets and conjunctive models
Bayes: probability reasoning; conjunctive
GLMM: linear learning; max-difficulty reduction
Computationally much easier and faster
Results are still comparable
GLMM is better than Bayes nets when WPI-1, WPI-5 used
GLMM is comparable with Bayes nets when WPI-39 or WPI78 used
WPI-39: GLMM 12.41%, Bayes: 12.05%
WPI-78: GLMM 12.09%, Bayes: 13.75%
76
Worcester Polytechnic Institute
Cognitive Diagnostic Assessment – BIC results
3085
BIC
Model
WPI-1
-222
WPI-5
WPI-39
WPI-78
04-05 Data 173445.2
170359.9
170581.7
165711.4
05-06 Data 39210.57
39174.29
54696.4
54299.54
36
4870
-15522
399
#data points are different
Items tagged with more than one skill will be duplicated in
the data
Finer grained models have more multi-mappings, and
thus, more data points (higher BIC)
WPI-5 better than WPI-1; WPI-78 better than WPI-39
Calculate MAD as the evaluation gauge
77
Worcester Polytechnic Institute
Analyzing Instructional Effectiveness
Detect relative instructional effectiveness
among items in the same GLOP using
learning decomposition.
Prior encounters
Student
Item
Correct
?
t1
ln
t2
t3
t4
Tom
1
0
0
0
0
Tom
0
1
0
0
0
Tom
0
1
0
1
0
Tom
1
1
1
1
0
P(correct )
a Student Item B1 * t1 B2 * t2 B3 * t3 B4 * t4
1 P(correct )
78
Feng, M., Heffernan, N., & Beck, J. (2009). Using learning decomposition to analyze instructional effectiveness in the ASSISTment
system. In Dimitrova, Mizoguchi, du Boulay, & Graesser (Eds), Proceedings of the 14th International Conference on Artificial
Intelligence in Education (AIED-2009). Amsterdam, Netherlands: IOS Press. Brighton, UK.
Searching Results
Among 38 GLOPs, LFA
found significant better
models for 12
Shall I be happy?
“Sanity” check: random
assigned factor tables
#items in
GLOP
(#GLOPs)
Learningsuggested
factors
Random
factor
table
2 (11)
5
5
4 (7)
3
1
5-11 (15)
4 (5, 6, 8, 9)
1 (5)
3 (5)
Further works need to be done
Quantitatively measure whether and how data analysis
results can be helpful for subject-matter experts
Explore the automatic factor assigning approach on
more data for other systems
Contrast with human experts as controlled condition
79
Guess which item is the
most difficult one?
Log likelihood
-532.6
-524
Item ID
Squareroot
FactorHigh
Bayesian Information Criterion
1,079.2
1,065.99
894
1
0
Num of skills
1
2
41
1
1
Num of parameters
2
4
4673
1
1
117
1
1
Coefficients
1.099, 0.137
1.841, 0.100; -0.927, 0.055
80