Text Block #1 - Title - Worcester Polytechnic Institute

Transcript Text Block #1 - Title - Worcester Polytechnic Institute

Towards Assessing Students’ Fine
Grained Knowledge: Using an
Intelligent Tutor for Assessing
Mingyu Feng
August 18th, 2009
Ph.D. Dissertation Committee:
Prof. Neil T. Heffernan (WPI)
Prof. Joseph E. Beck (WPI)
Prof. Carolina Ruiz (WPI)
Prof. Kenneth R. Koedinger (CMU)
Worcester Polytechnic Institute
Motivation – the need

Concerns about poor student performance on new state
tests


High-stakes standards-based tests are required by the No
Child Left Behind (NCLB) Act
Student performance are not satisfactory



Massachusetts (2003, 20% failed 10th grade math on the first try)
Worcester
Secondary teachers are asked to be data-driven


MCAS test reports
Formative assessment and practice tests

Provided by Northwest Evaluation Association; Measured Progress;
Pearson Assessments, etc.
2
Motivation – the problems

I: Formative assessment takes time from
instruction



NCLB or NCLU (No Child Left Untested)?
Every hour spent assessing students is an hour lost
from instruction
Limited classroom time compels teachers to make
a choice
3
Motivation – the problems

II: Performance reports are not satisfactory

Teachers want more frequent and more detailed reports
4
Confrey, J., Valenzuela, A., & Ortiz, A. (2002). Recommendation to the Texas State Board of Education on
the Setting of TAKS Standards: A Call to Responsible Action. At http://www.syrce.org/State_Board.htm
Main Contributions






Improved assessment system by taking into account how
much assistance students need (WWW’06; ITS’06;
EDM’08; UMUAI Journal’09 (nominated for James Chen
award))
Established a way to track and predict performance
longitudinally over multiple years (WWW’06; EDM’08)
Rigorously evaluated the effectiveness of the skill models of
various granularities (AAAI’06 EDM Workshop; TICL’07;
IEEE Journal’09)
Used data mining approach to evaluate effectiveness of
individual contents (AIED’09)
Used data mining to refine existing skill models (EDM’09;
in preparation)
Developed an online reporting system deployed and used by
real teachers (AIED’05; Book chapter’07; TICL Journal’06;
JILR Juornal’07)
5
Roadmap
Motivation
 Contributions
 Background - ASSISTment
 Using tutoring system as an assessor





Dynamic assessment
Longitudinal modeling
Cognitive diagnostic modeling
Conclusion & general implications
6
ASSISTments System

A web-based tutoring system that assists students
in learning mathematics and gives teachers
assessment of their students’ progress

Teachers like ASSISTments

Students like ASSISTments
7
An ASSISTment
We break multi-step items
(original question) into
scaffolding questions
 Attempt: student take an action to
answer a question
 Response: the correctness of
student answer (1/0)
 Hint Messages: given on demand
that give hints about what step to
do next
 Buggy Message: a context
sensitive feedback message
 Skill: a piece of knowledge
required to answer a question

8
Facts about ASSISTments
5000+ students have used the system regularly
 More than 10 million data records collected
 Other features



Learning experiments; authoring tools, account and
class management toolkit …
The dissertation uses data of about 1000 students
who used ASSISTments during 2004-2006
AIED’05: Razzaq, L., Feng, M., Nuzzo-Jones, G., Heffernan, N.T., Koedinger, K. R., Junker, B., Ritter, S., Knight, A., Aniszczyk, C., Choksey,
S., Livak, T., Mercado, E., Turner, T.E., Upalekar. R, Walonoski, J.A., Macasek. M.A., Rasmussen, K.P. (2005). The Assistment Project:
Blending Assessment and Assisting. In C.K. Looi, G. McCalla, B. Bredeweg, & J. Breuker (Eds.) Proceedings of the 12th International
9
Conference on Artificial Intelligence in Education, pp. 555-562. Amsterdam: ISO Press.
Book Chapter: Razzaq, L., Feng, M., Heffernan, N., Koedinger, K., Nuzzo-Jones, G., Junker, B., Macasek, M., Rasmussen, K., Turner, T., &
Walonoski, J. (2007). Blending Assessment and Instructional Assistance. In Nedjah, Mourelle, Borges and Almeida (Eds). Intelligent
Educational Machines within the Intelligent Systems Engineering Book Series . pp.23-49. Springer Berlin / Heidelberg.
Roadmap
Motivation
 Contributions
 Background - ASSISTments
 Using tutoring system as an assessor





Dynamic assessment
Longitudinal modeling
Cognitive diagnostic modeling
Conclusion & general implications
10
A Grade Book Report
Where does this
score come
from?
11
JILR Journal: Feng, M. & Heffernan, N. (2007). Towards Live Informing and Automatic Analyzing of Student Learning: Reporting in the Assistment System. Journal of
Interactive Learning Research. 18 (2), pp. 207-230. Chesapeake, VA: AACE.
TICL Journal: Feng, M., Heffernan, N.T. (2006). Informing Teachers Live about Student Learning: Reporting in the Assistment System. Technology, Instruction, Cognition,
and Learning Journal. Vol. 3. Old City Publishing, Philadelphia, PA. 2006.
Automated Assessment

Big idea: use data collected while a student
uses ASSISTment to assess him

Lots of types of data available



(last screen just used % correct on original
questions)
Lots of other possible measures
Why should we be more complicated?
12
Worcester Polytechnic Institute
A Grade Book Report



Static – does not distinguish “Tom”
and “Jack”
Average – ignores development
over time
Uninformative – not informative for
classroom instruction
Dynamic assessment
Longitudinal modeling
Cognitive diagnostic assessment
13
Dynamic Assessment – the idea

Dynamic testing began before computerized testing (Brown,
Bryant, & Campione, 1983).
14
Brown, A. L., Bryant, N.R., & Campione, J. C. (1983). Preschool children’s learning and transfer of matrices problems:
Potential for improvement. Paper presented at the Society for Research in Child Development meetings, Detroit.
Dynamic vs. Static Assessment

Developing dynamic testing metrics






# attempts
# minutes to come up with an answer; # minutes to
complete an ASSISTment
# hint requests; # hint-before-attempt requests;
#bottom-out hints
% correct on scaffolds
# problems solved
“Static” measure

correct/wrong on original questions
15
Dynamic Assessment – data

2004-2005 Data



Sept, 2004 – May, 2005
391 students
Online data



267 minutes (sd. = 79); 9 days; 147 items (sd. = 60)
8th grade MCAS scores (May, 2005)
2005-2006 Data



Sept, 2005 – May, 2006
616 students
Online data


196 minutes (sd. = 76); 6 days; 88 items (sd. = 42)
8th grade MCAS scores (May, 2006)
16
Dynamic Assessment - modeling

Three linear stepwise regression models
1-parameter IRT
proficiency
estimate
The standard test model
MCAS
Score
The assistance model
1-parameter IRT
proficiency
estimate +
all online
metrics
The mixed model
All online
metrics
17
1-parameter IRT: One parameter item response theory model
Dynamic Assessment - evaluation

Bayesian Information Criterion (BIC)





Widely used model selection criterion
Resolves overfitting problem by introducing a penalty term
for the number of parameters
Formula
Prefer model with lower BIC
Mean Absolute Deviation (MAD)

Cross-validated prediction error
Function

Prefer model with lower MAD

18
Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 25, 111-163.
Dynamic Assessment - results
1-parameter IRT
proficiency
estimate
1-parameter IRT
proficiency
estimate +
all online
metrics
The standard test model
The assistance model
All online
metrics
Model
The standard test model
The assistance model
The mixed model
The mixed model
MAD
6.40
p=0.001
p=0.001
5.46
p=0.001
5.04
Correlation with 2005
8th grade MCAS
0.733
-295
19
0.821
-402
BIC
-450
0.841
Dynamic Assessment – what variables are
important?
20
Dynamic Assessment - robustness

See if model can generalize

Test model on other year’s data
21
Compare Models from Two Years
(Constant)
IRT_Proficiency_Estimate
Scaffold_Percent_Correct
Avg_Question_Time
Avg_Attempt
Avg_Hint_Request
Question_Count
Avg_Item_Time
Total_Attempt

2004-2005 data
2005-2006 data
32.414
26.8
20.427
-0.17
-10.5
-3.217
3.284
32.944
21.327
-0.102
0.072
0.045
-0.044
Which metrics are stable across years?
22
Worcester Polytechnic Institute
Dynamic Assessment - conclusion

ASSISTments data enables us to assess more
accurately

The relative success of the assistance model over the
standard test model highlights the power of the
dynamic measures
Feng, M., Heffernan, N.T, Koedinger, K.R. (2006a). Addressing the Testing Challenge with a Web-Based EAssessment System that Tutors as it Assesses. In Proceedings of the 15th International World Wide Web
23
Conference. pp. 307-316. New York, NY: ACM Press. 2006. Best Student Paper Nominee.
Feng, M., Heffernan, N.T., & Koedinger, K.R. (2009). Addressing the assessment challenge in an online System
that tutors as it assesses. User Modeling and User-Adapted Interaction: The Journal of Personalization Research
(UMUAI journal). 19(3), 2009.
Roadmap
Motivation
 Contributions
 Background - ASSISTments
 Using tutoring system as an assessor

Dynamic assessment
 Longitudinal modeling
 Cognitive diagnostic modeling


Conclusion & general implications
24
Can we have our cake and eat it, too?
Most large standardized tests are
unidimensional or low-dimensional.
 Yet, teachers need fine grained diagnostic
reports (Militello, Sireci, & Schweid, 2008;
Wylie, & Ciofalo, 2008; Stiggins, 2005)


Can we have our cake and eat it, too?
Militello, M., Sireci, S., & Schweid, J. (2008). Intent, purpose, and fit: An examination of formative assessment systems in
school districts. Paper presented at the American Educational Research Association, New York City, NY.
25
Wylie, E. C., & Ciofalo, J. (2008). Supporting teachers' use of individual diagnostic items. Teachers College Record.
Retrieved from http://www.tcrecord.org/PrintContent.asp?ContentID=15363 on October 13, 2008.
Stiggins, R. (2005). From formative assessment to assessment FOR learning: A path to success in standards-based schools.
Phi Delta Kappan, 87(4), 324-328.
Cognitive Diagnostic Assessment
McCalla & Greer (1994) pointed out that the ability to
represent and reason about knowledge at various
levels of detail is important for robust tutoring.
 Griel, Wang & Zhou (2008) proposed one direction
for future research is to increase understanding of how
to select an appropriate grain size or level of analysis


Can we use MCAS test results to help select the right
grain-sized model from a series of models of different
granularities?
McCalla, G. I. and Greer, J. E. (1994). Granularity- based reasoning and belief revision in student models. In Greer, J. E.
and McCalla, G. I., (eds), Student Modeling: The Key to Individualized Knowledge-Based Instruction, pages 39-62.
Springer-Verlag, Berlin.
26
Gierl, M.J., Wang, C., & Zhou, J. (2008). Using the attribute hierarchy method to make diagnostic inferences about
examinees’ cognitive skills in Algebra on the SAT. Journal of Technology, Learning, and Assessment, 6(6).
Building Skill Models
Math
Patterns,
Relations,
and Algebra
Data Analysis,
Statistics and
Probability
…
…
WPI - 1
Geometry
…
UnderstandingSetting-updataand-solving- Understanding presentation- Understandingand-applyingequation
techniques
-pattern
congruence-andsimilarity
…
…
Equationsolving
…
Circle-graph
Number Sense
and Operations
WPI - 5
…
Usingmeasurement
-formulas- Convertingunderstandingandfrom-one- numbertechniques measure-torepresentations
another
Area
Similar-triangles
Perimeter
WPI - 39
…
…
Congruence
XY-graph
Inducing-function
…
…
Plot-graph
Equation-concept
Measurement
…
Unit-conversion
EquivalentWPI - 78
FractionsDecimals-Percents
Ordering-fractions
27
Building Skill Models
Math
Patterns,
Relations,
and Algebra
Data Analysis,
Statistics and
Probability
…
…
WPI - 1
Geometry
…
UnderstandingSetting-updataand-solving- Understanding presentation- Understandingand-applyingequation
techniques
-pattern
congruence-andsimilarity
…
…
Equationsolving
…
Circle-graph
Number Sense
and Operations
WPI - 5
…
Usingmeasurement
-formulas- Convertingunderstandingandfrom-one- numbertechniques measure-torepresentations
another
Area
Similar-triangles
Perimeter
WPI - 39
…
…
Congruence
XY-graph
Inducing-function
…
…
Plot-graph
Equation-concept
Measurement
…
Unit-conversion
EquivalentWPI - 78
FractionsDecimals-Percents
Ordering-fractions
28
Cognitive Diagnostic Assessment – data

2004-2005 Data





2005-2006 Data





Sept, 2004 – May, 2005
447 students
Online data: 7.3 days; 87 items (sd. = 35)
Item level response of 8th grade MCAS test (May, 2005)
Sept, 2005 – May, 2006
474 students
Online data: 5 days; 51 items (sd. = 24)
Item level 8th grade MCAS scores (May, 2006)
All online and MCAS items have been tagged with all
four skill models
29
Cognitive Diagnostic Assessment - modeling

Fit mixed-effects logistic regression model
Longitudinal model
(e.g. Singer & Willett,
2003)
-- Xijkt is the 0/1 response of student i on question j tapping skill k in month t
-- Montht is elapsed month in the study; 0 for September, 1 for October, and so on
-- β0k and β1k : respective fixed effects for baseline and rate of change in probability of
correctly answering a question tapping skill k.
-- β00 and β10 : the group average incoming knowledge level and rate of change
-- β0 and β1 : the baseline level of achievement and rate of change of the student

Predict MCAS score



Extrapolate the fitted model in time to the month of the MCAS test
Obtain probability of getting each MCAS question correct, based upon
skill tagging of the MCAS item
Sum up probabilities to get total score
30
How do I Evaluate Models?
04-05
Data
Real
MCAS
score
Skill Models
Absolute Difference
ASSISTment Predicted Score
WPI-1
WPI-5
WPI-39
WPI-78
WPI-1
WPI-5
WPI-39
WPI-78
Mary
25.00
23.31
22.85
22.18
20.47
1.69
2.15
2.82
4.53
Tom
32.00
29.66
29.15
28.67
27.13
2.34
2.85
3.33
4.87
…
…
Sue
29.00
28.46
28.23
27.85
26.26
0.54
0.77
1.15
2.74
Dick
28.00
27.41
26.70
26.12
24.30
0.59
1.30
1.88
3.70
Harry
22.00
23.33
22.58
22.02
20.14
1.33
0.58
0.02
1.86
MAD
4.42
4.37
4.22
4.11
%Error
13.00%
12.85%
12.41%
12.09%
Paired two-sample t-test
31
Comparing Models of Different Granularities
04-05 Data
MAD
%Error
WPI-1
WPI-5
4.37 >
4.42 >
13.00% >
WPI-1
WPI-78
4.22
>
12.85% > 12.41% >
P =0.006
05-06 Data
WPI-39
P <0.001
WPI-5
P =0.21
WPI-39
1-parameter
IRT model
4.11
4.36
12.09%
12.83%
P =0.10
WPI-78
MAD
6.58
6.51
4.83
4.99
4.67
%Error
19.37%
19.14%
15.10%
14.70%
13.70%
P <0.001
P <0.001
P <0.001
P =0.03
32
The Effect of Scaffolding - hypothesis
Only using original questions makes it hard to
decide which skill to “blame”
 Scaffolding questions aid in diagnosis by
directly assessing a single skill
 Hypotheses



Using responses to scaffolding questions will
improve prediction accuracy
Scaffolding questions are more useful for fine
grained models
33
The Effect of Scaffolding - results
04-05
Data
WPI-1
Only original
questions used
Original + Scaffolding
questions used
14.91%
13.00%
WPI-5
14.06%
12.85%
WPI-39
15.29%
12.41%
WPI-78
17.75%
12.09%
05-06
Data
WPI-1
Only original
questions used
Original + Scaffolding
questions used
20.05%
19.37%
WPI-5
19.88%
19.14%
WPI-39
18.68%
15.10%
WPI-78
16.91%
14.70%
34
Cognitive Diagnostic Assessment - usage
Results presented in a nested structure of different
granularities to serve a variety of stake-holders
35
Cognitive Diagnostic Assessment - conclusion

Fine-grained models do the best job estimating student
skill level overall



Not necessarily the best for all consumers (e.g.
principals)
Need ability to diagnosis (e.g. scaffolding questions)
Scaffolding questions


Helps improve overall prediction accuracy
More useful for fine-grained models
Feng, M., Heffernan, N.T, Mani, M. & Heffernan C. (2006). Using Mixed-Effects Modeling to Compare Different Grain-Sized Skill Models. In
Beck, J., Aimeur, E., & Barnes, T. (Eds). Educational Data Mining: Papers from the AAAI Workshop. Menlo Park, CA: AAAI Press. pp. 57-66.
Feng, M, Heffernan, N., Heffernan, C. & Mani, M. (2009). Using mixed-effects modeling to analyze different grain-sized skill models. IEEE
Transactions on Learning Technologies Special Issue on Real-World Applications of Intelligent Tutoring Systems. (Featured article of the issue)36
Pardos, Z., Feng, M. & Heffernan, N. T. & Heffernan-Lindquist, C. (2007).Analyzing fine-grained skill models using bayesian and mixed effect
methods. In Luckin & Koedinger (Eds.) Proceedings of the 13th Conference on Artificial Intelligence in Education. Amsterdam, Netherlands: IOS
Press.pp.626-628.
Future Work - Skill Model Refinement

We found that WPI-78 is good enough to better
predict a state test than some less fine-grained models

However, WPI-78 may have some mis-taggings



Expert-built models are subject to the risk of “expert blind
spot”
Our best-guess in a 7-hour coding session
A best guess model should be iteratively tested and
refined
37
Skill Model Refinement - approaches

Human experts manually update hand-crafted models



(1,000+ items ) * (100+ skills)
Not practical to do it often
Data mining can help



Skills or items with high residuals
Skills consistently over-predicted or under-predicted
“Un-learned” skills (i.e. negative slopes from mixedeffects models)
38
Feng, M., Heffernan, N., Beck, J, & Koedinger, K. (2008). Can we predict which groups of questions students will learn from? In Beck &
Baker (Eds.). Proceedings of the 1st International Conference on Education Data Mining. Montreal, 2008.
Skill Model Refinement - approaches

Searching for better models automatically

Learning Factor Analysis (LFA) (Koedinger, & Junker,
1999)
Human identify
Auto-methods
A semi-automated method
 Three parts





difficulty factors
through task
analysis
search for better
models based
upon factors
Difficulty factors associated with problems
A combinatorial search space by applying operators (add, split,
merge) on the base model
A statistical model that evaluate how a model fit the data
Can we increase the efficiency of LFA?
Auto-methods
search for better
models based
upon factors
39
Suggesting Difficulty Factors
Some items in a random
sequence cause significantly
less learning than others
 Hypothesis


Problems that “don’t help”
students learn might be
teaching a different skill(s)
Create factor tables
 Preliminary results show
some validity

Skill
Factor
Circle-area
High
Circle-area
High
Circle-area
High
Circle-area
Low
40
Feng, M., Heffernan, N., & Beck, J. (2009). Using learning decomposition to analyze instructional effectiveness in the ASSISTment
system. In Dimitrova, Mizoguchi, du Boulay, & Graesser (Eds), Proceedings of the 14th International Conference on Artificial
Intelligence in Education (AIED-2009). Amsterdam, Netherlands: IOS Press. Brighton, UK.
Roadmap
Motivation
 Contributions
 Background - ASSISTments
 Using tutoring system as an assessor





Dynamic assessment
Longitudinal modeling
Cognitive diagnostic modeling
Conclusion & general implications
41
Conclusion of the Dissertation

The dissertation establishes novel assessment
methods to better assess students in tutoring
systems



Assess students better by analyzing their learning
behaviors when using the tutor
Assess students longitudinally by tracking learning
over time
Assess students diagnostically by modeling finegrained skills
42
Comments from the Education Secretary

Secretary of Education, Arne Duncan weighed
in (in Feb 2009) on the NCLB Act, and called
for continuous assessment
Duncan says he is concerned about overtesting but
he thinks states could solve the problem by
developing better tests. He also wants to help them
develop better data management systems that help
teachers track individual student progress. "If you
have great assessments and real-time data for
teachers and parents that say these are [the
student's] strengths and weaknesses, that's a real
healthy thing," he says.
43
Ramírez, E., & Clark, K. (Feb., 2009). What Arne Duncan Thinks of No Child Left Behind: The new education secretary talks
about the controversial law and financial aid forms. (Electronic version) Retrieved on March 8th, 2009 from
http://www.usnews.com/articles/education/2009/02/05/what-arne-duncan-thinks-of-no-child-left-behind.html.
General implication

Continuous assessment systems are possible to
build (we built one)



Save classroom instruction time by assessing
students during tutoring
Track individual progress and help stakeholders
get student performance information
Provide teachers with fine-grained, cognitively
diagnostic feedbacks to be “data-driven”
44
A metaphor for this shift

Businesses don’t close down periodically to take
inventory of stock any more
Bar code; auto-checkout
 Non-stopped business
 Richer information

Committee on the Foundations of Assessment
Board on Testing and Assessment
Center for Education
National Research Council
James W. Pellegrino
Naomi Chudowsky
Robert Glaser
(page 284).
45
Acknowledgement

My advisor


Neil Heffernan
Committee members



Ken Koedinger
Carolina Ruiz
Joe Beck
The ASSISTment team
 My family
 Many more…

46
Thanks!
Questions?
Worcester Polytechnic Institute
Backup slides
48
Motivation – the problems

III: The “moving” target problem



Testing and instruction have been separate fields
of research with their own goals
Psychometric theory assumes a fixed target for
measurement
ITS wants student ability to “move”
49
More Contributions

Working systems



www.ASSISTment.org
The reporting system that gives cognitive diagnostic
reports to teachers in a timely fashion
Establish an easy approach to detect the effectiveness
of individual tutoring content
AIED’05: Razzaq, L., Feng, M., Nuzzo-Jones, G., Heffernan, N.T., Koedinger, K. R., Junker, B., Ritter, S., Knight, A., Aniszczyk, C., Choksey,
S., Livak, T., Mercado, E., Turner, T.E., Upalekar. R, Walonoski, J.A., Macasek. M.A., Rasmussen, K.P. (2005). The Assistment Project:
Blending Assessment and Assisting. In C.K. Looi, G. McCalla, B. Bredeweg, & J. Breuker (Eds.) Proceedings of the 12th International
Conference on Artificial Intelligence in Education, pp. 555-562. Amsterdam: ISO Press.
Book Chapter: Razzaq, L., Feng, M., Heffernan, N., Koedinger, K., Nuzzo-Jones, G., Junker, B., Macasek, M., Rasmussen, K., Turner, T., &
Walonoski, J. (2007). Blending Assessment and Instructional Assistance. In Nedjah, Mourelle, Borges and Almeida (Eds). Intelligent
Educational Machines within the Intelligent Systems Engineering Book Series . pp.23-49. Springer Berlin / Heidelberg.
JILR Journal: Feng, M. & Heffernan, N. (2007). Towards Live Informing and Automatic Analyzing of Student Learning: Reporting in the
Assistment System. Journal of Interactive Learning Research. 18 (2), pp. 207-230. Chesapeake, VA: AACE.
TICL Journal: Feng, M., Heffernan, N.T. (2006). Informing Teachers Live about Student Learning: Reporting in the Assistment System.
50
Technology, Instruction, Cognition, and Learning Journal. Vol. 3. Old City Publishing, Philadelphia, PA. 2006.
AIED’09: Feng, M., Heffernan, N.T., Beck, J. (2009). Using learning decomposition to analyze instructional effectiveness in the ASSISTment
system. In Dimitrova, Mizoguchi, du Boulay, and Grasser (Eds), Proceedings of the 14th International Conference on Artificial
Intelligence in Education (AIED-2009). pp. 523-530. Amsterdam, Netherlands: IOS Press.
Evidence
62%
50%
37%
37%
51
Evidence
1.
2.
3.
Congruence
Perimeter
Equation-Solving
52
Terminology
MCAS
 Item/question/problem
 Response
 Original question
 Scaffolding question
 Hint message
 Bottom-out hint
 Buggy message

Attempt
 Skill/knowledge
component
 Skill model/cognitive
model/Q-matrix
 Single mapping model
 Multi-mapping model

53
54
The reporting system

I developed the first reporting system for
ASSISTments in 2004 that

is online, live, and gives detailed feedback at a
grain size for guiding instruction
55
Worcester Polytechnic Institute
The grade book
“It’s spooky; he’s watching everything we do”. – a student
56
Identifying difficult steps
57
Informing hard skills
58
Linear Regression Model
An approach to modeling relationship between one or more
variables (y) and one or more variables (X)
 Y depends linearly on X


How linear regression works?



Minimizing sum-of-squares
Example of linear regression
with one independent variable
Stepwise regression

Forward; backward; Combination
59
Worcester Polytechnic Institute
1-Parameter IRT Model
Item response theory (IRT) model relates the
probability of an examinee's response to a test
item to an underlying ability in a logistic
function
 1-PL IRT model

where βn is the ability of person n and δi is the difficulty of item i.

I used BI-LOG MG to run the model and get
estimate of student ability and item difficulty
60
Worcester Polytechnic Institute
Dynamic assessment - The models
61
Dynamic assessment - The models
62
Dynamic assessment – The models
63
Dynamic assessment - Validation
64
Longitudinal Modeling - data
Average %correct on original questions over time (FAKE data)
What does our real data look like?
65
66
239
MCASS core
54.00
240
243

45.00


36.00

27.00
18.00


9.00




245










244














0.00
246
MCASS core
54.00
247
45.00

36.00
27.00
248




















9.00










316
320
321
327

331

45.00



36.00

27.00


18.00














36.00





668


805


 












18.00




9.00

0.00
806
54.00
45.00

36.00
27.00
807

809

810

















18.00


67


9.00
0.00






669


27.00


667
45.00



666
54.00





9.00
0.00
MCASS core


18.00
54.00
MCASS core
315


0.00
MCASS core
314




0
2
4
6
8
Centered Month
0
2
4
6
8
Centered Month
0
2
4
6
8
Centered Month
0
2
4
6
8
Centered Month
Longitudinal Modeling - methodology

What do we get from (linear) mixed effects
models?

Average population trajectory for the specified group

Trajectory indicated by two parameters


intercept:  00
slope:
 10

The average estimated score for a group at time j is

Each student got two parameters to vary from
the group average
 j   00   10 * TIME j
One trajectory for every single student


Intercept: 00
  0i
slope:  10
  1i
The estimated score for student i at time j is
ij  ( 00   0i )  ( 10   1i ) * TIME j
68
Singer, J. D. & Willett, J. B. (2003). Applied Longitudinal Data Analysis: Modeling Change and Occurrence. Oxford
University Press, New York.
68
Longitudinal Modeling - results
BIC: Bayesian Information Criterion
(the lower, the better)
Feng, M., Heffernan, N.T, Koedinger, K.R. (2006a) Addressing the Testing Challenge with a Web-Based EAssessment System that Tutors as it Assesses. In Proceedings of the 15th International World Wide Web
Conference. pp. 307-316. New York, NY: ACM Press. 2006. Best Student Paper Nominee.
Feng, M., Heffernan, N.T, Koedinger, K.R. (2006b). Predicting State Test Scores Better with Intelligent 69
Tutoring Systems: Developing Metrics to Measure Assistance Required. In Ikeda, Ashley & Chan (Eds.).
Proceedings of the 8th International Conference on Intelligent Tutoring Systems. Springer-Verlag: Berlin.
pp. 31-40. 2006.
Mixed effects models
Individuals in the population are assumed to have their own
subject-specific mean response trajectories over time
 The mean response is modeled as a combination of
population characteristics (fixed effects) and subject-specific
effects that are unique to a particular individual (random
effects)
 It is possible to predict how individual response trajectories
change over time
 Flexibility in accommodating imbalance in longitudinal data


Methodological features: 1) 3 or more waves of data 2) an
outcome variable (dependent variable) whose values change
systematically over time 3) A sensible metric for time that is
the fundamental predictor in the longitudinal study
70
Sample longitudinal data
71
Comparison of Approaches

Ayers & Junker (2006)

Estimate student proficiency using
1-PL IRT model
 LLTM (linear logistic test model)






Main question difficulty decomposed into K skills
1-PL IRT fits dramatically better
Only main questions used
Additive, non-temporal
WinBUGS
72
Worcester Polytechnic Institute
Comparison of Approaches

Pardos et al. (2006)




Conjunctive Bayes nets
Non-temporal
Scaffolding used
Bayes Net Toolbox (Murphy, 2001)
DINA model
(Anozie, 2006)

73
Worcester Polytechnic Institute
Comparison of Approaches

Feng, Heffernan, Mani & Heffernan (2006)



Logistic mixed-effects model (Generalized Linear Mixedeffects Model, GLMM)
Temporal
Xi j is the 0/1 response of student i on question j tapping
KC k in month t,
Montht is elapsed month in the study; β0k and β1k are respective fixed effects for baseline and rate
of change in probability of correctly answering a question tapping KC k.

R lme4 library
74
Worcester Polytechnic Institute
Comparison of Approaches

Comparing to LLTM in Ayers & Junker (2006)

Student proficiency depends on time

Question difficulty depends on KC and time

Assign only the most difficult skill instead of full Q-matrix
mapping of multiple skills as in LLTM
Scaffolding used to gain identifiability
Ayers & Junker (2006) use regression to predict MCAS after
obtaining estimate of student ability (θ) (MAD= 10.93%)
No such regression process in my work





logit(p=1) = θ – 0; estimated score = full score * p
Higher MAD, but provide diagnostic information
75
Worcester Polytechnic Institute
Comparison of Approaches

Comparing to Bayes nets and conjunctive models
Bayes: probability reasoning; conjunctive
 GLMM: linear learning; max-difficulty reduction
 Computationally much easier and faster
 Results are still comparable



GLMM is better than Bayes nets when WPI-1, WPI-5 used
GLMM is comparable with Bayes nets when WPI-39 or WPI78 used
 WPI-39: GLMM 12.41%, Bayes: 12.05%
 WPI-78: GLMM 12.09%, Bayes: 13.75%
76
Worcester Polytechnic Institute
Cognitive Diagnostic Assessment – BIC results

3085
BIC
Model
WPI-1
-222
WPI-5
WPI-39
WPI-78
04-05 Data 173445.2
170359.9
170581.7
165711.4
05-06 Data 39210.57
39174.29
54696.4
54299.54
36

4870
-15522
399
#data points are different
Items tagged with more than one skill will be duplicated in
the data
 Finer grained models have more multi-mappings, and
thus, more data points (higher BIC)
 WPI-5 better than WPI-1; WPI-78 better than WPI-39


Calculate MAD as the evaluation gauge
77
Worcester Polytechnic Institute
Analyzing Instructional Effectiveness

Detect relative instructional effectiveness
among items in the same GLOP using
learning decomposition.
Prior encounters
Student
Item
Correct
?
t1
ln
t2
t3
t4
Tom
1
0
0
0
0
Tom
0
1
0
0
0
Tom
0
1
0
1
0
Tom
1
1
1
1
0
P(correct )
 a  Student  Item  B1 * t1  B2 * t2  B3 * t3  B4 * t4
1  P(correct )
78
Feng, M., Heffernan, N., & Beck, J. (2009). Using learning decomposition to analyze instructional effectiveness in the ASSISTment
system. In Dimitrova, Mizoguchi, du Boulay, & Graesser (Eds), Proceedings of the 14th International Conference on Artificial
Intelligence in Education (AIED-2009). Amsterdam, Netherlands: IOS Press. Brighton, UK.
Searching Results
Among 38 GLOPs, LFA
found significant better
models for 12
 Shall I be happy?



“Sanity” check: random
assigned factor tables
#items in
GLOP
(#GLOPs)
Learningsuggested
factors
Random
factor
table
2 (11)
5
5
4 (7)
3
1
5-11 (15)
4 (5, 6, 8, 9)
1 (5)
3 (5)
Further works need to be done



Quantitatively measure whether and how data analysis
results can be helpful for subject-matter experts
Explore the automatic factor assigning approach on
more data for other systems
Contrast with human experts as controlled condition
79

Guess which item is the
most difficult one?
Log likelihood
-532.6
-524
Item ID
Squareroot
FactorHigh
Bayesian Information Criterion
1,079.2
1,065.99
894
1
0
Num of skills
1
2
41
1
1
Num of parameters
2
4
4673
1
1
117
1
1
Coefficients
1.099, 0.137
1.841, 0.100; -0.927, 0.055
80

Text Block #1 - Title - Worcester Polytechnic Institute

Transcript Text Block #1 - Title - Worcester Polytechnic Institute

Directory