Applications of Natural Language Processing

Download Report

Transcript Applications of Natural Language Processing

Course 7 – 05 April 2012
Diana Trandabăț
[email protected]
1

NLP in eLearning
◦ Generating test questions
◦ Keywords identification
◦ Extraction of definitions
2




eLearning comprises all forms of electronically
supported learning and teaching.
eLearning 2.0 - with the emergence of Web 2.0
Conventional e-learning systems were based on
instructional packets, which were delivered to
students using assignments. Assignments were
evaluated by the teacher.
In contrast, the new e-learning places increased
emphasis on social learning and use of social
software such as blogs, wikis, podcasts etc.
3

NLP techniques in educational applications
working with textual data:
◦
◦
◦
◦


intelligent tutoring systems
automatic generation of exercises
assessment of learner generated discourse
reading and writting assistance
These applications require an adaptation of NLP
techniques to various types of discourse, e.g.
tutoring dialogues, which are different from
typical task-oriented spoken dialogue systems.
Moreover, educational applications place strong
requirements on NLP systems, which have to be
robust yet accurate.
4
Educational Natural Language Processing
eLearning
Computer assisted
learning/instruction
NLP
Analysis and use of
language by machines
5

Definition:
◦ Field of research exploring the
techniques in educational contexts

use
of
NLP
Why?
◦ Large text repositories with user generated
discourse and user generated metadata are
created
◦ These repositories need advanced information
management and NLP to be efficiently accessed
◦ Using these repositories to create structured
knowledge bases can improve NLP
6



Definition: All forms of assessment delivered
with the help of computers
also
called
Computer
Assisted/Aided
Assessment (CAA)
Adequate question types for CAA (McKenna &
Bull, 1999):
◦
◦
◦
◦
◦
◦
Multiple choice questions (MCQs)
True/False questions
Matching questions
Ranking questions
Sequencing questions
etc.
7

Generation of questions and exercises
◦ Writing test questions, especially objective test
items, is an extremely difficult and time consuming
task for teachers
◦ Use of NLP to automatically generate objective test
items, esp. for language learning

Assessment and evaluation of answers to
subjective test items
◦ Use of NLP to automatically:
 Diagnose errors in short-answer essays
 Grade essays
8

Source data
◦ Corpora: texts should be chosen according to
 the learner model (level, mastered vocabulary)
 the instructor model (target language, word category)
◦ Lexical semantic resources, e.g. WordNet

Tools
◦
◦
◦
◦
◦
Tokeniser and sentence splitter
Lemmatiser
Conjugation and declension tools
POS tagger
Parser and chunker
9

Choose the correct answer among a set of
possible answers:
◦ Who was voted the best international footballer for
2004?
Question focus
(a) Henry
Distractors
(b) Beckham
(c) Ronaldinho
Correct answer / Key
(d) Ronaldo
 Usually 3 to 5 alternative answers
10

Distractors (also distracters) are the incorrect
answers presented as a choice in a multiplechoice test
◦ Challenge: Generation of "good" distractors





Ensure that there is only one correct response for
single response MCQ
The key should not always occur at the same
position in the list of answers
Distractors should be grammatically parallel with
each other and approximately equal in length
Distractors should be plausible and attractive
However, distractors should not be too close to
the correct answer and risk confusing students
11
1. Selection of the key
 Unknown words that appear in a reading
 Domain-specific terms
2. Generation of the question focus
 Constrained patterns
 Transformation of source clauses to question
focuses.
Transitive verbs require objects → Which kind
of verbs require objects?
12
3, Generation of the distractors
 WordNet concepts which are semantically close to the key,
e.g. hypernyms and co-hyponyms
◦ "Which part of speech serves as the most central element in a
clause?"
◦ Key: "verb",
◦ Distractors: "noun", "adjective", "preposition“





Same POS
Similar frequency range
For grammar questions, use a declension or a conjugation
tool to generate different forms of the key, e.g. change
case, number, person, mode, tense, etc.
Common student errors in the given context
Collocations: frequent co-occurrence with either the left or
the right context
13



Consists of a portion of text with certain
words removed
The student is asked to "fill in the blanks“
Challenges:
◦ Phrase the question so that only one correct answer
is possible (e.g. verb to be conjugated)
14



1. Selection of an input corpus
2. POS tagging
3. Selection of the blanks in the input corpus
◦ Every "n-th" (e.g. fifth or eighth) word in the text
◦ Words in specified frequency ranges, e.g. only high
frequency or low frequency words
◦ Words belonging to a given grammatical category
◦ Open-class words, given their POS
◦ Machine learning, based on a pool of input questions
used as training data

4. Where needed, provide some information
about the word in the blank, e.g. verb lemma
when the test targets verb conjugation
15

Short answer assessment
◦ Learner's response, one +
target responses, question,
source reading passage
◦ Linguistic analysis:
annotation, alignment,
diagnosis



Essays
Plagiarism detection
Speech generation
16


Related techniques: summarisation and
sentence compression
Syntactic simplification:
◦ Removal or replacement of difficult syntactic
structures, using hand-built transformational rules
applied to dependency and parse trees

Lexical simplification:
◦ Replace difficult words with simpler ones
◦ Difficult words are identified using the number of
syllables and/or frequency counts in a corpus
◦ Choose the simplest synonym for difficult words in
WordNet
17

Overall goal: support vocabulary acquisition
during reading for:
◦ children, who learn to read
◦ foreign language learners, who read texts in a foreign
language


Problem: a word's context may not provide
enough information about its meaning
Solution: augment documents with dynamically
generated annotations about (problematic)
words
18
A grammar is created for the automatic identification
of definitions in texts
Types of definitions

“is_def” – “HTML este tot un protocol folosit de
World Wide Web.” (HTML is also a protocol used
by World Wide Web).

“verb_def” – “Poşta electronică reprezintă
transmisia mesajelor prin intermediul unor reţele
electronice.” (Electronic mail represents sending

messages through electronic networks).
“punct_def” – “Bit – prescurtarea pentru binary
digit” (Bit – shortcut for binary digit)
19

layout_def
Ro:
Organizarea
datelor
Cel mai simplu mod de organizare este cel
secvenţial.
En:
Data organizing The simplest method is the sequential one.


“pron_def” – “…definirii conceptului de baze de
date. Acesta descrie metode de ….” (…defining the
database concept. It describes methods of ….)
“other_def” – “triunghi echilateral, adică cu toate
laturile egale” (equilateral triangle i.e. having all
sides equal).
Type
is_def
verb_def
Manual
%
Automatic %
70 33.8
204 32.8
116 56.0
272 43.8
punct_def
15
7.2
layout_def
2
1.0
21
3.4
pron_def
4
2.0
0
0.0
Total
207
124 20.0
621

Simple grammar rules
Composed grammar rules

“is_def” grammar rule:

<rule name="may_be_term">
<seq>
<query match="tok[@base='fi' and
substring(@ctag,1,5)='vmip3']"/>
<first>
<ref name="UndefNominal" />
<ref name="DefNominal" />
</first>
</seq>
</rule>

Lxtransduce (Tobin 2005) is used to
match the grammar in files
Definition Type
is_def
Result
Sentence-level matching:
P: 0.5366, R: 1.0, F2: 0.7765
Token-level matching:
P: 0.0648, R: 0.3328, F2: 0.14
verb_def
Sentence-level matching
P: 0.7561, R: 1.0, F2: 0.9029
Token-level matching
P: 0.0471, R: 0.1422, F2: 0.085
punct_def
Sentence-level matching
P: 0.1463, R: 1.0, F2: 0.3396
Token-level matching
P: 0.0025, R: 0.1163, F2: 0.0072
layout_def
Sentence-level matching
P: 0.0488, R: 1.0, F2: 0.1333
Token-level matching
P: 0.0007, R: 0.1020, F2: 0.0022

Accordingly to the answer type, we have
the following type of questions
(Harabagiu, Moldovan 2007):
◦ Factoid – “Who discovered the oxygen?” or
“When did Hawaii become a state?” or “What
football team won the World Coup in 1992?”.
◦ List – “What countries export oil?” or “What are
the regions preferred by the Americans for
holidays?”.
◦ Definition – “What is quasar?” or “What is a
question-answering system?”



Question: Cine este Zeus?
(Cine, zeus, PERSON)
Snippet: 0026#10014#1.0#Zeus#Zeus\zeus\NP
este\fi\V3\ cel\cel\TSR\ mai\mai\R\
puternic\puternic\ASN\ dintre\dintre\S\
olimpieni\olimpieni\NPN\ ,\,\COMMA\
socotit\socoti\VP\ drept\drept\S\
stăpânul\stăpân\NSRY\ suprem\suprem\ASN\
al\al\TS\ oamenilor\om\NPOY\ şi\şi\CR\ al\al\TS\
zeilor\zeu\NPOY\ .\.\PERIOD\
Our pattern for “is_def”
(\zeus\.*\NP .*\fi\V3\ (.*)) match the snippet


Using a trening corpus of documents
annotated with keywords
Measuring distribution of manually marked
keywords over documents
Bulgarian
Czech
Dutch
English
German
Polish
Portuguese
Romanian
# of annotated
documents
55
465
72
36
34
25
29
41
Average length (#
of tokens)
3980
672
6912
9707
8201
4432
8438
3375
# of keywords
Bulgarian
Czech
Dutch
English
German
Polish
Portuguese
Romanian
3236
1640
1706
1174
1344
1033
997
2555
Average # of
keywords per doc.
77
3.5
24
26
39.5
41
34
62


Did the human annotators annotate keywords
of domain terms?
Was the task adequately contextualised?




Good keywords have a typical, non random
distribution in and across documents
Keywords tend to appear more often at
certain places in texts (headings etc.)
Keywords are often highlighted /
emphasised by authors
Keywords express / represent the topic(s)
of a text





Linguistic filtering of KW candidates, based
on part of speech and morphology
Distributional measures are used to identify
unevenly distributed words
◦ TFIDF
Knowledge of text structure used to identify
salient regions (e.g., headings)
Layout features of texts used to identify
emphasised words and weight them higher
Finding chains of semantically related words



Treating multi word keywords
Assigning a combined weight which takes
into account all the aforementioned factors
Multilinguality: finding good settings for all
languages, balancing language dependent
and language independent features


Keyphrases have to be restricted wrt to length
(max 3 words) and frequency (min 2
occurrences)
Keyphrase patterns must be restricted wrt to
linguistic categories (style of learning is
acceptable; of learning styles is not)




Human annotators marked n keywords in
document d
First n choices of KWE for document d
extracted
Measure overlap between both sets
measure also partial matches

Resources:
◦ Lexical semantic resources, e.g. WordNet
◦ Web 2.0 resources, e.g. Wikipedia, Wiktionary

Tools:
◦
◦
◦
◦
◦
◦
◦
Tokeniser and sentence splitting
Morphological analysis
Part of speech tagging
Parsing and chunking
Word sense disambiguation
Summarisation
Keyword extraction
35

To assist instructors
◦ Automatic generation of questions and exercises
◦ Assessment of learner-generated discourse

To assist learners
◦ Reading and writing assistance
◦ Electronic career guidance
◦ Educational question answering

For all users in the Web 2.0
◦ NLP for wikis
◦ Quality assessment of user generated contents
36








Computer-Assisted Language Learning
Intelligent Tutoring Systems
Information search for eLearning
Educational blogging
Annotations and social tagging
Analysing collaborative learning processes
automatically
Learners' corpora and resources
eLearning standards, e.g. SCORM
37

1a) Extract definitions from a given Wikipedia page
1b) Generate questions such as “what is …" or “what is
the meaning of …" from the list above

2) Automatic generation of “fill the blanks” questions

 Dacă nu ai nimic planificat diseară, hai __ teatru.
 (a) la (b) de (c) pentru (d) null
◦ Input: a sentence and the key
 Dacă nu ai nimic planificat diseară, hai la teatru.
 Key: la
◦ Output: generate three distractors using different approaches:
 baseline: word frequencies
 Collocations
 "creative" method, devised by the students
38



Jill Burstein: Opportunities for Natural Language
Processing
Research
in
Education
ProceedingCICLing '09 Proceedings of the 10th
International
Conference
on
Computational
Linguistics and Intelligent Text Processing,
Springer-Verlag Berlin, Heidelberg, 2009
Paola Monachesi, Eline Westerhout. What can NLP
techniques do for eLearning? Presented at INFOS
2008, 27-29 March.
Adrian
Iftene,
Diana
Trandabăţ,
Ionuţ
Pistol: Grammar-based Automatic Extraction of
Definitions and Applications for Romanian. RANLP
2007 workshop: Natural Language Processing and
Knowledge
Representation
for
eLearning
Environments.
39

Educational Applications of NLP
http://www.ets.org/research/topics/as_nlp/edu
cational_applications
40
41
(1) Plagiarism of authorship: the direct case of putting
your own name to someone else’s work
(2) Word-for-word plagiarism: copying of phrases or
passages from published text without quotation or
acknowledgement.
(3) Paraphrasing plagiarism: words or syntax are
changed (rewritten), but the source text can still be
recognized.
(4) Plagiarism of the form of a source: the structure of an
argument in a source is copied (verbatim or rewritten)
(5) Plagiarism of ideas: the reuse of an original thought
from a source text without dependence on the words
or form of the source
(6) Plagiarism of secondary sources: original sources are
referenced or quoted, but obtained from a secondary
source text without looking up the original.
42