Supporting e-learning with automatic glossaryextraction

Transcript Supporting e-learning with automatic glossaryextraction

Supporting e-learning
with
automatic glossary
extraction
Experiments with Portuguese
Rosa Del Gaudio, António Branco
RANLP, Borovets 2007
Presentation Plan
LT4eL project
● ILIAS
● Corpus
● Tool
● Grammars
● Copula
● Other Verbs
● Punctuation
● Results
● Conclusion
●
LT4eL
Improve retrieval and accessibility of LO in
learning management systems
●Employ language technology resources and tools
for the semi-automatic generation of descriptive
metadata .
●
Develop new functionalities such as a key word
extractor and a glossary candidate detector,
semantic search, tuned for the various languages
addressed in the project (Bulgarian, Czech, Dutch,
English, German, Maltese, Polish, Portuguese,
Romanian).
●
ILIAS
Objective
Build a Glossary in an automatic way to support elearning process. In practice this means to extract a
definition from unstructured text (scientific papers,
enciclopedia, web pages)
●
Better access to information for student
●Accelerate the work of the tutor
●
ILIAS: Glossary Candidate
Detector
The Corpus
• 274.000 tokens
• Tutorials
• PhD Thesis
• Scientific papers
• 3 Domains evenly
represented
• e-learning
• Technology for non
experts
• Calimera
XML format
<definingText continue="y" def="m147" def_type1="is_def" id="d5">
<markedTerm dt="y" id="m147" kw="y">
<tok base="intranet" class="word" ctag="PNM" id="t9032" sp="y">Intranet</tok>
</markedTerm>
<tok base="ser" class="word" ctag="V" id="t9033" msd="pi-3s" sp="y">é</tok>
<tok base="uma" class="word" ctag="UM" id="t9034" msd="fs" sp="y">uma</tok>
<tok base="rede" class="word" ctag="CN" id="t9035" msd="fs" sp="y">rede</tok>
<tok base="desenvolver,desenvolvido" class="word" ctag="PPA" id="t9036" msd="fs"
sp="y">desenvolvida</tok>
<tok base="para" class="word" ctag="PREP" id="t9037" sp="y">para</tok>
<tok base="processamento" class="word" ctag="CN" id="t9038" msd="ms"
sp="y">processamento</tok>
<tok base="de" class="word" ctag="PREP" id="t9039" sp="y">de</tok>
<tok base="informação" class="word" ctag="CN" id="t9040" msd="fp"
sp="y">informações</tok>
<tok base="em" class="word" ctag="PREP" id="t9041" sp="y">em</tok>
<tok base="uma" class="word" ctag="UM" id="t9042" msd="fs" sp="y">uma</tok>
<tok base="empresa" class="word" ctag="CN" id="t9043" msd="fs" sp="y">empresa</tok>
<tok base="ou" class="word" ctag="CJ" id="t9044" sp="y">ou</tok>
<tok base="organização" class="word" ctag="CN" id="t9045" msd="fs">organização</tok>
<tok class="punctuation" ctag="PNT" id="t9046" sp="y">.</tok>
</definingText>
LxTransduce
• Match tree using
elements
• Quick
• Unicode friendly
• freeware
• Easy to integrate in
other tools (java)
• Input: simple text or
xml
• Regular expressions
• Substitution and
markup
• Output the same file
with changes
Rules in lxtransduce
<rule name="PARopen">
<query match="tok[.~'^\($']"/>
</rule>
<rule name="PARcl">
<query match="tok[.~'^\($']"/>
</rule>
<rule name="parenthetic">
<seq>
<ref name="PARopen"/>
<repeat-until name="tok">
<ref name="PARcl"/>
</repeat-until>
<ref name="PARcl"/>
</seq>
</rule>
<rule name="Conj">
<query match="tok[@ctag =
'CJ']"/>
</rule>
<rule name="Coor"> <!-Conjunctions or comma -->
<first>
<query match="tok[. = ',']"/>
<ref name="Conj" mult="+"/>
</first>
</rule>
First development
phase
●
●
●
●
●
Precision Recall
F2
Gr 00
0.14
0.44
0.26
Gr 01
0.31
0.20
0.22
Less than 50% of the corpus
Focus on the verb
Precision: manually marked/all automatic
Recall: correct automatic/manually marked
F2 :3*(precision*recall)/2*precision+recall
Second developing phase
• 75% of the corpus for developing
• 25% of the corpus for testing
• Specific grammar/rules for each type
Copula baseline grammar
Verb “to be” third person
singular or plural present
indicative
<rule name="SERdef">
<best>
<ref name="Ser3"/>
<ref name="PoderSer"/>
</best>
</rule>
<rule name="euristic">
<seq>
<repeat-until name="tok">
<ref name="SERdef" mult="+"/>
</repeat-until>
<ref name="SERdef" mult="+"/>
<not>
<ref name="PPA"/>
</not>
<ref name="tok" mult="*"/>
<end/>
</seq>
</rule>
Copula base result
• Sentence level results
• Problem with precision
Copula Grammar
Rules for is_type

<rule name="Serdef">
<query
match="tok[@ctag = ’V’ and
@base=’ser’ and
(@msd[starts-with(.,’fi-3’
)]
or @msd[starts-with(.,’pi3’ )])]
</rule>
....
<rule name="copula1">
<seq>
<ref name="SERdef"/>
<best>
<seq>
<ref name="Art"/>
<ref name="adj|adv|prep|"
mult="*"/>
<ref name="Noun" mult="+"/>
</seq>
....
</best>
<ref name="tok" mult="*"/>
<end/>
</seq>
</rule>
Confronting Results
Include that patterns that were excluded
Try to gather the syntactic pattern of non definition and
confront with the syntactic pattern of definition.
Other_Verbs grammar
• Collect verbs in a lexicon
• Three different category:
reflexive, active, passive.
• 22 different verbs
<lex word="chamar">
<cat>ref</cat>
</lex>
<lex word="chamar,chamado">
<cat>pas</cat>
</lex>
<rule name="Vpas">
<seq>
<ref name="tok"/>
<not>
<ref name="not"/>
</not>
<ref name="tok" mult="?"/>
<query match="tok[mylex(@base)
and (@ctag='PPA')]"
constraint="mylex(@base)/cat='
pas'"/>
</seq>
</rule>
Results for verb_type
• Analyze each verbs separately as with is_type
• Richer syntactic patterns
Punctuation Grammar
Preliminary work
●
Definition introduced by
colon mark (most frequent)
●
<rule name="punct_def">
<seq>
<start/>
<ref name="CompmylexSN"
mult="+"/>
<query match="tok[.~’^:\$’]"/>
<ref name="tok" mult="+"/>
<end/>
</seq>
</rule>
All-in-one
• Combination of the previous grammars
• The type is not take into account to calculate
precision and recall
Conclusions and Future Work
• Overall results: Recall 86%, Precision 14%
• Difference among domains: the style of a document
influence the result.
• Improve the rules for verb_type and punc_type
• Combining with other techniques such as ML

Supporting e-learning with automatic glossaryextraction

Transcript Supporting e-learning with automatic glossaryextraction

Directory