Transcript ppt

javaConLib
GSLT: Java Development for HLT
Leif Grönqvist – [email protected]
11. June 2002 10:30
What have I done?
 I have implemented a library useful for
various word sense disambiguation
based on contexts
 From the beginning I have had a test
method trying to provoke errors in each
part of the implementation
 A command line application using the
library, implementing Yarowsky 1995
 I have tried to make final code at once
11 juni 2002
Java Development for HLT: Leif
Grönqvist
2
What is left to do?
 One very simple test implementation
 A tutorial based documentation
 Adjust things Lars pointed out in the last
iteration
 Make an ANT build script
 The final report
11 juni 2002
Java Development for HLT: Leif
Grönqvist
3
Project Background
 Several methods for word disambiguation
based on context. For example:
 Yarowsky’s unsupervised algorithm from 1995
is based on two general observations:
 One sense per collocation: nearby words provide
strong and consistent clues
 One sense per discourse: the sense for a target
word is highly consistent within any document
11 juni 2002
Java Development for HLT: Leif
Grönqvist
4
11 juni 2002
Java Development for HLT: Leif
Grönqvist
5
11 juni 2002
Java Development for HLT: Leif
Grönqvist
6
A much simpler
supervised approach
 Start with a disambiguated set of
occurrences
 Count all word types within a +-5 word
context for each sense
 To disambiguate a new occurrence:
compare the context to the possible
sense’s distributions
11 juni 2002
Java Development for HLT: Leif
Grönqvist
7
javaConLib
 These two algorithms have a lot in
common
 There are many more similar algorithms
 javaConLib includes classes that simplify
implementation and tuning a lot
 Higher order and intuitive methods – the
main class will look more like an
algorithm description
11 juni 2002
Java Development for HLT: Leif
Grönqvist
8
Typical parts of a main
class
 Yarowsky y=new Yarowsky(5);
 Corpus trainCorp=new Corpus (“train.txt”);
 SenseSet s1=new SenseSet(“äger|ägde,
“Abs”, y.posl1);
 DecisionList decList=y.train95(s1, s2, “rum”,
trainCorp);
 ContextList testCont=y.test95(decList,
testCorpus, s1, s2, word);
 print(testCont.toString());
11 juni 2002
Java Development for HLT: Leif
Grönqvist
9
The Classes
 Context: An array of words with specific size and the main word at
position 0.
 ContextList: A set of Contexts around a certain word type extracted
from a corpus
 Corpus: A corpus is basically a vector containing words read from a file
 Decision: A decision contains a word, a position, and a score deciding
how good it is to decide the sense for the main word in a context
 DecisionList: A DecisionList like the one used in Yarowsky's algorithm
from 1995.
 FreqList: A frequency list for strings in a corpus
 Positions: Holds a list of positions (integers) relative to the center word
when working with words and contexts.
 SenseSet: A set of the necessary components for each sense when
using the Yarowsky -95 algorithm for word sense disambiguation
 Yarowsky: A class with some structures and classes useful when
implementing Yarowsky's disambiguation algorithm from 1995, and
similar.
11 juni 2002
Java Development for HLT: Leif
Grönqvist
10
We are done
And probably out of time
11 juni 2002
Java Development for HLT: Leif
Grönqvist
11