Transcript ppt
javaConLib
GSLT: Java Development for HLT
Leif Grönqvist – [email protected]
11. June 2002 10:30
What have I done?
I have implemented a library useful for
various word sense disambiguation
based on contexts
From the beginning I have had a test
method trying to provoke errors in each
part of the implementation
A command line application using the
library, implementing Yarowsky 1995
I have tried to make final code at once
11 juni 2002
Java Development for HLT: Leif
Grönqvist
2
What is left to do?
One very simple test implementation
A tutorial based documentation
Adjust things Lars pointed out in the last
iteration
Make an ANT build script
The final report
11 juni 2002
Java Development for HLT: Leif
Grönqvist
3
Project Background
Several methods for word disambiguation
based on context. For example:
Yarowsky’s unsupervised algorithm from 1995
is based on two general observations:
One sense per collocation: nearby words provide
strong and consistent clues
One sense per discourse: the sense for a target
word is highly consistent within any document
11 juni 2002
Java Development for HLT: Leif
Grönqvist
4
11 juni 2002
Java Development for HLT: Leif
Grönqvist
5
11 juni 2002
Java Development for HLT: Leif
Grönqvist
6
A much simpler
supervised approach
Start with a disambiguated set of
occurrences
Count all word types within a +-5 word
context for each sense
To disambiguate a new occurrence:
compare the context to the possible
sense’s distributions
11 juni 2002
Java Development for HLT: Leif
Grönqvist
7
javaConLib
These two algorithms have a lot in
common
There are many more similar algorithms
javaConLib includes classes that simplify
implementation and tuning a lot
Higher order and intuitive methods – the
main class will look more like an
algorithm description
11 juni 2002
Java Development for HLT: Leif
Grönqvist
8
Typical parts of a main
class
Yarowsky y=new Yarowsky(5);
Corpus trainCorp=new Corpus (“train.txt”);
SenseSet s1=new SenseSet(“äger|ägde,
“Abs”, y.posl1);
DecisionList decList=y.train95(s1, s2, “rum”,
trainCorp);
ContextList testCont=y.test95(decList,
testCorpus, s1, s2, word);
print(testCont.toString());
11 juni 2002
Java Development for HLT: Leif
Grönqvist
9
The Classes
Context: An array of words with specific size and the main word at
position 0.
ContextList: A set of Contexts around a certain word type extracted
from a corpus
Corpus: A corpus is basically a vector containing words read from a file
Decision: A decision contains a word, a position, and a score deciding
how good it is to decide the sense for the main word in a context
DecisionList: A DecisionList like the one used in Yarowsky's algorithm
from 1995.
FreqList: A frequency list for strings in a corpus
Positions: Holds a list of positions (integers) relative to the center word
when working with words and contexts.
SenseSet: A set of the necessary components for each sense when
using the Yarowsky -95 algorithm for word sense disambiguation
Yarowsky: A class with some structures and classes useful when
implementing Yarowsky's disambiguation algorithm from 1995, and
similar.
11 juni 2002
Java Development for HLT: Leif
Grönqvist
10
We are done
And probably out of time
11 juni 2002
Java Development for HLT: Leif
Grönqvist
11