94.2% Correctness By: Myles Maxfield and Jonathan Reed

Download Report

Transcript 94.2% Correctness By: Myles Maxfield and Jonathan Reed

To quantitatively test the quality of the spell checker, the program was executed on predefined “test beds”
of words for numerous trials, ranging from thousands to millions of times. The test beds varied widely in
content because they were freshly constructed from the latest statistical data of the most commonly searched
words on two products of our sponsor company. To ensure that the time efficiency results were accurate, the
data from each trial was recorded and averaged. In addition, each corrected result and the frequency of said
result were recorded to show the accuracy and consistency of the program.
Spellchecker Running Time (Trial 5) for Words In the Test Bed.
500
500
450
450
400
400
Time (Milliseconds)
Time (Milliseconds)
Spellchecker Running Time (Trial 4) for Words In the Test Bed.
350
300
250
200
150
350
300
250
200
150
100
100
50
50
gr
an
bu
bu
gr
an
t
gr s
ea
sy
ne nts
ss
co pla
m
rp
n
on
or
a
co
ey
rp tion
sp
en ore
at
di
io
ng
n
pa
tte
rn
co s
n
pa tra
trn ts
er
sh
re
ip
al
es
ta
no tie
n
p
tra ro
de fitf
m
aa
rk
he
lth
im
po
in
rt
sh
ur
en
c
ly e
ce
re nse
st
au
r
a
go
dv ant
ve
er
rn
tis
m
en ing
tg
in
co ran
in
co rpo ts
ra
m
tio
e
st
n
at
em
en
zo t
ba nni
nk ng
ru
pt
co sie
py
wr
re
gu ite
la
en tio
n
st
at s
ut
io
ns
0
t
gr s
ea
sy
ne nts
ss
co pla
m
rp
n
on
or
a
c
ey
or
t
po ion
sp
en
re
at
di
io
ng
n
pa
tte
rn
co s
n
pa tra
trn ts
er
sh
re
ip
al
es
ta
no tie
n
p
tra ro
de fitf
m
aa
rk
he
lth
im
po
in
rt
sh
ur
en
c
ly e
ce
re nse
st
au
ad r an
go
ve
t
ve
rti
rn
s
m
en ing
tg
in
co ran
in
co rpo ts
ra
m
tio
e
st
n
at
em
en
zo t
ba nnin
nk
g
ru
pt
co sie
py
wr
re
gu ite
la
en tio
n
st
at s
ut
io
ns
0
Input Word
Input Word
94.2% Correctness
The spell checker works relatively well.The theoretical run time using Big-Oh notation is O(m*n2), where m
is a small constant (0<m<1) and n is the length of the inputted word. The scalar m is used, because the Kd tree
filters out many impossible matches and therefore cuts down on the data size of the algorithms. The n2 comes
from the use of the Damerau-Levenshtein edit distance algorithm. The runtime of the algorithm on our sponsor
company’s servers is even less, and is therefore within the acceptable range.
The correctness of the algorithm is relatively high. The algorithm gets the correct answer for all but four
cases of the 70-word test-bed that the algorithm was run on. The algorithm did not find an answer for two of
the four test cases that did not get the right answer. Because the spell checker output will not be displayed if it
does not return an answer, the spell checker will only suggest an incorrect word 2.9% of the time.
The purpose of the project was to create a
spell checker that could check through a corpus
specific dictionary of at least 100,000 words
long and correct a search query from a user in
500 milliseconds or less. The program was
required to identify incorrectly spelled phrases
of words and to use words in the phrase to guess
the correct spelling of the rest of the words.
The spelling mistakes of a person can be
divided into two categories: typos and guesses.
Typos are random mistakes made by accident.
Guesses are attempts at spelling a word by
sounding it out. With typos, the correct word
can be found by using string matching
algorithms, such as the Demarau-Levenstein edit
distance algorithm, to search a dictionary and
find the closest word that matches.
Correcting guesses requires the words to be
translated into their phonetic equivalents, which
can be done using programming libraries such as
soundex and metaphone. The closest match of
the misspelled word can be found by comparing
the phonetic equivalents using the exact same
string matching algorithms used for typos.
However, checking through an entire
dictionary for each misspelled word is
inefficient and time costly. Therefore, the
process can be optimized by filtering out words
in the dictionary that are cannot be matches. A
Kd tree is a special data structure that
intelligently organizes the dictionary so only
words that could possibly match with the
misspelled word are found. This acts like a filter
for the core algorithm and eliminates excess
calculations.
In addition to checking individual words for
mismatches, words that are commonly used
together as phrases can be utilized to “hint” at
what the correct spelling of the rest of the phrase
is. This can be used to reduce excess work even
more and increase efficiency.
The accuracy and time efficiency results
shown for the final spell checker program
exceeded the performance of the spell checker
previously used by our sponsor company and
also versions using the Landau-Vishkin
algorithm and a modified-Landau-Vishkin
algorithm. The previous spell checker program
used was an “out of the box” spell checker built
into the coding of Lucene, an open source
software package that was used to store and
access webpage data. The lucene Spell Checker
works by ranking words using the Levenshtein
edit distance. The Levenshtein edit distance
algorithm operates in O(M*N) efficiency and
requires M*N member space, where M and N
are the lengths of the two strings being
compared. The algorithm is very simple and is
extremely fast when comparing short strings, but
inefficient for long strings (Black, 2006). The
Damerau-Levenshtein edit distance algorithm
implemented in the final spell checker relies on
the same method and is very similar; however,
the Damerau-Levenshtein algorithm considers
transposed letters as one mistake, and the
Levenshtein algorithm treats them as two. The
Lucene spell checker provides quick results for
small test cases and is easy to use, but was
unsatisfactory because of the low accuracy in
returned results and poor time efficiency for
long test cases (Black, 2006). The spell checker
produced by the project far exceeds the Lucene
spell checker in speed for longer test cases.
This study was designed to create a spell
checker for an Internet-based search engine.
Different algorithms were created and unified
under a common spell checker program. The
resulting program returned the correct spelling
of an inputted word most of the time, and the
program ran fast enough that the user would not
notice a delay. Users now can search the
Internet using our sponsor company’s search
engine and have their spelling corrected.