Presentation - ICT for Emerging Regions
Download
Report
Transcript Presentation - ICT for Emerging Regions
Authors
N.A.K.B.D.Gunasekara
Mr. W.V.Welgama
Dr.A.R.Weerasinghe
Overview
Introduction
Literature Review
Aims & Objectives
Methodology
Design & Implementation
Results & Evaluation
Conclusion
Future Work
Content
Introduction
Literature Review
Aims & Objectives
Methodology
Design & Implementation
Results & Evaluation
Conclusion
Future Work
Introduction
What is Part Of Speech Tagging ?
The process of assigning a corresponding POS tag like noun, verb,
preposition to every token in the text.
Pronoun
Verb
Noun
ඇය ගීයක් සෙමින් මුමුණයි.
Adverb
Introduction
Motivation
An important preprocessing task in many NLP areas like
Information retrieval :
Stemming
Selection of high content words
Word sense disambiguation
Speech synthesis (e.g. Text-to-Speech)
Speech recognition
Machine translation
Content
Introduction
Literature Review
Aims & Objectives
Methodology
Design & Implementation
Results & Evaluation
Conclusion
Future Work
Literature Review
Different POS tagging approaches
POS Tagging
Supervised
Rule-based
HMM
Stochastic
CRF
Unsupervised
Neural
MEMM
Rule-based
Stochastic
Baum-Welch
Neural
Literature Review
Different POS tagging approaches
POS Tagging
Supervised
Rule-based
HMM
Stochastic
CRF
Unsupervised
Neural
MEMM
Rule-based
Stochastic
Baum-Welch
Neural
Literature Review
Different POS tagging approaches
POS Tagging
Supervised
Rule-based
HMM
Stochastic
CRF
Unsupervised
Neural
MEMM
Rule-based
Stochastic
Baum-Welch
Neural
Literature Review
Related Work
Hidden Markov Model Based Part of Speech Tagger for Sinhala Language -2014
HMM with N-gram probabilities
90% accuracy using a Test set of 1024 words
Learning a Stochastic Part of Speech Tagger for Sinhala -2013
HMM with tri-gram probabilities
62% accuracy for both known & unknown words
A Stochastic Part of Speech Tagger for Sinhala – 2004
HMM with bi-gram probabilities
Tagging error below 60% when the unknown word percentage is 100%
Content
Introduction
Literature Review
Aims & Objectives
Methodology
Design & Implementation
Results & Evaluation
Conclusion
Future Work
Aims & Objectives
Aims of the Research
To find out whether the hybrid approach that incorporates both stochastic and
rule based approaches can give a better POS tagging accuracy over solely
stochastic based approach for Sinhala language.
Analyze how the Sinhala POS tagging can be improved by improving the Tag set
“UCSC TagSet Version 1” - “LTRL/UCSC POS TAG SET FOR SINHALA” developed in 2007
“UCSC TagSet Version 2” - An improved version of “UCSC TagSet Version 1”
“UCSC TagSet Version 3” - “UCSC NEW SINHALA TAGSET” developed by LTRL of UCSC in 2015
Aims & Objectives
Objectives
To do a comparative analysis between HMM based approach and hybrid
approach which incorporate both stochastic and rule based approaches.
To re-annotate the corpus with improved versions of the Tag set
“UCSC Annotated Corpus Version 1” annotated by “UCSC TagSet Version 1” - Collected from LTRL of UCSC
“UCSC Annotated Corpus Version 2” annotated by “UCSC TagSet Version 2” - Contribution of this research
“UCSC Annotated Corpus Version 3” annotated by “UCSC TagSet Version 3” - Contribution of this research
Content
Introduction
Literature Review
Aims & Objectives
Methodology
Design & Implementation
Results & Evaluation
Conclusion
Future Work
Methodology
1
• Implementation of HMM Tagger
2
• Integration of Stemmer
3
• Extend the HMM tagger to come up with a hybrid Tagger
4
• Evaluation of taggers using different Tag set versions
5
• Comparative analysis on two POS tagging approaches
Content
Introduction
Literature Review
Aims & Objectives
Methodology
Design & Implementation
Results & Evaluation
Conclusion
Future Work
Design & Implementation
Architecture of the HMM Tagger
Incorporation of the Stemmer
Architecture of the Hybrid Tagger
Tag guessing using Suffix Rules
Architecture of the HMM Tagger
Trainer
POS tag annotated
Corpus
Testing input
Tokenization
Preprocessor
Tagger
Unknown word
Probability Calculator
Transition & Emission
Probabilities
Viterbi Path finder
Back tracer
Tagger Output
Stemmer
Stem of
unknown
word
Incorporation of the Stemmer
First approach- In both Training phase & Tagging phase
Training phase - before calculating the emission probabilities
Tagging phase - to stem the input given to the tagger
Training Set
ගසෙන්
ගෙට
ගෙකට
Unseen word
PE (ගෙ|NNN)
ගෙක්
Stemmer
ගෙ
Incorporation of the Stemmer
Issues in integrating stemmer in the Training Phase :
Changes in Part of Speech due to stemming
Changes in meaning due to stemming
වැසියන්
e.g.
වැසියන් - appropriate tag is NNM
වැසි - appropriate tag is NNN
Unseen word
Second approach - Only in the Tagging phase
- Increases the tagger accuracy
Stemmer
වැසි
Architecture of the Hybrid Tagger
POS tag annotated
Corpus
Tagger
Testing input
Stemmer
Tokenization
Probability Calculator
Transition & Emission
Probabilities
Preprocessor
Viterbi Path finder
Back tracer
Tagger Output
Unkown word
Trainer
Stem is
in known
YES
NO
Tag guessing
Using morph
rules
Tag guessing using Suffix Rules
Affix - A morpheme that is combined to a stem/root of a word to form a new
word
Prefix - An element that is placed at the beginning of a root word
Suffix - An element that is placed at the end of a root word
e.g. Suffixes in Sinhala Language
In nouns : කම - මනුේෙ කම, අහංකාර කම
In adjectives: මික - කාර් මික, ධාර් මික
In verbs : මින් - නට මින්, කර මින්
Tag guessing using Suffix Rules
Categorized all open class words by their POS tags into separate files
e.g.
All NNN words in the annotated corpus are wrote into a file called NNN.txt
Tag guessing using Suffix Rules
Extract common suffixes in each of categories
common suffix - if only a suffix occurs in more than five distinct words
Tag guessing using Suffix Rules
Calculate the probability of each listed suffixes according to their tag
category
Tag guessing using Suffix Rules
Calculate the probability of each listed suffixes according to their tag
category
No. of occurrences “කු” appear as a suffix in a NNM word
= 180
Total no. of distinct words tagged as NNM in the training set = 1,457
Probability of the suffix “කු” appears in a word tagged as NNM = 180/1457
= 0.12
Tag guessing using Suffix Rules
Create one file that includes all
the common suffixes tagged by
the tag which has the highest
probability
When the hybrid tagger comes
to a previously unseen word, it
analyses the word’s suffix and
predict a tag
Content
Introduction
Literature Review
Aims & Objectives
Methodology
Design & Implementation
Results & Evaluation
Conclusion
Future Work
Results & Evaluation
HMM Tagger based on UCSC Tagset Version 1
HMM Tagger based on UCSC Tagset Version 2
HMM Tagger based on UCSC Tagset Version 3
Hybrid Tagger based on UCSC Tagset Version 2
Hybrid Tagger based on UCSC Tagset Version 3
Summary of Result
Results & Evaluation
Evaluation of taggers
Training set
Total words : 75,830
Distinct words : 14,027
Test set
Total words : 25,087
Distinct words :
6,954
HMM Tagger based on UCSC Tagset Version 1
“LTRL/UCSC POS TAG SET
FOR SINHALA” developed in
2007
Contains 22 Tags
Total no. of words in the input
No. of correctly tagged words
Accuracy of the tagger
= 25,087
= 17,689
= 70.51%
HMM Tagger based on UCSC Tagset Version 2
New tag called “UNK” for all
the words which do not fall
into any tag category
Total no. of words in the input
No. of correctly tagged words
Accuracy of the tagger
= 25,087
= 17,688
= 70.51%
Difference in accuracy rates in Tagset version 1 &
Tagset version 2
With the addition of UNK tag
Accuracy of question mark tag is
increased up to 100%
Increments in accuracies of 10 tag
categories
HMM Tagger based on UCSC Tagset Version 3
“UCSC NEW SINHALA TAGSET”
developed by LTRL of UCSC in
2015
Contains 29 tags including the
UNK tag
Total no. of words in the input
No. of correctly tagged words
Accuracy of the tagger
= 25,087
= 17,548
= 69.95%
Hybrid Tagger based on UCSC Tagset Version 2
Increased overall accuracy
Total no. of words in the input
No. of correctly tagged words
Accuracy of the tagger
= 25,087
= 18,098
= 72.14%
Hybrid Tagger based on UCSC Tagset Version 2
Increased accuracy rates of open
class tag categories
New words often added to open
class categories
Hybrid approach is a good
solution for the unknown word
problem
Total no. of words in the input
No. of correctly tagged words
Accuracy of the tagger
= 25,087
= 18,098
= 72.14%
Hybrid Tagger based on UCSC Tagset Version 3
Increased overall accuracy
Total no. of words in the input
No. of correctly tagged words
Accuracy of the tagger
= 25,087
= 17,657
= 70.38%
Summary of Results
Overall accuracy of hybrid tagger is higher
than HMM tagger.
HMM Tagger
Hybrid
Tagger
But the increment in hybrid tagger accuracy
is higher when used with “UCSC Tagset
Version 2”
“UCSC Tagset Version 3” is in a higher
descriptive level
number of collisions among tag categories
is high
TagSet Version 2
70.51%
72.14%
TagSet Version 3
69.95%
70.38%
Content
Introduction
Literature Review
Aims & Objectives
Methodology
Design & Implementation
Results & Evaluation
Conclusion
Future Work
Conclusion
1
• Addition of ‘UNK’ tag leads towards a more meaningful tagging process
2
• ‘UCSC TagSet Version 3’ results in decreased tagger accuracy due to the high
level descriptiveness
3
• Hybrid approach gives a higher POS tagging accuracy than the solely HMM
based approach for Sinhala language
Content
Introduction
Literature Review
Aims & Objectives
Methodology
Design & Implementation
Results & Evaluation
Conclusion
Future Work
Future Work
Hybrid POS tagging approach proposed in this research is based on bi-gram
transition probabilities. Therefore, in order to further improve the tagging
results, this approach can be extended to use tri-gram transition probabilities.
Integration of a named entity recognizer and a morphological analyzer with
the hybrid tagger can be helpful in boosting up the tagger accuracy
References
[1] M. Jayasuriya and a. R. Weerasinghe, “Learning a stochastic part of speech tagger for inhala,” 2013 Int. Conf. Adv. ICT Emerg.
Reg., pp. 137–143, 2013.
[2]D. Kumar and G. S. Josan, “Part of Speech Taggers for Morphologically Rich Indian Languages: A Survey,” Int. J. Comput.
Appl., vol. 6, no. 5, pp. 1–9, 2010.
[3] A. J. P. M. P. Jayaweera and N. G. J. Dias, “Hidden Markov Model Based Part Of Speech Tagger for Sinhala Language,” vol. 3,
no. 3, pp. 1–23, 2014.
[4] Wikipedia, ‘Sinhalese language’, 2015. [Online]. Available: http://en.wikipedia.org/wiki/Sinhalese_language. [Accessed: 28Mar- 2015].
[5] T. Fernando and A. Weerasinghe, “A Morphological Parser for Sinhala Verbs,” Icter.Org.
[6] R. Tsarfaty, D. Seddah, S. Kübler, and J. Nivre, “Parsing Morphologically Rich Languages: Introduction to the Special Issue,”
Comput. Linguist., vol. 39, no. 1, pp. 15–22, 2013.
[7] Wikipedia, ‘Language’,2015. [Online]. Available: http://en.wikipedia.org/wiki/Language. [Accessed: 28- Mar- 2015].
[8] R. Mooney, 'Part-Of-Speech Tagging, Sequence Labeling and Hidden Markov Models (HMMs)', University of Texas at
Austin.
[9] H. Tseng, D. Jurafsky, and C. Manning, “Morphological features help POS tagging of unknown words across language
varieties,” Proc. Fourth SIGHAN Work. Chinese Lang. Process., pp. 32–39, 2005.
[10]D. L. Herath and a. R. Weerasinghe, “A stochastic part of speech tagger for sinhala,” 2013 Int. Conf. Adv. ICT Emerg. Reg., pp.
137–143, 2013.
Example of Tagging a Sentence
අපි සල්, ම් හා පහන් සගන පන්ේ යමු.
අපි_PRP සල්_NNN ,_, ම්_NNN හා_CC පහන්_NNN
සගන_VP පන්ේ_NNN යමු_RP ._.
Enhancing the size of the annotated corpus
New input data taken from different newspapers are used
Hybrid tagger based on “UCSC TagSet Version 2”
New input data are tagged using the selected hybrid tagger
Tagged output is added to the training data set
The unchanged test data set is re-tagged with the enhanced
training set.
Change in Hybrid Tagger accuracy with the
addition of newly tagged data
Tagger accuracy decreases with the
addition of tagger output
Reason : hybrid tagger used to enhance
the training set has a tagging error of
27.86%
Solution : manually inspecting the tagger
output by a group of linguistic experts
before adding it to the annotated corpus
Content
Introduction
Literature Review
Aims & Objectives
Data
Methodology
Design & Implementation
Results & Evaluation
Conclusion
Future Work
Data
Data collection phase
Implementation of HMM & Hybrid taggers
Improving the size of the annotated corpus
“LTRL/UCSC POS TAG SET FOR SINHALA” developed in 2007
UCSC Sinhala Tagged Corpus V1
New 120 articles from “UCSC 10M Words Sinhala Corpus”
Re-annotation of the corpus with the improved tag set
“UCSC NEW SINHALA TAGSET” developed by LTRL of UCSC in 2015
Data
Corpus analysis
UCSC Annotated Corpus Version 1
Total No. of words which do not fall
into any tag category: 3989
No. of distinct words which do not
fall into any tag category: 759
Data
Solution from “UCSC TagSet
Version 3”
sidu (සිදු) – QVB
lak(ලක්) – QVB
path(පත්) - PAVB
QVB - Question Word in Kriya Mula
PAVB - Adjective in Kriya Mula
Aims & Objectives
Research Question
How the accuracy varies in Hybrid POS tagging than in solely HMM
based Stochastic tagging approach when unknown words are
presented?
Sentence Boundary Disambiguation (SBD)
A common problem in many languages
Person names begin with initials, acronyms and abbreviations make the sentence
boundary identification more challengeable
Solution :
A separate list of person name initials, commonly used acronyms and abbreviations are
maintained in our preprocessing step
Resulted in increased Tagger accuracy
Challenges involved in the POS Tagging
Ability of some words to play multiple Part Of Speech
දිය නා – A verb
නා ගෙ – An adjective
Handling unknown words which are not in the training set
Upper Bound for our POS Tagging Accuracy
Words which do not fall into any tag category
Total no. words in the Corpus
Total no. words that can be tagged precisely
in manual tagging
= 3,989
= 100,917
Maximum accuracy in manual approach
= (96,928/100,917)*100%
= 96.05%
In other words,
Tagging error presented in manual approach
= 96,928
= 3.95%
Emission probability P(wi|ti)
Emission probability indicates the likelihood of a given word is tagged by a
particular tag (assuming that the word is depended only on its tag)
Calculate by dividing the number of occurrences a particular tag appears in the
corpus with the given word, by the total number of occurrences that tag appears
in the corpus
නා_JJ ගෙ_NNN මලින්_NNN පිරී_VNF ඇල_VFM ._.
Emission probability P(wi|ti)
Transition probability P(ti|ti-1)
Bi-gram transition probability indicates the probability of a tag being depended
on the previous tag
Calculate by dividing the number of occurrences where the ti-1, ti tag sequence
appears, by the total number of occurrences where the tag ti-1 appears in the
corpus
නා_JJ ගෙ_NNN මලින්_NNN පිරී_VNF ඇල_VFM ._.
Transition probability P(ti|ti-1)
Hidden Markov Model
Main goal in HMM is to come up with the most probable tag sequence t1 …tn
given the word sequence w1…..wn, such that P(t1 …..tn | w1…..wn) is the
maximum.
Applying the Bayes Rule P(X|Y) = [ P(Y|X) * P(X) ] / P(Y)
Hidden Markov Model
Remove the denominator P(W) as it is same for all the sequences
Applying Likelihood and Transition assumptions
Tagging With Hidden Markov Model
Viterbi Algorithm
Find the best possible POS tag path,
given a sequence of words
For the task of decoding
A Dynamic Programming Algorithm
Accuracy Rate per each Tag
UCSC TagSet Version 1
UCSC TagSet Version 2
UCSC TagSet Version 1
TAG
Description
CC
Conjunction
DET
Determiner
FRW
Foreign Word
JJ
Adjective
JVB
Adjective in Kriya Müla
NNF
Common Noun Feminine
NNM
Common Noun Masculine
NNN
Common Noun Neuter
NNPA
Proper Noun Animate
NNPI
Proper Noun Inanimate
NVB
Noun in Kriya Müla
POST
Postposition
PRP
Pronoun
QFNUM
Number Quantifier
RB
Adverb
RP
Particle
SYM
Not Classified
UH
Interjection
VFM
Verb Finite Main
VNF
Verb Non Finite
VNN
Verbal Non Finite Noun
VP
Verb Participle
UCSC TagSet Version 2
TAG
Description
CC
Conjunction
DET
Determiner
FRW
Foreign Word
JJ
Adjective
JVB
Adjective in Kriya Müla
NNF
Common Noun Feminine
NNM
Common Noun Masculine
NNN
Common Noun Neuter
NNPA
Proper Noun Animate
NNPI
Proper Noun Inanimate
NVB
Noun in Kriya Müla
POST
Postposition
PRP
Pronoun
QFNUM
Number Quantifier
RB
Adverb
RP
Particle
SYM
Not Classified
UH
Interjection
VFM
Verb Finite Main
VNF
Verb Non Finite
VNN
Verbal Non Finite Noun
VP
Verb Participle
UNK
Unknown (Tag is unknown)
UCSC TagSet Version 3
TAG
Description
CC
Conjunction
CMVNF
Present Participle Verb Non Finite
DET
Determiner
FRW
Foreign Word
JJ
Adjective
JVB
Adjective in Kriya Müla
NNF
Common Noun Feminine
NNM
Common Noun Masculine
NNN
Common Noun Neuter
NPF
Proper Noun Feminine
NPM
Proper Noun Masculine
NPN
Proper Noun Neuter
NVB
PAVB
Noun in Kriya Müla
Participle Adjective in Kriya Mula
PAVNF
Past Participle Verb Non Finite
POST
Postposition
PRP
Pronoun
PPVB
Past Participle in Kriya Mula
PRVNF
Present Participle Verb Non Finite
QFNUM
Number Quantifier
QVB
Question Word in Kriya Mula
RB
Adverb
RP
Particle
SYM
Not Classified
UH
Interjection
VFM
Verb Finite Main
VNN
Verbal Non Finite Noun
VP
Verb Participle
UNK
Unknown (Tag is unknown)