Presentation - ICT for Emerging Regions

Download Report

Transcript Presentation - ICT for Emerging Regions

Authors
 N.A.K.B.D.Gunasekara
 Mr. W.V.Welgama
 Dr.A.R.Weerasinghe
Overview
 Introduction
 Literature Review
 Aims & Objectives
 Methodology
 Design & Implementation
 Results & Evaluation
 Conclusion
 Future Work
Content
 Introduction
 Literature Review
 Aims & Objectives
 Methodology
 Design & Implementation
 Results & Evaluation
 Conclusion
 Future Work
Introduction
 What is Part Of Speech Tagging ?
The process of assigning a corresponding POS tag like noun, verb,
preposition to every token in the text.
Pronoun
Verb
Noun
ඇය ගීයක් සෙමින් මුමුණයි.
Adverb
Introduction
 Motivation
 An important preprocessing task in many NLP areas like





Information retrieval :
 Stemming
 Selection of high content words
Word sense disambiguation
Speech synthesis (e.g. Text-to-Speech)
Speech recognition
Machine translation
Content
 Introduction
 Literature Review
 Aims & Objectives
 Methodology
 Design & Implementation
 Results & Evaluation
 Conclusion
 Future Work
Literature Review
 Different POS tagging approaches
POS Tagging
Supervised
Rule-based
HMM
Stochastic
CRF
Unsupervised
Neural
MEMM
Rule-based
Stochastic
Baum-Welch
Neural
Literature Review
 Different POS tagging approaches
POS Tagging
Supervised
Rule-based
HMM
Stochastic
CRF
Unsupervised
Neural
MEMM
Rule-based
Stochastic
Baum-Welch
Neural
Literature Review
 Different POS tagging approaches
POS Tagging
Supervised
Rule-based
HMM
Stochastic
CRF
Unsupervised
Neural
MEMM
Rule-based
Stochastic
Baum-Welch
Neural
Literature Review
 Related Work
 Hidden Markov Model Based Part of Speech Tagger for Sinhala Language -2014


HMM with N-gram probabilities
90% accuracy using a Test set of 1024 words
 Learning a Stochastic Part of Speech Tagger for Sinhala -2013

HMM with tri-gram probabilities

62% accuracy for both known & unknown words
 A Stochastic Part of Speech Tagger for Sinhala – 2004

HMM with bi-gram probabilities

Tagging error below 60% when the unknown word percentage is 100%
Content
 Introduction
 Literature Review
 Aims & Objectives
 Methodology
 Design & Implementation
 Results & Evaluation
 Conclusion
 Future Work
Aims & Objectives
 Aims of the Research
 To find out whether the hybrid approach that incorporates both stochastic and
rule based approaches can give a better POS tagging accuracy over solely
stochastic based approach for Sinhala language.
 Analyze how the Sinhala POS tagging can be improved by improving the Tag set
“UCSC TagSet Version 1” - “LTRL/UCSC POS TAG SET FOR SINHALA” developed in 2007
“UCSC TagSet Version 2” - An improved version of “UCSC TagSet Version 1”
“UCSC TagSet Version 3” - “UCSC NEW SINHALA TAGSET” developed by LTRL of UCSC in 2015
Aims & Objectives
 Objectives
 To do a comparative analysis between HMM based approach and hybrid
approach which incorporate both stochastic and rule based approaches.
 To re-annotate the corpus with improved versions of the Tag set
“UCSC Annotated Corpus Version 1” annotated by “UCSC TagSet Version 1” - Collected from LTRL of UCSC
“UCSC Annotated Corpus Version 2” annotated by “UCSC TagSet Version 2” - Contribution of this research
“UCSC Annotated Corpus Version 3” annotated by “UCSC TagSet Version 3” - Contribution of this research
Content
 Introduction
 Literature Review
 Aims & Objectives
 Methodology
 Design & Implementation
 Results & Evaluation
 Conclusion
 Future Work
Methodology
1
• Implementation of HMM Tagger
2
• Integration of Stemmer
3
• Extend the HMM tagger to come up with a hybrid Tagger
4
• Evaluation of taggers using different Tag set versions
5
• Comparative analysis on two POS tagging approaches
Content
 Introduction
 Literature Review
 Aims & Objectives
 Methodology
 Design & Implementation
 Results & Evaluation
 Conclusion
 Future Work
Design & Implementation
 Architecture of the HMM Tagger
 Incorporation of the Stemmer
 Architecture of the Hybrid Tagger
 Tag guessing using Suffix Rules
Architecture of the HMM Tagger
Trainer
POS tag annotated
Corpus
Testing input
Tokenization
Preprocessor
Tagger
Unknown word
Probability Calculator
Transition & Emission
Probabilities
Viterbi Path finder
Back tracer
Tagger Output
Stemmer
Stem of
unknown
word
Incorporation of the Stemmer
First approach- In both Training phase & Tagging phase
Training phase - before calculating the emission probabilities
 Tagging phase - to stem the input given to the tagger

Training Set
ගසෙන්
ගෙට
ගෙකට
Unseen word
PE (ගෙ|NNN)
ගෙක්
Stemmer
ගෙ
Incorporation of the Stemmer
 Issues in integrating stemmer in the Training Phase :
 Changes in Part of Speech due to stemming
Changes in meaning due to stemming
වැසියන්
e.g.
වැසියන් - appropriate tag is NNM
වැසි - appropriate tag is NNN
Unseen word
 Second approach - Only in the Tagging phase
- Increases the tagger accuracy
Stemmer
වැසි
Architecture of the Hybrid Tagger
POS tag annotated
Corpus
Tagger
Testing input
Stemmer
Tokenization
Probability Calculator
Transition & Emission
Probabilities
Preprocessor
Viterbi Path finder
Back tracer
Tagger Output
Unkown word
Trainer
Stem is
in known
YES
NO
Tag guessing
Using morph
rules
Tag guessing using Suffix Rules
 Affix - A morpheme that is combined to a stem/root of a word to form a new
word
 Prefix - An element that is placed at the beginning of a root word
 Suffix - An element that is placed at the end of a root word
e.g. Suffixes in Sinhala Language
In nouns : කම - මනුේෙ කම, අහංකාර කම
In adjectives: මික - කාර් මික, ධාර් මික
In verbs : මින් - නට මින්, කර මින්
Tag guessing using Suffix Rules
 Categorized all open class words by their POS tags into separate files
e.g.
All NNN words in the annotated corpus are wrote into a file called NNN.txt
Tag guessing using Suffix Rules
 Extract common suffixes in each of categories
 common suffix - if only a suffix occurs in more than five distinct words
Tag guessing using Suffix Rules
 Calculate the probability of each listed suffixes according to their tag
category
Tag guessing using Suffix Rules
 Calculate the probability of each listed suffixes according to their tag
category
No. of occurrences “කු” appear as a suffix in a NNM word
= 180
Total no. of distinct words tagged as NNM in the training set = 1,457
Probability of the suffix “කු” appears in a word tagged as NNM = 180/1457
= 0.12
Tag guessing using Suffix Rules
 Create one file that includes all
the common suffixes tagged by
the tag which has the highest
probability
 When the hybrid tagger comes
to a previously unseen word, it
analyses the word’s suffix and
predict a tag
Content
 Introduction
 Literature Review
 Aims & Objectives
 Methodology
 Design & Implementation
 Results & Evaluation
 Conclusion
 Future Work
Results & Evaluation
 HMM Tagger based on UCSC Tagset Version 1
 HMM Tagger based on UCSC Tagset Version 2
 HMM Tagger based on UCSC Tagset Version 3
 Hybrid Tagger based on UCSC Tagset Version 2
 Hybrid Tagger based on UCSC Tagset Version 3
 Summary of Result
Results & Evaluation
Evaluation of taggers
Training set
Total words : 75,830
Distinct words : 14,027
Test set
Total words : 25,087
Distinct words :
6,954
HMM Tagger based on UCSC Tagset Version 1
 “LTRL/UCSC POS TAG SET
FOR SINHALA” developed in
2007
 Contains 22 Tags
Total no. of words in the input
No. of correctly tagged words
Accuracy of the tagger
= 25,087
= 17,689
= 70.51%
HMM Tagger based on UCSC Tagset Version 2
 New tag called “UNK” for all
the words which do not fall
into any tag category
Total no. of words in the input
No. of correctly tagged words
Accuracy of the tagger
= 25,087
= 17,688
= 70.51%
Difference in accuracy rates in Tagset version 1 &
Tagset version 2
 With the addition of UNK tag
 Accuracy of question mark tag is
increased up to 100%
 Increments in accuracies of 10 tag
categories
HMM Tagger based on UCSC Tagset Version 3
 “UCSC NEW SINHALA TAGSET”
developed by LTRL of UCSC in
2015
 Contains 29 tags including the
UNK tag
Total no. of words in the input
No. of correctly tagged words
Accuracy of the tagger
= 25,087
= 17,548
= 69.95%
Hybrid Tagger based on UCSC Tagset Version 2
 Increased overall accuracy
Total no. of words in the input
No. of correctly tagged words
Accuracy of the tagger
= 25,087
= 18,098
= 72.14%
Hybrid Tagger based on UCSC Tagset Version 2
 Increased accuracy rates of open
class tag categories
 New words often added to open
class categories
 Hybrid approach is a good
solution for the unknown word
problem
Total no. of words in the input
No. of correctly tagged words
Accuracy of the tagger
= 25,087
= 18,098
= 72.14%
Hybrid Tagger based on UCSC Tagset Version 3
 Increased overall accuracy
Total no. of words in the input
No. of correctly tagged words
Accuracy of the tagger
= 25,087
= 17,657
= 70.38%
Summary of Results
 Overall accuracy of hybrid tagger is higher
than HMM tagger.
HMM Tagger
Hybrid
Tagger
 But the increment in hybrid tagger accuracy
is higher when used with “UCSC Tagset
Version 2”
 “UCSC Tagset Version 3” is in a higher
descriptive level
 number of collisions among tag categories
is high
TagSet Version 2
70.51%
72.14%
TagSet Version 3
69.95%
70.38%
Content
 Introduction
 Literature Review
 Aims & Objectives
 Methodology
 Design & Implementation
 Results & Evaluation
 Conclusion
 Future Work
Conclusion
1
• Addition of ‘UNK’ tag leads towards a more meaningful tagging process
2
• ‘UCSC TagSet Version 3’ results in decreased tagger accuracy due to the high
level descriptiveness
3
• Hybrid approach gives a higher POS tagging accuracy than the solely HMM
based approach for Sinhala language
Content
 Introduction
 Literature Review
 Aims & Objectives
 Methodology
 Design & Implementation
 Results & Evaluation
 Conclusion
 Future Work
Future Work
 Hybrid POS tagging approach proposed in this research is based on bi-gram
transition probabilities. Therefore, in order to further improve the tagging
results, this approach can be extended to use tri-gram transition probabilities.
 Integration of a named entity recognizer and a morphological analyzer with
the hybrid tagger can be helpful in boosting up the tagger accuracy
References
 [1] M. Jayasuriya and a. R. Weerasinghe, “Learning a stochastic part of speech tagger for inhala,” 2013 Int. Conf. Adv. ICT Emerg.
Reg., pp. 137–143, 2013.
 [2]D. Kumar and G. S. Josan, “Part of Speech Taggers for Morphologically Rich Indian Languages: A Survey,” Int. J. Comput.
Appl., vol. 6, no. 5, pp. 1–9, 2010.
 [3] A. J. P. M. P. Jayaweera and N. G. J. Dias, “Hidden Markov Model Based Part Of Speech Tagger for Sinhala Language,” vol. 3,
no. 3, pp. 1–23, 2014.
 [4] Wikipedia, ‘Sinhalese language’, 2015. [Online]. Available: http://en.wikipedia.org/wiki/Sinhalese_language. [Accessed: 28Mar- 2015].
 [5] T. Fernando and A. Weerasinghe, “A Morphological Parser for Sinhala Verbs,” Icter.Org.
 [6] R. Tsarfaty, D. Seddah, S. Kübler, and J. Nivre, “Parsing Morphologically Rich Languages: Introduction to the Special Issue,”
Comput. Linguist., vol. 39, no. 1, pp. 15–22, 2013.
 [7] Wikipedia, ‘Language’,2015. [Online]. Available: http://en.wikipedia.org/wiki/Language. [Accessed: 28- Mar- 2015].
 [8] R. Mooney, 'Part-Of-Speech Tagging, Sequence Labeling and Hidden Markov Models (HMMs)', University of Texas at
Austin.
 [9] H. Tseng, D. Jurafsky, and C. Manning, “Morphological features help POS tagging of unknown words across language
varieties,” Proc. Fourth SIGHAN Work. Chinese Lang. Process., pp. 32–39, 2005.
 [10]D. L. Herath and a. R. Weerasinghe, “A stochastic part of speech tagger for sinhala,” 2013 Int. Conf. Adv. ICT Emerg. Reg., pp.
137–143, 2013.
Example of Tagging a Sentence
අපි සල්, ම් හා පහන් සගන පන්ේ යමු.
අපි_PRP සල්_NNN ,_, ම්_NNN හා_CC පහන්_NNN
සගන_VP පන්ේ_NNN යමු_RP ._.
Enhancing the size of the annotated corpus
 New input data taken from different newspapers are used
 Hybrid tagger based on “UCSC TagSet Version 2”
 New input data are tagged using the selected hybrid tagger
 Tagged output is added to the training data set
 The unchanged test data set is re-tagged with the enhanced
training set.
Change in Hybrid Tagger accuracy with the
addition of newly tagged data
 Tagger accuracy decreases with the
addition of tagger output
 Reason : hybrid tagger used to enhance
the training set has a tagging error of
27.86%
 Solution : manually inspecting the tagger
output by a group of linguistic experts
before adding it to the annotated corpus
Content
 Introduction
 Literature Review
 Aims & Objectives
 Data
 Methodology
 Design & Implementation
 Results & Evaluation
 Conclusion
 Future Work
Data
Data collection phase
 Implementation of HMM & Hybrid taggers



Improving the size of the annotated corpus


“LTRL/UCSC POS TAG SET FOR SINHALA” developed in 2007
UCSC Sinhala Tagged Corpus V1
New 120 articles from “UCSC 10M Words Sinhala Corpus”
Re-annotation of the corpus with the improved tag set

“UCSC NEW SINHALA TAGSET” developed by LTRL of UCSC in 2015
Data
Corpus analysis
 UCSC Annotated Corpus Version 1
 Total No. of words which do not fall
into any tag category: 3989
 No. of distinct words which do not
fall into any tag category: 759
Data
Solution from “UCSC TagSet
Version 3”
sidu (සිදු) – QVB
lak(ලක්) – QVB
path(පත්) - PAVB
QVB - Question Word in Kriya Mula
PAVB - Adjective in Kriya Mula
Aims & Objectives
 Research Question
How the accuracy varies in Hybrid POS tagging than in solely HMM
based Stochastic tagging approach when unknown words are
presented?
Sentence Boundary Disambiguation (SBD)
 A common problem in many languages
 Person names begin with initials, acronyms and abbreviations make the sentence
boundary identification more challengeable
 Solution :

A separate list of person name initials, commonly used acronyms and abbreviations are
maintained in our preprocessing step
 Resulted in increased Tagger accuracy
Challenges involved in the POS Tagging
 Ability of some words to play multiple Part Of Speech
 දිය නා – A verb
 නා ගෙ – An adjective
 Handling unknown words which are not in the training set
Upper Bound for our POS Tagging Accuracy
Words which do not fall into any tag category
Total no. words in the Corpus
Total no. words that can be tagged precisely
in manual tagging
= 3,989
= 100,917
Maximum accuracy in manual approach
= (96,928/100,917)*100%
= 96.05%
In other words,
Tagging error presented in manual approach
= 96,928
= 3.95%
Emission probability P(wi|ti)
 Emission probability indicates the likelihood of a given word is tagged by a
particular tag (assuming that the word is depended only on its tag)
 Calculate by dividing the number of occurrences a particular tag appears in the
corpus with the given word, by the total number of occurrences that tag appears
in the corpus
නා_JJ ගෙ_NNN මලින්_NNN පිරී_VNF ඇල_VFM ._.
Emission probability P(wi|ti)
Transition probability P(ti|ti-1)
 Bi-gram transition probability indicates the probability of a tag being depended
on the previous tag
 Calculate by dividing the number of occurrences where the ti-1, ti tag sequence
appears, by the total number of occurrences where the tag ti-1 appears in the
corpus
නා_JJ ගෙ_NNN මලින්_NNN පිරී_VNF ඇල_VFM ._.
Transition probability P(ti|ti-1)
Hidden Markov Model
 Main goal in HMM is to come up with the most probable tag sequence t1 …tn
given the word sequence w1…..wn, such that P(t1 …..tn | w1…..wn) is the
maximum.
 Applying the Bayes Rule P(X|Y) = [ P(Y|X) * P(X) ] / P(Y)
Hidden Markov Model
 Remove the denominator P(W) as it is same for all the sequences
 Applying Likelihood and Transition assumptions
Tagging With Hidden Markov Model
Viterbi Algorithm
 Find the best possible POS tag path,
given a sequence of words
 For the task of decoding
 A Dynamic Programming Algorithm
Accuracy Rate per each Tag
UCSC TagSet Version 1
UCSC TagSet Version 2
UCSC TagSet Version 1
TAG
Description
CC
Conjunction
DET
Determiner
FRW
Foreign Word
JJ
Adjective
JVB
Adjective in Kriya Müla
NNF
Common Noun Feminine
NNM
Common Noun Masculine
NNN
Common Noun Neuter
NNPA
Proper Noun Animate
NNPI
Proper Noun Inanimate
NVB
Noun in Kriya Müla
POST
Postposition
PRP
Pronoun
QFNUM
Number Quantifier
RB
Adverb
RP
Particle
SYM
Not Classified
UH
Interjection
VFM
Verb Finite Main
VNF
Verb Non Finite
VNN
Verbal Non Finite Noun
VP
Verb Participle
UCSC TagSet Version 2
TAG
Description
CC
Conjunction
DET
Determiner
FRW
Foreign Word
JJ
Adjective
JVB
Adjective in Kriya Müla
NNF
Common Noun Feminine
NNM
Common Noun Masculine
NNN
Common Noun Neuter
NNPA
Proper Noun Animate
NNPI
Proper Noun Inanimate
NVB
Noun in Kriya Müla
POST
Postposition
PRP
Pronoun
QFNUM
Number Quantifier
RB
Adverb
RP
Particle
SYM
Not Classified
UH
Interjection
VFM
Verb Finite Main
VNF
Verb Non Finite
VNN
Verbal Non Finite Noun
VP
Verb Participle
UNK
Unknown (Tag is unknown)
UCSC TagSet Version 3
TAG
Description
CC
Conjunction
CMVNF
Present Participle Verb Non Finite
DET
Determiner
FRW
Foreign Word
JJ
Adjective
JVB
Adjective in Kriya Müla
NNF
Common Noun Feminine
NNM
Common Noun Masculine
NNN
Common Noun Neuter
NPF
Proper Noun Feminine
NPM
Proper Noun Masculine
NPN
Proper Noun Neuter
NVB
PAVB
Noun in Kriya Müla
Participle Adjective in Kriya Mula
PAVNF
Past Participle Verb Non Finite
POST
Postposition
PRP
Pronoun
PPVB
Past Participle in Kriya Mula
PRVNF
Present Participle Verb Non Finite
QFNUM
Number Quantifier
QVB
Question Word in Kriya Mula
RB
Adverb
RP
Particle
SYM
Not Classified
UH
Interjection
VFM
Verb Finite Main
VNN
Verbal Non Finite Noun
VP
Verb Participle
UNK
Unknown (Tag is unknown)