Stemming, tagging and chunking

Download Report

Transcript Stemming, tagging and chunking

Stemming, tagging and
chunking
Text analysis short of
parsing
Word-based analysis
• Whereas parsing gives a full syntactic
analysis, sometimes it is sufficient to have
less detailed information
• In many applications we are more
interested in words
• But what do we mean by “word”?
Words
• Naïve definition of a word: sequence of
characters surrounded separated from
each other by a space
• But punctuation marks are usually
attached to words
• Though not all punctuation marks are
word-delimiters, e.g. possessive
apostrophe, hyphen
Words
• We may want to treat hyphenated and
compound words as one word, or two
• By the same token we may want to treat
word sequences as if they were a single
word
• In addition, a given “word” can have
different word forms, depending on
inflections, or even conventions of
orthography
Tokenization
• The simplest form of analysis is to reduce
different word forms into tokens
• Also called “normalization”
• For example, if you want to count how
many times a given word occurs in a text
• Or you want to search for texts containing
certain words (e.g. Google)
Stemming
• Stemming is the particular case of
tokenization which reduces inflected forms
to a single base form or stem
• (Recall our discussion of stem ~ base form
~ dictionary form ~ citation form)
• Stemming algorithms are basic stringhandling algorithms, which depend on
rules which identify affixes that can be
stripped
Stemming
• As we know, morphology can be less than
straightforward, so a stemmer has to
“know” about rules such as consonant
doubling, y→i, etc.
• Also has to know about irregularities
• And to avoid overgeneration
• For this it probably needs a dictionary
Stemming
• Best known stemming algorithm for
English is Martin Porter’s stemmer,
published in 1979
• Original use was in information retrieval
• In computational terms, it is really just a
sophisticated string-handling algorithm
• In linguistic terms, it is interesting in that it
captures generalisations about English
morphology
Word categories
• A.k.a. parts of speech (POSs)
• Important and useful to identify words by
their POS
– To distinguish homonyms
– To enable more general word searches
• POS familiar (?) from school and/or
language learning (noun, verb, adjective,
etc.)
Word categories
• Recall that we distinguished
– open-class categories (noun, verb, adjective,
adverb)
– Closed-class categories (preposition,
determiner, pronoun, conjunction, …)
• While the big four are fairly clearcut, it is
less obvious exactly what and how many
closed-class categories there may be
POS tagging
• Labelling words for POS can be done by
dictionary lookup and/or some sort of process
• Identifying POS can be seen as a prerequisite to
parsing, and/or a result of morphological
analysis in its own right
• However, there are some differences:
– Parsers often work with the most simple set of word
categories, subcategorized by feature (or attributevalue) schemes
– Indeed the parsing procedure may contribute to the
disambiguation of homonyms
POS tagging
• POS tagging, per se, aims to identify wordcategory information somewhat independently of
sentence structure …
• … and typically uses rather different means
• POS tags are generally shown as labels on
words:
John/NPN saw/VB the/AT book/NCN on/PRP the/AT table/NN ./PNC
• We’ll return to tagging in detail, but first let’s
mention …
Chunking
• Like parsing except that it aims only to
identify major constituents
• And does not attempt to identify structure,
neither internal (within the chunk), nor
external (between chunks)
• Chunking will leave some parts of the text
unanalysed
• Example:
[NP [NP G.K. Chesterton ], [NP [NP author ] of [NP [NP The Man ] who was [NP Thursday ] ] ] ]
Chunking
• Chunks can be represented like tags or
like parse trees
Chunk parser
• A “chunk” is a continuous non-overlapping
sequence of words
• Chunker finds such sequences, often using
tagged text as input
• Chunk rules can be as simple as regular
expressions
• Chunkers can allow embedding, but typically
only to a shallow level
• Another example:
(S: (NP: I) saw (NP: the big dog) . )