From Textual Information to Numerical Vectors

Download Report

Transcript From Textual Information to Numerical Vectors

From Textual Information to
Numerical Vectors
Chapters 2.7-2.13
Presented by Aaron Hagan
Text Mining
• Supplements the human reader with automatic
systems undeterred by the text explosion. It involves
analyzing a large collection of documents to discover
previously unknown information.
• The information might be relationships or patterns
that are buried in the document collection and which
would otherwise be extremely difficult, if not
impossible, to discover.
What is Covered
• Part-of-speech tagging classifies words into
categories such as noun, verb or adjective
• Word sense disambiguation identifies the meaning of
a word, given its usage, from among the multiple
meanings that the word may have
• Parsing performs a grammatical analysis of a
sentence. Shallow parsers identify only the main
grammatical elements in a sentence, such as noun
phrases and verb phrases, whereas deep parsers
generate a complete representation of the
grammatical structure of a sentence
Motivation
• Up until now we have been dealing with individual
words and simple-minded (though useful) notions of
what sequence of words are likely.
• Now we turn to the study of how words
– Are clustered into classes
– Group with their neighbors to form phrases and sentences
– Depend on other words
• Interesting notions:
– Word order
– Constituency
– Grammatical relations
• Today: syntactic word classes – part of speech tagging
Part-Of-Speech Tagging
• At the step where text has been broken into tokens and
sentences.
• If no linguistic analysis is necessary, one might proceed
directly to feature generation in which the “features” will be
obtained from the tokens.
• If a goal is more specific, such as recognizing names of people,
place and organizations, it is usually desirable to perform
additional linguistic analyses of the text to extract more
sophisticated features.
• Find POS for each token.
• Words are organized into grammatical classes or parts of
speech.
• English : nouns, verbs, adjectives, adverbs, prepositions,
conjunctions.
History of POS Tagging
• Research on part-of-speech tagging has been closely
tied to corpus linguistics. The first major corpus of
English for computer analysis was the Brown Corpus
developed at Brown University by Henry Kucera and ,
in the mid-1960s.
• Consists of about 1,000,000 words of running English
prose text, made up of 500 samples from randomly
chosen publications. Each sample is 2,000 words.
• Mid 1980s, researchers in Europe began to use
hidden Markov models (HMMs) to disambiguate
parts of speech, when working to tag the of British
English. HMMs involve counting cases (such as from
the Brown Corpus), and making a table of the
probabilities of certain sequences.
CORPUS
• CORPUS OF CONTEMPORARY AMERICAN ENGLISH (COCA)
• The first large, balanced corpus of contemporary American English.
• The corpus contains more than 385 million words of text, including 20
million words each year from 1990-2008, and it is equally divided among
spoken, fiction, popular magazines, newspapers, and academic texts.
• The interface allows you to search for exact words or phrases, wildcards,
lemmas, part of speech, or any combinations of these. You can search for
surrounding words (collocates) within a ten-word window (e.g. all nouns
somewhere near chain, all adjectives near woman, or all verbs near key).
• The corpus also allows you to easily limit searches by frequency and
compare the frequency of words, phrases, and grammatical constructions,
in at least two main ways:
– By genre: comparisons between spoken, fiction, popular magazines,
newspapers, and academic, or even between sub-genres (or domains),
such as movie scripts, sports magazines, newspaper editorial, or
scientific journals
– Over time: compare different years from 1990 to the present time
Penn Treebank Tag Set
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
1. CC Coordinating
•
conjunction
•
2. CD Cardinal number
•
3. DT Determiner
•
4. EX Existential there
•
5. FW Foreign word
•
6. IN Preposition or
•
subordinating conjunction
•
7. JJ Adjective
•
8. JJR Adjective, comparative •
9. JJS Adjective, superlative •
10. LS List item marker
•
11. MD Modal
12. NN Noun, singular or mass•
13. NNS Noun, plural
•
14. NP Proper noun, singular
15. NPS Proper noun, plural •
16. PDT Predeterminer
•
17. POS Possessive ending
•
18. PP Personal pronoun
•
19. PP$ Possessive pronoun
•
20. RB Adverb
21. RBR Adverb, comparative
22. RBS Adverb, superlative
23. RP Particle
24. SYM Symbol
25. TO to
26. UH Interjection
27. VB Verb, base form
28. VBD Verb, past tense
29. VBG Verb, gerund or
present participle
30. VBN Verb, past participle
31. VBP Verb, non-3rd person
singular present
32. VBZ Verb, 3rd person
singular present
33. WDT Wh-determiner
34. WP Wh-pronoun
35. WP$ Possessive whpronoun
36. WRB Wh-adverb
http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQP-HTMLDemo/PennTreebankTS.html
Assigning POS to Tokens
• Possible to manually tag POS. Ideally want
automated system to identify POS.
• Most successful databases are one generated
automatically by machine-learning algorithms
from annotated copora.
– Example:
• Wall Street Journal suited well for certain type of data, but
may not be ideal for something like email messages.
• A lot of military funding for things such as processing
voluminous news source.
• Not much support for generating large training corpora in
other domains.
Part-Of-Speech Dictionaries
• Dictionaries showing word-POS correspondence can be useful.
• Difficult do to several parts of speech tied to one word.
– Example:
• Bore – noun - a tiresome person
• Bore – verb - to pierce with a turning or twisting movement of a
tool
– Example
• Book/VB that/DT flight/NN
• Tagging is a type of disambiguation
– Book can be NN or VB
– Can I read a book on this flight?
– That can be a DT or complementizer
– My travel agent said that there would be a meal on this flight
• The goal of POS tagging is to determine which of these possibilities is
realized in a particular text instance.
Approaches to POS Tagging
• Rule-based Approach
– Uses handcrafted sets of rules to tag input
sentences
• Statistical approaches
– Use training corpus to compute probability of a tag
in a context
• Hybrid systems (e.g. Brill’s transformationbased learning)
11
ENGTWOL (ENGlish TWO Level
analysis) Rule-Based Tagger
A Two-stage architecture
• Use lexicon FST (dictionary) to tag each word
with all possible POS
• Apply hand-written rules to eliminate tags.
• The rules eliminate tags that are inconsistent
with the context, and should reduce the list of
POS tags to a single POS per word.
12
ENGTWOL Adverbial-that Rule
Given input “that”
• If the next word is adj, adverb, or quantifier, and
following that is a sentence boundary, and the
previous word is not a verb like “consider” which
allows adjs as object complements,
• Then eliminate non-ADV tags,
• Else eliminate ADV tag
• I consider that odd. (that is NOT ADV)
• It isn’t that strange. (that is an ADV)
13
Det-Noun Rule:
• If an ambiguous word follows a determiner,
tag it as a noun
14
Does it work?
• This approach does work and produces
accurate results.
• What are the drawbacks?
– Extremely labor-intensive
15
Statistical Tagging
• Statistical (or stochastic) taggers use a training
corpus to compute the probability of a tag in a
context.
• For a given word sequence, Hidden Markov
Model (HMM) Taggers choose the tag sequence
that maximixes
P(word | tag) * P(tag | previous-n-tags)
• A HMM tagger chooses the tag ti for word wi that
is most probable given the previous tag, ti-1
ti = argmaxj P(tj | ti-1, wi)
16
HMM Example
• For example, once you've seen an article such as
'the', perhaps the next word is a noun 40% of the
time, an adjective 40%, and a number 20%.
– a program can decide that "can" in "the can" is far more
likely to be a noun than a verb or a modal. The same
method can of course be used to benefit from knowledge
about following words.
• More advanced ("higher order") HMMs learn the
probabilities not only of pairs, but triples or even
larger sequences. So, for example, if you've just seen
an article and a verb, the next item may be very likely
a preposition, article, or noun, but much less likely
another verb.
Statistical POS Tagging (Example)
• Use probability theory for POS tagging.
• Suppose, with no context, we just want to
know given the word “flies” whether it should
be tagged as a noun or as a verb.
• We use conditional probability for this: we
want to know which is greater
PROB(N | flies) or PROB(V | flies)
• Note definition of conditional probability
PROB(a | b) = PROB(a & b) / PROB(b)
– Where PROB(a & b) is the probability of the two
events a and b occurring simultaneously
18
Calculating POS for “flies”
We need to know which is more
• PROB(N | flies) = PROB(flies & N) / PROB(flies)
• PROB(V | flies) = PROB(flies & V) / PROB(flies)
• Use Corpus as reference for finding
probablities.
19
Corpus to Estimate
1,273,000 words; 1000 uses of flies; 400 flies in N
sense; 600 flies in V sense
PROB(flies) ≈ 1000/1,273,000 = .0008
PROB(flies & N) ≈ 400/1,273,000 = .0003
PROB(flies & V) ≈ 600/1,273,000 = .0005
Out best guess is that flies is a V
PROB(V | flies) = PROB(V & flies) / PROB(flies)
= .0005/.0008 = .625
20
Phrase Recognition
• Once tokens have been assigned POS tags, the
next step is to group individual tokens into
units, called phrases.
• The idea is for creating a “partial parse” of a
sentence and as a step in identifying the
“named entities” occurring in a sentence.
• Text parsing systems are suppose to scan a
text and mark the beginning and end of
phrases.
Phrase Recognition
• There are a number of conventions for marking,
but the most common :
– Mark a word inside a phrase with I• Can be extended with a code for the phrase type: I-NP, I-VP,
etc
– Mark a word at the beginning of a phrase adjacent to
another phrase with B• Can be extended with a code for the phrase type: B-NP, B-VP,
etc.
– And a word outside any phrase with O
• Looking for a particular sequence of words that
occurs frequently enough in the corpora.
• Simple statistical approach that looks at
multiword tokens.
Named Entity Recognition
• Specialization of phrase finding
• Particular noun phrase finding is the
recognition of particular types of proper noun
phrases, specifically persons, organizations,
and locations.
• Importance of these recognizers for
intelligence applications .
• (More on this in chapter 6).
Parsing into Phrases
• Usually a full parse of a sentence is done in
most sophisticated kind of text processing.
• Each word in the sentence has a relation to all
the other words and the main function
(subject, object, etc) in the sentance.
• There are many different kinds of parses, each
associated with linguistic theory of the
language.
Context-Free Parses
• A tree of nodes in which the leaf nodes are words of
a sentence, the phrases into which the words are
grouped are internal nodes, and there is one top
node at the root of the tree, which has the label S.
• A number of algorithms for producing such a tree
from the words of a sentence. With considerable
research constructing parsers from a statistical
analysis of tree banks of sentences parsed by handle.
• Provides information that phrase identification or
partial parsing cannot provide.
Parse Tree Example
Linear order of phrases in a
partial parse, one might conclude
that Johnson replaced Smith.
S
NP - N
JOHNSON
VP
PP
VP
AUX
was
PPART
replaced
PREP
at
PP
PREP
by
PNOUN
PNOUN
XYZ
PNOUN
CORP
Johnson was replaced at XYZ Corp by Smith.
PNOUN
Smith
Feature Generation
• Reason for the linguistic processing is to
identify features that can be useful for text
mining.
• Features that might be useful in identifying
the POS include: where the first letter is
capitalized (indicating a proper noun), if all the
characters are digits, periods, or comma
(marking a number), if characters alternate
case (usually an abrivation).
• A dictionary as to the possible parts of speech
for a token.
Feature Vector
• The feature vector for a document is assigned
a set of classes.
• Feature Vector Example:
– Classifying periods as End-Of-Sentence.
– Identifying tokens as instance of titles, such as
“Doctor” or “President”
Summary
• Part-of-Speech Tagging
– is an important step in Natural Language Analysis.
– is robust and fast.
– works with 95-97% accuracy.
• Parsing (= full syntax analysis)
– is more error-prone than PoS-Tagging.
– is important to get to the meaning of a sentence.
References / Applications
• http://www.cis.upenn.edu/~treebank/
• The Penn Treebank Project annotates naturallyoccuring text for linguistic structure. Most notably, we
produce skeletal parses showing rough syntactic and
semantic information -- a bank of linguistic trees.
• http://www.americancorpus.org/
• http://ucrel.lancs.ac.uk/claws/
• Stanford Natural Language Processing Group http://nlp.stanford.edu/software/tagger.shtm