Natural Language Understanding
Download
Report
Transcript Natural Language Understanding
Natural Language Processing (3a)
Zhao Hai 赵海
Department of Computer Science and Engineering
Shanghai Jiao Tong University
2010-2011
[email protected]
1
Outline
Lexicons and Lexical Analysis
Finite State Models and Morphological
Analysis
Collocation
2
Lexicons and Lexical Analysis (202)
Finite State Models and Morphological
Analysis (1)
Morphemes
Morphemes are the smallest meaningful units of language
and are typically word stems or affixes.
For example, the word “books” can be divided into two
morphemes; ‘book’ and ‘s’, where the meaning of ‘s’ is as a
plural suffix.
3
Lexicons and Lexical Analysis (203)
Finite State Models and Morphological
Analysis (2)
Morphology (1)
Morphology is generally divided into two types:
1.
Inflectional morphology covers the variant forms of nouns,
adjectives and verbs owing to changes in:
Person (first, second, third); Number (singular, plural);
Tense (present, future, past); Gender (male, female, neuter).
4
Lexicons and Lexical Analysis (204)
Finite State Models and Morphological
Analysis (3)
Morphology (2)
2.
Derivational morphology is the formation of a new word
by addition of an affix, but it also includes cases of
derivation without an affix:
disenchant (V) + -ment disenchantment (N);
reduce (V) + -tion reduction (N);
record (V) record (N);
progress (N) progress (V).
5
Lexicons and Lexical Analysis (205)
Finite State Models and Morphological
Analysis (4)
Morphology (3)
Most morphological analysis programs tend to deal only with
inflectional morphology, and assume that derivational variants
will be listed separately in the lexicon.
One exception is the Alvey Natural Language Toolkit
morphological analyzer. (Russell G, Pulman S, Ritchie G, and
Black A. 1986. A dictionary and morphological analyser for
English, Proceedings of 11th COLING Conference p.277-279)
6
Lexicons and Lexical Analysis (206)
Finite State Models and Morphological
Analysis (5)
Morphological Analyzer
A morphological analyzer must be able to undo the spelling
rules for adding affixes. For example, the analyzer must be able
to interpret “moved” as ‘move’ plus ‘ed’. For English, a few
rules cover the generation of plurals and other inflections such as
verb endings.
The main problem is where a rule has exceptions, which have
to be listed explicitly, or where it is not clear which rule applies,
if any.
7
Lexicons and Lexical Analysis (207)
Finite State Models and Morphological
Analysis (6)
Analysis of Plurals
The following word-stems obey regular rules for the
generation of plurals:
CHURCHES CHURCH + ES; SPOUSES SPOUSE + S;
FLIES FLY + IES; PIES PIE + S.
The remaining word-stems are irregular:
MICE MOUSE; FISH FISH; ROOVES ROOF +
VES;
BOOK ENDS BOOK END + S; LIEUTENANTS
GENERAL LIEUTENANT (+S) GENERAL.
8
Lexicons and Lexical Analysis (208)
Finite State Models and Morphological
Analysis (7)
Analysis of Inflectional Variants
The following word-stems obey regular rules:
LODGING LODGE + ING; BANNED BAN + NED;
FUMED FUME + D; BREACHED BREACH + ED;
TAKEN TAKE + N.
The following word-stems are irregular:
TAUGHT TEACH; FAUGHT FIGHT;
TOOK TAKE.
9
Lexicons and Lexical Analysis (209)
Finite State Models and Morphological
Analysis (8)
Finite State Transducers (FSTs) (1)
Finite-state transducers (FST) are automata for which each
transition has an output label in addition to the more familiar
input label.
Transducers transform (transduce) input strings into output
strings. The output symbols come from a finite set, usually called
output alphabet.
Since the input and output alphabet are frequently the same,
there is usually no distinction between them, that is, only the
input label is given.
10
Lexicons and Lexical Analysis (210)
Finite State Models and Morphological
Analysis (9)
Finite State Transducers (FSTs) (2)
Definition: A finite-state transducer (FST) is a 5-tuple M =
(Q , Σ, E, i, F) , where Q is a finite set of states, i ∈Q is the
initial state, F ⊆ Q is a set of final states, Σ is a finite alphabet
and E : Q ×( Σ ∪{ε}) × Σ* × Q is the set of transitions (arcs).
Σ* is the set of all possible words over the Σ:
Σ* = {v | v = v1v2…vn for n ≥ 1 and vi ∈ Σ for all 1≤ i ≤n} ∪
{ε}
11
Lexicons and Lexical Analysis (211)
Finite State Models and Morphological
Analysis (10)
Finite State Transducers (FSTs) (3)
Definition: Further, we define the state transition function δ :
Q ×( Σ∪{ε}) → 2Q (the power set of Q) as follows:
δ (p, a) = { q ∈ Q | ∃ v ∈Σ* : ( p, a, v, q ) ∈ E },
and the emission function λ : Q × (Σ ∪ {ε})× Q → 2 Σ* is
defined as:
λ(p, a, q) = {v ∈Σ* | (p, a, v, q) ∈ E}
12
Lexicons and Lexical Analysis (212)
Finite State Models and Morphological
Analysis (11)
Finite State Transducers (FSTs) (4)
Ex.: Let M = (QM,ΣM, EM, iM, FM) be an FST, where QM ={0, 1,
2},Σ M = {a, b, c}, δM = {(0, a, b, 1), (0, a, c, 2)} , iM = 0 and FM
={1, 2}.
M transduces a to b or a to c. Note that
for visualizing transducers we use the
colon to separate the input and output
labels of a transduction.
13
Lexicons and Lexical Analysis (213)
Finite State Models and Morphological
Analysis (12)
A Simple FST
Ex.: Morphological analysis for the word “happy” and its
derived forms: happy happy; happier happy+er;
happiest happy+est
12
14
Lexicons and Lexical Analysis (214)
Finite State Models and Morphological
Analysis (13)
Specification for the Simple FST
Arcs labeled by a single letter have that letter as both the input
and the output. Nodes that are double circles indicate success
states, that is, acceptable words.
The dashed link, indicating a jump, is not formally necessary
but is useful for showing the break between the processing of the
root form and the processing of the suffix.
No input is represented as an empty symbolε.
15
Lexicons and Lexical Analysis (215)
Finite State Models and Morphological
Analysis (14)
A Fragment of an FST
This FST accepts the following words, which all start with t: tie (state 4), ties (10), trap (7),
traps (10), try (11), tries (15), to (16), torch (19), torches (15), toss (21), and tosses (15). In
addition, it outputs tie, tie+s, trap, trap+s, try, try+s, to, torch, torch+s, toss, toss+s.
16
Lexicons and Lexical Analysis (216)
Finite State Models and Morphological
Analysis (15)
Specification for the Fragment of an FST (1)
The entire lexicon can be encoded as an FST that encodes all
the legal input words and transforms them into morphemic
sequences.
The FSTs for the different suffixes need only be defined once,
and all root forms that allow that suffix can point to the same
node.
17
Lexicons and Lexical Analysis (217)
Finite State Models and Morphological
Analysis (16)
Specification for the Fragment of an FST (2)
Words that share a common prefix (such as torch, toss, and so
on) also can share the same nodes, greatly reducing the size of
the network.
Note that you may pass through acceptable states along the
way when processing a word.
18
Lexicons and Lexical Analysis (218)
Finite State Models and Morphological
Analysis (17)
References
J. Hopcroft, J. Ullman. 1979. Introduction to Automata Theory,
Languages and Computation. Addison-Wesley Series in
Computer Science, Addison-Wesley, Reading, Massachusetts,
Menlo Park, California, London.
M. Mohri. 1997. Finite-state transducers in language and
speech processing. Computational Linguistics 23.
19
Lexicons and Lexical Analysis (219)
Collocation (1)
Definition
A collocation is an expression consisting of two or more
words that correspond to some conventional way of saying things.
For example,
noun phrases: strong tea; weapons of mass destruction;
phrasal verbs: to make up;
other phrases: the rich and powerful.
20
Lexicons and Lexical Analysis (220)
Collocation (2)
Compositionality
We call a natural language expression compositional if the
meaning of the expression can be predicted from the meaning of
the parts.
Collocations are characterized by limited compositionality, in
which there is usually an element of meaning added to the
combination.
For example, in the case of strong tea, strong has acquired the
meaning rich in some active agent which is closely related, but
slightly different from the basic sense having great physical
strength.
21
Lexicons and Lexical Analysis (221)
Collocation (3)
Non-Compositionality
Idioms are the most extreme examples of non-compositionality.
For instance, the idioms to kick the bucket and to hear it through
the grapevine only have an indirect historical relationship to the
meanings of the parts of the expression.
Most collocations exhibit milder forms of non-compositionality,
like the expression international best practice. It is very nearly a
systematic composition of its parts, but still has an element of added
meaning.
22
Lexicons and Lexical Analysis (222)
Collocation (4)
Other Terms
There is considerable overlap between the concept of
collocation and notions like term, technical term, and
terminological phrase.
The above three terms are commonly used when collocations
are extracted from technical domains (in a process called
terminology extraction).
23
Lexicons and Lexical Analysis (223)
Collocation (5)
Applications (1)
Collocations are important for a number of applications:
natural language generation (to make sure that the output
sounds natural and mistakes like powerful tea or to take a
decision are avoided);
computational lexicography (to automatically identify the
important collocations to be listed in a dictionary entry);
24
Lexicons and Lexical Analysis (224)
Collocation (6)
Applications (2)
parsing (so that preference can be given to parses with natural
collocations)
corpus linguistic research (for instance, the study of social
phenomena like the reinforcement of cultural stereotypes through
language).
25
Lexicons and Lexical Analysis (225)
Collocation (7)
Frequency (1)
Surely the simplest method for finding collocations in a text
corpus is counting.
If two words occur together a lot, then that is evidence that
they have a special function that is not simply explained as the
function that results from their combination.
26
Lexicons and Lexical Analysis (226)
New York Times
corpus
Collocation (8)
Frequency (2)
The table shows the bigrams
(sequences of two adjacent words) that
are most frequent in the corpus and their
frequency. Except for NewYork, all the
bigrams are pairs of function words.
A function word is a word which have no
lexical meaning, and whose sole function
is to express grammatical relationships,
such as prepositions, articles, and
conjunctions.
27
Lexicons and Lexical Analysis (227)
Collocation (9)
Frequency (3)
But just selecting the most frequently occurring bigrams is not
very interesting. Justeson and Katz (1995): pass the candidate
phrases through a part-of-speech filter which only lets through
those patterns that are likely to be “phrases”.
28
Lexicons and Lexical Analysis (228)
Collocation (10)
Frequency (4)
29
Lexicons and Lexical Analysis (229)
Collocation (11)
Frequency (5)
Each is followed by an example from the text which is used as
a test set. In these patterns A refers to an adjective, P to a
preposition, and N to a noun.
The next table shows the most highly ranked phrases after
applying the filter. The results are surprisingly good. There are
only 3 bigrams that we would not regard as non-compositional
phrases: last year, last week, and first time.
30
Lexicons and Lexical Analysis (230)
Collocation (12)
Frequency (6)
31
Lexicons and Lexical Analysis (231)
Collocation (13)
Frequency (7)
York City is an artefact of the way we have implemented the
Justeson and Katz filter. The full implementation would search
for the longest sequence that fits one of the part-of-speech
patterns and would thus find the longer phrase New York City.
The twenty highest ranking phrases containing strong and
powerful all have the form A N (where A is either strong or
powerful). They have been listed in the following table.
32
Lexicons and Lexical Analysis (232)
Collocation (14)
Frequency (8)
33
Lexicons and Lexical Analysis (233)
Collocation (15)
Frequency (9)
Given the simplicity of the method, these results are
surprisingly accurate. For example, they give evidence that
strong challenge and powerful computers are correct whereas
powerful challenge and strong computers are not.
However, we can also see the limits of a frequency-based
method. The nouns man and force are used with both adjectives
(strong force occurs further down the list with a frequency of 4).
34
Lexicons and Lexical Analysis (234)
Collocation (16)
Frequency (10)
Neither strong tea nor powerful tea occurs in New York Times
corpus.
However, searching the larger corpus of the World Wide Web
we find 799 examples of strong tea and 17 examples of powerful
tea (the latter mostly in the computational linguistics literature on
collocations), which indicates that the correct phrase is strong
tea.
35
Lexicons and Lexical Analysis (235)
Collocation (17)
Frequency (11)
Justeson and Katz’ method of collocation discovery is
instructive in that it demonstrates an important point.
A simple quantitative technique (the frequency filter)
combined with a small amount of linguistic knowledge (the
importance of parts of speech) goes a long way.
Later we will use a stop list that excludes words whose most
frequent tag is not a verb, noun or adjective.
36
Lexicons and Lexical Analysis (236)
Collocation (18)
Mean and Variance (1)
Frequency-based search works well for fixed phrases. But
many collocations consist of two words that stand in a more
flexible relationship to one another.
Consider the verb knock and one of its most frequent
arguments, door. Here are some examples of knocking on or at a
door from our corpus:
37
Lexicons and Lexical Analysis (237)
Collocation (19)
Mean and Variance (2)
She knocked on his door.
They knocked at the door.
100 women knocked on Donaldson’s door.
A man knocked on the metal front door.
38
Lexicons and Lexical Analysis (238)
Collocation (20)
Mean and Variance (3)
The words that appear between knocked and door vary and the
distance between the two words is not constant so a fixed phrase
approach would not work here.
But there is enough regularity in the patterns to allow us to
determine that knock is the right verb to use in English for this
situation, not hit, beat or rap.
39
Lexicons and Lexical Analysis (239)
Collocation (21)
Mean and Variance (4)
To simplify matters we only look at fixed phrase collocations
in most cases, and usually at just bi-grams.
We define a collocational window (usually a window of 3 to 4
words on each side of a word), and we enter every word pair in
there as a collocational bigram. Then we proceed to do our
calculations as usual on this larger pool of bigrams
40
Lexicons and Lexical Analysis (240)
Collocation (22)
Mean and Variance (5)
Using a three word collocational window to capture bigrams at a distance
41
Lexicons and Lexical Analysis (241)
Collocation (23)
Mean and Variance (6)
The mean and variance based methods described by
definition look at the pattern of varying distance between two
words.
One way of discovering the relationship between knocked and
door is to compute the mean and variance of the offsets (signed
distances) between the two words in the corpus.
42
Lexicons and Lexical Analysis (242)
Collocation (24)
Mean and Variance (7)
The mean is simply the average offset. For the examples
previously, we compute the mean offset between knocked and
door as follows:
This assumes a tokenization of Donaldson’s as three words
Donaldson, apostrophe, and s.
43
Lexicons and Lexical Analysis (243)
Collocation (25)
Mean and Variance (8)
The variance measures how much the individual offsets
deviate from the mean. We estimate it as follows:
where n is the number of times the two words co-occur, di is the
offset for co-occurrence i, andμis the mean.
44
Lexicons and Lexical Analysis (244)
Collocation (26)
Mean and Variance (9)
As is customary, we use the standard deviation
, the
square root of the variance, to assess how variable the offset
between two words is. The standard deviation for the four
examples of knocked / door in the above case is :
45
Lexicons and Lexical Analysis (245)
Collocation (27)
Mean and Variance (10)
The mean and standard deviation characterize the
distribution of distances between two words in a corpus. We
can use this information to discover collocations by looking for
pairs with low standard deviation.
We can also explain the information that variance gets at in
terms of peaks in the distribution of one word with respect to
another.
46
Lexicons and Lexical Analysis (246)
Collocation (28)
Mean and Variance (11)
The variance of strong with respect to opposition is small
47
Lexicons and Lexical Analysis (247)
Collocation (29)
Mean and Variance (12)
Because of this greater variability we get a higher
mean that is between positions -1 and -2 (-1.45)
and a
.
48
Lexicons and Lexical Analysis (248)
Collocation (30)
Mean and Variance (13)
The high standard deviation of
indicates this randomness.
This indicates that for and strong don’t form interesting collocations.
49
Lexicons and Lexical Analysis (249)
Collocation (31)
Mean and Variance (14)
50
Lexicons and Lexical Analysis (250)
Collocation (32)
Mean and Variance (15)
If the mean is close to 1.0 and the standard deviation low, as is
the case for NewYork, then we have the type of phrase that
Justeson and Katz’ frequency-based approach will also discover.
If the mean is much greater than 1.0, then a low standard
deviation indicates an interesting phrase.
51
Lexicons and Lexical Analysis (251)
Collocation (33)
Mean and Variance (16)
High standard deviation indicates that the two words of the
pair stand in no interesting relationship as demonstrated by the
four high-variance.
More interesting are the cases in between, word pairs that have
large counts for several distances in their collocational
distribution.
52
Lexicons and Lexical Analysis (252)
Collocation (34)
References
J. S. Justeson and S. M. Katz. 1995. Technical terminnology:
some linguistic properties and an algorithm for identification
in text. Natural Language Engineering 1.
M. A. K. Halliday. 1966. Lexis as a linguistic level. In C. E.
Bazell, J. C. Catford, M. A. K. Halliday, and R. H. Robins (eds.),
In memory of J. R. Firth. London: Longmans.
F. Smadja. 1993. Retrieving collocations from text: Xtract.
Computational Linguistics 19.
53
Lexicons and Lexical Analysis (253)
Assignments (7)
1. Pick a document in which your name occurs (an email, a university
transcript or a letter). Does Justeson and Katz’s filter identify your name as
a collocation?
2. We used the World Wide Web as an auxiliary corpus above because neither
strong tea nor powerful tea occurred in the New York Times. Modify
Justeson and Katz’s method so that it uses the World Wide Web as a
resource of last resort. Take a mainstream search engine as your tool.
54