Introduction to Linguistics and its role in Natural Language Processing
Download
Report
Transcript Introduction to Linguistics and its role in Natural Language Processing
CS598 DNR FALL 2005
Machine Learning
in
Natural Language
Introduction: Part 3
Linguistics Essentials
(The role of Linguistics in NLP)
1
Introduction
This is not a class in NLP – but we want to discuss how to
make progress in natural language understanding
Introduce basic linguistics concepts.
Basic terminology
Discuss the levels of analysis used in NLP
Problems associated with each level.
2
Comprehension
(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in
England. He is the same person that you read about in the book, Winnie the
Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When
Chris was three years old, his father wrote a poem about him. The poem was
printed in a magazine for others to read. Mr. Robin then wrote a book. He
made up a fairy tale land where Chris lived. His friends were animals. There
was a bear called Winnie the Pooh. There was also an owl and a young pig,
called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin
made them come to life with his words. The places in the story were all near
Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to
read about Christopher Robin and his animal friends. Most people don't know
he is a real person who is grown now. He has written two books of his own.
They tell what it is like to be famous.
1. Who is Christopher Robin?
2. When was Winnie the Pooh written?
3. What did Mr. Robin do when Chris was three years old?
4. Where did young Chris live?
5. Why did Chris write two books of his own?
Other motivating problems: Entailment; Translation,Generation…
3
Introduction
Discuss the levels of analysis used in NLP
Problems associated with each level.
For each level of Linguistics Analysis we will ask:
What are the problems here?
What would we consider as a solution?
4
Levels of Analysis
In traditional linguistics people talk about several levels of
analysis, or types of linguistics knowledge.
Morphology
Syntax
The meaning of words and of combinations fo words
Pragmatics.
Structural relation between words
Semantics
How words are constructed
How a sentence is used? What’s its purpose
Discourse (sometimes distinguished as a subfield of Pragmatics)
Relationships between sentences; global context.
5
Morphology
Morphology: How words are constructed; prefixes & Suffixes
The simple cases are:
kick, kicks, kicked, kicking
But other cases may be
sit, sits, sat, sitting
Not just as simple as adding and deleting certain endings, as in:
gorge, gorgeous
good, goods
arm, army
This might be very different in other languages...
(Problems; solutions)
6
Syntax
Syntax: Structural relationship between words.
The main issues here are structural ambiguities, as in:
I saw the Grand Canyon flying to New York.
or
Time flies like an arrow.
The sentence can be interpreted as a
Metaphor: time passes quickly, but also
Declaratively: Insects have an affinity for arrows
Imperative: measure the time of the insects.
Key issue: Often syntax doesn't tell us much about meaning.
Plastic cat food can cover
7
Semantics
Semantics: The meaning of words and of combinations of words.
Some key issue here:
Lexical ambiguities:
I walked to the bank
{of the river / to get money}.
The bug in the room {was probably planted by spies /
flew out the window}.
Compositionality: The meaning of phrases/sentences as a
function of the meaning of words in them.
(Problems; Solutions)
8
Pragmatics/Discourse
Pragmatics: How a sentence is used; its purpose.
E.g.: Rules of conversation:
Can you tell me what time it is
Could I have the salt
Discourse: Relations between sentences; global context.
An important example here is the problem of co-reference:
When Chris was three years old, his father wrote
a poem about him.
“Chicago?”
(Running towards an agent in an airport; Ticket Agency)
9
Morphology and Part-of-Speech
Words are related by morphological processes such as
dog...dogs
conceive ...inconceivable
Importance?
forming plural forms from singular forms:
adding prefixes and suffixes:
It makes language more predictable.
It allows us to handle new words which are outside our vocabulary.
Understanding morphology may support generalization to unknown words.
However, Morphology may be tricky.
Not always as simple as stripping common prefixes and suffixes.
preempt....... empt ?
gorgeous.... like a gorge?
apply........... like an apple?
old.............. oldly?
Mrs. .......... plural of Mr.
atomic......... not Tom-like
10
Morphological Processes
Inflectional forms:
Words generated share the same basic meaning and part of speech.
Words are generated by systematic modifications of the root forms.
kick,kicks,kicked, kicking
Derivational forms:
Words generated may have different meaning and part of speech.
friend...friendly; wide...widely; hard...hardly
Is there a problem to solve here? What would you consider a
solution?
11
Part of Speech
The part-of-speech of words in a sentence has an important role in all recent
works in natural language; Necessary to read the literature and the corpora.
Part of speech (POS) is a way to categorize words based on a particular
syntactic (and often semantic) function they take in the sentence.
Sometimes called syntactic or grammatical categories.
Important POS:
Nouns: typically refer to people, animals and “things”.
Verbs: express the action in the sentence.
Adjectives: describe properties of nouns.
Children eat sweet candy
Data / Demo
Children: Noun - group of people.
eat: Verb - describes what people do with candy.
sweet: Adj.- a property of candy
candy: Noun - a particular type of food
Other basic Parts of Speech: adjective, adverb, article, pronoun, conjunction …
12
Part of Speech (cont.)
Useful sub-categorization of POS into two types:
Open class words:
Closed class words:
A constantly changing set; new words are often introduced into the language.
nouns, verbs, adjectives and adverbs
A relatively stable set; new words are rarely introduced into the language.
articles, pronouns, prepositions, conjunctions.
It is therefore easier to deal with closed class words.
Articles:
Pronouns:
Prepositions:
Demonstratives:
Quantifiers:
Conjunctions:
a, an, the
I, you, me, we, he, she, him, her, it, them, they
to, for, with, between, at, of
this, that, these, those
some, every, most, any, both
and, or, but
13
Closed class words (not so easy)
Articles pose a lot of difficulty for language generation.
Most noun phrases start with an article:
a newspaper, an apple, the movie
But, there are many exceptions,
The bowl was full of rice.
I go to college.
She went on vacation.
He fell asleep in class.
*The bowl was full of apple.
*I go to university.
*She went on trip.
*He fell asleep in room.
14
Closed class words (not so easy-II)
Another closed class words that are hard to deal with: prepositions & particles.
Prepositions represent relations: time, location, modification, complements.
Particles are prepositions that follow verbs to create new verb forms.
He threw the cookies up the chimney
vs.
He threw up the cookies
And sometimes, it can be ambiguous:
He passed out
But also
He put the book on the table
He gave the book to Mary
He walked up the stairs
He looked over the paper.
Other problems with prepositions include attachments, which will be discussed
later when we discuss semantics.
Problems? Solutions?
POS? Disambiguation? Text Correction?
15
Nouns
Nouns refer to entities in the world, which represent objects,
places, concepts, people, events
Count nouns: describe specific objects or sets of objects (above)
Mass nouns: describe composites or substances,
dirt, water, garbage, deer.
Pronouns are special class of nouns that refer to a person or a
“thing'' that is salient in the context of use.
dog, city, idea, marathon
After Mary had arrived in the village, she looked for a hotel.
Relative Pronouns are pronouns like:
who, which, that
The man who saw Elvis..
The UFO that landed in Toledo ...
The Rolling Stones concert, which I attended, ...
16
Nouns (cont.)
Nouns can be objects of verbs or subjects of verbs:
Proper nouns are names like
Children eat sweet candy
Subject
Object
Mary, Smith, United stated, IBM, Little Rock.
Nouns have Modifiers. They can be modified by:
adjectives: words that attribute qualities to objects.
noun modifiers:
or by
dog food, tin can, song book.
In this case we can talk about the head noun which represents the main concept, e.g.,
dog food.
A noun is usually embedded in a noun phrase.
wet, loud, happy, funny
A syntactic unit of the sentence in which information about the noun is gathered.
The noun is the head of the noun phrase.
In addition to the noun we may find in a noun phrase an article: The tree, and an
adjective: “The tall tree''.
Problems, Solutions?
Identification? Why do we need to solve it?
How to evaluate it?
17
Verbs
Verbs: Words that represent actions, commands or assertions.
Main verbs: walk, eat, believe, claim, ask
Auxiliary verbs: be, do, have
Modal verbs: will, can, could
Verbs can be
transitive: they take a complement, as in:
eat an apple; read a book; sing a song
intransitive: verbs that do not take complements, as in:
she laughed; he slept; I lied
18
Verbs (cont.)
Verbs: have morphological forms:
Base:
walk
be
go
Present:
walks
is
goes
Past:
walked was
went
Present Participle: walking being going
Past Participle:
walked
been
going
19
Verbs (cont.)
Verbs can be Active or Passive.
The passive voice form consists of a form of “to be” followed by
the past participle.
Active
I saw Elvis
I will find him.
I have found him.
The roles are reversed in actives and passives.
Passive
Elvis was seen by me.
He will be found by me.
He has been found by me.
John killed Sam:
Sam was killed by John
subject is killer, direct object is victim
subject is victim, object of “by'' is killer
Some verbs take indirect objects, e.g.
I gave Mary the book
Mary: indirect object;
vs.
I gave the book to Mary.
book: direct object
20
Verbs (cont.)
Prepositions and Particles are important in the context of verbs.
When they appear as Particles they create new verb forms.
Sometimes, we need to know the meaning of the sentence to
decide if a word is a preposition or a particle.
She ran up the hill
She ran up the bill
21
Verb Phrases
The verb phrase is the syntactic unit that organizes all elements
of the sentence that depend syntactically on the verb.
The Verb is the head of the verb phrase.
An Adverb is an element of the verb phrase which specify
place, time, manner, degree
She often travels to Las Vegas.
She allegedly committed perjury.
She started her career off impressively.
22
Verb Sub-categorization
This is a categorization of verbs according to the types of complements they
take.
Complements of a verb are different syntactic means that verbs can exploit
to express related entities.
The set of complements that a verb can appear with is called its
subcategorization frame.
Examples
Verbnet
23
Sub-categorization Frames
Intransitive: NP(subject)
Transitive:
She put the book on the table
NP (subject) NP (object) that clause
She told me that Gary is coming.
Complements of verbs can be either
Obligatory arguments (subject, object, direct object)
NP (subject) Reflexive Pronoun(object)
She introduced herself
Clause complement
Mary gave John flowers
NP (subject) NP (object) PP(location)
John loves Mary
Reflexive Verbs
NP (subject) NP(object)
Dbl obj Construction: NP (subject) NP (direct object) NP (object)
The woman walked
She put the book on the table
or
Optional (like pp phrase or a subordinate clause (e.g., "that“ clause).
She gave her presentation on the stage.
24
Sub-categorization Frames
Intransitive: NP(subject)
Transitive:
She put the book on the table
NP (subject) NP (object) that clause
She told me that Gary is coming.
Complements of verbs can be either
Obligatory arguments (subject, object, direct object)
NP (subject) Reflexive Pronoun(object)
She introduced herself
Clause complement
Mary gave John flowers
NP (subject) NP (object) PP(location)
John loves Mary
Reflexive Verbs
NP (subject) NP(object)
Dbl obj Construction: NP (subject) NP (direct object) NP (object)
The woman walked
She put the book on the table
or
Optional (like pp phrase or a subordinate clause (e.g., "that“ clause).
She gave her presentation on the stage.
25
Syntactic and Semantic Regularities
Subcategorization frames capture syntactic regularities.
There are also semantic regularities, usually called selectional restrictions or
preferences.
E.g., "bark" prefers dogs as subjects
"eat" prefers edible things as objects.
Sentences that violate selectional preferences sound odd.
The cat barked all night.
I eat philosophy every day.
Last word about verbs:
Gerunds are present particles that function as nouns.
sleeping bags; drinking fountain; moving sale;
26
Syntax
Words is a sentence are not randomly strung together in a sequences.
Words are organized in phrases and arranged in particular word order.
Syntax is the study of regularities and laws of word order and phrase
structure.
In English, we cannot determine the meaning of the sentence from the
meaning of the words.
Peter gave Mary a book.
The basic word order in English is: Subject-Verb-Object
This holds for declarative sentences,
Mary gave Peter a book.
The children should eat spinach
but the order changes to express a particular "mood":
Interrogative (question): Should the children eat spinach? [Try on demos]
Imperative (command, request): Eat spinach!
27
Rewrite Rules
The regularities of word order are captured using rewrite rules.
The symbol on the left of the rule can be re-written as the set of symbols on
the right.
S NP VP NP John, garbage VP laughed, smells
This set of rewrite rules can produce the following sentences:
John laughed Garbage laughed John smelled
Garbage smelled.
Symbols that cannot be decomposed are called terminal symbols.
Symbols that can be decomposed are called nonterminals.
An intuitive way to represent a sentence structure is as a tree, in which each
nonterminal represents the application of the rewrite tree. T
he following example present a tree representation of the sentence
John walked the dog with fleas.
28
Rewrite Rules
The regularities of word order are captured using rewrite rules.
The symbol on the left of the rule can be re-written as the set of symbols on
the right.
S NP VP NP John, garbage VP laughed, smells
This set of rewrite rules can produce the following sentences:
John laughed Garbage laughed John smelled
Garbage smelled.
Symbols that cannot be decomposed are called terminal symbols.
Symbols that can be decomposed are called nonterminals.
An intuitive way to represent a sentence structure is as a tree, in which each
nonterminal represents the application of the rewrite tree. T
he following example present a tree representation of the sentence
John walked the dog with fleas.
29
Rewrite Rules
This is produced using a set of rewrite rules that we call the
Grammar: A formal specification of the structures allowable in a language
.A grammar that can produce this tree is: S
S --> NP VP
NP --> Det NP
NP
VP
NP --> Det noun PP
NP
NP --> ADJ NP
V
N
NP --> noun NP
NP
Det
NP --> noun PP
NP --> noun
N
PP
VP --> V NP PP
NP
VP --> V NP
VP --> V PP
John
walked the
dog
with the fleas
VP --> V
PP --> Prep NP PP P
But, the same grammar can also produce other trees. E.g.,
the one that means that the fleas helped John walk the
P --> Prep NP
dog.
That is, the grammar is not enough.
30
Parsing
A parsing technique is a method for determining the structure of a sentence
with respect to (given) a grammar.
A parser is a computer program that determines the structure of the
sentence. Not to confuse with a program that induces the grammar.
Lexical vs. non-lexical grammar: many grammars today are lexicalized in that
the re-write rules include specific words.
Notice that rewrite rules can be applied recursively. This is important, since
it allows for simple nonterminals to expand to a large number of words.This
allows for the generation for many long term dependencies, e.g., between
subjects and verbs, and is a source of difficulties in NLP.
Shallow parse is a parse of the sentence at a shallow level – only one or two
levels above the non-terminals. This is considered an easier task that, quite
often can be more robust.
There are multiple grammar formalisms. What we showed here is a
constituent-based formalisms; but there exist others.
31
Semantics
Semantics: the study of the meaning of language. Can be decomposed into:
Lexical semantics: the study of meaning of individual words
Global semantics: how the meaning of individual words are combined into
meaning of sentences (or more).
One approach to lexical semantics is to study how word meanings are related
to each other. To study this, words can be organized into lexical hierarchies
(as done in WordNet).
32
Lexical Semantics
Hypernym: a word with a more general sense.
Hyponnym: a word with a more specific sense.
Antonym: a word having opposite meaning.
meronym(tree)=leaf.
Synonym: same meaning
Homonyms: words that are written the same way but represent different
words.
antonym(hot)=cold.
Meronym: part-of.
hypernym(cat)= animal
Bank (river, finance); suit (law, set of garment)
Polysemy: word with two senses that are related
Branch: natural subdivision of a plant;
separate but dependent part of an organization.
33
Lexical Semantics
When we move to global semantics, the natural problem is:
How to use the meaning of single words to produce a meaning of a
sentence?
This is a hard problem, since natural language does not obey the principle of
compositionality.
E.g., the word white refers to different colors in the following expressions:
white paper; white hair; white skin; white wine
There are problems of idioms and the scope of words in the sentence that
makes this even harder.
Mutli-word expressions
34
Pragmatics
One of the important issues studied here is that of discourse analysis.
A central problem there is that resolution of anaphoric relations.
An example:
Mary helped the other passenger out of the cab. The man had asked her to help
him because of his foot injury.
Anaphoric relations hold between Noun Phrases that refer to the same thing
in the world.
In the above example, there are quite a few ways to resolve the identify of
"the man","him" and "his foot".
This issue is important in many applications, in particular in information
extraction -- where there is a need to keep track of participants.
The Reference problem vs. the Co-reference problem.
35
Summary
Linguistics is subdivided traditionally into
Phonetics (physical sounds of the language; consonants, vowels, intonation)
Phonology (how sounds are mentally represented),
Morphology,
Syntax,
Semantics and
Pragmatics.
Most of the work within the statistics and learning-based approaches to
natural language is done in the areas of Syntax, Semantics, and some
Pragmatics and this will be our main concern in this course as well.
Phonetics is also studied using related methods, within the Speech
community, and the techniques we will present in this course could be used
there, as well as in Morphology and Discourse analysis.
36