I256: Applied Natural Language Processing

Download Report

Transcript I256: Applied Natural Language Processing

I256
Applied Natural Language
Processing
Fall 2009
Lecture 2
• Python
• Related fields
• Linguistic essentials
Barbara Rosario
Today
• Announcements
– I admitted all the students in the waiting list. Tele-Bears should
reflect the change by today.
– Any questions/concerns about the class?
– Homework due next Tuesday September 8 at 12:30
• Make sure you are all set to start with Python & NLTK
– Office hours (Room 6)
• Today: Gopal at 2
• Wednesday 3-4: Gopal (iIf there is request, let him know)
• Thursday: Barbara at 2
– Some (light) readings for Thursday
• Python
• Related fields
• Linguistic essentials
Python
Python - Simple yet powerful
The zen of python : http://www.python.org/dev/peps/pep-0020/
•
•
•
•
•
•
•
•
Very clear, readable syntax
Strong introspection capabilities
– http://www.ibm.com/developerworks/library/l-pyint.html (recommended)
Intuitive object orientation
Natural expression of procedural code
Full modularity, supporting hierarchical packages
Exception-based error handling
Very high level dynamic data types
Extensive standard libraries and third party modules for virtually every task
– Excellent functionality for processing linguistic data.
– NLTK is one such extensive third party module.
Source : python.org
Python (built-in types)
•
Numeric types
–
–
–
–
•
Sequences
–
–
–
•
Strings (immutable)
Lists (mutable)
Tuples (immutable)
Mappings
–
•
•
•
•
plain integers - long in C, 32 bit precision (try: sys.maxint)
long integers -(unlimited precision)
floating point numbers
complex numbers
Dictionary
File objects
Classes
Instances
Exceptions
Source : python.org
Python (Lists and tuples)
LISTS
• More than an ‘array’.
• Hold arbitrary objects and expand/collapse dynamically.
Define using standard array like syntax
Few methods
>>> mylist=[‘nlp’,42577,256,’applied_nlp’]
>>> mylist[3]
‘applied_nlp’
>>> mylist[-1]
‘applied_nlp’
>>> mylist[1:3]
[42577,256]
TUPLE
• A tuple is an immutable list. Cannot be changed once created.
>>> mytuple=(‘nlp’,42577,256,’applied_nlp’)
>>> mytuple[3]
’applied_nlp’
>>> mytuple[3]=‘blahblah’
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError:’tuple’ object does not support item assignment
Source : python.org
List li
•len(li)
•li.append(‘something’)
•li.extend([list])
•li.insert(index,’value’)
•li.index(“nlp”)
•li.remove(“nlp”)
•li=li+[list]
…….
………..
Python (Strings)
•
Provides many string manipulation methods
>>> print “uc” + “berkeley”
“ucberkeley”
>>> li = [‘a',‘b',‘c’,‘d']
>>> s = ";".join(li)
>>> s
‘a;b;c;d'
>>> s.split(";")
[‘a',‘b',‘c’,‘d']
•
Strings can be subscripted (indexed)
–
Can use some list style methods
>>> mystring=“jolly good”
>>>mystring[1:5]
‘olly’
•
String formatting (the % operator)
>>> print “this is a %s course”%(“NLP”)
“this is a NLP course”
>>> print “this is a %s course in fall%d”%(“NLP”,9)
“this is a NLP course in fall9”
>>> print “this is %(course)s course”%{‘course’:”NLP”}
“this is a NLP course”
Source : python.org
Few methods
String str
•len(str)
•str.capitalize()
•str.count(sub[, start[, end]])
•str.find(sub[, start[, end]])
•str.replace(old, new[, count])
•str.strip([chars])
• str.split([sep[, maxsplit]])
…….
………..
Python (Mapping objects)
•
•
•
•
A mapping object maps hashable values to arbitrary objects.
Mappings are mutable objects.
There is currently only one standard mapping type, the dictionary.
Few methods
Creating dictionaries
Dictionary d
comma-separated list of key: value pairs within braces
>>> mydict={‘nlp’:42577,256:’applied_nlp’}
>>>mydict[256]
‘applied_nlp’
Using the constructor of a built-in dict class
dict(one=2, two=3)
dict({'one': 2, 'two': 3})
dict(zip(('one', 'two'), (2, 3)))
dict([['two', 3], ['one', 2]])
Source : python.org
•len(d)
•d[key]
•d[key] = value
•del d[key]
•key in d
•clear()
•copy)()
•get(key[, default])
•Items()
•iteritems()
…….
………..
Submission for assignment 1
For Assignment 1 (see also web site)
• create a file LastNameFirstName_assignment1.py
• This is the main file where all your code will reside.
• We will evaluate each question/sub-question as
>>> python LastNameFirstName_assignment1.py question1
>>> python LastNameFirstName_assignment1.py question1.1
•
Add logic to your code based on the command line argument (process your command
line argument string ) and output accordingly. The command line arguments in python
are accessed through sys.argv list . You can also use getopt module.
•
Make sure you include a this header information in the beginning of your code
#! /usr/bin/env python
#author: ‘Your name'
#email = ‘your email address'
#python_version = ‘python version you are using'
For question on the homework, please email [email protected]
email your assignment to [email protected] and [email protected]
Related Fields
• NLP
• Linguistics
– All about languages
• Computational Linguistics
– Using computational methods to learn more about how language
works
• Speech Recognition
– Mapping audio signals to text
– Two components: acoustic models and language models
– Language models in the domain of stat NLP
• Cognitive Science
– Figuring out how the human brain work, including language
Linguistics essentials
• Important distinction:
– study of language structure (grammar)
– study of meaning (semantics)
• Grammar
– Phonology (the study of sound systems and abstract
sound units).
– Morphology (the formation and composition of words)
– Syntax (the rules that determine how words combine
into sentences)
• Semantics
– The study of the meaning of words (lexical semantics)
and fixed word combinations (phraseology), and how
these combine to form the meanings of sentences
http://en.wikipedia.org/wiki/Linguistics
Linguistics sub-fields
• Discourse analysis
– concerned with the structure of texts and
conversations
• Pragmatics
– concerned with how meaning is transmitted
based on a combination of linguistic
competence, non-linguistic knowledge, and
the context of the speech act.
Linguistics sub-fields
• Evolutionary linguistics
– origins of language
• Historical linguistics
– explores language change
• Sociolinguistics
– looks at the relation between linguistic variation and social structures
• Psycholinguistics
– explores the representation and functioning of language in the mind
•
Neurolinguistics
– looks at the representation of language in the brain
• Language acquisition
– how children acquire their first language and how children and adults
acquire and learn their second and subsequent languages
• And others:
– for an overview see http://en.wikipedia.org/wiki/Linguistics
Adapted from http://en.wikipedia.org/wiki/Linguistics
Linguistics essentials
• This course:
• Some grammar
• Mostly “semantics”
Grammar: words
• Words of a language are grouped into classes to reflect
similar syntactic behaviors
• Syntactical or grammatical categories (aka part-ofspeech)
–
–
–
–
–
Nouns (people, animal, concepts)
Verbs (actions, states)
Adjectives
Prepositions
Determiners
• Open or lexical categories (nouns, verbs, adjective)
– Large number of members, new words are commonly added
• Closed or functional categories (prepositions,
determiners)
– Few members, clear grammatical use
Grammar: words
• Word categories are related by
morphological processes
– s for plural nouns
– ed for verbs’ past forms
– Next class
– Why important for NLP?
– More important for some languages
• English regular verbs have 4 forms (at most 8 in
irregular verbs)
• Finnish verbs have 10,000 forms
Grammatical categories
• Nouns typically refer to entities in the
world like people, animals, things, ideas..
• Type of inflections
– Number
– Gender
– Case (nominative, genitive, accusative,
dative)
• Pronouns: variables to refer to an entity
previously mentioned
Grammatical categories: Verbs
• Usually denote an action (bring, read), an
occurrence (decompose, glitter), or a state of
being (exist, stand).
• Depending on the language, a verb may vary in
form according to many factors, possibly
including its tense, aspect, mood and voice.
• It may also agree with the person, gender, and/or
number of some of its arguments (subject, object,
etc.)
Verbs’ factors
• Tense: time of the action
– Present, past, future
• Mood: signal modality (possibility and
necessity)
– Realis mood
– The state is known (John is sick)
– Irrealis mood
– Indicate that a certain situation or action is not known to
have happened as the speaker is talking.
– Just may/must be sick
Verbs’ factors
• Aspect
– Defines the temporal flow (or lack thereof) in
the event or state.
– Habitual aspect
• I eat, I have eaten, I ate, I had eaten
– Progressive, or continuous, aspect
• I am eating, I have been eating, I was eating, I had
been eating
Verbs’ factors
• Voice
– Describes the relationship between the action
(or state) that the verb expresses and the
participants identified by its arguments
(subject, object, etc.).
– Active voice: when the subject is the agent or
actor of the verb (the cat ate the mouse)
– Passive voice: when the subject is the patient,
target or undergoer of the action (the mouse
was eaten by the cat)
Other grammatical categories
• Adverbs
• Prepositions
– In, on, over, at
• Coordinating Conjunctions
– Link 2 sentences
• and, or, but…
• She bought or leased the car
• Subordinating Conjunctions
• That, because, if…
• She said that she would lease a car
Phrase structure
• Words are organized in phrases
• Phrases: grouping of words that are
clumped as a unit
• Syntax: study of the regularities and
constraints of word order and phrase
structure
Major phrase types
• Sentence (S) (whole grammatical unit).
Normally rewrites as a subject noun phrase
and a verb phrase
• Noun phrase (NP): phrase whose head is a
noun or a pronoun, optionally accompanied
by a set of modifiers
– Head is the word that determines the syntactic
type of the phrase
– The smart student of physics with long hair
determiner adjective
complements
(prepositional
phrase)
(post) modifier
(prepositional
phrase)
Major phrase types
• Prepositional phrases (PP)
– Headed by a preposition and containing a NP
• She is [on the computer]
• They walked [to their school]
• Verb phrases (VP)
– Phrase whose head is a verb
• [Getting to school on time] was a struggle
• He [was trying to keep his temper]
• That woman [quickly showed me the way to hide]
Phrase structure grammar
• Syntactic analysis of sentences
– (Ultimately) to extract meaning:
• Mary gave Peter a book
• Peter gave Mary a book
• Rewrite rules
– Category  category* (i.e. the symbol on the
left side can be rewritten as the sequence of
symbols on the right side)
– Start symbol is S (for sentence)
Phrase structure grammar
•
•
•
•
•
•
S  NP VP
NP  AT NN
NP  NP PP
VP  VP PP
VP  VP
PP  IN NP
The cat sleeps
•
•
•
•
•
•
•
•
The cat sleeps in the box
NO
The cat hopes she can sleeps in the box
AT  the
NN  child
NN  cat
NN  box
VP  sleep
VP  eat
IN  in
IN  of
Lexicon
Context free grammars
• The rewrite rules depend solely on the
category and not on any surrounding
context: Context Free Grammar
• Main problems:
– Identify these grammars for natural languages
(linguistics)
– Known the grammar, identify the phrase
structures of sentences (NLP, parsing)
Phrase structure parsing
• Parsing: the process of reconstructing the
derivation(s) or phrase structure trees that
give rise to a particular sequence of words
• Parse is a phrase structure tree
– New art critics write reviews with computers
Phrase structure parsing &
ambiguity
• The children ate the cake with a spoon
• PP Attachment Ambiguity
• Why is it important for NLP?
Semantics
•
Semantics is the study of the meaning of
words, construction and utterances
1. Study of the meaning of individual words
(lexical semantics)
2. Study of how meanings of individual
words are combined into the meaning of
sentences (or larger units)
Lexical semantics
• How words are related with each other
• Hyponymy
– scarlet, vermilion, carmine, and crimson are all
hyponyms of red
• Hypernymy
• Antonymy (opposite)
– Male, female
• Meronymy (part of)
– Tire is meromym of car
• Etc..
Semantics: beyond individual words
• Once we have the meaning of the
individual words, we need to assemble
them to et the meaning of the whole
sentence
• Hard because natural language does not
obey the principle of compositionality by
which the meaning of the whole can be
predicted by the meanings of the parts
Semantics: beyond individual words:
complications
•
Collocations
–
•
Idioms: meaning is opaque
–
•
White skin, white wine, white hair
Kick the bucket
Scope
– Everyone didn’t go to the movie
1. Everyone’s scope is over not (i.e. not one person
went to the movie)
2. Negation not has scope over everyone (at least one
person didn’t go)
Semantics: beyond individual words
• Discourse
• Anaphoric relations
– Mary helped Peter get out of the cat. He
thanked her. [He and Peter are the same
person, her and Mary too]
Next class
• Syntax of words
• Morphology
• Stemming
– Collapse related morphological forms to the original
lexeme
– Sit, sits, sitting, sat  lexeme: sit
• Tokenization
– Divide text into units (words, numbers etc)
• Word segmentation
– For languages with no spaces between words