771Lec01-Overview

Download Report

Transcript 771Lec01-Overview

CSCE 771
Natural Language Processing
Lecture 1
Overview
Topics

Overview
Readings: Chapters 1,2
January 14, 2013
Overview
Pragmatic issues
Course Plans
 Foundation for research
Today



–2–
Challenge of 2001’s HAL
Areas of Research
Examples of Language Processing
CSCE 771 Spring 2013
NLP Why Should You Care?
Two trends
An enormous amount of knowledge is now available in
machine readable form as natural language text
2.
Conversational agents are becoming an important form of
human-computer communication
Much of human-human communication is now mediated by
computers
1.
–3–
Slide from: Speech and Language Processing Jurafsky and Martin
CSCE 771 Spring 2013
Commercial World
Lot’s of exciting stuff going on…
Powerset
–4–
Slide from: Speech and Language Processing Jurafsky and Martin
CSCE 771 Spring 2013
Commercial World
Lot’s of exciting stuff going on…
–5–
CSCE 771 Spring 2013
Google Translate
–6–
Slide from: Speech and Language Processing Jurafsky and Martin
CSCE 771 Spring 2013
Google Translate
–7–
Slide from: Speech and Language Processing Jurafsky and Martin
CSCE 771 Spring 2013
Web Q/A
–8–
Slide from: Speech and Language Processing Jurafsky and Martin
CSCE 771 Spring 2013
HAL 9000 of 2001: A Space Odyssey
A scene from Arthur Clarke and Stanley Kubrick’s 2001
DAVE:
Open the pod bay doors, HAL.
HAL:
I’m sorry Dave, I’m afraid I can’t do that.
Notes on Context:


–9–
HAL is the main computer on the spaceship
HAL is paranoid and decides to kill off the crew
CSCE 771 Spring 2013
Clarke a little too Optimistic
 We don’t have a HAL today in 2009.
 How close are we?




Computers replaced bank tellers (in many instances)
But the NASA computers don’t talk yet
Microsoft XP/Vista’s voice commands
Adobe Reader reading PDF documents
 But can they understand spoken commands?
– 10 –
CSCE 771 Spring 2013
Challenges in developing HAL
So what are the major challenges in developing HAL?
 Speech recognition
 Natural Language understanding
 Information retrieval
 Information extraction
 Inference
 Speech generation
– 11 –
CSCE 771 Spring 2013
Samples of Language Processing
Text processing (in Unix)
 wc – word count
 grep regexpr files - print lines in the files that match re
 find
More knowledgeable processing
 spelling checking/correcting
 grammar checking
 Information retrieval

– 12 –
Find all documents on decomposition by David Parnas
CSCE 771 Spring 2013
Even More knowledgeable processing
 Information extraction

Reading the “online” Wall Street Journal
 What was the dividend paid by GM last year?

USC Handbook
 How many hours does it take to get a PhD in CSE?
 Machine translation





– 13 –
The spirit is willing but the body is weak.
To Russian: Sprit охотно готово но тело слабо.
Back to English: Vodka is good but the meat is rotten. (Rich 86)
Babelfish - http://world.altavista.com/tr
Back to English: Sprit is willingly prepared but body weakly.
CSCE 771 Spring 2013
Even Deeper Understanding
Email access over the phone


Respond to commands “list all emails from Bob”
Read email message 8
 Text to speech
Assistants

– 14 –
Agents reading the net summarizing a topic
CSCE 771 Spring 2013
Subcategories of Knowledge in S&L
 Phonetics/phonology
 Morphology – shape and behavior of words in
contexts
 Syntax – the legitimate sequences of words
 Semantics – the meanings of words, phrases,
sentences and documents
 Pragmatics – the appropriate use of language –
politeness, direct/indirectness
 Discourse conventions – correctly structuring
conversations
– 15 –
CSCE 771 Spring 2013
Ambiguity: I made her duck.
1. .
2. .
3. .
4. .
5. .
– 16 –
CSCE 771 Spring 2013
Word Ambiguity
Her – who is this?
Made

Verb with meanings: 1) create
2) cook
3) force
Duck


Noun: the waterfowl, the food
Verb
So how do we resolve this sentence?
– 17 –
CSCE 771 Spring 2013
Turing Test
Computer simulate intelligence
– 18 –
http://en.wikipedia.org/wiki/Turing_test
CSCE 771 Spring 2013
The Chinese room
John Searle's 1980 paper Minds, Brains, and Programs
proposed an argument against the Turing Test
known as the "Chinese room" thought experiment.
Searle argued that software (such as ELIZA) could
pass the Turing Test simply by manipulating
symbols of which they had no understanding.
Without understanding, they could not be described as
"thinking" in the same sense people do.
Loebner Prize – competition since 1991 to best attempt
at passing Turing Test
– 19 –
http://en.wikipedia.org/wiki/Turing_test
CSCE 771 Spring 2013
Loebner Prize
The prizes for each year include:
$2,000 for the most human seeming of all bots for that
year - awarded every year
$25,000 for the first bot that judges cannot distinguish
from a real human in a text-only based Turing Test
(awarded once only)
$100,000 to the first bot that judges cannot distinguish
from a real human in a Turing Test that includes
deciphering and understanding text, visual, auditory
(and tactile?) input.
http://en.wikipedia.org/wiki/Loebner_prize
http://www.loebner.net/Prizef/loebner-prize.html
– 20 –
CSCE 771 Spring 2013
Finite Automata arose in the 1950’s
1936 Turing’s model of algorithmic computation
1943 McCulloch-Pitts model of the neuron
1951, 1956 Kleene first introduced finite automata and
regular expressions
1959 Rabin and Scott - Nondeterministic finite automata
1968 Thompson first to compile regular expressions into
an editor for text searching
– 21 –
CSCE 771 Spring 2013
Key Concepts #1 Formal Language
A formal language is a set of strings (finite) from a finite
alphabet.
Key Concept #1: A model that can both recognize and
generate all and only the strings of a formal language
acts as a definition of the language.
L(re) = L(Mnfa)
Formal languages are not the same as natural
languages.
Linguists are generally more interested Generative
Grammars, CS are more interested in recognizing.
– 22 –
CSCE 771 Spring 2013
Formal Languages
Alphabet: Σ (finite set of symbols)
Strings:


s = c1c2 … cn (finite sequence of characters)
Length | s | = n
Language:

a language is a set of strings
Example languages over Σ = {a, b, c}
– 23 –
CSCE 771 Spring 2013
Regular Expressions
.
– 24 –
CSCE 771 Spring 2013
Regular Expression Examples
– 25 –
CSCE 771 Spring 2013
Finite Automata to recognize a Language
– 26 –
CSCE 771 Spring 2013
CSCE 531 – Overview in one slide
% flex lang.l
// lex.yy.c
% bison lang.y
// lang.c
% gcc lex.yy.c lang.c –o parse
Input source
program
% parse input
lang.l
FLEX
lang.y
BISON
lex.yy.c
yylex()
lang.c
yyparse()
Executable Program
– 27 –
CSCE 771 Spring 2013
Regular Expressions in Unix tools
Ken Thompson regular expressions in ed  ex  vi


Reg-expr  NFA then simulate
Global pattern match command
 g/Unix/s/Unix/UNIX/g
 g/re/print == grep
– 28 –
CSCE 771 Spring 2013
Grep family
Global match Regular Expression and Print (GREP)



grep [uU]nix f1 f2 … fn
egrep pat files
// efficient NFADFA, then execute
fgrep pat files
// fixed grep for fixed strings
Find for searching directories (not really reg expr)


– 29 –
find dir –name pat
// search for files with name matching pat
find dir -exec grep pat {}
//search in files for the pattern pat
CSCE 771 Spring 2013
Editing scripts
Create a script of editing commands then execute with
ex file1 < edScript
Example:
1,$s/[uU]nix/UNIX/g
1,$s/langauge/language/g
g/^$/d
// delete empty lines ^=start of line $=end
…
w
q
– 30 –
CSCE 771 Spring 2013
Other Unix regular expression Based Tools
 sed (stream editor)
 awk
 Perl – scripting language
 Python
 Ruby
 reg_comp, reg_exec in C
– 31 –
CSCE 771 Spring 2013
Python String constants
http://docs.python.org/2/library/stdtypes.html
string.ascii_letters string.ascii_lowercase
string.ascii_uppercase string.digits - The string '0123456789'.
string.hexdigits - The string '0123456789abcdefABCDEF'.
string.letters - The specific value is updated when locale.setlocale()
is called.
string.lowercase
string.octdigits - The string '01234567'.
string.punctuation - String of ASCII characters which are
considered punctuation
string.printable
string.uppercase
string.whitespace
– 32 –
CSCE 771 Spring 2013
String Method Examples
s = "i think 771 is going great!"
print s.capitalize( )
#center( width[, fillchar])
print ':'+ s.center(44, '.') + ':‘
#count( sub[, start[, end]])
print s.count("in")
print s.count("in", 13)
print s.count("in", 3)
print s.count("in", 13, 22)
print s.count("in", 13, 15)
#decode( [encoding[, errors]])
#encode( [encoding[,errors]])
#endswith( suffix[, start[, end]])
– 33 –
CSCE 771 Spring 2013
expandtabs( [tabsize])
find( sub[, start[, end]])
index( sub[, start[, end]]) Like find(), but raise
ValueError when the substring is not found.
isalnum( )
isalpha( )
isdigit( )
– 34 –
CSCE 771 Spring 2013
rpartition( sep)
rsplit( [sep [,maxsplit]])
rstrip( [chars])
split( [sep [,maxsplit]])
splitlines( [keepends])
startswith( prefix[, start[, end]])
strip( [chars]) swapcase( )
title( )
translate( table[, deletechars])
upper( )
zfill( width)
– 35 –
CSCE 771 Spring 2013
Python re — Regular expressions
• http://docs.python.org/library/re.html
• re — Regular expression module
•
Operators (special characters)
Lookahead / lookbehind
Search vs match
•
re module contents
•
•
– 36 –
CSCE 771 Spring 2013
Python Regular Expressions
http://docs.python.org/2/library/re.html
– 37 –
CSCE 771 Spring 2013
Fundamental Re Operators in Python
RegExpr matches
c
A|B
Matches either re A or re B
AB
matches re A followed by re B
A*
matches 0 or more repetitions of the re A
(A)
– 38 –
matches the single character c
Matches re A, i.e. The re inside the
parentheses
CSCE 771 Spring 2013
Other Operators in Python
RegExpr
'.'
Matches
(Dot.) In the default mode, this matches any
character except a newline. …
“A +”
“A ?”
“ A{m} ”
“A{m,n}”
“ \c ”
Quoted character
“[chars]”
character class
– 39 –
CSCE 771 Spring 2013
Greedy Operators in Python
– 40 –
CSCE 771 Spring 2013
Non Greedy Operators in Python
RegExpr
Matches
“ A*? ”
“ A+? ”
“ A?? ”
“ A{m,n}? ”
– 41 –
CSCE 771 Spring 2013
Groups
The actual text that matches a re in parentheses is a group
can be referred to later
Meaning of special character
(A)
(?P<name>A)
(?: A)
(?P=name)
\number
Matches re A, and indicates the start
and end of a group
Matches A and names the group “name”
A non-capturing version of regular parentheses
Matches whatever text was matched by the
earlier group named name.
Matches the contents of the group of that
number.
Example: (?P<frst> [a-z]{3}) (?P=frst)
– 42 –
CSCE 771 Spring 2013
Group related
Meaning of special character
( ?# … )
A comment
(?= A)
lookahead assertion
(?! A)
negative lookahead assertion.
(?<= A)
lookbehind assertion
(?<!...)
(?(id/name)yes-pattern|no-pattern)
– 43 –
CSCE 771 Spring 2013
Positional special characters
Meaning of special character
'^'
'$'
– 44 –
(Caret.) Matches the start of the string
Matches the end of the string or just before the
newline at the end of the string
CSCE 771 Spring 2013
Positional special characters
\A Matches only at the start of the string.
\b Matches the empty string, but only at the beginning or
end of a word.
\B
\d matches any decimal digit --- \D any non-digit character
\s matches any whitespace character, equivalent to
[ \t\n\r\f\v]
--- \S
\w matches any alphanumeric character and the
underscore --- \W
\Z Matches only at the end of the string
– 45 –
CSCE 771 Spring 2013
re Module - Matching vs Searching
import re
re.match(pattern, line)
re.search(pattern, line)
>>> re.match("c", "abcdef") # No match
>>> re.search("c", "abcdef") # Match
<_sre.SRE_Match object at ...>
– 46 –
CSCE 771 Spring 2013
re.compile
re.compile(pattern[, flags])
prog = re.compile(pattern)
result = prog.match(string)
– 47 –
CSCE 771 Spring 2013
Python’s Raw String Format
What regular expression matches the two character
pattern “\\”?
•
Re = “\\\\”
Sometimes it simplifies patterns to disable the ‘\’. The
“raw” modifier changes the interpretation of ‘\’ in
regular expressions.
For instance
“\n” is an regular expression matches one character the
newline
r“\n” is a regular expression with two characters ‘\’ and
‘n’
– 48 –
CSCE 771 Spring 2013
Natural Language Toolkit
• http://nltk.org/
• interfaces to over 50 corpora and
• lexical resources such as WordNet
• suite of text processing libraries for
•
•
•
•
•
•
– 49 –
classification,
tokenization,
stemming,
tagging,
parsing, and
semantic reasoning.
CSCE 771 Spring 2013
Installing NLTK
http://nltk.org/install.html
Windows 32-bit binary installation
1. Install Python:
http://www.python.org/download/releases/2.7.3/
2. Install Numpy (optional):
http://sourceforge.net/projects/numpy/files/NumPy/1.6.
2/numpy-1.6.2-win32-superpack-python2.7.exe
3. Install NLTK: http://pypi.python.org/pypi/nltk
4. Install PyYAML: http://pyyaml.org/wiki/PyYAML
5. Test installation: Start>Python27, then type import nltk
– 50 –
CSCE 771 Spring 2013
Installing NLTK Data
http://nltk.org/nltk_data/
Run the Python interpreter and type the commands:
>>> import nltk
>>> nltk.download()
– 51 –
CSCE 771 Spring 2013
– 52 –
CSCE 771 Spring 2013
– 53 –
CSCE 771 Spring 2013
Eliza
1966 Weizenbaum – program that chatted simulating a
Rogerian psychologist
User: Men are all alike.
Eliza: IN WHAT WAY?
User: They are always bugging us about something.
Eliza: CAN THINK OF A SPECIFIC EXAMPLE
…
http://en.wikipedia.org/wiki/Eliza
http://code.google.com/p/nltk/source/browse/trunk/nltk/
nltk/chat/eliza.py?r=8479
– 54 –
CSCE 771 Spring 2013
Links and References
Eliza



http://i5.nyu.edu/~mm64/x52.9265/january1966.html
http://www-ai.ijs.si/eliza/eliza.html
http://www.strout.net/info/coding/python/ai/therapist.py
Turing Test

– 55 –
http://www.abelard.org/turpap/turpap.htm
CSCE 771 Spring 2013
IBM’s Watson
http://en.wikipedia.org/wiki/Watson_%28computer%29
– 56 –
CSCE 771 Spring 2013
Watson Architecture
http://en.wikipedia.org/wiki/Watson_%28computer%29
– 57 –
CSCE 771 Spring 2013
The Face of Watson
https://www.youtube.com/watch?v=WIKM732oEek
Text to Speech
– 58 –
CSCE 771 Spring 2013