I256: Applied Natural Language Processing

Download Report

Transcript I256: Applied Natural Language Processing

I256
Applied Natural Language
Processing
Fall 2009
Lecture 1
Introduction
Barbara Rosario
Introductions
• Barbara Rosario
– iSchool alumni (class 2005)
– Intel Labs
• Gopal Vaswani
– iSchool master student (class 2010)
• You?
Today
•
•
•
•
•
•
•
•
Introductions
Administrivia
What is NLP
NLP Applications
Why is NLP difficult
Corpus-based statistical approaches
Course goals
What we’ll do in this course
Administrivia
• http://courses.ischool.berkeley.edu/i256/f09/index.html
• Books:
– Foundations of Statistical NLP, Manning and Schuetze, MIT press
– Natural Language Processing with Python, Bird, Klein & Loper,
O'Reilly. (also on line)
– See Web site for additional resources
• Work:
– Individual coding assignments (Python & NLTK-Natural Language
Toolkit) (4 or 5)
– Final group project
– Participation
• Office hours:
– Barbara: Thursday 2:00-3:00 in Room 6
– Gopal: Tuesday 2:00-3:00 in Room 6 (to be confirmed)
Administrivia
• Communication:
– My email: [email protected]
– Gopal : [email protected]
– Mailing list: [email protected]
• Send an email to [email protected] with subscribe
i256 in the body
• Through intranet
– Announcements: webpage and/or mailing list and/or Bspace
(TBA)
– Public discussion: Bspace(?)
• Related course: Statistical Natural Language
Processing, Spring 2009, CS 288
– http://www.cs.berkeley.edu/~klein/cs288/sp09/
– Instructor: Dan Klein
– Much more emphasis on statistical algorithms
• Questions?
Natural Language Processing
• Fundamental goal: deep understand of
broad language
– Not just string processing or keyword
matching!
• End systems that we want to build:
– Ambitious: speech recognition, machine
translation, question answering…
– Modest: spelling correction, text
categorization…
Slide taken from Klein’s course: UCB CS 288 spring 09
Example: Machine Translation
NLP applications
• Text Categorization
– Classify documents by topics, language, author, spam filtering,
information retrieval (relevant, not relevant), sentiment
classification (positive, negative)
•
•
•
•
Spelling & Grammar Corrections
Information Extraction
Speech Recognition
Information Retrieval
– Synonym Generation
•
•
•
•
Summarization
Machine Translation
Question Answering
Dialog Systems
– Language generation
Why NLP is difficult
• A NLP system needs to answer the question
“who did what to whom”
• Language is ambiguous
– At all levels: lexical, phrase, semantic
– Iraqi Head Seeks Arms
• Word sense is ambiguous (head, arms)
– Stolen Painting Found by Tree
• Thematic role is ambiguous: tree is agent or location?
– Ban on Nude Dancing on Governor’s Desk
• Syntactic structure (attachment) is ambiguous: is the ban or
the dancing on the desk?
– Hospitals Are Sued by 7 Foot Doctors
• Semantics is ambiguous : what is 7 foot?
Why NLP is difficult
• Language is flexible
– New words, new meanings
– Different meanings in different contexts
• Language is subtle
– He arrived at the lecture
– He chuckled at the lecture
– He chuckled his way through the lecture
– **He arrived his way through the lecture
• Language is complex!
Why NLP is difficult
• MANY hidden variables
– Knowledge about the world
– Knowledge about the context
– Knowledge about human communication techniques
• Can you tell me the time?
• Problem of scale
– Many (infinite?) possible words, meanings, context
• Problem of sparsity
– Very difficult to do statistical analysis, most things
(words, concepts) are never seen before
• Long range correlations
Why NLP is difficult
• Key problems:
– Representation of meaning
– Language presupposes knowledge about the
world
– Language only reflects the surface of
meaning
– Language presupposes communication
between people
Meaning
• What is meaning?
– Physical referent in the real world
– Semantic concepts, characterized also by relations.
• How do we represent and use meaning
– I am Italian
• From lexical database (WordNet)
• Italian =a native or inhabitant of Italy Italy = republic in southern
Europe [..]
– I am Italian
• Who is “I”?
– I know she is Italian/I think she is Italian
• How do we represent “I know” and “I think”
• Does this mean that I is Italian? What does it say about the “I” and
about the person speaking?
– I thought she was Italian
• How do we represent tenses?
Today
•
•
•
•
•
•
•
•
Introductions
Administrivia
What is NLP
NLP Applications
Why is NLP difficult
Corpus-based statistical approaches
Course goals
What we’ll do in this course
Corpus-based statistical
approaches to tackle NLP problem
– How can a can a machine understand these
differences?
• Decorate the cake with the frosting
• Decorate the cake with the kids
– Rules based approaches, i.e. hand coded syntactic
constraints and preference rules:
• The verb decorate require an animate being as agent
• The object cake is formed by any of the following, inanimate
entities (cream, dough, frosting…..)
– Such approaches have been showed to be time
consuming to build, do not scale up well and are very
brittle to new, unusual, metaphorical use of language
• To swallow requires an animate being as agent/subject and a
physical object as object
– I swallowed his story
– The supernova swallowed the planet
Corpus-based statistical
approaches to tackle NLP problem
• A Statistical NLP approach seeks to solve these
problems by automatically learning lexical and
structural preferences from text collections
(corpora)
• Statistical models are robust, generalize well
and behave gracefully in the presence of errors
and new data.
• So:
– Get large text collections
– Compute statistics over those collections
– (The bigger the collections, the better the statistics)
Corpus-based statistical
approaches to tackle NLP problem
• Decorate the cake with the frosting
• Decorate the cake with the kids
• From (labeled) corpora we can learn that:
#(kids are subject/agent of decorate) > #(frosting is subject/agent of
decorate)
• From (UN-labeled) corpora we can learn that:
#(“the kids decorate the cake”) >> #(“the frosting decorates the cake”)
#(“cake with frosting”) >> #(“cake with kids”)
etc..
• Given these “facts” we then need a statistical model
for the attachment decision
Corpus-based statistical approaches
to tackle NLP problem
• Topic categorization: classify the document
into semantics topics
Document 1
Document 2
The U.S. swept into the Davis Cup final
on Saturday when twins Bob and Mike
Bryan defeated Belarus's Max Mirnyi
and Vladimir Voltchkov to give the
Americans an unsurmountable 3-0 lead
in the best-of-five semi-final tie.
One of the strangest, most relentless
hurricane seasons on record reached
new bizarre heights yesterday as the
plodding approach of Hurricane
Jeanne prompted evacuation orders
for hundreds of thousands of
Floridians and high wind warnings
that stretched 350 miles from the
swamp towns south of Miami to the
historic city of St. Augustine.
Topic = sport
Topic = disaster
Corpus-based statistical approaches
to tackle NLP problem
• Topic categorization: classify the document
into semantics topics
Document 1 (sport)
Document 2 (disasters)
The U.S. swept into the Davis
Cup final on Saturday when twins
Bob and Mike Bryan …
One of the strangest, most
relentless hurricane seasons on
record reached new bizarre heights
yesterday as….
• From (labeled) corpora we can learn that:
#(sport documents containing word Cup) > #(disaster documents
containing word Cup) -- feature
• We then need a statistical model for the topic
assignment
Corpus-based statistical
approaches to tackle NLP problem
• Feature extractions (usually linguistics
motivated)
• Statistical models
• Data (corpora, labels, linguistic
resources)
Goals of this Course
• Learn about the problems and possibilities of natural
language analysis:
– What are the major issues?
– What are the major solutions?
• At the end you should:
– Agree that language is difficult, interesting and important
– Be able to assess language problems
• Know which solutions to apply when, and how
• Feel some ownership over the algorithms
– Be able to use software to tackle some NLP language tasks
– Know language resources
– Be able to read papers in the field
What We’ll Do in this Course
• Linguistic Issues
– What are the range of language phenomena?
– What are the knowledge sources that let us
disambiguate?
– What representations are appropriate?
• Applications
• Software (Python and NLTK)
• Statistical Modeling Methods
What We’ll Do in this Course
• Read books, research papers and tutorials
• Final project
– Your own ideas or chose from some suggestions I will
provide
– We’ll talk later during the couse about ideas/methods
etc. but come talk to me if you have already some
ideas
• Learn Python
• Learn/use NLTK (Natural Language ToolKit) to
try out various algorithms
Python
Python - Simple yet powerful
The zen of python : http://www.python.org/dev/peps/pep-0020/
•
•
•
•
•
•
•
•
Very clear, readable syntax
Strong introspection capabilities
– http://www.ibm.com/developerworks/linux/library/l-pyint.html
(recommended)
Intuitive object orientation
Natural expression of procedural code
Full modularity, supporting hierarchical packages
Exception-based error handling
Very high level dynamic data types
Extensive standard libraries and third party modules for virtually every task
– Excellent functionality for processing linguistic data.
– NLTK is one such extensive third party module.
Source : python.org
NLTK
•
•
•
•
NLTK defines an infrastructure that can be used to build NLP programs in Python.
It provides basic classes for representing data relevant to natural language
processing.
Standard interfaces for performing tasks such as part-of-speech tagging, syntactic
parsing, and text classification.
Standard implementations for each task which can be combined to solve complex
problems.
Language processing task
NLTK modules
Functionality
Accessing corpora
nltk.corpus
standardized interfaces to corpora and lexicons
String processing
nltk.tokenize, nltk.stem
tokenizers, sentence tokenizers, stemmers
Collocation discovery
nltk.collocations
t-test, chi-squared, point-wise mutual information
Part-of-speech tagging
nltk.tag
n-gram, backoff, Brill, HMM, TnT
Classification
nltk.classify, nltk.cluster
decision tree, maximum entropy, naive Bayes, EM, k-means
Chunking
nltk.chunk
regular expression, n-gram, named-entity
Parsing
nltk.parse
chart, feature-based, unification, probabilistic, dependency
Resources:
•
Download at http://www.nltk.org/download
•
Getting started with NLTK Chapter 1
•
NLP and NLTK talk at google http://www.youtube.com/watch?v=keXW_5-llD0
Source : nltk.org
This is not the complete list
Topics
• Text corpora & other resources
• Words (Morphology, tokenization, stemming, part-ofspeech, WSD, collocations, lexical acquisition, language
models)
• Syntax: chunking, PCFG & parsing
• Statistical models (esp. for classification)
• Applications
–
–
–
–
–
–
–
Text classification
Information extraction
Machine translation
Semantic Interpretation
Sentiment Analysis
QA / Summarization
Information retrieval
Next Assignment
• Due before next class Tue Sep 1
– No turn-in
• Download and install Python and NLTK
• Download the NLTK Book Collection, as
described at the beginning of chapter 1 of the
book Natural Language Processing with Python
• Readings:
– Chapter 1 of the book Natural Language
Processing with Python
– Chapter 3 of Foundations of Statistical NLP
• Next class:
– Linguistic Essentials
– Python Introduction