NLTK (Natural Language Tool Kit) http://www.nltk.org/

Download Report

Transcript NLTK (Natural Language Tool Kit) http://www.nltk.org/

NLTK (Natural Language Tool Kit)
http://www.nltk.org/
Unix for Poets
(without Unix)
Unix  Python
Homework #4
• No need to buy the book
– Free online at http://www.nltk.org/book
• Read Chapter 1
– http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html
• Start with exercise 22 and go as far as you can
– Exercise 23: Solve however you like
– (no need to use for and if)
• Due Tuesday at sunrise
– Send email to [email protected]
Installing
• Chapter 01: pp. 1 - 4
– Python
– NLTK
– Data
George Miller’s Example: Erode
• Exercise: Use “erode” in a sentence:
– My family erodes a lot.
Definition
• to eat into or away; destroy by slow
consumption or disintegration
– Battery acid had eroded the engine.
Examples
– Inflation erodes the value of our money.
• Miller’s Conclusion:
– Dictionary examples are more helpful than defs
George Miller: Chomsky’s Mentor & Wordnet
Introduction to Programming
Traditional
(Start with Definitions)
Non-Traditional
(Start with Examples)
• Constants: 1
• Variables: x
• Objects:
• Recursion
– lists, strings, arrays, matrices
• Expressions: 1+x
• Statements: Side Effects
– print 1+x;
• Conditionals:
– If (x<=1) return 1;
•
•
•
•
Iteration: for loops
Functions
Recursion
Streams
def fact(x):
if(x <= 1): return 1
else: return x * fact(x-1)
• Streams:
– Unix Pipes
• Briefly mentioned
– Everything else
Python
def fact(x):
if(x <= 1): return 1
else: return x * fact(x-1)
def fact2(x):
result=1
for i in range(x):
result *=(i+1);
return result
•
Exercise: Fibonacci in Python
Recursion
Iteration
Flatten: List  String
First
>>> def flatten(list):
if(len(list) == 1): return list[0];
else: return list[0] + ' ' + flatten(list[1:len(list)]);
Rest
flatten = split-1
Python Objects
Lists
>>> sent1
['Call', 'me', 'Ishmael', '.']
>>> type(sent1)
<type 'list'>
>>> sent1[0]
First
'Call'
>>> sent1[1:len(sent1)]
['me', 'Ishmael', '.']
Rest
Strings
>>> sent1[0]
'Call'
>>> type(sent1[0])
<type 'str'>
>>> sent1[0][0]
'C'
>>> sent1[0][1:len(sent1[0])]
'all'
Types & Tokens
Polymorphism
Polymorphism
(From Wikipedia)
Tokens
Types
Tokens
Types
FreqDist
Concordances
URLs (Chapter 3)
HTML
Works with almost any URL!
>>>url="https://jshare.johnshopkins.edu/kchurch4/public
_html/teaching/103/Lecture07/WebProgramming/java
script_example_with_sounds.html"
>>> def url2text(url):
html = urlopen(url).read()
raw = nltk.clean_html(html)
tokens = nltk.word_tokenize(raw)
return nltk.Text(tokens)
>>> text=url2text(url)
>>> text.concordance('Nonsense')
An Equivalence Relation (=R)
• A Partition of S ≡ Set of Subsets of S
– Mutually Exclusive & Exhaustive
• Equivalence Classes ≡ A Partition such that
– All the elements in a class are equivalent (with respect to =R)
– No element from one class is equivalent to an element from another
• Example: Partition integers into evens & odds
• Even integers: 2,4,6…
• Odd integers: 1,3,5…
– x =R y  x has the same parity as y
• Three Properties
– Reflexive: a =R a
– Symmetric: a =R b  b =R a
– Transitive: a =R b & b =R c  a =R c
>>> for s in wn.synsets('car'): print s.lemma_names
['car', 'auto', 'automobile', 'machine', 'motorcar']
['car', 'railcar', 'railway_car', 'railroad_car']
['car', 'gondola']
['car', 'elevator_car']
['cable_car', 'car']
Word Net (Ch2):
An Equivalence Relation
>>> for s in wn.synsets('car'): print flatten(s.lemma_names) + ': ' + s.definition
car auto automobile machine motorcar: a motor vehicle with four wheels; usually
propelled by an internal combustion engine
car railcar railway_car railroad_car: a wheeled vehicle adapted to the rails of railroad
car gondola: the compartment that is suspended from an airship and that carries
personnel and the cargo and the power plant
car elevator_car: where passengers ride up and down
cable_car car: a conveyance for passengers or freight on a cable railway
Synonymy: An Equivalence Relation?
Comments
A Partial Order (≤R)
• Powerset({x,y,z})
– Subsets ordered by inclusion
– a≤Rb  ab
• Three properties
– Reflexive:
• a≤a
– Antisymmetric:
• a≤b & b≤a  a=b
– Transitivity:
• a≤b & b≤c  a≤c
Wordnet: A Partial Order
>>> for h in wn.synsets('car')[0].hypernym_paths()[0]:
print h.lemma_names
['entity']
['physical_entity']
['object', 'physical_object']
['whole', 'unit']
['artifact', 'artefact']
['instrumentality', 'instrumentation']
['container']
['wheeled_vehicle']
['self-propelled_vehicle']
['motor_vehicle', 'automotive_vehicle']
['car', 'auto', 'automobile', 'machine', 'motorcar']
Help
s = wn.synsets('car')[0]
>>> s.name
'car.n.01'
>>> s.pos
'n'
>>> s.lemmas
[Lemma('car.n.01.car'), Lemma('car.n.01.auto'),
Lemma('car.n.01.automobile'),
Lemma('car.n.01.machine'),
Lemma('car.n.01.motorcar')]
>>> s.examples
['he needs a car to get to work']
>>> s.definition
'a motor vehicle with four wheels; usually propelled
by an internal combustion engine'
>>> s.hyponyms()[0:3]
[Synset('stanley_steamer.n.01'),
Synset('hardtop.n.01'), Synset('loaner.n.02')]
>>> s.hypernyms()
[Synset('motor_vehicle.n.01')]
CFGs: Context
Free Grammars
(Ch8)
Ambiguity
• The Chomsky Hierarchy
– Type 0 > Type 1 > Type 2 > Type 3
– Recursively Enumerable > CS > CF > Regular
• Examples
– Type 3: Regular (Finite State):
• Grep & Regular Expressions
• Right-Branching: A  a A
• Left-Branching: B  B b
– Type 2: Context-Free (CF):
• Center-Embedding: C  …  x C y
• Parenthesis Grammars: <expr>  ( <expr> )
• w wR
– Type 1: Context-Sensitive (CS): w w
– Type 0: Recursively Enumerable
– Beyond Type 0: Halting Problem
Summary
Chapter 1
• NLTK (Natural Lang Toolkit)
– Unix for Poets without Unix
– Unix  Python
• Object-Oriented
– Polymorphism:
• “len” applies to lists, sets, etc.
• Ditto for: +, help, print, etc.
• Types & Tokens
– “to be or not to be”
– 6 types & 4 tokens
• FreqDist: sort | uniq –c
• Concordances
Chapters 2-8
• Chapter 3: URLs
• Chapter 2
– Equivalence Relations:
• Parity
• Synonymy (?)
– Partial Orders:
• Wordnet Ontology
• Chapter 8: CF Parsing
– Chomsky Hierarchy
• CS > CF > Regular