Transcript pptx

Lecture 1: Introduction
Methods in Computational
Linguistics II
Queens College
Methods in Computational
Linguistics II
• 2nd semester of a two semester course
providing instruction in
– The basics of computer science and
programming (via python)
– An introduction to techniques in computational
linguistics
1
My background
• Research
– Speech Synthesis, and Recognition
– Prosody (Intonation)
– Speech Segmentation
– Non-native speech
– Political speech, and other paralinguistics
• Computer Science professor at Queens
and CUNY GC.
• Worked at IBM and Google
2
Your Background
• Name.
• What are your research interests in
linguistics?
• How do you expect computational linguistics
to fit into your work?
– Are there techniques or applications that you are
particularly looking to learn
• Programming background?
– 1 semester? more?
• Are you simultaneously taking Language
Technologies
3
Outline
• NLTK
– Overview
– Major Capabilities
• Searching and Sorting.
–
–
–
–
Linear (Sequential) search
Binary Search
Insertion sort
MergeSort
• Course Policies
• Syllabus Review
4
NLTK
• Natural Language Toolkit.
• A set of utilities in python that facilitate the
processing of text.
5
NLTK Functionality
•
•
•
•
•
•
•
•
Accessing corpora
String processing
Collocation discovery
Part of speech tagging
Classification and Clustering
Evaluation Metrics
Chunking
Parsing
6
NLTK Functionality
• Semantic interpretation
– first order logic, lambda calculus, model
checking
• Probability and estimation
• WordNet Browsing
• Chatbots
7
NLTK as a resource
• This range of functionality is quite broad,
and not necessarily cohesive.
• However, there are resources and tools
(functions and objects) that underpin most
major computational linguistics tasks.
8
Major Computational Linguistics
Tasks
• Syntax
•
– Tagging
•
– Parsing
•
• Semantics
•
– Information Extraction
•
– Semantic Role
•
Labeling
•
• Phonology
• Sentence Processing
• Segmentation
Summarization
Speech Recognition
Speech Synthesis
Information Retrieval
Sentiment Analysis
Authorship studies
Co-reference
resolution
9
NLTK Resources
• NLTK also contained lexical material
–
–
–
–
–
–
–
–
–
–
Project Gutenberg
WordNet
Penn Treebank (subset)
Named Entity Recognition data
Inaugural addresses
Sentiment data
Names corpus
Switchboard (subset)
TIMIT
Webtext
10
Quick Assignment
• Methods I used NLTK.
• Homework 0
– Make sure that NLTK is installed and working
correctly
– Install matplotlib to use nltk’s graphing
functions.
• “Due” asap.
11
One Question Pop Quiz
• Solve for p
12
Math
• Computational Linguistics requires a notquite-trivial amount of math.
• Statistics and probabilistic modeling form the
pillars underlying these computational
techniques.
• This involves counting and algebra.
• Machine learning governs the classification
and clustering techniques that CL makes
heavy use of.
– Requires calculus, statistics, linear algebra.
13
Math in this course
• Overview of probability.
– Next class
• Algebra for evaluation, some common
features
• Statistics for Naïve Bayes classification
• Entropy in Decision Trees
14
Outline
• NLTK
– Overview
– Major Capabilities
• Searching and Sorting.
–
–
–
–
Linear (Sequential) search
Binary Search
Insertion sort
MergeSort
• Course Policies
• Syllabus Review
15
Data Structures, Algorithms, etc.
• In computer science, there is a tight
relationship between data structures and
algorithms
• In general, the more complex the data
structure
– the more general or flexible the data and
relationships that can be represented
– the faster algorithms can run
16
Searching and Sorting
• Searching and sorting is a frequent
example of the relationship between
algorithm runtimes, and data structuring.
• Search: identify the location of a value, x,
in a list, A.
• Sort: manipulate a list A, such that the
values in A are increasing. A[i] <= A[i+1]
17
Sequential Search
def search(A, x):
for i in xrange(len(A)):
if A[i] == x:
return i
return -1
18
How long does sequential search
take to run?
• Best case?
• Worst case?
• Average case?
19
Binary Search
• If the list A is in increasing order, large
chunks of the list can be be ignored.
def search(A, x):
top = len(A)
bottom = 0
while bottom < top:
mid = (top + bottom) / 2
if A[mid] < x:
bottom = mid + 1
elif A[mid] > x:
top = mid
else:
return mid
return -1
20
How long does binary search take
to run?
• Best Case?
• Worst Case?
• Average Case?
21
Improvement of Binary Search
• Binary search is a significant improvement
– log n < n
• However, Binary search requires that A is
sorted.
• How long does it take to sort an Array and
how does this impact the total runtime?
22
Insertion Sort
• Sort the list [5, 2, 4, 6, 1, 3]
def insertionSort(A):
for j in xrange(1, len(A)):
key = A[j]
i = j - 1
while i > -1 and A[i] >
key:
A[i + 1] = A[i]
i = i - 1
A[i + 1] = key
23
How long does Insertion sort take
to run?
• Best Case?
• Worst Case?
• Average Case?
24
Can we sort faster?
• Yes.
• This requires recursion.
• We’ll come back to this, but here is a first
example.
25
Merge Sort
def mergeSort(A):
if len(A) == 1:
return A
mid = len(A) / 2
Abottom = mergeSort(A[1:mid])
Atop = mergeSort(A[mid + 1:len(A)])
return merge(Abottom, Atop)
26
Merge
def merge(A, B):
C = []
i = 0
j = 0
A.append(float('inf'))
B.append(float('inf'))
for k in xrange(len(A) + len(B)):
if A[i] < B[j]:
C.append(A[i])
i = i + 1
else:
C.append(B[j])
j = j + 1
return C
27
How long does Merge Sort take to
run?
• Hint: This is a (much) harder question.
• Best Case?
• Worst Case?
• Average Case?
28
Comparison of run times
Sorting
0
n*log(n)
Searching
n
log(n)
How much searching do you need to do to make it
worth sorting?
29
Class Structure and Policies
• Course website:
– http://eniac.cs.qc.cuny.edu/andrew/methods2/syllabus.html
• Email list
– Banner does not have an email function
– Put your email address on the sign up sheet.
30