Introduction

Download Report

Transcript Introduction

Introduction
LING 575
Week 1: 1/08/08
1
Plan for today
• General information
• Course plan
• HMM and n-gram tagger (recap)
• EM and forward-backward algorithm
2
Before next time
• Select papers that you’d like to present
– Reply to the 1st message at GoPost by noon
Saturday
• Read M&S 9.3.3
– Remember to hand in your questions next
time.
3
General information
4
General info
• Course url: http://courses.washington.edu/ling575x
– Syllabus (incl. slides, assignments, and papers):
updated every week.
– GoPost:
– Collect it:
• Please check your emails at least once per day.
5
Office hour
• Email:
– Email address: [email protected]
– Subject line should include “ling575”
– The 48-hour rule: it works both ways
• Office hour:
– Time: Fr: 10:30-11:30am
– Location: Padelford A-210G
6
Slides
• The slides will be online before class if
possible.
• The final version will be uploaded a few
hours after class.
7
Prerequisites
• CS 326 (Data Structures) or equivalent:
• Stat 391 (Prob. and Stats for CS) or equivalent: Basic concepts in
probability and statistics
• Programming in Perl, C, C++, Java, or Python
• LING570
• LING572
• Being comfortable with formulas
8
Grades for LING575
No midterm or final exams.
Graded:
• Assignments (5):
• Presentation:
45-60%
15-25%
Not graded:
• Reading:
5-10%
• Class participation: 10-20%
9
Assignments
• Assignments:
– Due at 2:30pm on Tuesdays
– 1% penalty for each hour after the due date. Nothing
accepted after 4 days.
– Submit via CollectIt
• Reading:
– Papers should be read before class.
– Bring at least two questions to class.
– Your answers will be checked but not graded.
10
Presentation
• Select your week by noon this Saturday (1/12) by
replying to the GoPost message:
– first come, first service
• If later for whatever reason, the week you selected no
long works for you, it is your responsibility to find
someone to switch.
• For your week, email Fei the slides by noon the Monday
(i.e., the day before your presentation).
– 1% penalty for each hour after the due date.
11
Patas
• If you need to have a patas account, you need to email
[email protected] right away to get an
account.
• The directory for LING575:
~/dropbox/07-08/575x/
– hw1/, hw2/, ….: Assignments and solution
– hmm/: A pre-existing HMM package
– misc_slides/: Solution to exams and misc slides that
are not on the course url.
12
Course plan
13
ML learning
• Supervised learning: LING572
• Semi-supervised learning:
– Some annotated data, plus a large amount of
annotated data
– Ex: self-training, co-training, transductive SVM
• Unsupervised learning:
– There are no annotated data
– Ex: EM
14
Unsupervised learning
• No annotated data
• But the knowledge has to come from somewhere.
– Dictionary / lexicon
– Seed examples
–…
 We choose unsupervised POS tagging as a case to
study.
15
Supervised POS tagging
• It is a sequence labeling problem.
• Statistical approach:
– Sequence labeling algorithms: HMM, MEMM, CRF,
…
– Classification algorithms: decision tree, naïve Bayes,
MaxEnt, SVM, Boosting, ….
• Most unsupervised POS tagging algorithms use EM to
estimate HMM parameters.
16
Major approaches to
unsupervised tagging
• All assume a large amount of unannotated data
• Approach #1: use EM to estimate HMM
– No lexicon
– With full lexicon
– With filtered lexicon
17
Major approaches (cont)
• Approach #2: clustering the words based
on
– distributional cues
– morphological cues
• Approach #3: cross-lingual approach:
– It requires parallel data
– Seeds are created by projecting POS info
from one language to the other.
18
Major approaches (cont)
• Approach #4: Prototype learning:
– It requires a small number of prototypes: e.g.,
“book” is a noun, “the” is a determiner.
– Prototypes would help to label other words.
19
In this course
• We will
– discuss the papers in each category
– explore various methods aiming at improving
the start of the art.
• Compared to last year’s ling573, this
course focuses
– more on machine learning
– less on search and rule writing
20