Ruifang`s Lecture on NLP Tools
Download
Report
Transcript Ruifang`s Lecture on NLP Tools
Introduction to NLP
Tools
09/23/2003
1
Motivation
• Machine Translation
– From English to French
• What’s needed?
2
Motivation Cont’d (1)
• Syntactic parser
• Part-Of-Speech Tagger
– Example: NP -> adj noun
• Morphological Analyzer
– Example: “tools” -> “tool”
“Who is he?” -> “Who is he ?”
• Semantic Analyzer
– Word sense disambiguate (“wash dishes”)
– Choose the correct translation
3
Motivation Cont’d (2)
• Lexicons
– The information of the word
How many senses? What’s the possible translations
of the word?
• Corpus
– Useful for learning a tool
– Useful for evaluation
4
Outline
•
•
•
•
•
•
Lexicons
Text corpora
Morphological tools
Part-Of-Speech(POS) taggers
Syntactic parsers
Semantic knowledge bases and semantic
parser
• Speech tools
5
Lexicons
• Definition
– A repository for words
• Lexicons in LDC(Linguistic Data
Consortium)
– creating and sharing linguistic resources: data,
tools and standards.
• CELEX
• WordNet
6
CELEX
•
•
•
•
Dutch Center for Lexical Information
Lexical databases of English , Dutch and German
21,000 nouns, 8,000 adjectives and 6,000 verbs
English:
–
–
–
–
–
–
–
–
–
–
–
English Orthography, Lemmas
English Phonology, Lemmas
English Morphology, Lemmas
English Syntax, Lemmas
English Frequency, Lemmas
English Orthography, Wordforms
English Phonology, Wordforms
English Morphology, Wordforms
English Frequency, Wordforms
English Corpus Types
English Frequency, Syllables
7
WordNet
• A database of lexical relations
• Inspired by current psycholinguistic
theories of human lexical memory
• Synset: a set of synonyms, representing one
underlying lexical concept
– Example:
• fool {chump, fish, fool, gull, mark, patsy, fall guy,
sucker, schlemiel, shlemiel, soft touch, mug}
• Relations link the synsets: hypernym, HasMember, Member-Of, Antonym, etc.
8
WordNet Cont’d
• Example
pu-erh.cs.utexas.edu$ wn bike -partn
Part Meronyms of noun bike
2 senses of bike
Sense 1
motorcycle, bike
HAS PART: mudguard, splashguard
Sense 2
bicycle, bike, wheel
HAS PART: bicycle seat, saddle
HAS PART: bicycle wheel
HAS PART: chain
HAS PART: coaster brake
HAS PART: handlebar
HAS PART: mudguard, splashguard
HAS PART: pedal, treadle, foot lever
HAS PART: sprocket, sprocket wheel
•
Example
Pu-erh.cs.utexas.edu$wn bike
Information available for noun bike
-hypen
Hypernyms
-hypon, -treen Hyponyms & Hyponym Tree
-synsn
Synonyms (ordered by frequency)
-partn
Has Part Meronyms
-meron
All Meronyms
-famln
Familiarity & Polysemy Count
-coorn
Coordinate Sisters
-simsn
Synonyms (grouped by similarity of meaning)
-hmern
Hierarchical Meronyms
-grepn
List of Compound Words
-over
Overview of Senses
Information available for verb bike
-hypev
Hypernyms
-hypov, -treev Hyponyms & Hyponym Tree
-synsv
Synonyms (ordered by frequency)
-famlv
Familiarity & Polysemy Count
-framv
Verb Frames
-simsv
Synonyms (grouped by similarity of meaning)
-grepv
List of Compound Words
-over
Overview of Senses
9
Corpus
• Definition
– Collections of text and speech
•
•
•
•
LDC
Penn Treebank
DSO
Hansard
10
Some of the Top Corpus from LDC
• TIPSTER
– Information Retrieval, Data Extrraction datasets
– TIPSTER project, TREC project
• TIMIT Acoustic-Phonetic Continuous Speech Corpus
– A corpus of read speech designed to
– Provide speech data for the acquisition of acousticphonetic
knowledge
– Useful for the development and evaluation of automatic speech
recognition systems
• ECI(European Corpus Initiative Multilingual Corpus) multilingual
electronic text corpus
• NTIMIT
– A phonetically
– balanced, continuous speech, telephone bandwidth speech database
11
Penn Treebank
• A collection of corpora
• Tagged with POS, Syntactic roles,
predicate/argument structure, dysfluency
annotation
• How are they made
– Hand correction of the output of an errorful automatic
process
• 3 million words
– 1 million words tagged with predicate/argument
structure for extraction semantic knowledge
12
Penn Treebank Cont.’d
• Corpora
– Wall Street Journal
– ATIS (Air Travel
Information System)
– Brown Corpus
– IBM Manual Sentences
– Library of America
Texts: Mark Twain,
Henry Adams, Herman
Melville ...
– MUC-3 Messages
• Example:
( (S (NP-SBJ Rally 's)
(VP operates
and
franchises
(NP (NP (QP about 160)
fast-food restaurants)
(PP-LOC throughout
(NP the U.S))))
Seeking/VBG to/TO block/VB
[ the/DT investors/NNS ]
from/IN buying/VBG
[ more/JJR shares/NNS ]
./.
13
DSO
• Word Sense Corpus
– Contains sentences in which about 192,800
word occurrences have been tagged with
WordNet senses
– Taken from the Brown corpus and the Wall
Street Journal corpus
– 121 nouns and 70 verbs
14
Hansard
• Official records (Hansards) of the 36th Canadian
Parliament, both in English of French
• 1.3 million pairs of aligned sentences of English
and French
– Example
• Comme il est 14 h 30, la Chambre s'ajourne jusqu'\xe0 lundi
prochain, \xe0 11 heures, conform\xe9ment au paragraphe
24(1) du R\xe8glement.
• It being 2.30 p.m., the House stands adjourned until Monday
next at 11 a.m., pursuant to Standing Order 24(1).
• Useful for Machine Translation
15
Morphological Tools
• PC-KIMMO
– A two-level morphological parser
• Porter Stemmer
• Penn Treebank Tokenizer
– Seperate document into words
– “dog?” -> “dog ?”
16
Porter Stemmer
• Simple algorithm, use a set of cascaded rewrite
rules
– Example
• Ational->ATE (relational->relate)
• Stem:
– The main morpheme of the word, supplying the main
meaning
• Fast
• Used very widely in Information Retrieval
– Run stemmer on keywords and the words in the
documents
17
Part-Of-Speech(POS) Taggers
•
•
•
•
Part-Of-Speech: noun, verb, pronoun, etc.
Brill’s Tagger
HMM Tagger
MXPOST
18
Brill’s Tagger
•
•
•
•
Transformation-Based Learning(TBL) tagger
/projects/nlp/brill-pos-tagger
First labels every word with its most-likely tag
Then Use Learned TBL Rules to correct mistakes
– Example:
• Change NN to VB when the previous tag is TO
19
HMM Tagger
• Also called Maximum Likelihood Tagger
• Xerox PARC's HMM tagger:
ftp://parcftp.xerox.com/pub/tagger/
• Choose the tag sequence with the maximum
possibility given the words seen.
20
MXPOST: Maximum Entropy POS
Tagger
• Maximum Entropy Model is a framework
integrating many information sources(called
features) for classification
• Each candidate tag is a class
• Given features of the word(the around words, the
morphological feature, and around tags, etc.),
decide which class it belongs.
21
Syntactic Parsers
• Collin’s Parser
• XTAG
• MXPOST: Maximum Entropy Parser
22
Collin’s Parser
• Context-free Grammar
• Use frequencies to solve ambiguities
• Got some idea of this parser
– Web-based Chart parser
23
XTAG
• An on-going project to develop a wide-coverage
grammar for English
• using a lexicalized Tree Adjoining Grammar (TAG)
formalism
– Context sensitive grammar
• consists of a parser, an X-windows grammar
development interface and a morphological
analyzer.
• /projects/nlp/xtag/
24
XTAG Cont’d
25
Semantic Knowledge Bases and
Semantic Parser
•
•
•
•
Analyze what does it say
WordNet
Penn Treebank
Web-based Semantic Parser
26
WordNet
• Respresents lexical relations
• Useful in word sense disambiguation
27
Penn Treebank
Predicate: fool(Kris)
28
Semantic Parser
• A web-based chart parser enriched with
semantic constraints
• Example:
– Input: My dog has fleas.
– Output: has(my(dog),fleas)
•
29
Speech Tools
• ISIP
• EPOS
• CSLU Toolkit
30
ISIP
• ISIP(Institute for Signal and Information
Processing) public domain speech
recognition system
• Open research software
• Online courses, tutorials, dictionaries,
databases
• Build your own speech recognition system
31
EPOS
• a language independent rule-driven Text-toSpeech (TTS) system
• supports several main speech generation
algorithms
32
CSLU Toolkit
• Basic framework and tools for people to build,
investigate and use interactive language systems
• speech recognition, natural language
understanding, speech synthesis and facial
animation technologies
• Easy to use , spread from higher education into
homes
33
Thanks!
34