downloading - Sketch Engine
Download
Report
Transcript downloading - Sketch Engine
Corpus Processing and NLP
Madrid 2010
Kilgarriff: Corpus Processing and NLP
1
What is NLP?
• Natural Language Processing
– natural language vs. computer languages
• Other names
– Computational Linguistics
• emphasizes scientific not technological
– Language Engineering
• official European Union term, ca 1996-99
– Human Language Technology (HLT)
• preferred EU and US Government term)
– Language Technology
Madrid 2010
Kilgarriff: Corpus Processing and NLP
2
NLP and linguistics
L
I
N
G
supply ideas
interpret results
test theories
expose gaps
N
L
P
plus turn into technology
Madrid 2010
Kilgarriff: Corpus Processing and NLP
3
Example: regular morphology
LINGUISTICS:
– Rules: stems -> inflected forms
NLP:
– program the rules
– apply rules to a lexicon of stems
– Is the output correct? Errors?
LINGUISTICS:
– refine the theory
Needed for: web search, spell-checkers, machine
translation, speech recognition systems etc.
Madrid 2010
Kilgarriff: Corpus Processing and NLP
4
Application areas
• web search
– Basic search
– Filtering results
• spelling and grammar checking
• machine translation (MT)
• talking to computers
– speech processing as well
• information extraction (IE)
– finding facts in a database of documents; populating a
database, answering questions
Madrid 2010
Kilgarriff: Corpus Processing and NLP
5
How can NLP make better
dictionaries?
By pre-processing a corpus:
•
•
•
•
•
tokenization
sentence splitting
lemmatization
POS-tagging
parsing
Each step builds on predecessors
Madrid 2010
Kilgarriff: Corpus Processing and NLP
6
Tokenization
“identifying the words”
from:
he didn't arrive.
to:
He
did
n’t
arrive
.
Madrid 2010
Kilgarriff: Corpus Processing and NLP
7
Automatic tokenization
• Western writing systems
– easy! space is separator
• Chinese, Japanese, some other writing
systems
– do not use word-separator
– hard
• like POS-tagging (below)
Madrid 2010
Kilgarriff: Corpus Processing and NLP
8
Why isn't space=separator enough
(even for English)?
• what is a space
– linebreaks, paragraph breaks, tabs
• Punctuation
– characters do not form parts of words but may
be attached to words (with no spaces)
• brackets, quotation marks
• Hyphenation
– is co-op one word or two? is well-managed?
Madrid 2010
Kilgarriff: Corpus Processing and NLP
9
Sentence splitting
“identifying the sentences”
from:
he didn't arrive.
to:
He
did
n’t
arrive
.
Madrid 2010
to:
<s>
He
did
n’t
arrive
.
</s>
Kilgarriff: Corpus Processing and NLP
10
Lemmatization
Mapping from text-word to lemma
help (verb)
text-word
help
helps
helping
helped
Madrid 2010
.
to
lemma
help (v)
help (v)
help (v)
help (v)
Kilgarriff: Corpus Processing and NLP
11
Lemmatization
Mapping from text-word to lemma
help (verb) help (noun), helping (noun)
text-word
help
helps
helping
helped
helpings
to
lemma
help (v), help (n)
help (v), helps (n)**
help (v), helping (n)
help (v)
helping (n)
**help (n): usually a mass noun, but part of compound home help which is
a count noun, taking the "s" ending.
Madrid
2010
.
Kilgarriff: Corpus Processing and NLP
12
Lemmatization
Dictionary entries are for lemmas so
lemmatization is required for a match
between text-word and dictionary-word
.
Madrid 2010
Kilgarriff: Corpus Processing and NLP
13
Lemmatization
• Searching by lemma
– English: little inflection
– French: 36 forms per verb
– Finno-Ugric: 2000.
• Not always wanted:
– English royalty
• singular: kings and queens
• plural royalties: payments to authors
Madrid 2010
Kilgarriff: Corpus Processing and NLP
14
Automatic lemmatization
• Write rules:
– if word ends in "ing", delete "ing";
– if the remainder is verb lemma, add to
list of possible lemmas
• If detailed grammar available, use it
• full lemma list is also required
– Often available from dictionary companies
Madrid 2010
Kilgarriff: Corpus Processing and NLP
15
Part-of-speech (POS) tagging
“identifying parts of speech”
from:
he didn't arrive.
to:
…
to:
<s>
He
did
n’t
arrive
PNP pers pronoun
VVD past tense verb
XNOT not
VV base form of
verb
.
Madrid 2010
.
C
</s>
Kilgarriff: Corpus
Processing and NLP
punctuation
16
Tagsets
• The set of part-of-speech tags to choose
between
– Basic: noun, verb, pronoun …
– Advanced: examples - CLAWS English
tagset
• NN2
• VVG
plural noun
-ing form of lexical verb
• Based on linguistics of the language.
Madrid 2010
Kilgarriff: Corpus Processing and NLP
17
POS-tagging: why?
• Use grammar when searching
– Nouns modified by buckle
– Verbs that buckle is object of
Madrid 2010
Kilgarriff: Corpus Processing and NLP
18
POS-tagging: how?
• Big topic for computational linguistics
– well understood
– taggers available for major languages
• Some taggers use lemmatized input, others do not
• Methods
– constraint-based: set of rules of the form
if previous word is "the" and VERB is one of the
possibilities, delete VERB
– Statistical:
• Machine learning from tagged corpus
• Various methods
• Ref: Manning and Schutze, Foundations of Statistical
Natural Language Processing, MIT Press 1999.
Madrid 2010
Kilgarriff: Corpus Processing and NLP
19
Parsing
• Find the structure:
– Phrase structure (trees)
The
cat sat on the
– Dependency structure (links)
–
Madrid 2010
The
cat
sat
on the
Kilgarriff: Corpus Processing and NLP
mat
mat
20
Automatic parsing
• Big topic
– see Jurafsky and Martin or other NLP
textbook
• Many methods too slow for large
corpora
• Sketch Engine usually uses “shallow
parsing”
– Patterns of POS-tags
– Regular expressions
Madrid 2010
Kilgarriff: Corpus Processing and NLP
21
Regular expressions
• Search for any pattern
• Very useful in lots of places
• Exercises
– http://www.sketchengine.co.uk/exercises/regex
Madrid 2010
Kilgarriff: Corpus Processing and NLP
Summary
• What is NLP?
• How can it help?
– Tokenizing
– Sentence splitting
– Lemmatizing
– POS-tagging
– Parsing
Madrid 2010
Kilgarriff: Corpus Processing and NLP
23
Exercise
•
•
•
•
A sentence of your language
A tagset of your language
Tokenize
For each word, decide
– What is the lemma (doesn’t apply in Chinese)
– Which tag applies
Word
Visiting
relatives
…
Madrid 2010
Lemma
visit
relative
Tag
VVG
NN2
Kilgarriff: Corpus Processing and NLP
24