tokenisation

Transcript tokenisation

CLINT
Tokenisation
March 2006
Introduction to Computational
Linguistics
1
Information Food Chain
↑
↑
↑
↑
↑
↑
↑
↑
Inference
Knowledge Representation
Meaning Extraction
Semantic Relationships
Chunking (noun phrases; verb
phrases)
Part of Speech Annotation
Paragraph and sentence identification
Tokenisation
Raw Text
March 2006
Introduction to Computational
Linguistics
2
Start with a Corpus
• A corpus is an organised body of materials
from language that is used as a basis for
empirical studies.
• Corpora classfied according to
– Representativeness
– Medium
– Language
– Information Content
– Structure
March 2006
Introduction to Computational
Linguistics
3
Examples of Corpora
• Project Gutenberg: public domain text
resources. http://www.promo.net/pg
• Brown Corpus: a tagged corpus of about
1M words put together at Brown 1960-70
• Penn Treebank: a corpus of parsed
sentences based on text from the WSJ
• Canadian Hansards: bilingual (En Fr)
corpus the Canadian parliament.
March 2006
Introduction to Computational
Linguistics
4
Low Level Issues
• Preprocessing: getting rid of junk such as
whitespace, images, certain formatting
information etc.
• Normalisation: deciding on standard
character representations; adopting upper
or lower case (or both)
• Tokenisation
March 2006
Introduction to Computational
Linguistics
5
Tokenisation
• Tokenisation is a process which divides
input text into individual units called
tokens.
• Tokens are normally taken to be indivisible
by the next level of analysis, but they can
be associated with various kinds of
information.
• An example of such information is the type
of the token: word, punctuation, number
March 2006
Introduction to Computational
Linguistics
6
What counts as a word?
•
•
•
Words are quite tricky to define
The standard definition: a string of
contiguous alphanumeric characters with
space on either side; may include
hyphens and apostrophes but no other
punctuation marks (Kucera and Francis
1967)
It is easy to find exceptions.
March 2006
Introduction to Computational
Linguistics
7
Problems Identifying Words
VfB Stuttgart scored twice in quick success
-ion early in the second half on their way to a
deserved 2-1 victory over Manchester United in
the Champions League on Wednesday.
(example from Mary Dalrymple, University of
London)
•
•
•
•
VfB Stuttgart, Manchester United
succession
2-1
Wednesday
March 2006
Introduction to Computational
Linguistics
8
Problems Identifying Words
Problems Involving Spaces
• Lack of spaces between words
Lebensversicherungsgesellschaftsanngesteller
(life insurance company employee)
Ix-Xemx
• The presence of spaces may not indicate
a word break
Coca Cola; +356 21 456 457
March 2006
Introduction to Computational
Linguistics
9
Problems Involving
Special Characters
• Words often include non-alphanumeric
characters which are actually part of the
word.
$22.50; www.di-ve.com.mt; BSc. IT :-)
• Words are often terminated by punctuation
which is not part of the word.
• Sometimes, terminating punctuation is part
of the word.
March 2006
Introduction to Computational
Linguistics
10
Periods
• In general, punctuation marks attach to
words, and can be removed. However
there are special cases:
• Most periods mark end of sentence
• Others mark abbreviations, e.g. "e.g.".
"Wash."
• Note that when an abbreviation occurs at
the end of a sentence there is only one
period.
March 2006
Introduction to Computational
Linguistics
11
Apostrophe
• English contractions such as won't or I'll
count as one word according to the classic
definition
• However there are reasons for wanting
two separate tokens – such as interaction
with grammar rules (S → NP VP)
• Penn Treebank splits such contractions
into two words.
March 2006
Introduction to Computational
Linguistics
12
Apostrophe
• This sometimes leaves odd words
For example isn’t yields is + n't
• 's is ambiguous
– Abbreviation for is (he's strange)
– Possessive (John's car)
• Word-final aprostrophe is ambiguous
– end of quotation
– possessive of word ending in s
March 2006
Introduction to Computational
Linguistics
13
Exercise
• How is the apostrophe used in Maltese
• How should a Maltese tokeniser deal with
it?
March 2006
Introduction to Computational
Linguistics
14
Hyphen
• Issue: do sequences of words joined by hyphens
count as one word or more?
• Typesetting hyphens (at end of line) and
hyphens in measure phrases (35-year-old)
are usually removed.
• Typesetting hyphens can be ambiguous
• Lexical hyphens are usually kept
hi-fi
• Hyphens – standing alone – are used as
punctuation.
• Texts are often inconsistent in usage of hyphens
March 2006
Introduction to Computational
Linguistics
15
Case
• Types vs. Tokens
– How many tokens in the following sentence:
The cat chased the rat on the table
– How many types?
• Tokenisation should correctly identify word
types, i.e.
– Tokens of the same type should be identified
– Tokens of different type should be distinguished
• Case representation of ordinary words must be
standardised.
March 2006
Introduction to Computational
Linguistics
16
Case
• Heuristics
– Map first character of a sentence to standard
case
– Map all words in titles to lowercase
• Problems
– Identification of sentence boundaries
– Identification of proper names
March 2006
Introduction to Computational
Linguistics
17
Normalisation
• Character representations.
• Converting all letters to lower or upper
case
• Removing punctuation
• Removing letters with accent marks and
other diacritics
• Expanding abbreviations
March 2006
Introduction to Computational
Linguistics
18
Further Normalisation
• Stemming: are eats and eating different
words?
• They are two different wordforms
• that have the same stem, eat, but different
suffixes, -s and -ing
• Stemming versus full morphological
analysis.
March 2006
Introduction to Computational
Linguistics
19
Summary
• The tokenisation problem interacts with design
decisions at different levels concerning
– Handling of non alphanumeric characters
– Case
– Punctuation
• Typically many of these problems are dealt with
by hand crafting special rules which match a
particular case.
• Such rules are often built out of regular
expressions.
March 2006
Introduction to Computational
Linguistics
20
Sources
Foundations of Statistical Language
Processing, Manning and Schütze, MIT
1999
March 2006
Introduction to Computational
Linguistics
21

tokenisation

Transcript tokenisation

Directory