Computerlinguistik I Vorlesung im WS 2004/05

Download Report

Transcript Computerlinguistik I Vorlesung im WS 2004/05

Computerlinguistik II /
Sprachtechnologie
Vorlesung im SS 2011
(M-GSW-10)
Prof. Dr. Udo Hahn
Lehrstuhl für Computerlinguistik
Institut für Germanistische Sprachwissenschaft
Friedrich-Schiller-Universität Jena
http://www.julielab.de
Classical Pipeline for NLP Analysis
ended
infection
a
severe
Termination Event
pregnancy
process
cause
the
Infection
I-degree
end + edPastTense
Morphology
Pregnancy
severe
Syntax
(Stemming,
Lemmatization)
(Parsing:
Syntactic Analysis)
Lexicon
Grammar
Semantics
(Semantic
Interpretation)
Semantics
Domain
Ontology
A severe infection ended the pregnancy.
2
Sample NLP Analysis for
Information Extraction
Syntactic Analysis &
Semantic Tagging
Information Extraction Rule
CONCEPT TYPE: Process Termination Event
CONSTRAINTS:
A severe infection ended the
SUBJ: (extract: Cause)
pregnancy.
Class: <Disease>
VERB: Root: “end” or “terminate”;
SUBJ: Phrase: a severe infection;
Mode: active
Class: <Disease>
OBJ:
(extract: PhysProcess)
VERB: Term: ended;
Class: <Process>
Root: “end” or “terminate”;
Mode: active, affirmative
Extracted Template
OBJ:
Phrase: the pregnancy
Class: <Process>
[ Process Termination Event
[ Cause : a severe infection ]
[ PhysProcess : pregnancy ]]
3
Morphological and Lexical Layer
• What is a word, lexically speaking?
• “go”, “blue”, “gene”, “take off”, “get rid of”,
“European Union”, “on behalf of”
• Are these reasonable lexical entries?
– Angela Merkel, Pope Benedict, Barack Obama,
Friedrich Schiller, King Juan Carlos I
• Morphology of words
– Inflection
• “activate”, “activates”, “activated”
end + edPastTense
– Derivation
• “active”, “activate”, “activation”, “activity”
– Compounding
• “brain activity”, “hyper-activity”
4
Syntactic Layer
• What is a phrase?
• “a severe infection”, “the pregnancy”
•  “infection ended the”
• How do phrases combine to form a clause or
a sentence?
S
VP
ended
infection
a
severe
pregnancy
NP
NP
the
Det Adj
N
V
Det
N
a severe infection ended the pregnancy
5
Semantic Layer
• What is the meaning of a word?
– “go”, “blue”, “gene”, “take off”, “get rid of”,
“European Union”, “on behalf of”
– relational or semantic primitives?
• What is the meaning of a phrase or sentence?
ended
Termination Event
process
infection
a
severe
pregnancy
cause
Pregnancy
Infection
the
I-degree
severe
6
Lexical Background Knowledge
• Lexicon
– Part of speech information
• “rain” can be a Verb (VB) or a Noun (NN)
– Inflection patterns (classification)
• “rain” – “rained” vs. “go” – “went”
– Syntactic features
• (in)transitivity of verbs (direct object required or not)
• Head-modifier specs
– Head nouns may have a determiner, several adjectives, a
quantifier as pre-modifiers; NP or PP as post-modifiers
– Semantic features
• Semantic types and relations
– “aspirin” is a “drug”, “drugs” are “man-made artifacts”
• Selection constraints
7
– [Drugs] “cure” [Diseases], [MedPers] “treat” [Patient]
Syntactic
Background Knowledge
• Grammar (syntax of NL)
– Set of rules or constraints
• S  NP VP, NP  Det Noun, VP  Verb NP
• “Number” of HeadNoun determines “Number” of
Determiner:
– “the drugs”,  “a drugs”
8
Semantic
Background Knowledge
• Semantics (meaning representation of NL)
– Set of rules or constraints
• S  NP1 VP, NP1,2  Det Noun, VP  Verb NP2
• IF
– a) the head noun of NP1 denotes a Disease &
– b) the head noun of NP2 denotes a Process &
– c) the main verb denotes a termination event
• THEN
– The following proposition can be instantiated:
» TERMINATE ( Disease, Process )
» TERMINATE ( “infection”, “pregnancy” )
9
Two Paradigms for NLP
• Symbolic Specification Paradigm
– Manual acquisition procedures
– Lab-internal activities
– Intuition and (few!) subjectively generated examples drive
progress based on individual (competence) judgments
• “I have a system that parses all of my nine-teen sentences!”
• Empirical (Learning) Paradigm
– Automatic acquisition procedures
– Community-wide sharing of common knowledge and
resources
– Large and ‚representative‘ data sets drive progress
according to experimental standards
• “The system was tested on 1,7 million words taken from the
WSJ segment of the MUC-7 data set and produced 4.9%
parsing errors, thus yielding a statistically significant 1.6%
improvement over the best result by parser X on the same
10
data set & a 40.3% improvement over the baseline system!”
Symbolic Specification Paradigm
• Manual rule specification
– Source: linguist´s intuition
• Manual lexicon specification
– Source: linguist´s intuition
• Each lab has its own (home-grown) set
of NLP software
– Hampers reusability
– Limits scientific progress
– Waste of human and monetary resources
(we “burnt” thousands of Ph.D. student all
over the world )
11
Shortcomings of the “Classical”
Linguistic Approach
• Huge amounts of background knowledge req.
– Lexicons (approx. 100,000 – 150,000 entries)
– Grammars (>> 15,000 – 20,000 rules)
– Semantics (>> 15,000 – 20,000 rules)
• As the linguistic and conceptual coverage of
classical linguistic systems increases (slowly),
it still remains insufficient; systems also reveal ‘spurious’ ambiguity, and, hence, tend to
become overly “brittle” and unmaintainable
• More fail-soft behavior is required at the
expense of … ? (e.g., full-depth understanding)
12
Empirical Paradigm
• Large repositories of language data
– Corpora (plain or annotated, i.e., enriched by meta-data)
• Large, community-wide shared repositories of
language processing modules
– Tokenizers, POS taggers, chunkers, NE recognizers, ...
• Shared repositories of machine learning algos
• Shallow analysis rather than deep understanding
• Automatic acquisition of linguistic knowledge
– Applying ML algos to train linguistic processors by using
large corpora rather than manual intuition
• Large, community-wide self-managed, task-oriented
competitions, comparative evaluation rounds
• Change of mathematics:
– Statistics rather than algebra and logics
13
Paradigm Shift – We Exchanged our Textbooks...
14
Pipeline for NLP Analysis (revisited)
end + edPastTense
Morphology
(Stemmer,
Lemmatizer)
Lexicon
A
severe
Det
Adj
infection
A
endedsevere
NN Det B
Vb Adj I
Pregnancy : Process
Infection: Disease
Termination (Pregnancy, Infection)
the infection
Det NN I
ended
pregnancy
NN Vb O
ended
infection
pregnancy
the
Det B
pregnancy
NN I
aSyntax
severe
the
(POS Tagger,
Chunker, Parser)
POS/Tree
Bank
Termination Event
process
cause
Pregnancy
Infection
Semantics
I-degree
(NE Recognizer,
severe
Proposition Analyzer)
Prop
Bank
Domain
Ontology
A severe infection ended the pregnancy.
15
Core NLP Technologies
• POS Tagging
• Chunking & Partial Parsing
• Named Entity
Recognition/Interpretation
16
POS Tagging
A severe infection ended the pregnancy .
DET ADJ
NOUN
VERB DET NOUN
ST
17
Penn Treebank Tag Set
Tag
Description
Examples
In total,
45 tags
.
sentence terminator . ! ?
DT
determiner
all an many such that the them these this
JJ
adjective, numeral
first oiled separable battery-powered
NN
common noun
cabbage thermostat investment
PRP
personal pronoun
herself him it me one oneself theirs they
IN
preposition
among out within behind into next
VB
verb (base form)
ask assess assign begin break bring
VBD
verb (past tense)
asked assessed assigned began broke
WP
WH-pronoun
that what which who whom
18
Transformation Rules
for Tagging [Brill, 1995]
• Initial State: Based on a number of features,
guess the most likely POS tag for a given word:
– die/DET Frau/NOUN ,/COMMA die/DET singt/VFIN
• Learn transformation rules to reduce errors:
– Change DET to PREL whenever the preceding word
is tagged as COMMA
• Apply learned transformation rules:
– die/DET Frau/NOUN,/COMMA die/PREL singt/VFIN
19
First 20 Transformation Rules
20
Taken from: Brill (1995), Transformation-Based Error-Driven Learning
Towards Statistical Models of
Natural Language Processing …
21
Letter-based Language Models
• Shannon’s Game
• Guess the next letter:
•
22
Letter-based Language Models
• Shannon’s Game
• Guess the next letter:
•
W
23
Letter-based Language Models
• Shannon’s Game
• Guess the next letter:
•
Wh
24
Letter-based Language Models
• Shannon’s Game
• Guess the next letter:
•
Wha
25
Letter-based Language Models
• Shannon’s Game
• Guess the next letter:
•
What
26
Letter-based Language Models
• Shannon’s Game
• Guess the next letter:
•
What d
27
Letter-based Language Models
• Shannon’s Game
• Guess the next letter:
•
What do
28
Letter-based Language Models
• Shannon’s Game
• Guess the next letter:
•
What do you think the next letter
is?
29
Word-based Language Models
• Shannon’s Game
• Guess the next letter:
•
What do you think the next letter
is?
• Guess the next word:
•
30
Word-based Language Models
• Shannon’s Game
• Guess the next letter:
•
What do you think the next letter
is?
• Guess the next word:
•
We
31
Word-based Language Models
• Shannon’s Game
• Guess the next letter:
•
What do you think the next letter
is?
• Guess the next word:
•
We are
32
Word-based Language Models
• Shannon’s Game
• Guess the next letter:
•
What do you think the next letter
is?
• Guess the next word:
•
We are now
33
Word-based Language Models
• Shannon’s Game
• Guess the next letter:
•
What do you think the next letter
is?
• Guess the next word:
•
We are now entering
34
Word-based Language Models
• Shannon’s Game
• Guess the next letter:
•
What do you think the next letter
is?
• Guess the next word:
•
We are now entering statistical
35
Word-based Language Models
• Shannon’s Game
• Guess the next letter:
•
What do you think the next letter
is?
• Guess the next word:
•
We are now entering statistical
territory
36
Approximating
Natural Language Words
• zero-order approximation:
letter sequences are independent of
each other and all equally probable:
• xfoml rxkhrjffjuj zlpwcwkcy
ffjeyvkcqsghyd
37
Approximating
Natural Language Words
• first-order approximation:
letters are independent, but occur
with the frequencies of English text:
• ocro hli rgwr nmielwis eu ll
nbnesebya th eei alhenhtppa oobttva
nah
38
Approximating
Natural Language Words
• second-order approximation:
the probability that a letter appears
depends on the previous letter
• on ie antsoutinys are t inctore st bes
deamy achin d ilonasive tucoowe at
teasonare fuzo tizin andy tobe seace
ctisbe
39
Approximating
Natural Language Words
• third-order approximation:
the probability that a certain letter
appears depends on the two
previous letters
• in no ist lat whey cratict froure birs
grocid pondenome of demonstures
of the reptagin is regoactiona of cre
40
Approximating
Natural Language Words
• Higher frequency trigrams for
different languages:
–
–
–
–
–
English:
German:
French:
Italian:
Spanish:
THE, ING, ENT, ION
EIN, ICH, DEN, DER
ENT, QUE, LES, ION
CHE, ERE, ZIO, DEL
QUE, EST, ARA, ADO
41
Zipfsches Gesetz
Wortverteilung im Vergleich zu einer einfachen Zipf-Verteilung (~1/n. Wortanzahl: 70;
Texte aus: http://www.gutenberg.org/dirs/etext04/8effi10.txt)
42
Terminology
• Sentence: unit of written language
• Utterance: unit of spoken language
• Word Form: the inflected form that appears
literally in the corpus
• Lemma: lexical forms having the same stem,
part of speech, and word sense
• Types (V): number of distinct words that
might appear in a corpus (vocabulary size)
• Tokens (NT): total number of words in a
corpus (note: V < NT)
• Types seen so far (T): number of distinct
words seen so far in corpus (note: T < V < NT)
43
Word-based Language Models
• A model that enables one to compute the
probability, or likelihood, of a sentence S,
P(S).
• Simple: Every word follows every other
word with equal probability (0-gram)
– Assume |V| is the size of the vocabulary V
– Likelihood of sentence S of length n is
1/|V| × 1/|V| … × 1/|V|
– If English has 100,000 words, the probability
of each next word is 1/100000 = .00001
44
Relative Frequency vs.
Conditional Probability
• Smarter: Relative Frequency
probability of each next word is related to word
frequency within a corpus (unigram)
• Likelihood of sentence S = P(w1) × P(w2) × … × P(wn)
• Assumes probability of each word is independent of probabilities
of other words
• Even smarter: Conditional Probability
Look at probability given previous words (n-gram)
• Likelihood of sentence S = P(w1) × P(w2|w1) × … × P(wn|wn-1)
• Assumes probability of each word is dependent on probabilities
of previous words
45
Generalization of Conditional
Probability via Chain Rule
• Conditional Probability
– P(A1,A2) = P(A1) · P(A2|A1)
• The Chain Rule generalizes to multiple events
– P(A1, …,An) =
P(A1) × P(A2|A1) × P(A3|A1,A2) × … × P(An|A1…An-1)
• Examples:
– P(the dog) = P(the) × P(dog | the)
– P(the dog bites) = P(the) × P(dog | the) × P(bites| the dog)
46
Relative Frequencies and
Conditional Probabilities
• Relative word frequencies are better than
equal probabilities for all words
– In a corpus with 10K word types, each word
would have P(w) = 1/10K
– Does not match our intuitions that different
words are more likely to occur
• (e.g. “the” vs. “shop” vs. “aardvark”)
• Conditional probability is more useful
than individual relative word frequencies
• dog may be relatively rare in a corpus
47
• but if we see barking, P(dog|barking) may be large
Probability for a Word String
• In general, the probability of a complete
string of words w1n = w1…wn is
P(w1n)
=P(w1)P(w2|w1)P(w3|w1 w2)…P(wn|w1…wn-1)
n
k 1)
P
(
|

w
w
k 1
=
k 1
• But this approach to determining the
probability of a word sequence gets to be
computationally very expensive
48
Markov Assumption (basic idea)
• How do we compute P(wn|w1n-1)?
• Trick: Instead of P(rabbit|I saw a), we use
P(rabbit|a).
– This lets us collect statistics in practice via a
bigram model: P(the barking dog) =
P(the|<start>) × P(barking|the) × P(dog|barking)
49
Markov Assumption (the very idea)
• Markov models are the class of
probabilistic models that assume that we
can predict the probability of some future
unit without looking too far into the past
– Specifically, for N=2 (bigram):
– P(w1n) ≈ Πk=1 n P(wk|wk-1); w0 := <start>
• Order of a Markov model: length of prior
context
– bigram is first order, trigram is second
order, …
50
Statistical HMM-based Tagging
[Brants, 2000]
• State transition probability: Likelihood of a tag
immediately following n other tags
– P1(Tagi | Tagi-1 ... Tagi-n)
• State emission probability: Likelihood of a
word given a tag
– P2(Wordi | Tagi)
• die/DET Frau/NOUN ,/COMMA die/DET or PREL
singt/VFIN
51
Trigrams for Tagging
• State transition probabilities (trigrams):
– P1(DET | COMMA NOUN) = 0.0007
– P1(PREL | COMMA NOUN) = 0.01
• State emission probabilities:
– P2( die | DET) = 0.7
– P2( die | PREL) = 0.2
• Compute probabilistic evidence for the tag
being
– DET:
– PREL:
P1 • P2 = 0.0007 • 0.7
P1 • P2 = 0.01 • 0.2
= 0.00049
= 0.002
52
• die/DET Frau/NOUN ,/COMMA die/PREL singt/VFIN
Inside (most) POS Taggers
• Lexicon look-up routines
• Morphological processing (not only
deflection!)
• Unknown word handler, if lexicon look-up fails
(based on statistical information)
• Ambiguity ranking (priority selection)
53
Chunking
Arginine methylation of STAT1 modulates IFN induced transcription
54
Chunking
[Arginine methylation] of [STAT1] modulates [IFN induced transcription]
55
Shallow Parsing
[Arginine methylation of STAT1]NP [modulates]VP [IFN induced transcription]NP
56
Shallow Parsing
[ [Arginine methylation]NP [of STAT1]PP ]NP
[Arginine methylation of STAT1]NP [modulates]VP [IFN induced transcription]NP
57
Shallow Parsing
[ [IFN induced]AP [transcription]N ]NP
[ [Arginine methylation]NP [of STAT1]PP ]NP
[Arginine methylation of STAT1]NP [modulates]VP [IFN induced transcription]NP
58
Deep Parsing
[ [IFN induced]AP [transcription]N ]NP
[ [[Arginine]N [methylation]N]NP [[of]P [STAT1]N]PP ]NP
[ [Arginine methylation]NP [of STAT1]PP ]NP
[Arginine methylation of STAT1]NP [ [modulates]V [IFN induced transcription]NP ]VP
59
Deep Parsing
[ [[IFN]N [induced]A]AP [transcription]N ]NP
[ [IFN induced]AP [transcription]N ]NP
[ [[Arginine]N [methylation]N]NP [[of]P [STAT1]N]PP ]NP
[ [Arginine methylation]NP [of STAT1]PP ]NP
[Arginine methylation of STAT1]NP [ [modulates]V [IFN induced transcription]NP ]VP
60
Chunking Principles
• Goal: divide a sentence into a sequence of
chunks (ako phrases)
• Chunks are non-overlapping regions of a text
– [I] saw [a tall man] in [the park]
• Chunks are non-exhaustive
– not all words of a sentence are included in
chunks
• Chunks are non-recursive
– a chunk does not contain other chunks
• Chunks are mostly base NP chunks
61
[ [the synthesis]NP-base of [long enhancer transcripts]NP-base ]NP-complex
The Shallow Syntax Pipeline
Tagging
Chunking
Parsing
62
BIO Format for Base NPs
63
A Simple Chunking Technique
• Simple chunkers usually ignore lexical
content
– Only need to look at part-of-speech tags
• Basic steps in chunking
– Chunking / Unchunking
– Chinking
– Merging / Splitting
64
Regular Expression Basics
• “|”
OR operator (explicit OR-ing)
– “[a|e|i|o|u]” matches any occurrence of vowels
• “[abc]” matches any occurrence of either
“a”, “b” or “c” (implicit OR-ing)
– “gr[ae]y” matches “grey” or “gray” (but not “graey”)
• “.”
matches arbitrary char
– “d.g” matches “dag”, “dig”, “dog”, “dkg” …
• “?”
preceding expression/char may or may not
occur
– “colou?r” matches “colour” and “color”
• “+”
preceding expression occurs at least
one time
– “(ab)+” matches “ab”, “abab”, “ababab”, …
• “*”
preceding expression occurs null time
or arbitrary often
– “(ab)*” matches “_”, “ab”, “abab”, “ababab”, …
65
Chunking
• Define a regular expression that matches the
sequences of tags in a chunk
– <DT>? <JJ>* <NN.?>
• Chunk all matching subsequences
– A/DT red/JJ car/NN ran/VBD on/IN the/DT street/NN
– [A/DT red/JJ car/NN] ran/VBD
on/IN [the/DT street/NN]
• If matching subsequences overlap, the first
one gets priority
• Unchunking is the opposite of chunking
66
Chinking
• A chink is a subsequence of the text that is
not a chunk
• Define a regular expression that matches the
sequences of tags in a chink
– ( <VB.?> | <IN> )+
• Chunk anything that is not a matching subsequence
– A/DT red/JJ car/NN ran/VBD on/IN the/DT street/NN
– [A/DT red/JJ car/NN]
ran/VBD on/IN [the/DT street/NN]
chink
67
Merging
• Combine adjacent chunks into a single chunk
• Define a regular expression that matches the
sequences of tags on both sides of the point
to be merged
– Merge a chunk ending in “JJ” with a chunk
starting with “NN”, i.e. left: <JJ>, right: <NN.>
• Chunk all matching subsequences
– [A/DT red/JJ ] [ car/NN] ran/VBD
on/IN the/DT street/NN
– [A/DT red/JJ car/NN] ran/VBD
on/IN the/DT street/NN
• Splitting is the opposite of merging
68
What are Named Entities?
• Names of persons
– Dr. Jonathan Peeko, Professor Johnson
• Names of companies or organizations
– Sony, United Nations, Texas Instruments, General Motors
• Names of locations
named entities are
– Paris, San Francisco, Rocky Mountains, Yellowstone Park
excluded from
• Date andintentionally
time expressions
the16.40
lexicon
– Feb 17, 1973; 4.40p.m.;
Uhr; autumn 2000; last year
• Addresses
– 7 Ugly Way, Wolverhampton UH0 1Q5
– [email protected]
• Names of proteins or genes or diseases,
– chloramphenicol acetyltransferase, NF-kappa B, SARS
• Measure expressions
– 420 kp, 21 l/m2, 37%, 900€
69
GATE: NER – Examples (1/3)
70
GATE: NER – Examples (2/3)
71
GATE: NER – Examples (3/3)
72
Pasta – NER Examples
Staphylococcus aureus enterotoxin A ( SEA )
belongs to a subgroup of the staphylococcal
superantigens that utilizes Zn2+ in the high
affinity interaction with MHC class II molecules.
A high affinity metal binding site was described
previously in SEA cocrystallized with Cd2+ in
which the metal ion was octahedrally coordinated, involving the N-terminal serine .
73
Pasta – NER Examples
Staphylococcus aureus enterotoxin A ( SEA )
belongs to a subgroup of the staphylococcal
superantigens that utilizes Zn2+ in the high
SPECIES
affinity interaction
with MHC class II molecules.
A high affinity metal binding site was described
previously in SEA cocrystallized with Cd2+ in
which the metal ion was octahedrally coordinated, involving the N-terminal serine .
74
Pasta – NER Examples
Staphylococcus aureus enterotoxin A ( SEA )
belongs to a subgroup of the staphylococcal
superantigens that utilizes Zn2+ in the high
affinity interaction with MHC class II molecules.
A high affinity metal binding site was described
previously in SEA cocrystallized with Cd2+ in
PROTEIN
which the metal ion was octahedrally coordinated, involving the N-terminal serine .
75
Named Entity
Recognition & Interpretation
<DRUG> Thalidomide </DRUG> was found to be
highly effective in managing the <TISSUE>
cutaneous </TISSUE> manifestations of
<DISEASE> leprosy </DISEASE> (<DISEASE>
erythema nodosum leprosum </DISEASE>) and
even to be superior to <DRUG> aspirin </DRUG>
(<DRUG> acetylsalicyclic acid </DRUG>) in
controlling <DISEASE> leprosy-associated fever
</DISEASE>
76
Named Entity
Recognition & Interpretation
<DRUG> Thalidomide </DRUG> was found to be
highly effective in managing the <TISSUE>
cutaneous </TISSUE> manifestations of
<DISEASE> leprosy </DISEASE> (<DISEASE>
erythema nodosum leprosum </DISEASE>) and
even to be superior to <DRUG> aspirin </DRUG>
(<DRUG> acetylsalicyclic acid </DRUG>) in
controlling <DISEASE> leprosy-associated fever
</DISEASE>
77
Two Types of NER Methods
Human Knowledge
Engineering
(Supervised) Machine
Learning Systems
• rule based
• use statistics or other
machine learning technique
• developers do (almost) not
need linguistic expertise
• fully automatic
• requires large amounts of
annotated training data
• annotators are cheap (but you
get what you pay for!)
• some changes may require reannotation of the entire
training corpus
• developed by experienced
language engineers
• based on human intuition
• requires only small amount
of plain training data
• development can be very
time consuming
• some changes may be hard
to accommodate
78
Naïve NER Method: List Look-up
• Recognize entities stored in given lists
• gazetteers, e.g., online phone directories,
yellow pages)
• Advantages:
• simple, fast, language independent, easy to
retarget (just create lists)
• Disadvantages:
• impossible to enumerate all names and name
variants, collection and maintenance of lists
79
NER by Pattern Recognition
• Names often have internal structure these components can be either stored
or guessed, e.g., for ”Location” we have
RegEx-style constraints such as:
Capitalized Word + {City, Forest, Center, River}
which yields: Sherwood Forest, Manchester City, Rhine River
Capitalized Word + {Street, Boulevard, Avenue, Road}
which yields: Portobello Street, Washington Avenue
80
NER by Expressive Rules
• Context-sensitive rules of the kind:
AB\C/D
– A is a set of attribute-value expressions and
optional score, the attributes refer to elements
of the input token feature vector
– B, C, D are sequences of attribute-value pairs
and regular expressions; variables are also
supported
– B and D are left and right context, respectively,
and can be empty (hint: read backwards!)
Example: [syn=NP, sem=ORG] (0.9)

\ [norm="university"], [token="of"],
[sem=REGION|COUNTRY|CITY] / ;
81
NER by Machine Learning
• NE task is frequently broken down in two parts:
– Recognizing the entity boundaries
– Classifying the entities in the NE categories
• Features are at least as important as the choice of
the ML method
– Simple pattern matching of orthographic features:
capitalization, punctuation marks, numerical symbols
– Windows for lexical features (e.g., “Mr.” for persons)
– Affix features (“-ase” for proteins, “”-ectomy” for
medical procedures, etc.”)
– POS info (and chunks)
• Major Approaches (ML is a study topic on its own!)
– Maximum Entropy [Chieu & Ng, 2002]
– Hidden Markov Models [Bikel et al., 1999]
– Support Vector Machines [Takeuchi & Collier, 2002]
82
Resources for NLP
• Empirical (Learning) Paradigm for NLP
• Types of Resources
– Language data (plain, annotated)
– Systems for acquiring and maintaining
language data
– Computational lexicons and ontologies
– NLP Core Engines
– NLP Application Systems
– Machine Learning Resources
• Methodological Issues of NLP
Resources
83
Language Data
• Plain language data
– Just text or speech
• ASCII, pdf, HTML/SGML
• Annotated language data
– Enriched by linguistic meta-data
• Linguistic annotation languages (XML)
84
Plain Language Data
• Mixed text collections
– British National Corpus (BNC)*
– Brown Corpus* / LOB Corpus*
• Newspaper collections
– Wall Street Journal
– IdS-Korpora
• The Web
* have POS annotations as well
85
British National Corpus (BNC)
• 100M word collection (some 4,050 texts)
of 20th century British English
• Written part (90%)
–
–
–
–
Regional and national newspapers
Specialist periodicals and journals (various genres)
Academic books and popular fiction
Letters, memoranda, school and university essays
• Spoken part (10%)
– Informal conversations (different ages, regions, social
classes)
– Formal business and government meetings
– Radio shows and phone-ins
• http://www.natcorp.ox.ac.uk/
86
British National Corpus (BNC)
• Encoding based on ‚Guidelines of the Text
Encoding Initiative‘ (TEI),
– using ISO standard 8879 (SGML: Standard
Generalized Markup Language)
• Whole collection is POS-tagged
– using the CLAWS tagger for the C5 tag set (C7 is
much more elaborate)
– Error rate: 1.7%
– Tagging ambiguity for 4.7% of all tags
87
Brown Corpus
• 1M word collection (500 texts) of Standard
American English
• Written texts only
– Press (reportage, editorials, reviews)
– Religion, skills and hobbies, popular lore, belles
lettres
– Fiction (mystery, science, adventure, romance)
• Fully tagged version exists
88
Lancaster-Oslo/Bergen Corpus
• 1M word collection (500 texts) of Standard
British English (counterpart of Brown)
• Written texts only
– Press (reportage, editorials, reviews)
– Religion, skills and hobbies, popular lore, belles
lettres
– Fiction (mystery, science, adventure, romance)
• Fully tagged version exists
89
Language Data Repositories
• Linguistic Data Consortium
– „Catalog“ option
– „LDC Online“ provides you a guest
account
http://www.ldc.upenn.edu/
90
Language Data Repositories
• Linguist List
– Open Language Archives Community
– „Text & Computer Tools“ button
• Texts and Corpora
– „Language Resources“ button
• Texts and Corpora
http://linguistlist.org/olac
91
Language Data Repositories
• European Language Resources Association
(ELRA)
– „R&D Catalog“ option
– Spoken LRs
•
•
•
•
Telephone recordings
Desktop/mircophone recordings
Broadcast resources
Speech related resources
– Written LRs
• Corpora
• Mono- and multilingual lexicons
– (Domain-specific) Terminological resources
– Multimodal/multimedia LRs
http://www.elra.info/
92
Language Data Repositories
• Natural Language Software Registry
–
–
–
–
–
–
–
–
Annotation tools
Evaluation tools
Language Resources
Multimedia
Multimodality
NLP Development Aid
Spoken Language
Written Language
http://registry.dfki.de
93
Annotated Language Data
• Levels of annotation
– Formal text structure processing
• Paragraphs, sentences, tokens
– Syntactic mark-up
• Parts of speech
• Shallow syntactic structures: chunks
• Deep syntactic structures: parses
– Semantic mark-up
• Named entities
• Propositions, predicate-argument structures
– Discourse mark-up
• Referential relations
• Rhetorical relations
94
Annotation Styles
• In-line annotation
– Mark-ups appear as integral part of the
original text
• This is an <XMLTag> in-line <\XMLTag>
annotation
• Stand-off annotation
– Mark-ups appear distinct from the original
text (e.g., in a different window)
• This is a stand-off annotation
– <XMLTag StartChar: 11, XMLTag EndChar: 19>
95
General Language Corpora for
Syntactic Annotation
• Penn Treebank (U Penn)
– language: English (general language)
– text genre: mostly newspaper articles (Wall Street
Journal)
– size: 1,200,000 (annotated) tokens
– Syntactic tagging based on set of 45 tags
– Syntactic phrase structures (parse trees) based on
Government-Binding grammar
– No named entity annotation
– But propositional annotation: PropBank
http://www.cis.upenn.edu/~treebank/
96
General Language Corpora for
Proposition Annotation
• PropBank (U Penn)
– language: English (general language)
– text genre: financial newspaper articles (Wall
Street Journal)
– size: 300,000 (annotated) tokens
– proposition format:
• [ subject - predicate - object ]
– “semantic” counterpart of Penn Treebank
http://www.cis.upenn.edu/~ace/
97
General Language Corpora
for Discourse Analysis
• RST Corpus (ISI/USC, USA)
– language: English
– size: 385 documents, i.e., 176,000 tokens;
21,789 elementary discourse units (EDUs)
– text genre: newspaper articles (Wall Street
Journal)
– Rhetorical Structure Theory (RST)
• 90 coherence relations
98
Penn TreeBank: Sizes and Genres
Shallow
99
Penn TreeBank POS Tag Set
100
PTB POS Annotation Process
• Four annotators: Grad students of linguistics
• Comparison of two annotation styles on a 16,000
word sample:
– „Tagging“:
• completely manual annotation
– „Correcting“:
• automatical POS tagging and subsequent manual correction
• Inter-annotator disagreement:
– „Tagging“:
– „Correcting“:
7,2%
4,1%
• Comparison of accuracy with benchmark version
(disagreement):
– „Tagging“:
– „Correcting“:
5,4%
4,0%
101
Illustration of the „Correcting“ Mode
• Training of annotators took 15h
• Annotation speed (after one month of training):
> 3000 words/h
• Double as fast as „Tagging“ !
102
Syntactic Annotation of PTB
• Correction of false automatic parser output
as provided by the Fidditch parser (Hindle
1989):
– Outputs only one analysis per sentence
– No attachments when parser is unsure about
attachment decision
– Alternative solution: decomposition of sentence
structure into sets of partial trees
 partial sentence structure description
– Good lexicon and grammar coverage
• Task of annotators is mainly to „glue“ (i.e.,
to attach) partial phrase structure trees
– Less time-consuming than re-bracketing the
entire parser output
103
Penn Treebank Phrasal Tag Set
104
Partially
bracketed
output from
Fidditch
105
Automatic
simplification of
the output from
Fidditch
106
After
„Correcting“
by the
annotators
107
TiGer Corpus
• 0,9M word collection (50K sentences) of
German language newspaper articles (FR)
• http://www.ims.unistuttgart.de/projekte/TIGER/TIGERCorpus/
• morphological, POS, parse tree tagging
• Treebank query tool TiGer Search
108
TiGer Corpus
109
TiGer Search (NP)
110
STTS Tag Set for German (1/2)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
ADJA attributives Adjektiv [das] große [Haus]
ADJD adverbiales oder [er fährt] schnell prädikatives Adjektiv [er ist] schnell
ADV Adverb schon, bald, doch
APPR Präposition; Zirkumposition links in [der Stadt], ohne [mich]
APPRART Präposition mit Artikel im [Haus], zur [Sache]
APPO Postposition [ihm] zufolge, [der Sache] wegen
APZR Zirkumposition rechts [von jetzt] an
ART bestimmter oder der, die, das, unbestimmter Artikel ein, eine, ...
CARD Kardinalzahl zwei [Männer], [im Jahre] 1994 (Ordinalzahlen sind als ADJA getaggt)
FM Fremdsprachliches Material [Er hat das mit ``] A big fish ['' übersetzt]
ITJ Interjektion mhm, ach, tja
KOUI unterordnende Konjunktion um [zu leben], mit ``zu'' und Infinitiv anstatt [zu fragen]
KOUS unterordnende Konjunktion weil, daß, damit, mit Satz wenn, ob
KON nebenordnende Konjunktion und, oder, aber
KOKOM Vergleichskonjunktion als, wie
NN normales Nomen Tisch, Herr, [das] Reisen
NE Eigennamen Hans, Hamburg, HSV
PDS substituierendes Demonstrativ- dieser, jener pronomen
PDAT attribuierendes Demonstrativ- jener [Mensch] pronomen
PIS substituierendes Indefinit- keiner, viele, man, niemand pronomen
PIAT attribuierendes Indefinit- kein [Mensch], pronomen ohne Determiner irgendein [Glas]
PIDAT attribuierendes Indefinit- [ein] wenig [Wasser], pronomen mit Determiner [die] beiden
[Brüder]
PPER irreflexives Personalpronomen ich, er, ihm, mich, dir
111
PPOSS substituierendes Possessiv- meins, deiner pronomen
PPOSAT attribuierendes Possessivpronomen mein [Buch], deine [Mutter]
STTS Tag Set for German (2/2)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
PRELS substituierendes Relativpronomen [der Hund ,] der
PRELAT attribuierendes Relativpronomen [der Mann ,] dessen [Hund]
PRF reflexives Personalpronomen sich, einander, dich, mir
PWS substituierendes wer, was Interrogativpronomen
PWAT attribuierendes welche [Farbe], Interrogativpronomen wessen [Hut]
PWAV adverbiales Interrogativ- warum, wo, wann, oder Relativpronomen worüber, wobei
PAV Pronominaladverb dafür, dabei, deswegen, trotzdem
PTKZU ``zu'' vor Infinitiv zu [gehen]
PTKNEG Negationspartikel nicht
PTKVZ abgetrennter Verbzusatz [er kommt] an, [er fährt] rad
PTKANT Antwortpartikel ja, nein, danke, bitte
PTKA Partikel bei Adjektiv am [schönsten], oder Adverb zu [schnell]
TRUNC Kompositions-Erstglied An- [und Abreise]
VVFIN finites Verb, voll [du] gehst, [wir] kommen [an]
VVIMP Imperativ, voll komm [!]
VVINF Infinitiv, voll gehen, ankommen
VVIZU Infinitiv mit ``zu'', voll anzukommen, loszulassen
VVPP Partizip Perfekt, voll gegangen, angekommen
VAFIN finites Verb, aux [du] bist, [wir] werden
VAIMP Imperativ, aux sei [ruhig !]
VAINF Infinitiv, aux werden, sein
VAPP Partizip Perfekt, aux gewesen
VMFIN finites Verb, modal dürfen
VMINF Infinitiv, modal wollen
VMPP Partizip Perfekt, modal gekonnt, [er hat gehen] können
112
XY Nichtwort, Sonderzeichen 3:7, H2O, enthaltend D2XW3
Penn Proposition (Prop) Bank (2000 – )
• Predicate/argument structure (PAS) along
syntactic subcategorization frames
• Focus on verbs (events) and their syntactic
arguments (participants)
– later phases: nominalizations, adjectives and
prepositions
• Linguistic heritage:
– Verb classes for the English language (Levin 1993)
– with focus on semantic considerations (semantic
or theta roles)
• Large coverage is a major goal
113
Example for Propositions (PPB)
Bush met Blair
battle
wrestle
join
debate
Bush and Blair met
Bush met with Blair
Bush and Blair had a
meeting
consult
Proposition: meet(Bush, Blair)
meet(Somebody1, Somebody2)
...
When Bush met Blair on Thursday
they discussed the stabilization of the Iraq.
meet(Bush, Blair)
discuss([Bush, Blair], stabilize(X, Iraq))
114
Penn Treebank Sentence
NP-SBJ
Analysts
(S (NP-SBJ Analysts)
(VP have
(VP been
(VP expecting
S
(NP (NP a GM-Jaguar pact)
(SBAR (WHNP-1 that)
VP
(S (NP-SBJ *T*-1)
(VP would
have VP
(VP give
(NP the U.S. car maker)
been VP
(NP (NP an eventual (ADJP 30 %) stake)
(PP-LOC in (NP the British company))))))))))))
expecting
NP
SBAR
NP
S
a GM-Jaguar WHNP-1
pact
that NP-SBJ
*T*-1 would
VP
VP
NP
give
PP-LOC
NP
Analysts have been expecting a GM-Jaguar
pact that would give the U.S. car maker an the US car
maker
eventual 30% stake in the British company.
NP
an eventual
30% stake
in
NP
the British 115
company
Penn PropBank Sentence
(S Arg0 (NP-SBJ Analysts)
(VP have
have been expecting
(VP been
(VP expecting
Arg1 (NP (NP a GM-Jaguar pact)
Arg1
Arg0
(SBAR (WHNP-1 that)
(S Arg0 (NP-SBJ *T*-1)
(VP would
(VP give
a GM-Jaguar
Analysts
Arg2 (NP the U.S. car maker)
pact
Arg1 (NP (NP an eventual (ADJP 30 %) stake)
(PP-LOC in (NP the British company))))))))))))
Arg0
*T*-1
that would give
Arg2
the US car
maker
Arg1
an eventual 30% stake in the
British company
expect(Analysts, GM-J pact)
give(GM-J pact, 30% stake, US car maker)
116
PPB Annotation Principles
• Search for the most frequently used predicates (verbs)
in the PTB
• Survey of the „usage“ of a certain predicate
– Considering the number of evidences in the corpus
– Selection of roles which
• occur frequently
• are „semantically“ necessary
– Indexing of roles (arguments) according to the (Arg0 ... Arg5)
scheme yields distinct framesets for a verb
• Arg0: prototypical agent
• Arg1: prototypical patient or theme
• Arg2-5: no systematic generalization applies
• Propositional annotation is based on a sentence‘s PTB
parse structure and the availability of the framesets
• Additional annotation of verbs by temporal, aspectual
117
and voice information (ArgMs)
PPB Annotation Principles:
Framesets
• Frames for more than
3,300 verbs exist
• 4,500 framesets exist
indicating an average
polysemy rate of 1.36
• Classical Zipfian
distribution for framesets:
‚go‘ has 20 FSs, ‚come‘,
‚get‘, ‚make‘, ‚take‘, etc.
more than a dozen,
2,581 out of 3,342 verbs
have only a single one
118
PPB Annotation Principles (cont.)
• Extraction of all sentences which contain a
given verb
• 1st run: automatic tagging
http://www.cis.upenn.edu/~josephr/TIDES/index.html#lexicon
• 2nd run: “Double blind hand correction”
– Basically carried out by linguistics students
(undergraduates)
– Tagging tool highlights discrepancies
• 3rd run: “Salomonization”
– Judge’s decision (by project leader?)
– approximately 5% of the verbs are concerned
119
PPB Inter-Annotator Agreement
• P(A) probability of interannotator agreement
• P(E) agreement expected by chance
• ArgM a set of adjunct-like arguments every verb can
take in addition to semantic roles from its roleset
120
Different Meanings of a Verb
121
Semantically Related Verbs –
Meta Frames
122
PPB Annotation Statistics
• Training time for PropBank annotators: +/- 3 days
– Less than for syntactic (bracketing) annotators
• Semi-automatic pre-annotation by already existing
frames (VerbNet – a generalization of Levin classes)
• Speed statistics
– 25 verb frames per week
– 50 (!?) predicates per person and hour
• average inter-annotator agreement: < 80%
– Still, variance ranges between 60% and 100%
• There exists an arbiter „gold standard“
– Agreement between annotators and gold standard ranges
between 45% and 100%
• The larger the potential number of arguments for a
verb, the higher the likelihood of disagreement
123
SALSA Corpus
• Saarbrücken Lexical Semantics
Annotation and Acquisition Project
• Bereitstellung einer großen lexikalischsemantischen Ressource für PrädikatArgument-Struktur im Deutschen
• Verbesserung der semantischen
Verarbeitung auf der Ebene der PrädkatArgment-Struktur
• http://www.coli.uni-saarland.de/projects/salsa/
124
SALSA Ziele
• Bereitstellung einer lexikalischsemantischen Ressource (Korpus +
Lexikon) für das Deutsche mit
Informationen über:
– Wortbedeutungen auf der Ebene Framesemantischer Klassifikation von Prädikaten
– Semantische Rollen und syntaktische
Realisierungsmuster
• Entwicklung von Verfahren zur
– Automatischen Akquisition lexikalischsemantischer Information
– Auswertung und Anwendung lexikalischsemantischer Ressourcen
125
SALSA Grundlagen
• Berkeley FrameNet-Datenbank
• TIGER-Korpus
(Saarbrücken/Stuttgart/Potsdam):
126
SALSA Annotation auf TiGER Syntax
127
Sublanguage Corpora
• GENIA (U Tokyo)
– language: English (biomedical sublanguage)
– text genre: biology articles (Medline bibliographic
database)
– size: 2,000 annotated abstracts (18,500 sentences,
491,000 tokens)
• selected from a MeSH term search of “Human”, “Blood
Cells” and “Transcription Factors”
– POS tagging based on PTB tag set
– Syntactic phrase structures (beta version); PTB-style
treebank (200 abstracts only)
– Named entity annotation based on a subset of
substances (peptides, amino acids, DNA), biological
locations (organisms, tissues) involved in reactions of
128
proteins (GENIA ontology) — 100,000 bio annotations
Demo of Genia
Example:
„Preincubation of cells with 1,25-(OH)2D3
augmented IL-1 beta mRNA levels only
in U-937 and HL-60 cells.“
129
POS Annotation in Genia
Preincubation/NN of/IN cells/NNS
with/IN 1,25-(OH)2D3/NN
augmented/VBD IL-1/NN beta/NN
mRNA/NN levels/NNS only/RB in/IN
U-937/NN and/CC HL-60/NN cells/NNS
./.
130
Syntactic Annotation in Genia
131
Named Entity Annotation in
Genia
132
Infrastructure Requirements
• Definition of Description Languages for
– Tagging/NER: Tag Set (Syntactic, Semantic)
– Chunking/Parsing: Grammar Format
– Proposition Analysis: Proposition Format, Ontology
(Concept System, Relation Types)
– Discourse Analysis: Reference and rhetorical
relations
• Manual Creation of Corpora
– Training Coders in Applying Description Languages
– Test of Coder Reliability
• Benefit:
– Solid Foundation for Supervised Learning
133
Trained Automata
134
Introspektion vs. Annotationen
• Klassisches Paradigma (Introspektion):
– Manuelle Regelformulierung
– Manuelle Lexikonspezifikation
• Empirisches Paradigma (Induktion):
– Automatisches Regellernen
– Automatisches Lexikonlernen
• Annotationen als Metadaten
– Grundlage maschineller Lernverfahren
135
Medical Sublanguage vs. General Language
• Medical language as a sublanguage
–
–
–
–
(ad hoc) abbreviations and acronyms (o.B., V.a., COPD)
(idiosyncratic) measure units (mmHg, mm Hg)
variable forms of enumeration patterns (1.,2.,..., a),b)...)
Latin-/Greek-based terminology (ulcus ventriculi)
• However: less complexity and variation than
general language
• Expect standard general-language-trained
off-the-shelf POS taggers to perform ‘ok’
• Statistically significant performance gain for
biomedical POS taggers when trained on
dedicated biomedical corpora (Wermter &
Hahn, 2004)
136
Resources for NLP
• Empirical (Learning) Paradigm for NLP
• Types of Resources
–
–
–
–
–
–
Language data (plain, annotated)
Systems for rendering language meta data
Computational lexicons and ontologies
NLP Core Engines
NLP Application Systems
Machine Learning Resources
• Methodological Issues of NLP
Resources
137
Systems for Rendering
Language Meta Data
• Software infrastructure which supports the manual
annotation processes at all levels
• Easy adaptation to user-defined annotation languages
• Visualization component
– In-line vs. stand-off
– Semantics of colors
– Graphical overlay structures
• Team support mechanisms wrt annotation
– Comparison of annotator pairs/groups
– Consensus seeking
– Built-in quality evaluation schemes (annotator agreement)
• Software engineering standards
– Version control (of annotation software)
– Change history (of annotation products)
138
Systems for Rendering
Language Meta Data
• Wordfreak
• MMax
• …
http://www.ldc.upenn.edu/annotation
139
Wordfreak
• JAVA-based language annotation tool
• Plug-in architecture allows easy
extension of Wordfreak’s functionality
http://sourceforge.wordfreak.net/api/index.html
http://sourceforge.wordfreak.net
140
Wordfreak Screenshot
• English Treebanking annotation using the
Tree Table component
141
Wordfreak Screenshot
• Chinese Treebanking using the Text
Viewer component
142
Wordfreak Screenshot
• MUC named entity annotation using the
Concordance Viewer component
143
Wordfreak Screenshot
• ACE named entity and co-reference
annotation for Arabic using the Text
Viewer component
144
MMax II
•
•
•
•
•
European Media Lab (EML), Heidelberg
Stand-off annotation
Arbitrarily many levels of annotation
Graphical rendering of relations between markables
Permanent user-definable and attribute-dependent
markable visualization
• Downloadable evaluation version with a key expiring
after a given timestamp (full version now open-source)
• Read the MMax Quick Start Guide
http://www.eml-research.de/english/research/nlp/
download/mmax.php#mmax2
145
http://mmax2.sourceforge.net/
MMAX II Screenshot
• Set relation to indicate coreference
relation
146
MMAX II Screenshot
• pointer relation to link a bridging
expression to its bridging antecedent
147
Resources for NLP
• Empirical (Learning) Paradigm for NLP
• Types of Resources
–
–
–
–
–
–
Language data (plain, annotated)
Systems for rendering language data
Computational lexicons and ontologies
NLP Core Engines
NLP Application Systems
Machine Learning Resources
• Methodological Issues of NLP
Resources
148
Computational Lexicons,
Terminologies & Ontologies
• Computational Lexicons
– Language-specific information (English, Spanish, German, etc.),
cover common-sense knowledge
– Cover, at best, all linguistic description levels for a lexical item
but usually don’t
– Undetermined towards formalization, yet electronically available
• Terminologies
– Language-independent (though verbally encoded!), cover
domain-specific, expert-level knowledge
– Cover lexico-semantic information only (semantic relations)
– Informal, computational issues are (usually) of no concern
• Ontologies
– Language-independent, cover domain-specific, expert-level
knowledge
– cover conceptual information (semantic relations, semantic
integrity constraints, rules, etc.)
– Formal specifications, computational issues are a major concern
149
– Formal reasoning: inferences
Examples: Computational Lexicons,
Terminologies & Ontologies
• Computational Lexicons
– WordNet (English) & EuroWordNet
– GermaNet
– FrameNet
• (Biomedical) Terminologies
– Unified Medical Language System (UMLS)
– GENIA Ontology
– Open Biological Ontologies (OBO)
• Gene Ontology (GO)
• Ontologies
– Formal reasoning (for text understanding)
150
WordNet
• English WordNet (V3.0)
– semantic (relation) lexicon of English
(general language)
• no morphology!, no syntax!, no etymology
– groupings of words into sets of synonyms
(synsets)
– English definitions for lexical entries/synsets
(glosses)
– defines semantic relations between synsets
– covers (base forms of) nouns, verbs,
adjectives, adverbs
– Size: more than 155,000 lexical entries
http://wordnet.princeton.edu/
151
WordNet
• EuroWordNet
–
–
–
–
–
–
Portuguese, Spanish, Spanish Catalan-Basque
French
Italian
German (not fully freely available)
Russian
...
• Global WordNet
–
–
–
–
http://globalwordnet.org
Arabic
Mandarin-Chinese
Hindi
....
152
WordNet SynSets and Glosses
• Nouns
S: (n) jump, leap (a sudden and decisive increase) "a jump in attendance"
direct hyponym | full hyponym
S: (n) quantum leap, quantum jump (a sudden large increase or advance) "this may not
insure success but it will represent a quantum leap from last summer„
direct hypernym | inherited hypernym| sister term derivationally related form
S: (n) leap, jump, saltation (an abrupt transition) "a successful leap from college to the major leagues"
S: (n) jump ((film) an abrupt transition from one scene to another)
S: (n) startle, jump, start (a sudden involuntary movement) "he awoke with a start"
S: (n) jump, parachuting (descent with a parachute) "he had done a lot of parachuting in the army"
S: (n) jump, jumping (the act of jumping; propelling yourself off the ground) "he advanced in a series
of jumps"; "the jumping was unexpected"
153
WordNet Synsets and Glosses
• Verb
S: (v) jump, leap, bound, spring (move forward by leaps and bounds) "The horse bounded across
the meadow"; "The child leapt across the puddle"; "Can you jump over the fence?"
S: (v) startle, jump, start (move or jump suddenly, as if in surprise or alarm) "She startled when I
walked into the room"
S: (v) jump (make a sudden physical attack on) "The muggers jumped the woman in the fur coat"
S: (v) jump (increase suddenly and significantly) "Prices jumped overnight"
S: (v) leap out, jump out, jump, stand out, stick out (be highly noticeable)
S: (v) jump (enter eagerly into) "He jumped into the game"
S: (v) rise, jump, climb up (rise in rank or status) "Her new novel jumped high on the bestseller list"
S: (v) jump, leap, jump off (jump down from an elevated point) "the parachutist didn't want to jump";
"every year, hundreds of people jump off the Golden Gate bridge"; "the widow leapt into
the funeral pyre"
S: (v) derail, jump (run off or leave the rails) "the train derailed because a cow was standing on the
tracks"
S: (v) chute, parachute, jump (jump from an airplane and descend with a parachute)
S: (v) jump, leap (cause to jump or leap) "the trainer jumped the tiger through the hoop"
S: (v) jumpstart, jump-start, jump (start (a car engine whose battery is dead) by connecting it to
another car's battery)
S: (v) jump, pass over, skip, skip over (bypass) "He skipped a row in the text and so the sentence
was incomprehensible"
S: (v) leap, jump (pass abruptly from one state or topic to another) "leap into fame"; "jump to a
conclusion"; "jump from one thing to another"
154
S: (v) alternate, jump (go back and forth; swing back and forth between two states or conditions)
WordNet Relations
• Nouns
– Hypernyms
• „Y is a hypernym (more general term) of X, if every X is a
(kind of) Y“
– Hyponyms
• „Y is a hyponym (more specific term) of X, if every Y is a
(kind of) X“
• Y is hyponym of X 1 X is a hypernym of Y
– Coordinate terms
• „Y is a coordinate term of X, if X and Y share a
hypernym“
– Holonyms
• „Y is a holonym (whole) of X, if (every/some?) X is a part
of Y“
– Meronyms
• „Y is a meronym (part) of X, if (every/some?) Y is a part
of X“
155
• Y is a meronym of X 1 X is a holonym of Y
WordNet Relations
• Verbs
– Hypernyms
• „the verb Y is a hypernym (more general term) of the
verb X, if the activity X is a (kind of) Y“
– e.g., travel to movement
– Troponyms
• „the verb Y is a troponym of the verb X, if the activity
Y is doing X in some manner“
– e.g., lisp to talk
– Entailment
• „the verb Y is entailed by the verb X, if by doing X
you must be doing Y“
– e.g., snoring by sleeping
– Coordinate terms
• „Y is a coordinate verb of X, if X and Y share a
hypernym“
156
WordNet Relations
• Adjectives
– Related nouns
– Participle of verb
• Adverbs
– Root adjectives
157
WordNet V3.0 Statistics
POS
Unique Strings
SynSets
Word-Sense Pairs
(word - #synset pairs)
Noun
117,100
82,100
146,300
Verb
11,500
13,800
25,000
Adj
21,500
18,200
30,000
Adv
4,500
3,600
5,600
155,300
117,700
206,900
S
http://wordnet.princeton.edu/wordnet/man/wnstats.7WN.html
158
WordNet V3.0 Statistics
POS
Monosemous
Words / senses
Polysemous
Words
Polysemous
senses
Noun
101,900
16,000
44,400
Verb
6,300
5,300
18,800
Adj
16,500
5,000
14,400
Adv
3,700
700
1,800
128,400
27,900
79,500
S
http://wordnet.princeton.edu/wordnet/man/wnstats.7WN.html
159
WordNet V3.0 Statistics
POS
Average polysemy*
Average polysemy*
(incl. monosemous words) (excl. monosemous words)
Noun
1.24
2.79
Verb
2.17
3.57
Adj
1.40
2.71
Adv
1.25
2.50
* Number of synsets that contain the word
http://wordnet.princeton.edu/wordnet/man/wnstats.7WN.html
160
GermaNet (V 6.0)
• German WordNet (93.500 lexemes)
– 72.000 nouns
– 13.000 verbs
– 8.600 adjectives
• 69.600 Synsets
– 53.800 nouns
– 9.000 verbs
– 6.000 adjectives
• 82.000 Relations
– 75.000 hypernym/hyponym
– 5.000 holonym/partonym
http://www.sfs.uni-tuebingen.de/GermaNet/
161
GermaNet (V 5.3)
162
FrameNet
• English FrameNet
– semantic frames of English (script-style)
• no morphology!, no syntax!, no etymology
– English-style, semi-formal definitions for lexical
entries
– Statistics (Version 1.3)
• 11,000 lexical units
• 1,050 semantic frames
• 135,000 example sentences for frames (taken from
the British National Corpus [BNC] and US newswire)
http://framenet.icsi.berkeley.edu/
http://framenet.icsi.berkeley.edu/index.php?opt
ion=com_content&task=view&id=17881&Itemid=66/
• Try out: FrameGrapher
163
FrameNet Entry
FrameNet Data Search for jump
Lexical unit search results: Closest match is jump...
Lexical Unit
Frame
jump.v
jump.v
jump.v
jump.v
Self_motion
Traversing
Change_position_on_a_scale
Attack
jumper.n
jumping.a
jumpsuit.n
Clothing
Lively_place
Clothing
164
FrameNet Entry (cont.)
Self_motion
Definition: The Self_mover, a living being, moves under its own power in a directed fashion, i.e. along
what could be described as a Path, with no separate vehicle.
FEs:
Core:
Area [Area]
Semantic Type Location
Area is used for expressions which describe a general area in which motion takes place when the
motion is understood to be irregular and not to consist of a single linear path. Note that this FE should
not be used for cases when the same phrase could be used with the same meaning with a non-motion
target, since these should be annotated with the Place FE.
Direction [dir]
The direction that the Self_mover heads in during the motion.
Goal [Goal]
Semantic Type Goal
Goal is used for any expression which tells where the Self_mover ends up as a result of the
motion.
Path [Path]
Semantic Type Path
Path is used for any description of a trajectory of motion which is neither a Source nor a Goal. This
includes "middle of path'' expressions.
Self_mover [SMov] Semantic Type Sentient
Self_mover is the living being which moves under its own power. Normally it is expressed as an
external argument.
Source [Src]
Semantic Type Source
Source is used for any expression which implies a definite starting-point of motion. In prepositional
phrases, the prepositional object expresses the starting point of motion. With particles, the165
starting
point of motion is understood from context.
(Biomedical) Terminologies
• Sublanguages: domain-specific
• Relational Encoding
– Is-a
– Part-of
166
UMLS –
Unified Medical Language System
• http://www.nlm.nih.gov/pubs/factsheets/umls.html
– Purpose: clinical coding, billing, document retrieval, …
– Umbrella system made up of more than 100
terminologies
– Size: 2,000,000 terms; 900,000 concepts, 12,000,000
relations
– Content: (almost) the whole of (clinical) medicine
– Lexical semantics: thesaurus relations for taxonomies,
partonomies, also other light-weight semantics
(approximately 80 additional relation types)
– Basic and variant word forms, and (quite complex) NPs
– (English) Specialist Lexicon uses conceptual grounding
167
of UMLS for NLP applications
UMLS Thesauri
Clinical
repositories
Genetic
knowledge bases
SNOMED
Other
subdomains
OMIM
…
MeSH
UMLS
Biomedical
literature
NCBI
Taxonomy
Model
organisms
GO
UWDA
Anatomy
Genome
annotations
168
UMLS Tables
Concept 1
RIGHT-SIDE-OF-HEART
LEFT-SIDE-OF-HEART
ANGINA-PECTORIS
HEART
HEART
WALL-OF-HEART
BRONCHIAL-TUBERCULOSIS
BRONCHIAL-TUBERCULOSIS
SARCOMA
LENS-CRYSTALLINE
ACUTE-MYELOID-LEUKEMIA
RIGHT-HAND
ALLERGIC-REACTION
LUNG
anatomical concepts
pathological concepts
relation
narrower_rel
part_of
has_location
has_part
has_part
part_of
has_location
narrower_rel
sibling
part_of
has_location
is_a
associated_with
broader_rel
Concept 2
HEART
HEART
HEART
HEART-ATRIUM
MITRAL-VALVE
HEART
BRONCHI
TUBERCULOSIS
CARCINOMA
EYE
BONE-MARROW
HAND
DERMATITIS-ATOPIC
ATELECTASIS
169
Open Biological Ontologies (OBO)
http://obo.sourceforge.net
• Coverage:
• Anatomy (cells, human,
model organisms, etc.)
• Chemical entities
• Experimental conditions
• Genomics, proteomics
• …
• Structured controlled
vocabularies (thesauri)
• Basic Relations: is-a, part-of
• OBO entry: ID, concept name,
textual definition, synonyms
170
OBO Statistics
(Dec 2009)
• More than 60 OBO ontologies
• about 50% of them contain more than 1000 terms:
• 2 x > 25 000 terms: NCI Thesaurus, FMA (Human Anatomy)
• 5 x 10 000-25 000: GO (total), disease ontology, MeSH
“ontology”, mouse anatomy stages, ChEBI (chemicals)
• 18 x 1000 -10 000 terms: molecule role (chemicals, protein
by function), human, mouse, fly, fish anatomy (some:
developmental anatomy), phenotype ontology, tissue
ontology, sequence ontology
• Less than 1000 terms: cell ontology, pathway ontology, MGED
(Microarray Gene Expression Database), relationship ontology
(amongst others)
• Rapidly growing! – check out every day (o.k., week is also fine)
http://www.obofoundry.org
171
Gene Ontology (GO)
• Purpose: Data annotation and integration for genes and
gene products (cross-species)
• Coverage: Three ontologies in one for molecular biology
• cellular component: location of a gene product, within
(sub)cellular structures and macromolecular complexes, e.g.,
nucleus or ribosome
• molecular function: the tasks performed by individual gene
products at the biochemical level, e.g., enzyme or transporter
• biological process: biological goals to which a gene product
contributes; that process is accomplished by ordered assemblies
of molecular functions, e.g., mitosis or cell growth
• 20,500 categories (95,6% w./ verbal definitions)
• 2 base relations; 30,500 relation instances
• Specific/general (88%) (mitotic chromosome is-a chromosome)
• Part/whole
(12%) (telomere part-of chromosome)
172
http://www.geneontology.org
Snapshot of GO
is-a relation
I
P
part-of relation
173
GENIA Ontology
• Purpose:
– biological named entity annotation (prerequisite
subroutine for text mining)
• Coverage:
– Cell signaling reactions in human
• substances involved in biochemical reactions
• biological locations where substances are found &
reactions take place (e.g., organisms, tissues, cells)
• 45 categories
– Informal semantics:
verbal “scope notes” as an informal phrasing of
the categories’ meaning (no definitory axioms)
• 1 base relation (is-a), 44 relation instances
http://www-tsujii.is.s.u-tokyo.ac.jp/
~genia/topics/Corpus/genia-ontology.html
174
The Complete GENIA Ontology
175
GENIA Ontology Scope Notes
“An individual member of a group
of non-complex proteins, e.g.,
STAT1, STAT2, STAT3, or a (noncomplex) protein not regarded as a
member of a particular group. “
NF kappa B, CD28, IL-3, GTP-bound
p21ras, HIV-1 tat, protein kinase C,
(…)
“A family or a group of proteins,
e.g., STATs”
antibodies, transcription factors,
cytokines, cytosolic protein, T-cell
receptors, DNA binding protein, (…)
176
General Shortcomings
• Category descriptions, at best, are verbally
defined
• Relations are usually undefined, their
names appeal to human/expert intuition
• (Almost) No attempt at interoperability
• Lots of unlinked fragments (still a long way
to go to some sort of ‘Bio-UMLS’)
177
Ontologies
• Formal Reasoning
• Conceptual Computation
178
Why Conceptualize?
• Nomenclatures, thesauri, ontologies, …
• “Mapping problem” due to term variation
–
Natural language a
domain knowledge
179
“Mapping Problem” (1/2)
Problem: Mapping a textual occurrence of a bio entity
(text token, term) to its ontological category (type)
• Orthographic variations
– Hyphens, slashes, spaces (e.g., NF-KB, NF KB,
NF/KB, NFKB)
– Upper/lower cases (e.g., NF-KB, NF-kb)
– Spelling variations (e.g., tomour vs. tumor,
oestrogen vs. estrogen, alpha vs. a)
• Lexical and phrasal variations
– Acronyms (e.g., RAR vs. retinoic acid receptor)
– Different reductions (e.g., SB2 gene vs. SB2,
180
thyroid hormone receptor vs. thyroid receptor)
“Mapping Problem” (2/2)
• Semantic variations (n:m token-type relations)
– n:1 Synonyms (e.g., in FlyBase: EST-6 vs.
Esterase 6 vs. carboxyl ester hydrolase)
– 1:m Ambiguity as polysemy (e.g., ‘per’ in
FlyBase: period gene vs. clock gene)
181
Why Is Bio Terminology So Hard?
V-SNARE
Vesicle SNARE
SNAP Receptor
Soluble NSF Attachment Protein
N-Ethylmaleimide-Sensitive Fusion Protein
Maleic acid N-ethylimide
Vesicle Soluble Maleic acid N-ethylimide Sensitive
Fusion Protein Attachment Protein Receptor
182
Why Conceptualize?
• Nomenclatures, thesauri, ontologies, …
• “Mapping problem” due to term variation
–
Natural language a
domain knowledge
• “Structure computing” on knowledge structures
–
–
–
Lexical look-up
Relational navigation (general-specific, is-a)
Formal reasoning (inferencing)
183
“Structure Computing”
How Things Got Started …
Animals …
Mammalia
Primates
Eubacteria
Green plants (Green Algae, Higher Plants)
Eukaryotes
Fungi
Homo sapiens
Archaea
Tree of life web project :
http://tolweb.org/tree/phylogeny.html
184
“Structure Computing”
… and Where We Are Heading to
is-a
is-a
Animals …
Mammalia
Primates
Eubacteria
Green plants (Green Algae, Higher Plants)
Eukaryotes
Fungi
is-a
Homo sapiens
Archaea
185
Why Conceptualize?
• Nomenclatures, thesauri, ontologies, …
• “Mapping problem” due to term variation
–
Natural language a
domain knowledge
• “Structure computing” on knowledge structures
–
–
–
Lexical look-up
Relational navigation (general-specific, is-a)
Formal reasoning (inferencing)
• Bio view: data annotation & data integration
186
Bio View:
Swiss-Prot and GO Terms
GO terms
Function:
required for T-cell
proliferation and other
activities crucial to the
regulation of the
immune response
Location:
secreted protein
http://www.expasy.org/sprot/
187
Ontologies and Data Annotation
UniProt
PubMed
Fact Database
Ontology
Literature
Collection
188
Ontologies and Data Integration
UniProt
PubMed
Fact Database
Literature
Collection
Yeast
Ontology
FlyBase
Mouse
Fact Databases
189
Why Conceptualize?
• Nomenclatures, thesauri, ontologies, …
• “Mapping problem” due to term variation
–
Natural language a
domain knowledge
• “Structure computing” on knowledge structures
–
–
–
Lexical look-up
Relational navigation (general-specific, is-a)
Formal reasoning (inferencing)
• Bio view: data annotation & data integration
• NLP view: text-based content management
–
–
Category classification (IR)
Semantic interpretation (IE, TM)
190
NLP view:
Two Text-Based CM Paradigms
Information Retrieval,
Document Classification
Information Extraction,
Text Mining
191
Information Extraction
Disease: leprosy
Drug:
Thalidomide was found
to be highly effective
in managing the cutaneous
manifestations of leprosy
(erythema nodosum
leprosum) and even to be
superior to aspirin
(acetylsalicyclic acid)
in controlling leprosyassociated fever
Thalidomide
Effective-for: Thalidomide,
cutaneous
manifestations
of leprosy
Disease: leprosy-associated
fever
Drug:
Thalidomide,
Aspirin
Effective-for: [ Thalidomide >
Aspirin ],
leprosy-associated
fever
192
Ontologies for Information Extraction
S1 A mitochondrion provides the cell
with energy in the form of ATP.
S2 The organelle possesses its own
genetic material which is inherited
maternally.
S3 The ATP synthesizing enzyme
ATP synthase is located in the
inner membrane.
193
Ontologies for Information Extraction
S1 A mitochondrion provides the cell
with energy in the form of ATP.
is-a
S2 The organelle possesses its own
genetic material which is inherited
maternally.
S3 The ATP synthesizing enzyme
ATP synthase is located in the
inner membrane.
194
Ontologies for Information Extraction
S1 A mitochondrion provides the cell
with energy in the form of ATP.
S2 The mitochondrion possesses its own
genetic material which is inherited
maternally.
S3 The ATP synthesizing enzyme
ATP synthase is located in the
inner membrane.
195
Ontologies for Information Extraction
S1 A mitochondrion provides the cell
with energy in the form of ATP.
S2 The mitochondrion possesses its own
genetic material which is inherited
maternally.
part-of
S3 The ATP synthesizing enzyme
ATP synthase is located in the
inner membrane.
196
Conceptual Normalization
S1 A mitochondrion provides the cell
with energy in the form of ATP.
S2 The mitochondrion possesses its own
genetic material which is inherited
maternally.
S3 The ATP synthesizing enzyme
ATP synthase is located in the
mitochondrial inner membrane.
197
Semantic Interpretation
“Normalized“ Text Level
S1 A mitochondrion provides the cell
with energy in the form of ATP.
Propositional Level
• Provide [mitoch., cell, energy]
S2 The mitochondrion possesses its own
genetic material which is inherited
maternally.
• Possess [mitoch., gen. material]
S3 The ATP synthesizing enzyme
ATP synthase is located in the
mitochondrial inner membrane.
• Synthesize [ATP synthase, ATP]
• Located-in [ATP synthase,
198
mitoch. inner membrane]
Reasoning on Medical
Ontologies
1. Taxonomy
„is-a“
Finger
is-a
is-a
Thumb
is-a
Left Thumb
199
Reasoning on Medical
Ontologies
1. Taxonomy
„is-a“
2. Mereology
„part-of“
part-of
Thumbnail
200
Reasoning on Medical
Ontologies
1. Taxonomy
„is-a“
2. Mereology
„part-of“
part-of
Thumb
part-of
Thumbnail
201
Reasoning on Medical
Ontologies
1. Taxonomy
„is-a“
2. Mereology
„part-of“
Hand
part-of
part-of
Thumb
part-of
Thumbnail
202
Reasoning on Bio Ontologies
1. Taxonomy
„is-a“
Protein
is-a
is-a
Enzyme
is-a
ATPase
203
Reasoning on Bio Ontologies
1. Taxonomy
„is-a“
2. Mereology
„part-of“
Metaphase
Metaphase
204
Reasoning on Bio Ontologies
1. Taxonomy
„is-a“
2. Mereology
„part-of“
Cell Cycle
Mitosis
part-of
Metaphase
Metaphase
205
Reasoning on Bio Ontologies
1. Taxonomy
„is-a“
2. Mereology
„part-of“
Cell Cycle
Cell Cycle
part-of
Mitosis
part-of
Metaphase
Metaphase
206
Reasoning on Bio Ontologies
1. Taxonomy
„is-a“
2. Mereology
„part-of“
Cell Cycle
Cell Cycle
part-of
part-of
Mitosis
part-of
Metaphase
Metaphase
207
Ontology Design Workflow
•
•
•
•
Select a set of foundational relations
Define the ground axioms for these relations
Establish constraints across these relations
Define a set of formal properties induced by
these relations
• Introduce the basic categories & classify the
relevant kinds of domain entities accordingly
• Elicit the dependencies and interrelations
among the basic categories
208
Fundamental Distinctions
• Universals (classes, types, concepts)
vs. particulars (instances, tokens, concrete
& countable entities in the world which exist
in space and time)
• Continuants (entities which endure, or
continue to exist, through time while
undergoing different sorts of changes)
•
e.g., molecule, cell, membrane, organ
vs. occurrents (processes, events – entities
which unfold themselves in successive
temporal phases)
•
209
e.g., ion transport, cell division, breathing
Upper Ontologies
DOLCE
210
Relations Ontology (RelO)
Für alle x,y aus M:
xRy und yRx g x=y
foundational
spatial
temporal
C part-of C1 &
C1 has-part C
participation
Class Relations!
211
Resources for NLP
• Empirical (Learning) Paradigm for NLP
• Types of Resources
–
–
–
–
–
–
Language data (plain, annotated)
Systems for rendering language data
Computational lexicons and ontologies
NLP Core Engines
NLP Application Systems
Machine Learning Resources
• Methodological Issues of NLP
Resources
212
NLP Core Engines
• OpenNLP Tools
– Part of the Open Source initiative
– Sentence splitter, tokenizer, POS tagger, shallow
and full parser, NE recognizer
– All components based on maximum entropy model
• http://www.cs.cmu.edu/afs/cs/user/aberger/www/
html/tutorial/tutorial.html
http://opennlp.sourceforge.net
213
Maximum Entropy Model
[http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/tutorial.html]
– Statistical model of modeling the behavior of
random processes
– Sample of output (incomplete knowledge!) from a
statistical process is available
– Predict future process from this sample accurately
– Technically:
• To select a model from a set C of allowed probability
distributions, choose the model p* from C with maximum
entropy H(p):
Conditional entropy:
Measures uniformity of conditional distribution p(y|x)
214
~p are observations from sample
data,
p are values from the statistical model
Resources for NLP
• Empirical (Learning) Paradigm for NLP
• Types of Resources
–
–
–
–
–
–
Language data (plain, annotated)
Systems for rendering language data
Computational lexicons and ontologies
NLP Core Engines
NLP Application Systems
Machine Learning Resources
• Methodological Issues of NLP
Resources
215
NLP Application Systems
• Document Retrieval
– LUCENE
• Information Extraction
– GATE
http://opennlp.sourceforge.net
216
LUCENE
– Document retrieval system
• Ranked search
• Powerful querying
– Phrase queries,
– Wildcard queries
– Proximity queries
• Sorting by any field
– JAVA implementation, Open Source Project
http://lucene.apache.org/java/docs/index.html
217
GATE
– Information extraction workbench
• Annotation tool
• IE engine, including sentence segmenter,
tokenizer, POS tagger, parser, NE recognizer, coreference resolver, template extractor (IE)
– Free Open Source framework
– JAVA implementation under UIMA
http://gate.ac.uk
218
GATE Screenshots
219
GATE Screenshots
220
GATE Screenshots
221
GATE Screenshots
222
GATE Screenshots
223
GATE
Screenshots
224
Resources for NLP
• Empirical (Learning) Paradigm for NLP
• Types of Resources
–
–
–
–
–
–
Language data (plain, annotated)
Systems for rendering language data
Computational lexicons and ontologies
NLP Core Engines
NLP Application Systems
Machine Learning Resources
• Methodological Issues of NLP
Resources
225
Machine Learning Resources
• CMU Machine Learning Software Packages
http://www.cs.cmu.edu/project/ai-repository/
ai/areas/learning/systems/0.html
• David Aha’s Machine Learning Page
http://home.earthlink.net/~dwaha/research/
machine-learning.html#software
• …
• Machine Learning Resources (Meta Page)
http://www.sciencemag.org/feature/data/
compsci/machine_learning.dtl
226
Computerlinguistische Repositorien
• Linguistic Data Consortium (LDC)
– http://www.ldc.upenn.edu/
• European Language Resources
Association (ELRA)
– http://www.elra.info/
227
System-Wettbewerbe
• Linguistische Kernfunktionalität
– ParsEval, SemEval, RTE, …
• Dokumenten-Retrieval
– TREC, CLEF
• Informationsextraktion
– MUC, ACE
• Textzusammenfassung
– SUMMAC
• Domänenabhängigkeit
– BioCreative, BioNLP Shared Tasks, CALBC
• Erkennung gesprochener Sprache
228
System-Wettbewerbe (1/3)
1. (vertrauenswürdiger, fairer, objektiver)
Ausrichter konstituiert sich
•
•
•
Thematik des Challenge festlegen
Textauswahl, Formate etc.
Wettbewerbssoftware bereitstellen
2. Anfertigung des Goldstandards (ground
truth)
• Aufspaltung in
•
•
Training-Set (70/90)
Test-Set (30/10)
229
System-Wettbewerbe(2/3)
3. Freigabe des Training-Set (Dauer: 3-6 W)
•
•
•
Teilnehmer trainieren ihr System am
Training-Set
Vergleich eigener Ergebnisse gegen
Development-Set
Teilnehmer fixiert am Ende der Trainingsphase n optimale Systemzustände
(frozen system)
4. Freigabe des Test-Set (Dauer: 2-3 T)
•
Frozen system operiert auf Test-Set 230
System-Wettbewerbe(3/3)
5. Abgabe der Ergebnisse beim Ausrichter
6. Auswertung der Ergebnisse des TestSet-Laufs beim Ausrichter
•
•
Vergleich eigener Ergebnisse gegen
Goldstandard
Standardisierte Metriken für Qualitätsmessung (precision, recall, F-score)
7. Vergleich und Ranking aller Teilnehmer
durch Ausrichter
•
anonym (bei Bedarf)
231
NLP System-Wettbewerbe:
Speech Recognition
232
NLP System-Wettbewerbe:
Informationsextraktion
233
Resources for NLP
• Empirical (Learning) Paradigm for NLP
• Types of Resources
–
–
–
–
–
–
Language data (plain, annotated)
Systems for rendering language data
Computational lexicons and ontologies
NLP Core Engines
NLP Application Systems
Machine Learning Resources
• Methodological Issues of NLP
Resources
234
Methodological Issues
• Training set vs. Test set
– sample size, representativeness
• Annotation metrics
–
–
–
–
Inter-annotator consistency
Intra-annotator consistency
k and c – not enough
What is a correct match?
• Full vs. partial overlap
• Relevance
• Ground truth
– … how silver or even bronze is it, really ?
• Competitions and evaluation rounds are
vital for measuring progress
235
Methodological Issues
• Language specifics
– General language (newspaper) vs. sublanguages
(e.g., life sciences)
– Modalities (spoken vs. written language)
– Mediality (text plus graphical objects, photos, etc.)
• Software engineering
– Version control
– Interoperability
• middleware (e.g., UIMA)
• Accessibility
– Proprietary vs. Open source
236