AX - Institute of Formal and Applied Linguistics

Download Report

Transcript AX - Institute of Formal and Applied Linguistics

Building Sub-Corpora Suitable for Extraction
of Lexico-Syntactic Information
Ondřej Bojar, [email protected], Institute of Formal and Applied Linguistics, ÚFAL MFF, Charles University, Prague
Conclusion
Syntactic Analysis Needs Lexicons
A Dependency Treebank
Automatic Extraction of
Lexico-Syntactic Data
• Designed and implemented a new scripting language AX for
performing selection of sentences based on linguistically
motivated criteria.
Lexicon of Syntactic Behavior
Treebanks contain the
required syntactic information but cover too few
lexemes in too few situations.
For instance, the Prague Dependency Treebank (1.5 million
98,263 sentences) covers only 5,400 Czech verbs out of an
total of 40,000. Only 500 verbs occur in the PDT more than
Therefore the PDT is not sufficient as a source of valencies
verbs.
• Prepared an AX script (15 filters and 21 rules) to
demonstrate its utility. The script selects sentences suitable for
extraction of Czech verb frames.
The extraction of frames itself has not been performed yet, but the
utility of the selection can be illustrated by improvement of the
Czech Collins parser’s accuracy (measured by the number of verb
occurrences with correctly assigned daughters). When combining the
linguistically motivated selection with a filter selecting only short
sentences, the accuracy of 73.2 % can be achieved. Approximately
10 % of input sentences pass both filters.
tokens in
estimated
50 times.
of Czech
A Big Corpus
(with morphological information only)
?
?
?
?
?
?
?
Corpora without syntactic
annotation contain enough
examples but many of the
sentences available are too
complex to extract the
syntactic information automatically.
AX
Proposed solution: First
“pick nice examples”,
then extract the lexicosyntactic information traditionally.
AX Overall Scheme
Word forms with ambiguous morphological information are still represented
as single feature structures. For example, the Czech word form má can serve
either as a personal pronoun or as a finite verb. A single feature structure can
hold both of the variants:
[cat-pron, lemma-“můj“, case-nom, gend-fem, num-sg
|cat-verb, lemma-”mít“, tense-pres, person-third, gend-masc, num-sg]
Filters are used to strike out sentences not suitable for further
analysis or for extracting the lexico-syntactic information:
AX (automatic extraction) is a new scripting language designed to make the
following tasks easy:
• Dealing with (ambiguous) morphological information.
• Partial parsing and grammatically consistent simplification of sentences (if
needed to check for more complex phenomena).
• Selection of sentences to keep, based on linguistic criteria (both morphological
and syntactic ones).
• Print-out of the simplified version of selected sentences. If the script was
prepared carefully, this can already be the desired lexico-syntactic information.
Sample AX Script and AX Rule
• Early filters decide according to simple criteria, such as too
many punctuation marks.
• Later filters make use of results of partial syntactic analysis and
reject sentences after a more sophisticated decision, such as
discovering noun phrases ordered in a manner where syntactic
ambiguity is very common and would spoil the observed verb
frame.
• Filters are expressed as regular expression of feature structures.
Rulesets are used to perform partial syntactic analysis of the
sentence. Rulesets may produce more possible “readings” of the
sentence. Some of the readings may be rejected by following filters.
(Formally, a reading is a sequence of feature structures.)
A Sample AX Rule:
Sentence 1
Sentence 2
Sentence 3
In the selected
sentences only
65 %
73 %
Input for the script is a sentence represented as a sequence of feature
structures corresponding one to one to the input word forms.
?
Arbitrary texts augmented
with morphological
information (not necessarily
disambiguated).
In all the
sentences
55 %
68 %
An AX script is an arbitrary sequence of filters and rulesets.
Subcorpus of Nice
Examples
AX
Observed verbs with all the
daughters recognized correctly
Sentences of any length
Sentences with up to 10 words
A filter rejects sentences with
strange symbols.
A ruleset combines aux.+main verb.
A filter rejects
This might be ambiguous, several sentences with two
readings can be generated.
main verbs.
Sentence 1 was rejected by
the first filter.
Sentence 2 was accepted,
one reading passed the last
Reading 1 rejected. filter and the final sequence
of feature structures will be
Reading 2 accepted.
printed out
Reading 1 rejected.
Reading 2 rejected.
Reading 3 rejected.
Sentence 3 was rejected
because none of the
readings passed the second
filter.
Full text, acknowledgement and the list of references in the proceedings of ESSLLI Student Session, Vienna, 2003.
rule “combine main and aux. verb parts”:
combined \gap [cat-trace]
-->
aux {gap: ![cat-verb]*} main
| main {gap: ![cat-verb]*} aux
:: # unification requirements follow
aux = [cat-verb, lemma-”být”],
main = [cat-verb_participle],
aux.person = main.person,
aux.number = main.number,
combined = [cat-complex_verb],
combined.lemma = main.lemma,
combined.person = main.person,
combined.number = main.number
end
Replace the matching region with the content
of variable combined, the region labelled gap
and an extra feature structure trace marking the
former location of the second part of the
complex verb (if useful in further analysis).
Find a subsequence in the input reading that
matches the given regular expression. Assign
words to variables aux and main and mark the
region between them with the label gap.
Restrict, which words can be assigned to input
variables aux and main.
Ensure a grammatical agreement between the
words assigned to variables aux and main.
Fill (restrict by unifying) the output variable
combined with features relevant for further
analysis.