Motivation - CorpEus

Download Report

Transcript Motivation - CorpEus

CorpEus, a ‘web as corpus’ tool
designed for the
agglutinative nature of Basque
I. Leturia, A. Gurrutxaga1,
I. Alegria, A. Ezeiza2
WAC3 – September 15-16, 2007 – Louvain-la-Neuve
Elhuyar R&D, Usurbil, Basque Country
IXA Group, University of the Basque Country, Donostia, Basque Country
1
2
Contents
• Motivation
• Problems with Basque language
• Our approach
• CorpEus, a ‘web as corpus’ tool for Basque
• EusBila, a search service for Basque
• Evaluation
Contents
• Motivation
• Problems with Basque language
• Our approach
• CorpEus, a ‘web as corpus’ tool for Basque
• EusBila, a search service for Basque
• Evaluation
Motivation
• No doubt corpora are necessary:
– for linguistic research
– for language normalization
– for developing language technologies
• But many corpora are exclusively used for
these purposes
• They are not made publicly available and
searchable through the Internet
Motivation
• For Basque, it is essential to have corpora
available for querying
– Standardization of Basque started only in 1968
– Many rules, words and spellings have been changing
since; still, every now and then new rules are released
by the Academy of Basque Language
– It was not taught in schools until the seventies and in
universities until the eighties
– No decision as to the correct word or spelling has yet
been taken in many areas or words
– Even written production abounds with misspellings,
errors, uncertainties, etc.
Motivation
• Basque speaking community needs corpora
– Teachers
– Writers
– Technical text producers
– Dictionary makers
– Translators
– Students
– Academics in the field of standardization
• Basque is not a language rich in corpora
– Few, small and not updated
Motivation
• Only corpora available (I):
– XX. mendeko euskararen corpusa:
• Academy of the Basque language
• 4.6 million words
• Balanced
• Literary texts
• Twentieth century
• http://www.euskaracorpusa.net/XXmendea/Konts_a
rrunta_fr.html
Motivation
• Only corpora available (II):
– Ereduzko prosa gaur:
• University of the Basque Country
• 23.8 million words
• Literary and press texts regarded as “reference”
• 2000 - 2005
• http://www.ehu.es/euskaraorria/euskara/ereduzkoa/araka.html
Motivation
• Only corpora available (III):
– Zientzia eta teknologiaren corpusa:
• Elhuyar Foundation and the IXA Group of the
University of the Basque Country
• 7.6 million words
• Texts on science and technology
• 1990 - 2002
• http://www.ztcorpusa.net
Motivation
• Only corpora available (IV):
– Klasikoen gordailua:
• Susa publishing house
• 10.7 million words
• Non-tagged
• Classic texts
• http://klasikoak.armiarma.com/corpus.htm
Motivation
• But we do have the Internet
– Huge repository of texts
– Constantly updated
• A tool for querying the Internet as if it
were a Basque corpus would be very
interesting
Motivation
• Also disadvantages:
– Not linguistically tagged:
• Always some uncertainty
• Variants and misspellings will not appear when
looking for a word
– It will never show all, only what there is in the
first results returned by search engines
– The Internet is often considered nonrepresentative
– The Internet is full of redundancy
Motivation
• Nevertheless, we thought that the benefits
far exceeded the disadvantages
• We embarked on a project to build a ‘web
as corpus’ tool for Basque
Contents
• Motivation
• Problems with Basque language
• Our approach
• CorpEus, a ‘web as corpus’ tool for Basque
• EusBila, a search service for Basque
• Evaluation
Contents
• Motivation
• Problems with Basque language
• Our approach
• CorpEus, a ‘web as corpus’ tool for Basque
• EusBila, a search service for Basque
• Evaluation
Problems with Basque language
• Similar services exist:
– WebConc (http://www.niederlandistik.fuberlin.de/cgi-bin/web-conc.cgi)
– WebCorp (http://www.webcorp.org.uk/)
– KWiCFinder (http://www.kwicfinder.com)
• But these rely on search engines
• Search engines don’t work well for Basque
Problems with Basque language
• Looking for conjugations and inflections
– Basque is an agglutinative language
• A given lemma makes many different word forms
– lan (“work”): lana (“the work”), lanak (“works” or “the
works”), lanari (“to the work”), lanei (“to the works”),
lanaren (“of the work”), lanen (“of the works”)…
– Looking only for the exact given word, or the
word plus an “s” for the plural, is not enough
– Wildcards are not an appropriate solution
• Looking for lan* would also return forms of the
words lanabes (“tool”), lanbro (“fog”)…
Problems with Basque language
• Language discrimination
– No search engine offers the possibility of
returning only pages in Basque
– Big problem when looking for:
• Technical words that exist also in other languages:
anorexia, sulfuroso, byte, allegro, sistema, energia…
• Short words: katu (“cat”), ur (“water”)…
• Proper nouns: Egipto, Newton, Pluton…
– Many non-Basque results are returned, often
no Basque results at all
Problems with Basque language
• Lack of knowledge about the language
– Status of language:
• Late standardization
• Still many changes in words and rules
• Late teaching in schools and universities
• Many non-standardised areas or words
• Many misspellings and errors in written production
– A word might be incorrect but appear often in
the web
– The user might think it is correct, without
knowing that a more appropriate word exists
Contents
• Motivation
• Problems with Basque language
• Our approach
• CorpEus, a ‘web as corpus’ tool for Basque
• EusBila, a search service for Basque
• Evaluation
Contents
• Motivation
• Problems with Basque language
• Our approach
• CorpEus, a ‘web as corpus’ tool for Basque
• EusBila, a search service for Basque
• Evaluation
Our approach
• Looking for conjugations and inflections:
Morphological query expansion (I)
– Morphological generator created by the IXA
Group of the University of the Basque Country
– We obtain all the forms of a given lemma
– We ask the search engine for all of them using
an OR operator
– etxe (“house”) => etxe OR etxea OR etxeak
OR etxeari OR etxeek OR …
Our approach
• Looking for conjugations and inflections:
Morphological query expansion (II)
– Little problems:
• The APIs of the search engines have each a limit in
number of words or length of search phrase
– we had to discover the limits by trial and error
• Due to these limits, real lemmatised search is
impossible
– we looked in a corpus for the most frequent cases,
numbers, times, etc. of the declinations and inflections of
words
– these are the forms of the words sent in the query
Our approach
• Language discrimination:
Language-filtering words (I)
– We looked in a corpus for the most frequent
words in Basque
– We include them in the search phrase using an
AND operator
Our approach
• Language discrimination:
Language-filtering words (II)
– Little problems (I):
• The most frequent words in Basque exist in other
languages too
• Several language-filtering words had to be used
– the more of these, the more we gained in precision (fewer
non-Basque pages returned) but also lost in recall (more
Basque pages were left out), and vice versa
– we chose precision and include four filtering words
– if few results are returned, the user can try again
increasing the recall
Our approach
• Language discrimination:
Language-filtering words (III)
– Little problems (II):
• In bilingual pages, the searched word can be in a piece of text
that is not in Basque
– LangId, a free language identifier developed by the IXA Group of
the University of the Basque Country
– applied to some context around the words to see if it is in a piece
of text in Basque
– it does not work well with small contexts, but if the context is too
big pieces in other languages can be included
– we start with quite a broad context and progressively reduce its
length until minimal length for LangId to work properly is reached
– if at any time LangId says it is in Basque, we stop and we show it
Our approach
• Lack of knowledge about the language:
Variant suggestion (I)
– EDBL, lexical database created by the IXA
Group of the University of the Basque Country
– Each word is linked to its variants, common
errors, old spellings, etc.
– When a user enters a word, its standard form
or variants are suggested
Our approach
• Lack of knowledge about the language:
Variant suggestion (II)
– Somehow lightens one of the problems of the
non-linguistically-tagged nature of the web:
• in a tagged corpus, variants would be assigned the
correct lemma and would appear when looking for
the lemma
• with our approach, the user can obtain the variants
too
Contents
• Motivation
• Problems with Basque language
• Our approach
• CorpEus, a ‘web as corpus’ tool for Basque
• EusBila, a search service for Basque
• Evaluation
Contents
• Motivation
• Problems with Basque language
• Our approach
• CorpEus, a ‘web as corpus’ tool for
Basque
• EusBila, a search service for Basque
• Evaluation
CorpEus
• System architecture:
– User enters word
– Query the EDBL for variants
– Query morphological generator to obtain
conjugations and inflections
– Query APIs of search engines
– Download pages
– Find occurrences of the forms of the word
– Query LangId for language occurrences are in
– Show KWiCs and counts
Word
Variants
Word, variants
Inflections, conjugations
Search phrase
User
CorpEus
URLs
EDBL (IXA)
Morphological
generator (IXA)
Search engines’
APIs
URLs
Web pages
WWW
Occurrence contexts
Language
LangId (IXA)
CorpEus
• Features (I):
– Lemma-based search
– Language-filtered search
– Variant suggestion
CorpEus
• Features (II):
– Ambiguous or unrecognised words:
• The user chooses the analysis upon which to base
the morphological generation
CorpEus
• Features (III):
– Search for more than one word:
• Lemma-based search performed for all of them
• Occurrences of any of the words are shown
CorpEus
• Features (IV):
– Noun phrase or term searching:
• Enclosing various terms in double quotes
• Morphological generation applied to last word
• Thus, proper lemma-based search for whole noun
phrases or terms (in Basque, only the last
component of the noun phrase or term is inflected)
CorpEus
• Features (V):
– Different ordering criteria:
• Pages arriving order (default)
• Form of searched word
• Context after the word
• Context before the word
– Ordered on the fly as they arrive
CorpEus
• Features (VI):
– Analysis of the words:
• Possible lemmas and POSs of the forms of the
searched word are shown in a floating box
• Different colours:
– Light green: correct word, unambiguous
– Dark green: variant, unambiguous
– Light yellow: correct word, ambiguous
– Dark yellow: variant, ambiguous
– Red: unrecognised word
CorpEus
• Features (VII):
– Count charts:
• Word forms
• Possible lemma or POS
• Word before or after
• Lemma of word before or after
•…
CorpEus
• Features (VIII):
– Many textual content file types:
•
•
•
•
•
•
•
•
•
•
HTML
XML
RSS
TXT
PDF
DOC
RTF
PPT
XLS
…
– Parallel downloading of pages to avoid blocking
CorpEus
• Demo: http://www.corpeus.org
Contents
• Motivation
• Problems with Basque language
• Our approach
• CorpEus, a ‘web as corpus’ tool for Basque
• EusBila, a search service for Basque
• Evaluation
Contents
• Motivation
• Problems with Basque language
• Our approach
• CorpEus, a ‘web as corpus’ tool for Basque
• EusBila, a search service for Basque
• Evaluation
EusBila
• Search engines don’t work well for Basque
• We decided to build a search service for
Basque based on the principles of CorpEus:
– API based
– Lemma-based search
– Language-filtered search
– Variant suggestion
• But return URLs and snippets, not KWiCs
or charts
EusBila
• Problem: limit of calls per day of the APIs
– Google: 1,000 calls per day
– Yahoo!: 5,000 calls per day
– Windows Live Search: 10,000 calls per day
• The limits can be enough for a corpus tool,
but not for a general use search service
• Microsoft recently augmented the limit in
calls per day to 25,000 and also launched
an unlimited use commercial license
EusBila
• Published a paper in iNEWS07 (Improving
Non-English Web Searching), a workshop
in SIGIR’07 (July 2007, Amsterdam)
• It aroused interest, as it is a cost-effective
web search solution that can be used by
other minority languages with few
resources
EusBila
• Launch:
– By Eleka Ingeniaritza Linguistikoa
– Under commercial name Elebila
– October 2007
EusBila
• Demo: EusBila
Contents
• Motivation
• Problems with Basque language
• Our approach
• CorpEus, a ‘web as corpus’ tool for Basque
• EusBila, a search service for Basque
• Evaluation
Contents
• Motivation
• Problems with Basque language
• Our approach
• CorpEus, a ‘web as corpus’ tool for Basque
• EusBila, a search service for Basque
• Evaluation
Evaluation
• The methodolgy used in EusBila and
CorpEus was evaluated for the iNEWS07
paper on EusBila
• We evaluated:
– Gain in recall due to morphological query
expansion
– Gain in precision due to language-filtering
words
– Loss in recall due to language-filtering words
Evaluation
• Indicator for precision: percentage of
results that were actually in Basque
• Indicator for recall: estimated hit counts
returned by the API
• Compared Windows Live Search’s API with
EusBila using this same API
• The words for the evaluation were taken
from the search logs of a very popular
science portal in Basque
Evaluation
Evaluation
Condition
Measured
variable
Result
Languagefiltering words
Morphological
query expansion
Words
Not applied
-
Only
Basque
Hit counts
Gain in precision due to
language-filtering words
-
Not applied
Any
kind
% of results
in Basque
Loss in recall due to
language-filtering words
-
Not applied
Only
Basque
Hit counts
Decrease from 6.48% to 57.69%,
depending on the number of
language-filtering words*
Applied
-
Any
kind
Hit counts
40.19% increase
Gain in recall due to
morphological query
expansion
Gain in recall due to
morphological query
expansion
89.43% increase
70.55 points increase, from
27.19% to 97.74%
* The amount of filtering words can optionally be reduced to increase the recall when few results are returned
CorpEus, a ‘web as corpus’ tool
designed for the
agglutinative nature of Basque
I. Leturia, A. Gurrutxaga1,
I. Alegria, A. Ezeiza2
WAC3 – September 15-16, 2007 – Louvain-la-Neuve
Elhuyar R&D, Usurbil, Basque Country
IXA Group, University of the Basque Country, Donostia, Basque Country
1
2