Transcript PPT 0.5M

Language and Tools for Lexical
Resource Management
Asanee Kawtrakul (1)
Aree Thunkijjanukij (2)
Preeda Lertpongwipusana(1)
Poonna Yospanya(1)
(1)Department of Computer Engineering, Faculty of Engineering,
(2) Thai National AGRIS center
Kasetsart University
23 January 2003 APAN-Fukuoka
Acknowledgement
• JIRCUS: Japan International Research
Center for Agricultural Sciences
• Organizing committee
• Kasetsart University
Outline
• Background & Motivation
• Problems in Lexical Resource
Preparation
• Requirements for Lexical Resource
Management
• Proposed Language and tools
• Conclusion and Next steps
Background and Motivation
• Thailand is the agricultural basis country
– having a rich knowledge and data in agricultural field,
• A great quantity of agricultural information
was scattered in unstructured and unrelated
text
– Skimming/Digesting and integrating becomes
essential
• Knowledge is around the world
– Knowledge Discovery without language barrier is also
needed
The Basic Idea behind..
Internet
Gathering
Module
Translation
Module
Graphical
User Interface
AgriculturalDocument
collection
Indexing
and Clustering
Module
Data Cube
Summarization
Module
Textual Data as a Input
Let us focus on Canada’s agricultural products. In
1998, there were 1,216 registered commercial egg
producers in Canada. Ontario produced 39.8% of
all eggs in Canada, Quebec was second with
16.6%. The western provinces have a combined
egg production of 35.6% and the eastern
provinces have a combined production of 8.0%.
With a courtesy of Agriculture and Agri-Food Canada, http://www.agr.ca/cb
Summarization and Translation as a
Result
Category Exporter Year Month Price
Unit
Paddy
Thailand
2002
January
300 Dollars/Ton
Paddy
Thailand
2002 February
285 Dollars/Ton
ประเภท
ผู้ส่งออก
ปี
ข้าวเปลือก
ข้าวเปลือก
ประเทศไทย
ประเทศไทย
2545
เดือน
มกราคม
2545 กุมภาพันธ์
ราคา
หน่ วย
บาทต่อเกวียน
13,625 บาทต่อเกวียน
14,340
The Development of Agricultural System for
Knowledge Acquisition and Dissemination
• 5 years Project (2001-2005)
• The Collaborative work between:
– Thai National AGRIS center:
• Providing Bilingual Thesaurus (AGROVOC)
– Department of Computer Engineering
• Developing NLP techniques for Searching, Summarizing and
Translation including tools for lexical resource management
• Funded by Kasetsart University Research
and Development Institution
Acquisition System
Linguist/Domain Expert
Very Large Corpus
Rules
Thesaurus Lexicon
Linguistic Knowledge Base
Document
Indexing & Clustering
Intelligent Search Engine
•With Translation
•With Summarization
Document Warehouse
Gathering Module
Internet/Intranet
Thai Agricultural Thesaurus
• Total number of English vocabulary is
27,531 terms
• Translate in to Thai only 10,280 terms
(except scientific names)
• Scientific name were not be translated
– ex. Oryza (genus) sativa (specy) of rice or
family
Problem in hand-coded Thesaurus
•
•
•
•
Scalability
Reliability and Coherence
Rigidity
Cost
Foods
Processed Products
Bakery Product
Canned Products
Deistic Foods
Dried Products
Frozen Foods
Frozen Products
Fermented Foods
Fermented Products
Fermented Fish
Alcoholic Beverage
milk
Fermented Foods
Fermented Fish
Fermented Fish
Foods
Processed Products
Fermented Foods
Products
Local Product
Fermented Fish
Commercial Vegetables: The September index, at 107, was up
1.9 percent from
1998. Price
last month but 3.6 percent below September
increases for lettuce, tomatoes, broccoli, and celery more than
offset price
decreases for onions, carrots, and cucumbers
Commercial
Vegetable
tomatoes
Broccoli
Carrots
Cucumbers
Commercial Vegetable
broccoli
carrot
tomato
User Category
tomato
tomatoes
Keyword Assigned
tomatoes
BT
VEGETTABLES
BROCCOLI
type=leaf vegetable
color=green
SWEET PEPPER
type=fruit vegetable
color=red, green, yellow
TOMATOES
type=fruit vegetable
color=red, yellow
NT
CHERRY TOMATOES
type=fruit vegetable
color=red
RT
LYCOPERSICON ESCULENTUM
type=taxonomic BT
SOLANACEAE
color=red
NT
CAPSICUM
NICOTIANA
Expert Domain
Other Major Problems (1)
• Accessing to textual information
– Language variation:
• Many ways to express the same idea
Ex: thinning flower uses deblossoming
thinning branch uses pruning
– how the computer can know that words a
person uses are related to words found in
stored text?
Ex: user:
thinning branch
computer: pruning
Requirement (1)
• Accessing to textual information
–Need intelligent browsing from
related concept to related concept,
rather than from occurrence of
stemmed character strings
Other Major Problems (2)
• Transforming from unstructured
to structured information
Requirement (2)
• Need Application-based Frame about
product price
– Knowledge representation in table form
– Consisting of attributes and their values
Attributes
Category
Paddy
Exporter
Thailand
Price
300
Unit
Dollars/Ton
Values
Problems in Translation: Pragmatic and
Semantic
0.97* averagePrice
September
Of year ??
•
Using Ontology
of year from19901992
The September All Farm Products Index was 97 percent of its 1990-
92 base, down1.0 percent from the August index and 2.0 percent
below the September 1998
Index
August
Year1997
Down
0.02*price(September 1998)
“Year 1990-1992” meaning
Product
Year
A
1990
1991
1992
B
-
-
-
C
-
-
-
D
-
-
-
Requirement (3)
• Lexicon should having the semantic
constraints between lexical entities,
restriction on usage categories
Summary
lexicon
of
Problems related
to
• In terms of coverage
– Extensional coverage, i.e., number of entries
– Intensional coverage, i.e., the number of information
fields
• In terms of semantic domain covered by the
application
– Meaning Interpretation with respect to objects, subject
matter, topics of discourse, and pragmatic interpretation
• The user category with reference to the
intended system users
– Commercial product vs Plant products vs Family
products
One Solution
• Encoding world knowledge in the
structures attached to each
lexical item which needs both
language and tools
The Design of Lexicon: Requirement
Specification
• Macrostructure: Lexicon structure in terms of
relations between lexical entries
– i.e. Hierarchical taxonomies which are characteristic
of thesauri of semantically related word family
• Microstructure: types of information for each
entry
–
–
–
–
Pronunciation or phonemic transcription
Syntactic properties
Meaning
Pragmatics of their use in real context and language
Microstructure (cont’)
• Lexical entity could contain slots/scripts for
each specific domain and need intelligent
Analyzer and understanding language
– Supplies information extraction
– Supplies the missing value
Lexical
Resource
Language
Management
• which is able to:
– Handle heterogeneity of linguistic
knowledge structures.
– Handle
exceptions
and
inconsistencies of natural languages.
– Provide an intuitive means to store
and manipulate both linguistic and
world knowledge.
Language Features
• The language is designed in a way that
will enable:
– Supports for heterogeneous structures.
– Sufficient provisions to handle exceptions and
inconsistencies of natural languages (this is
achieved through the +/- operators).
– Deduction of knowledge from rules.
– Detection and prevention of potential integrity
violations.
Language and Tools Specification
requirement
• Flexibility – almost any structures can be
defined in this model.
• Extensibility – extending a structure is
simple.
• Maturability – structure reformation and
deformation are supported.
• Integrity – meta-relations help prevent
malformed or ill-semantic data entries.
• Dealing with inconsistencies is feasible.
Some Syntactic Elements
• Knowledge manipulations are achieved
through these primitives:
– def is used to define structures not already
existing.
– redef changes aspects of existing structures.
– undef removes specified structures from the
knowledge base.
– ret is used to retrieve structures from the
knowledge base.
Examples
• Hierarchies: tree structures representing
generalization semantics, or classes, of
atoms.
thing
animate
human
inanimate
animal
A semantic tree represented by a hierarchy structure
Usage Examples
• Defining a hierarchy
– def thing(animate(human+animal)+inanimate).
• Adding the ‘plant’ and ‘vehicle’ concepts
– def animate(plant+vehicle).
• Reparenting the ‘vehicle’ concept
– redef animate(vehicle) inanimate(vehicle).
• Removing the ‘human’ concept
– undef human. (provided that there is only a single
instance of ‘human’)
Usage Examples (2)
• Defining case frames for verbs
– First, we need to define meta-relations for
words belonging to the sub-hierarchy ‘verb’.
– def meta case(verb, sub:thing).
– def meta case(verb, sub:thing, obj:thing).
– Then, we define case frames for several
verbs.
– def case(eat, sub:human+animal, obj:food).
– def case(fly, sub:bird-penguin). (here, we
emphasize the use of +/- operators)
Hierarchy & Set
c1
w1
w3
c3
w4
w5
f1
w2
c2
p1
w6
w7
f2
f4
f3
Defining a Hierarchy
c1
def c1(“w1”(“w3”)+c2(“w4”)+“w2”).
def “w5”+“w6” under “w4”.
w1
w3
w4
w5
def “p1”(“w7”) under “w2”.
w2
c2
p1
w6
w7
Manipulating the Hierarchy
c1
redef “w4” under “w2”.
undef “w1”.
w1
w3
w2
c2
w4
w5
p1
w6
w7
Defining a Set
c3
f1
def c3{[f1]+[f2]+[f3]}.
def [f4] in c3.
f2
f4
f3
Defining a Relation
c2
def meta r1(c2, c3).
Template defined.
def r1(“w4”, [f1]).
Relation defined.
def r1(“w1”, [f3]).
Constraint violated.
Definition not allowed.
c2
r1 ’
w1
c3
w4
w5
inherited
r1
f1
w6
f2
f4
f3
Synset & Surrogates
• A synset is an unnamed set identified by its
unique ID.
• Members of a synset are considered
synonymous
with
different
degrees
of
synonymity.
• Distance graph is automatically constructed
within a synset with surrogates being
representatives of synset members.
• Entities with identical features are attached to
the same surrogates.
Synset & Surrogates
p1
f1
f4
f1
w1
surrogate network internally constructed
f1
f2
s2
s1
w2
p2
w6
synset#1
f4
s3
s5
f3
p3
s4
w3
f3
f2
w4
f4
f4
f1
f3
Synset & Multilingual Lexicon
• Synset members are not confined within
language scope, that is, entities from different
language may belong to the same synset.
• Distance matrix are computed from number of
different features over each pair of surrogates.
• Traversing from a word to nearest-distant words
is handled by the system. We can determine
words with potentially nearest semantics here.
Expected Result
Keyword Generated
“Fruit vegetable”,red
Keyword Generated
BT
VEGETTABLES
tomatoes
“Fruit vegetable”,red
Keyword Generated
Expert Domain
BT
VEGETTABLES
BROCCOLI
type=leaf vegetable
color=green
tomatoes
“Fruit vegetable”,red
Keyword Generated
Expert Domain
BT
VEGETTABLES
BROCCOLI
type=leaf vegetable
color=green
SWEET PEPPER
type=fruit vegetable
color=red, green, yellow
tomatoes
“Fruit vegetable”,red
Sweet pepper
Keyword Generated
Expert Domain
BT
VEGETTABLES
BROCCOLI
type=leaf vegetable
color=green
SWEET PEPPER
type=fruit vegetable
color=red, green, yellow
TOMATOES
type=fruit vegetable
color=red, yellow
tomatoes
“Fruit vegetable”,red
Sweet pepper
Tomatoes
Keyword Generated
Expert Domain
tomatoes
BT
VEGETTABLES
BROCCOLI
type=leaf vegetable
color=green
SWEET PEPPER
type=fruit vegetable
color=red, green, yellow
TOMATOES
type=fruit vegetable
color=red, yellow
NT
CHERRY TOMATOES
type=fruit vegetable
color=red
“Fruit vegetable”,red
Sweet pepper
Tomatoes
Cherry Tomatoes
Keyword Generated
Expert Domain
tomatoes
“Fruit vegetable”,red
Sweet pepper
Tomatoes
Cherry Tomatoes
Keyword Generated
BT
VEGETTABLES
BROCCOLI
type=leaf vegetable
color=green
SWEET PEPPER
type=fruit vegetable
color=red, green, yellow
TOMATOES
type=fruit vegetable
color=red, yellow
NT
CHERRY TOMATOES
type=fruit vegetable
color=red
RT
LYCOPERSICON ESCULENTUM
type=taxonomic BT
SOLANACEAE
color=red
NT
CAPSICUM
NICOTIANA
Expert Domain
Keyword Generated
“Plant in same family”
Keyword Generated
tomatoes
“Plant in same family”
Capsicum
Keyword Generated
BT
VEGETTABLES
BROCCOLI
type=leaf vegetable
color=green
SWEET PEPPER
type=fruit vegetable
color=red, green, yellow
TOMATOES
type=fruit vegetable
color=red, yellow
NT
CHERRY TOMATOES
type=fruit vegetable
color=red
RT
LYCOPERSICON ESCULENTUM
type=taxonomic BT
SOLANACEAE
color=red
NT
CAPSICUM
Expert Domain
tomatoes
“Plant in same family”
Capsicum
Nicotiana
Keyword Generated
BT
VEGETTABLES
BROCCOLI
type=leaf vegetable
color=green
SWEET PEPPER
type=fruit vegetable
color=red, green, yellow
TOMATOES
type=fruit vegetable
color=red, yellow
NT
CHERRY TOMATOES
type=fruit vegetable
color=red
RT
LYCOPERSICON ESCULENTUM
type=taxonomic BT
SOLANACEAE
color=red
NT
CAPSICUM
NICOTIANA
Expert Domain
tomatoes
“Plant in same family”
Capsicum
Nicotiana
Keyword Generated
BT
VEGETTABLES
BROCCOLI
type=leaf vegetable
color=green
SWEET PEPPER
type=fruit vegetable
color=red, green, yellow
TOMATOES
type=fruit vegetable
color=red, yellow
NT
CHERRY TOMATOES
type=fruit vegetable
color=red
RT
LYCOPERSICON ESCULENTUM
type=taxonomic BT
SOLANACEAE
color=red
NT
CAPSICUM
NICOTIANA
Expert Domain
Commercial Vegetable
broccoli
carrot
tomato
User Category
tomato
tomatoes
Keyword Assigned
tomato
Tomato
Tomatoes
Cherry Tomatoes
Keyword Generated
tomatoes
BT
VEGETTABLES
BROCCOLI
type=leaf vegetable
color=green
SWEET PEPPER
type=fruit vegetable
color=red, green, yellow
TOMATOES
type=fruit vegetable
color=red, yellow
NT
CHERRY TOMATOES
type=fruit vegetable
color=red
RT
LYCOPERSICON ESCULENTUM
type=taxonomic BT
SOLANACEAE
color=red
NT
CAPSICUM
NICOTIANA
Expert Domain
Conclusion and Next steps
• This is a preliminary introduction of the
language, with a few of its many
possibilities.
• Structures not mentioned in details here
have not yet been firmly specified. These
structures are rules, maps, and contexts,
which are incorporated to extend the
potentials
in
handling
deductions,
multilingual operations, domain-dependent
retrievals, etc.
Next Steps
• Revise the Idea
• Continue the Implementation
– Aligner Tool
– GUI tools for Thesaurus maintenance
• Short - term solutions to language variability problems by
exploiting available knowledge sources with available
techniques
• Long-range approach need high quality language
understanding , i.e., Automatic thesaurus construction
– System of Agricultural Information
Summarization and Translation
Thank you