20091014b_treehouse

Download Report

Transcript 20091014b_treehouse

thai-language.com
Glenn Slayden
October 14, 2009
Agenda
•
•
•
•
•
•
•
Background and history
Site surface demonstration
Database ontology
Database technology
Data Entry demonstration
Future directions
Q&A : throughout please
Overarching Motivation
• Long-term objectives:
–Increase linguistic rigor
–Publish any new work
–Maintain popular accessibility
–Build community
Historical Parchment - 1997
More Parchment - 2001
Site Demonstration
Database? What Database
• How big is a monolingual dictionary?
• 100,000 words x 30 b/entry = 30 MB
• How much memory in a modern server?
32GB.
• That’s about 1/10th of 1% (.00094)
• SQL? MySql? PostGres? Not indicated.
Case Study
October 13, 2009 – 64-bit web server – 32 GB RAM
Server Memory Utilization
n.b. this entire pie chart represents 10% of total memory
In-memory is the way to go
•
•
•
•
•
For performance
For ease and speed of development
Easy refactoring
LINQ – C# “language-integrated query”
Have a flexible and powerful object-model
without worrying about relational mapping
• Completely avoid OR/M (object-relational
mapping) “impedance mismatch” issues
thai-language.com Ontology
• Disclaimer and warning
– Internal names of programming objects are not
(any longer) intended to have any relationship to
corresponding Linguistic terms. On the following
slides please consider these names to be opaque
monikers.
thai-language.com Ontology
Entry
Definition
Phrase
Category
These colors correspond (roughly) to data-entry screen colors in DBEdit
The most basic
Lucky Decision
• ..that turned out to be incredibly valuable:
– Heterogeneous objects are assigned ID numbers
within mutually exclusive ranges
Scary Picture with Clouds In It
Data Entry Demonstration
Future directions
• Track provenance of entries and changes
• Separate-out meta-information in English
senses
• Move towards community curatorship while
maintaining asset value
– Requires reputation-granting authority
• Refine and formalize dictionary statement of
purpose (i.e. to prevent hijacking)
Technology Changes
• In 2009, optimizing a language dictionary
database for size is not necessary
• Detailed fields should be generously deployed
• Exception to the in-memory model:
– Comprehensive change version tracking may
warrant database storage
– This is necessary for community curatorship
An integrated DELPH-IN style
computational-analytical grammar
• Associate a rigorous HPSG feature structure
with each sense
• Display MRS and tree on dictionary page for
compounds and sentences.
• Ability to designate gold standard parse trees
and attestation provenance
• Live interface for LKB/PET-style parser to
provide arbitrary parsing
Thanks for Coming!