Document

Transcript Document

The Universities’ Collection Databases
 ”The Universities’ Collection Databases” denotes all databases
developed by the Unit for digital documentation at the Arts
Faculty, University of Oslo.
 The databases contains data from archaeology, antropology,
botany, zoology, numismatics, history, history of arts,
lexicography
 The databases are accessible via specially developed end user
applications and via the WWW.
The Universities' Collection databases
This presentation gives an overview of
 A common user interface
 Samples from some of the databases
Implementation
 The databases are implemented in Oracle 8.1.7, not using any spesific
object oriented features
 The object types (and the table structures) are defined in a common
meta database
 All databases are accessed via a common framework
 The common framework get design and structure information from
the meta database. All queries are generated automatically on the
basis of the information in the meta database.
 Each user is granted access via a user database
 The user interface program checks the meta database for new versions
of modules and upgrade it self automatically via the net.
 New databases are added regularly
 A WWW version is being developed
The users have their
personal navigator
for quick access to
databases of interest.
Each database has
an assosiated object
type
The users can add
their own folders
or categories to
the navigator
Choose a database
(archaeological
artifacts and finds)
Search for
the artifact
type ”ring”
Click on a
column title to
sort the result
grid
Drag and drop a
column title to
group the rows in
the grid
9 rings found
in the county
'Akershus'
Double click to view
detailed information
(show the object
viewer)
The artifacts found
together with the
selected ring (in the
same find event)
The users can export the
data as HTML, Excel or
according to the users’
predefined report templates
The result
grid exported
to Excel
The users can
define report
templates
Drag and drop result rows onto a
predefined report template of the
corresponding object type to
create a report
The report is ready to
be printed
Click to save
pointers to selected
rows in the result
grid
A list can hold pointers to a
manually selected set of
objects or a dynamic set (query
defined). The pointers can be
of a single object type or have
different typed. In the latter
case the type will be the
common supertype
Click on the list icon to see the
content of a list. In the system a
stored list is just a (sub) database
and can be queried.
Additional pointers to
can be added to an
existing list
Click on the explorer icon
to get an overview of users
and data sources
(databases and stored
lists)
Select another
database (here: place
names excerpts)
Click to see both the
result grid and the
object viewer
Click to switch
windows
Display the object
correponding to the next
row in the result grid
The users can
create and store
their personal
result grid design
The tree structure reflects
the structure of the object
type as defined in the meta
database
The users can create
and store their
personal query form
design
Linguistic and lexicographic
applications





Lexicographic archives
Lexical databases
Dictionary databases
Editing tools
The Meta Dictionary - a tool for the field linguist or
lexicographer
 The Norwegian Dictionary project
 Text corpus tools
Lexical archives
 The database for the traditional word slip collection of
the Norwegian Dictionary project
 Main collection : 2 900 000 facsimiles
 Regional collection: 187 000 facsimiles
 The database is linked to the Meta Dictionary
Head word
Part of speech
Literature references
Place of utterance
Facsimiles
Morphological databases
 Lists with lemmata and inflected forms for
the two Norwegian written languages
(bokmål, nynorsk)
 Basis for a two level morpho-syntactic tagger
 Produced in collaboration with the Text
Laboratory at the Arts faculty, Univ. of Oslo
 Bokmål: 156.000 lemmata, 1,2 million
inflected forms
 Nynorsk: 123.000 lemmata, 896.000 inflected
forms
 The databases are linkedto the Meta
Dictionary
Lemma
Paradigme codes and
generated inflected
forms
Dictionary databases
 Database tools for two major Norwegian
dictionaries
 The entire process from editing to camara
ready manuscript
 The tools are integrated in the common
framwork
 The manuscripts are linked to the Meta
Dictionary
The dictionary
entry
Fields for different
information categories
Graphical representation
of the definition
structure
The editing tools are for
the time being not a
parts of the common
framwork
AWYSIWY
Gpresentatio
n of the entry
The entries can be
viewed in the their
running context
The program generates the
head word part of the entry
based on the lemma and
part of speech marking
Navigation
buttons
A set of entries (or the
entire manuscript) can be
typeset in the PDF format
and presented on the
screen.
The entries are exported
from the database as XML
documents, converted via
TEX, DVI to PDF and
send back to the user.
The Norwegian Dictionary




A national dictionary project (nynorsk)
To be finished in 12 volumes by year 2014
DOK is developing the software solutions
The dictionary manuscript is linked to the
Meta Dictionary
Graphical
representation of
the entry
The full text
based on the
structure of the
entry
Each part of the
dictionary entry
has its own data
entry form
Data entry form
for the head
word part
The
Artikkelteksten
entry text is
vert kontinuerleg
updated
automatically
oppdatert
Skard’s dictionary
 Defines the 1938 orthography
 32 000 entries
 The dictionary is linked to the Meta
Dictionary
The Meta Dictionary
 A tool for systematising weakly normalized
languages and a tool for the development of
the Norwegian Dictionary (NO2014)
 Interlinks different lexical databases
 521 000 headwords (NO2014)
 The backbone in the (NO2014) project
924 slips about
the word ”hus”
(house)
Word forms /lemmata
written in different dialects
and/or according to
changing orthographies
Word compound
analysis
Object viewer
according to the
type of lexical
resource (here
slips)
Links to other
lexical
resources
Tool for fast
normalization of
the head words in
the Meta
Dictionary
Each project
assistant has to
normalize 300
entries a day
All links are
manually
checked
Norwegian (Nynorsk) electronic text corpus
Background
 Editorial requirements for NO2014
 Design and implementation
Unit for digital documentation, DOK
Work began in August 2002 and will continue
according to the tasked assigned to the unit by
NO2014 for one year
[email protected]
Norwegian (Nynorsk) electronic text corpus
Long-term goals
 The definitive corpus for New Norwegian for
lexicography and for other domains using
electronic resources
 A corpus access system that can be reused for
other languages and text collections
 Incorporation of robust methods from
computational linguistics with the goal of creating
a linguistic workbench, over and above a corpus
workbench
Norwegian (Nynorsk) electronic text corpus
Application Area
 Editorial work within NO2014
Headword selection
Choice of examples
Examples are catalogued in the Meta
Dictionary
Sense division
Firth: Knowing a word by the company it
keeps.
Aided by the refined collection of
examples
Norwegian (Nynorsk) electronic text corpus
Integration with the Meta Dictionary
 Excerpta refined by
Methods from computational linguistics
Human interaction
 Eventually a selection will be made for
publication, but in the framework of the Meta
Dictionary, even those that were excluded from
publication will remain available for other
application areas
 Communication with the editing software through
the Meta Dictionary
Norwegian (Nynorsk) electronic text corpus
Design
 Representative corpus based on specifications
produced by the EU language resources project,
LE-PAROLE
 SGML markup in accordance with PAROLE’s
specifications, based on TEI
 One-to-One mapping between the PAROLE
format and a database structure defined in Oracle.
Norwegian (Nynorsk) electronic text corpus
Status
 25,000,000+ words
Dag og Tid (news paper)
21,000,000 words
Legacy data
approx 5,000,000 literature
 Existing agreements
Weekly deliveries from Dag og Tid
Samlaget (publishing house)
Syn og Segn (monthly magazine)
Norwegian (Nynorsk) electronic text corpus
The next steps
 Application for access through the web, last
quarter of 2002
 Balancing the domains covered by the corpus:
continuous
 Stand-alone windows application
 Continuous incorporation of computational
linguistics methods for phrase identification and
extraction,

Document

Transcript Document

Directory