Transcript Document
Metadata as Infrastructure for
Information Retrieval and Text
Mining
Prof. Ray R. Larson
University of California, Berkeley
School of Information
March 2006
NaCTeM – Ray R. Larson
Overview
Metadata as Infrastructure
– What, Where, When and Who?
What are Entry Vocabulary Indexes?
– Notion of an EVI
– How are EVIs Built
Time Period Directories
– Mining Metadata for new metadata
March 2006
NaCTeM – Ray R. Larson
Metadata as Infrastructure
The difference between memorization and
understanding lies in knowing the context
and relationships of whatever is of interest.
When setting out to learn about a new topic,
a well-tested practice is to follow the
traditional “5Ws and the H”: Who?, What?,
When?, Where?, Why?, and How?
March 2006
NaCTeM – Ray R. Larson
Metadata as Infrastructure
The reference collections of paper-based libraries
provide a structured environment for resources,
with encyclopedias and subject catalogs,
gazetteers, chronologies, and biographical
dictionaries, offering direct support for at least
What, Where, When, and Who.
The digital environment does not yet provide an
effective, and easily exploited, infrastructure
comparable to the traditional reference library.
March 2006
NaCTeM – Ray R. Larson
What?
Searching texts by topic, e.g. Dewey, LCSH, any subject
index, or category scheme applied to documents.
Two kinds of mapping in every search:
• Documents are assigned to topic categories, e.g. Dewey
• Queries have to map to topic categories, e.g. Dewey’s
Relativ Index from ordinary words/phrases to Decimal
Classification numbers.
Also mapping between topic systems, e.g. US Patent
classification and International Patent Classification.
March 2006
NaCTeM – Ray R. Larson
‘What’ searches involve mapping
to controlled vocabularies
Thesaurus/
Ontology
Texts
March 2006
NaCTeM – Ray R. Larson
Start with a
collection of
documents.
March 2006
NaCTeM – Ray R. Larson
Classify and
index with
controlled
vocabulary
Index
Or use a preindexed
collection.
March 2006
NaCTeM – Ray R. Larson
For:
“Wirtschaftspolitik”
Problem:
Controlled
Index
Vocabularies
can be
difficult for
people to use.
In Library of Congress subj
Use: “Economic
Policy”
“pass mtr veh spark ign eng”
March 2006
NaCTeM – Ray R. Larson
Solution:
Entry Level
Vocabulary
Index
Indexes.
pass mtr veh
spark ign eng”
March 2006
EVI
= “Automobile”
NaCTeM – Ray R. Larson
“What” and Entry Vocabulary
Indexes
EVIs are a means of mapping from user’s
vocabulary to the controlled vocabulary of a
collection of documents…
March 2006
NaCTeM – Ray R. Larson
Building and Searching EVIs
Domains to select
from: Engineering,
Medicine, Biology,
Social science, etc.
User selects a
subject domain of
interest.
Has an Entry
Vocabulary
Module been
built?
User has question
but is unfamiliar
with the domain
he wants to
search.
YES
Use an existing
EVI.
NO
Download a set
of training data.
Extract terms (words
and noun phrases) from
titles and abstracts.
Build associations
between extracted terms
& controlled
vocabularies.
Map user’s query to
ranked list of
controlled
vocabulary terms
For noun
phrases
Internet DB indexed
with a controlled
vocabulary.
Part of speech
tagging
Building an Entry Vocabulary
Module (EVI)
March 2006
NaCTeM – Ray R. Larson
User selects search
terms from the ranked
list of terms returned by
the EVI.
Searching
Technical Details
Download a
set of
training data.
Extract terms
(words and noun
phrases) from titles
and abstracts.
Build associations
between extracted
terms & controlled
vocabularies.
For noun phrases
Internet DB
indexed with a
controlled
vocabulary.
Part of speech
tagging
Building an Entry Vocabulary Module (EVI)
March 2006
NaCTeM – Ray R. Larson
Association Measure
t
¬t
C
a
c
¬C
b
d
Where t is the occurrence of a term and C is the
occurrence of a class in the training set
March 2006
NaCTeM – Ray R. Larson
Association Measure
Maximum Likelihood ratio
W(C,t) = 2[logL(p1,a,a+b) + logL(p2,c,c+d)
- logL(p,a,a+b) – logL(p,c,c+d)]
where
logL(p,n,k) = klog(p) + (n – k)log(1- p)
a
and p1= a+b
c
p2=c+d
Vis. Dunning
March 2006
NaCTeM – Ray R. Larson
a+c
p= a+b+c+d
Alternatively
Because the “evidence” terms in EVIs can
be considered a document, you can also use
IR techniques and use the top-ranked
classes for classification or query expansion
March 2006
NaCTeM – Ray R. Larson
Find
Plutonium
In Arabic
Chinese
Greek
Japanese
Korean
Russian
Tamil
Digital library resources
Statistical association
W(c, t) 2[logL(p1 , a, a b) ...
March 2006
NaCTeM – Ray R. Larson
EVI example
User
Query
“Automobile”
EVI 1
EVI 2
Index term:
“pass mtr
veh
spark ign
eng”
Index term:
“automobiles”
OR
March 2006
NaCTeM – Ray R. Larson
“internal
combustible
engines”
But why stop
there?
Index
EVI
March 2006
NaCTeM – Ray R. Larson
Index
“Which EVI
do I use?”
EVI
Index
EVI
Index
EVI
Index
March 2006
NaCTeM – Ray R. Larson
Index
EVI to EVIs
EVI
EVI2
Index
EVI
Index
EVI
Index
March 2006
NaCTeM – Ray R. Larson
Why not treat language the
same way? In Arabic
Find
Plutonium
March 2006
Chinese
Greek
Japanese
Korean
Russian
Tamil
NaCTeM – Ray R. Larson
It is also difficult to move
between different media forms
Texts
EVI
Thesaurus/
Ontology
Numeric
datasets
March 2006
NaCTeM – Ray R. Larson
Searching across data types
Different media can be linked indirectly via
metadata, but often (e.g. for socio-economic
numeric data series) you also need to specify
WHERE to get correct results
March 2006
NaCTeM – Ray R. Larson
But texts associated with numeric
data can be mapped as well…
Texts
EVI
Thesaurus/
Ontology
EVI
captions
March 2006
NaCTeM – Ray R. Larson
Numeric
datasets
EVI to Numeric Data example
1
2
search
interface 1
10
numeric
table
11
search
interface 2
March 2006
3
EVI
LCSH
9
4
online
catalog
5
search
results
captions
8
7
numeric
database
new query
NaCTeM – Ray R. Larson
6
marc
But there are also geographic
dependencies…
Texts
EVI
Thesaurus/
Ontology
EVI
Maps/
Geo Data
March 2006
captions
NaCTeM – Ray R. Larson
Numeric
datasets
WHERE: Place names are
problematic…
Variant forms: St. Petersburg, Санкт Петербург,
Saint-Pétersbourg, . . .
Multiple names: Cluj, in Romania / Roumania /
Rumania, is also called Klausenburg and
Kolozsvar.
Names changes: Bombay Mumbai.
Homographs:Vienna, VA, and Vienna, Austria;
– 50 Springfields.
Anachronisms: No Germany before 1870
Vague, e.g. Midwest, Silicon Valley
Unstable boundaries: 19th century Poland;
Balkans; USSR
Use a gazetteer!
March 2006
NaCTeM – Ray R. Larson
WHERE. Geo-temporal search interface. Place names found i
documents. Gazetteer provided lat. & long. Places displayed on
map.
Timebar
March 2006
NaCTeM – Ray R. Larson
Zoom on map. Click on place for a list of records. Click on record to display text.
March 2006
NaCTeM – Ray R. Larson
Catalogs and gazetteers should talk to each other!
Catalog
search
Gazetteer
search
Geographic sort / display of catalog search result.
March 2006
NaCTeM – Ray R. Larson
So geographic search becomes
part of the infrastructure
Texts
Maps/
Geo Data
March 2006
EVI
Thesaurus/
Ontology
Gazetteers
captions
NaCTeM – Ray R. Larson
Numeric
datasets
WHEN: Search by time is also
weakly supported…
Calendars are the standard for time
But people use the names of events to refer to time
periods
Named time periods resemble place names in
being:
– Unstable: European War, Great War, First World War
– Multiple: Second World War, Great Patriotic War
– Ambiguous: “Civil war” in different centuries in
England, USA, Spain, etc.
Places have temporal aspects & periods have
geographical aspects: When the Stone Age was,
varies by region
March 2006
NaCTeM – Ray R. Larson
Similarity between place names
and period names
Suggests a similar solution: A gazetteer-like
Time Period Directory.
Gazetteer:
– Place name – Type – Spatial markers (Lat & long) -- When
Time Period Directory:
– Period name – Type – Time markers (Calendar) – Where
Note the symmetry in the connections
between Where and When.
March 2006
NaCTeM – Ray R. Larson
Solution - Time Period
Directories
Initial development involved mining the
Library of Congress Subject Authority file
for named time periods…
March 2006
NaCTeM – Ray R. Larson
LC MARC Authorities Records
<USMARC>
<Fld001>sh 00000613 </Fld001>
<Fld151><a>Magdeburg
(Germany)</a><x>History</x><y>Siege, 15501551</y></Fld151>
<Fld550><w>g</w><a>Sieges</a><z>Germany</z></Fld550>
<Fld670><a>Work cat.: 45053442: Besselmeier, S. Warhafftige
history vnd beschreibung des Magdeburgischen Kriegs,
1552.</a></Fld670>
<Fld670><a>Cath. encyc.</a><b>(Magdeburg: besieged (155051) by the Margrave Maurice of Saxony)</b></Fld670>
<Fld670><a>Ox. encyc. reformation</a><b>(Magdeburg: ...
during the 1550-1551 siege of Magdeburg ...)</b></Fld670>
</USMARC>
March 2006
NaCTeM – Ray R. Larson
timePeriodEntry
Time Period Directory Instance
Contains components described below
- periodID
Unique identifier
- periodName
Period name, can be repeated for alternative names
Information about language, script, transliteration scheme
Source information and notes (where was the period name mentioned)
- descriptiveNotes
Description of time period
- dates
Calendar and date format
Begin & end date (exact, earliest, latest, most-likely, advocated-by-
source, ongoing)
Notes, sources
- periodClassification
Period type, e.g. Period of Conflict, Art movement
Can plug in different classification schemes
Can be repeated for several classifications
- location
Associated places with time period
Contains both place name and entry to a gazetteer providing more
specific place information like latitude / longitude coordinates
Can plug in different location indicators (e.g. ADL gazetteer, Getty
Thesaurus of Geographic names)
Recently added coordinates for direct use
- relatedPeriod
Related time periods
periodID of related periods
Information about relationship type (part-of, successor etc.)
Can plug in different relationship type schemes
- entryMetadata
Notes about creator / creation of instance
Entry date
March 2006
NaCTeM – Ray R. Larson
March 2006
NaCTeM – Ray R. Larson
Time periods by named location
March 2006
NaCTeM – Ray R. Larson
Catalog Search Result
March 2006
NaCTeM – Ray R. Larson
Web Interface - Access by map
March 2006
NaCTeM – Ray R. Larson
Zoomable interface gives access
to geographically focused info…
March 2006
NaCTeM – Ray R. Larson
Web Interface - Access by timeline
Link initiates search of the
Library of Congress catalog
for all records relating to this
time period.
March 2006
NaCTeM – Ray R. Larson
WHEN and WHAT
These named time periods are derived from Library of Congress catalog
subject headings and so can be used for catalog searching which finds books
on topics important for that time period
March 2006
NaCTeM – Ray R. Larson
Time period directories link via
the place (or time)
Texts
Maps/
Geo Data
EVI
Thesaurus/
Ontology
Gazetteers
captions
Time Period Directory
March 2006
Numeric
datasets
Time lines, Chronologies
NaCTeM – Ray R. Larson
WHEN, WHERE and WHO
Catalog records found from a time period search commonly include
names of persons important at that time. Their names can be forwarded
to, e.g., biographies in the Wikipedia encyclopedia.
March 2006
NaCTeM – Ray R. Larson
Place and time are broadly important across numerous tools
and genres including, e.g. Language atlases, Library catalogs,
Biographical dictionaries, Bibliographies, Archival finding
aids, Museum records, etc., etc.
Biographical dictionaries are heavy on place and time:
Emanuel Goldberg, Born Moscow 1881. PhD under Wilhelm
Ostwald, Univ. of Leipzig, 1906. Director, Zeiss Ikon,
Dresden, 1926-33. Moved to Palestine 1937. Died Tel Aviv,
1970.
Life as a series of episodes involving Activity (WHAT),
WHERE, WHEN, and WHO else.
March 2006
NaCTeM – Ray R. Larson
A new form of biographical
dictionary would link to all
Biographical Dictionary
Texts
Maps/
Geo Data
EVI
Thesaurus/
Ontology
Gazetteers
captions
Time Period Directory
March 2006
Numeric
datasets
Time lines, Chronologies
NaCTeM – Ray R. Larson
A Metadata Infrastructure
INTERMEDIA INFRASTRUCTURE
Facet
Authority Control
Special Display Tools
RESOURCES
CATALOGS
WHAT
Thesaurus
Syndetic Structure
Learners
WHERE
Gazetteer
Maps
WHEN
Time Period Directory
Timelines
WHO
Biographical Dictionary
Text and Images
Dossiers
March 2006
NaCTeM – Ray R. Larson
Achives
Historical Societies
Libraries
Museums
Public Television
Publishers
Booksellers
Audio
Images
Numeric Data
Objects
Texts
Virtual Reality
Webpages
Acknowledgements
Electronic Cultural Atlas Initiative project
This work was partially supported by the Institute
of Museum and Library Services through a
National Leadership Grant for Libraries, award
number LG-02-04-0041-04, Oct 2004 - Sept 2006
entitled “Supporting the Learner: What, Where,
When and Who” – See: http://ecai.org/imls2004
Michael Buckland, Fred Gey, Vivien Petras, Matt
Meiske, Kim Carl
Contact: [email protected]
March 2006
NaCTeM – Ray R. Larson