Transcript Module 7a

Module 7a: Creating Controlled
Vocabularies
IMT530: Organization of Information Resources
Winter 2008
Michael Crandall
Steps in Constructing CVs
• Define your domain
• Gather concepts
– From user interviews, search logs, content
analysis, preexisting vocabularies
•
•
•
•
•
Select your approach
Extract terminology
Control your terms
Organize your terms
Maintain, maintain, maintain
IMT530- Organization of Information Resources
2
Elements of Building CVs
• Select your approach
– Pre- or post-coordinated (sixteenth century lute music or sixteenth
century and lutes and music)
– Open or closed (indexers can add terms or not)
– Enumeration vs. synthesis (facets)
• Extract terms
– Warrant (from users or domain or both)
• Control terms
–
–
–
–
–
–
–
Specificity (cats or Siamese cats?)
Control of homographs (qualifications)
Term consistency and word form (plurals, etc.)
Multiword/phrase sequence and form (inverted, normal form?)
Term definitions (scope notes)
Syntax (citation order)
Semantic factoring
• Organize terms
– Semantic relationships
IMT530- Organization of Information Resources
3
Different Approaches
Pre- and Post-Coordination
• Pre-coordination involves creating terms
that combine multiple concepts (not
words) into a single term
• Post-coordination involves creating
terms that contain single concepts only,
not multiple ones
• Some authors refer to this as
“combination”, and say “pre-combined”
and “post-combined”
IMT530- Organization of Information Resources
5
Single or Multiple Concept?
• Is “information retrieval systems” a
single concept or a multiple concept?
• Multiple concepts are often joined with
conjunction (and, or) or preposition
(in, of)
• Multiple concepts are often indicated in
subdivisions, which may be indicated by
a dash (--) or a comma (,)
• Bottom line is, it’s hard to tell in some
cases
IMT530- Organization of Information Resources
6
Examples
Post-Coordinated Terms
Animal nutrition
Effects
Salt
Pre-Coordinated Term
Effects of salt on animal nutrition
IMT530- Organization of Information Resources
7
More Examples
• More pre-coordinate terms
– France – Textile industries – Skilled
Personnel – Training (PRECIS)
– Plants – Nutrition – Genetic aspects
(LCSH)
• Pre-coordinate terms often have
subdivisions (the words that appear after
the hyphens above)
IMT530- Organization of Information Resources
8
Advantages of Pre-Coordination
• All of the concepts that may apply to indexing a single
document may appear in a single term
• Multiple concepts have the context and meaning
embedded in syntactic order and constructions
– they may make more sense
– they are more precise
– different syntax means that concepts with different
meanings can be represented using the same simple
concepts, e.g.:
Art by children
Art for children
Art about children
Children and art
Children in art
IMT530- Organization of Information Resources
Music industry
Music for industry
Industrial uses of music
Music about industry
9
Advantages of Pre-coordination
• More terms are available for indexers to use
to express the subjects of documents
• The results of a multiple-concept search will
result in a list of terms to select from (not a list
of document representations with those words
in them)
• Thus, a user is able to browse all the topics
available to get an overview of what is
available
IMT530- Organization of Information Resources
10
Sample Display
•
•
•
•
•
•
•
•
•
Accidents -- 29 Related LC Subjects Accidents
Accidents Aeronautics Military United States
Accidents Aeronautics Statistics Periodicals
Accidents Aerosols
Accidents Agricultural Laborers United States
Accidents Agriculture
Accidents Agriculture Abstracts
Accidents Agriculture Bibliography
Accidents Agriculture Research United States
IMT530- Organization of Information Resources
11
Disadvantages of Pre-Coordination
• Must create more terms, more costly to
create
• Often, complex rules for combination are
needed to create pre-coordinated terms.
Result  cost of the CV is increased,
training for CV designers is longer and
more difficult, and the possibility of error
increased
• Makes for a long term list
IMT530- Organization of Information Resources
12
Disadvantages of Pre-Coordination
• Long strings of terms may not be
interpretable by users –
– Ethnic groups – young people – ethnic
identity – psychotherapy – cultural aspects
• In manual systems, access is limited to
the first concept listed; only with online
keyword access are the other
embedded concepts accessible.
IMT530- Organization of Information Resources
13
Advantages of Post-coordination
• Vocabulary is short, because each
concept is only represented once
• Rules for creation of terms are often
simpler
• Simple, thus easier to construct, thus
less costly
• Terms are shorter and easier to read
and understand
• In a manual system, individual concepts
may be more accessible
IMT530- Organization of Information Resources
14
Disadvantages of Post-coordination
• Does not allow for subtle distinctions in
meaning
– Art and children vs. children in art vs. art by
children
– Music in industry vs. industry in music vs.
music industry
• May have to assign a lot of headings for
a single document, thus relying on
searching mechanisms to put them
together
IMT530- Organization of Information Resources
15
Disadvantages of Post-Coordination
• A multiple-concept search frequently
results in a list of document
representations with those words in
them; these results are not grouped
according to similarity, but are often
listed in a random order.
• The results list of document
representations does not give the user
an overview of the subject area covered
by the words entered in the search
IMT530- Organization of Information Resources
16
Sample Display
RESULTS OF BOOLEAN “AND” SEARCH ON natural AND
disaster
• Mapping vulnerability : disasters, development, and people /
edited by Greg Bankoff, Georg Frerks, D
• Understanding the economic and financial impacts of natural
disasters / Charlotte Benson, Edward J.
• Cultures of disaster : society and natural hazards in the
Philippines / Greg Bankoff
• Malaria control during mass population movements and natural
disasters / Peter B. Bloland and Holly
• Hurricane! : coping with disaster : progress and challenges since
Galveston, 1900 / Robert Simpson,
• The use of earth observing satellites for hazard support
[electronic resource] : assessments & scena
• The vulnerability of cities : natural disasters and social resilience
/ Mark Pelling
IMT530- Organization of Information Resources
17
Open and Closed Controlled
Vocabularies
• An open vocabulary is one in which an
indexer may add a term at any time if
they need it- pretty rare in traditional
indexing (but common in folksonomies)
• A closed vocabulary is one in which an
indexer may not add a term at any time.
Term additions are controlled by the
creators of the CV, not by indexers
IMT530- Organization of Information Resources
18
Synthesis and Enumeration
• The synthesis & enumeration attribute
has to do with how a controlled
vocabulary is set up to operate and with
where and at what point term creation
happens
• The creation of terms may be restricted
to the CV designer in some cases; in
other cases, indexers have some
flexibility in creating new terms by using
a technique called “synthesis”
IMT530- Organization of Information Resources
19
Enumeration
• An enumerated vocabulary is simply a list of
terms. Indexers look at the list, select a term,
and use it for indexing
• If a term is not present to index a particular
document, then the indexer has to either ask
the CV designer to add a term, or they are
stuck
• Many enumerated vocabularies are also
closed vocabularies
• Enumerative vocabularies came first in history
– it probably didn’t occur to anyone that there
could be any other way!
IMT530- Organization of Information Resources
20
Example of Enumeration
• Sample list of enumerated terms:
–
–
–
–
–
–
–
bowls
plastic bowls
wood bowls
wood chairs
steel chairs
wood bookshelves
steel bookshelves
• Note that if an indexer had a document on “steel
bowls”, that term is not available. The indexer using
this vocabulary would have to either assign “bowls”
(not specific), or would have to ask the CV designer to
add the term “steel bowls”
IMT530- Organization of Information Resources
21
Advantages of Enumeration
• Enumerated vocabularies are often easy
to use because there are fewer rules for
indexers (just look up your term, write it
down, and move on!)
• All possible terms appear in the
vocabulary, so it is easy to search and
display all possible terms
IMT530- Organization of Information Resources
22
Disadvantages of Enumeration
• Some terms are not available for the indexer
to use; some combinations simply are not
there
• List of terms may become very long (the
Library of Congress Classification, a highly
enumerated classification scheme, has 46
volumes!)
• Terms may be repeated over and over
•
•
•
•
•
Wood bowls
Wood chairs
Wood bookshelves
Wood cabinets
Wood structures
IMT530- Organization of Information Resources
23
Synthesis
• Synthesis is a technique developed in the 20th
century as a means of saving space and time
in CV creation, and of extending flexibility to
the indexer
• In a synthetic system, tables containing single
terms are created by the CV designer and
indexers follow rules to combine the terms
from different tables to create a new term
• We’ll look at this in more detail in a couple
weeks when we discuss faceted classification
IMT530- Organization of Information Resources
24
Synthesis and Enumeration
vs. Pre- & Post-coordination
• The relationship between enumeration &
synthesis and pre- & post- coordination is not
one-to-one!
• Some enumerated vocabularies are precoordinate; others are post-coordinate
• Most synthetic vocabularies are precoordinate, but it is possible for a synthetic
vocabulary to be post-coordinate, particularly
where it is exposed to end users
– Where indexers assign terms from facets, the user
has no control over coordination, but where a user
can select and combine facets, it’s post-coordinate
IMT530- Organization of Information Resources
25
Synthesis and Enumeration vs.
Open and Closed Vocabularies
• All synthetic systems are open to a
limited extent because indexers may
combine simple terms to create new
longer terms - but are closed if indexers
may not add new terms to tables
• Synthetic systems are completely open
if an indexer is allowed to add terms to
the tables, add new tables, and add new
rules for term synthesis
IMT530- Organization of Information Resources
26
Extracting Terminology
Sources and Origins of
Terminology
• Where do you get terms for a controlled
vocabulary?
• Sources and origins of terminology may
come from explicit statements of warrant
• Making a conscious decision about
warrant demonstrates that as a CV
designer you are aware of the different
possibilities and have made considered
choices
IMT530- Organization of Information Resources
28
Warrant
• Warrant is “the authority that is used to
justify decisions about what is included
in a system,” (Clare Beghtol)
• Types of warrant:
– Literary warrant
– User warrant
– Scholarly warrant
– Cultural warrant (Beghtol, 2002)
IMT530- Organization of Information Resources
29
Literary & User warrant
• Literary Warrant
– terms or organization reflect or are taken
directly from resources themselves; this
includes dictionaries, encyclopedias, etc.
on a topic
• User (aka Use, Enquiry) Warrant
– terms or organization reflect use; user
terminology may (or may not) be taken
directly from logs of system use or from
personal interactions with users
IMT530- Organization of Information Resources
30
Scholarly & Cultural Warrant
• Scholarly Warrant
– terms or organization reflect the opinions of a
panel of human experts
• Cultural Warrant
– terms or organization derived from cultural practice
or understanding; for example, Dewey and LCSH
reflect American/Western cultural bias; Colon
Classification reflects Indian/Eastern cultural bias
(this also can be partly a function of literary
warrant…)
IMT530- Organization of Information Resources
31
Questions?
• If not, take a break!!!
IMT530- Organization of Information Resources
32
Exercise 7a
• Today we are starting a multiple-part
exercise in building controlled
vocabularies
• The first step is extracting concepts from
a defined domain
• Form your groups and work through the
two parts of exercise 7a
• Keep a copy of your concept lists to use
in the next exercise
IMT530- Organization of Information Resources
33