Indexing Language

Download Report

Transcript Indexing Language

Final Exam Review
SIMS 202
Profs. Hearst & Larson
UC Berkeley SIMS
Fall 2000
Final Exam

Monday Dec 11
– 9:30-12:30
– Room 202 and 205

Bring
– Pens/pencils
– Calculator
– Notes/Books (optional)
Final Exam

Topics
– Comprehensive, but
– Emphasis on materials since the midterm

Types of questions
– Similar to those on homeworks and the
midterm, but less time-consuming
– Probably a design problem.
Relationships among
Language, Concepts, and Categories
Symbols and Language
Abstract concepts are difficult to
express in a computer.
 Combinations of abstract concepts
are even more difficult to express:

–
–
–
–
time
shades of meaning
social and psychological concepts
causal relationships
Symbols and Language
As the man walks the cavorting dog, thoughts
arrive unbidden of the previous spring, so unlike
this one, in which walking was marching and
dogs were baleful sentinels outside unjust halls.
What is the relation between the symbols and the meaning?
Symbols and Language


Language only hints at meaning.
Most meaning of text lies within our minds
and common understanding.
– “How much is that doggy in the window?”
» how much: social system of barter and trade (not the
size of the dog)
» “doggy” implies childlike, plaintive, probably cannot do
the purchasing on their own
» “in the window” implies behind a store window, not
really inside a window, requires notion of window
shopping
Lexical Relations

Conceptual relations link concepts

Lexical relations link words


How do they differ?
How are they similar?
Major Lexical Relations
Synonymy
 Polysemy
 Metonymy
 Hyponymy/Hyperonymy
 Meronymy
 Antonymy

Relationships among Meanings

Homonymy: same word, different meanings
– bank (river bank) vs bank (financial institution)

Polysemy: same word, different senses of
meaning
– slightly different concepts expressed similarly
– bank (institution vs building)

Synonyms: different words, related senses
of meanings
– different ways to express similar concepts
– jail, prison, penitentiary
Defining Category Membership
Necessary and Sufficient Conditions:
– (This used to be a very influential
definition of category membership; it is
ok for math and logic but out-of-date
for human categories)
– Every condition must be met.
– No other conditions can be required.
Can category membership be
crisply defined?
What are the necessary and
sufficient conditions for
something to be a game?
Properties of Categorization

Family Resemblance
– Members of a category may be related to one
another without all members having any property in
common.
» Instead, they may share a large subset of traits.
» Some attributes are more likely given that others have been
seen.
– Example: feathers, wings, twittering, ...
» Likely to be a bird, but not all features apply to “emu”
» Unlikely to see an association with “barks”
Properties of Categorization

Centrality
– Some members of a category may be
“better examples” than others.
» Example: robins vs. chickens vs. emus
» Exampe: soccer vs. gambling vs. hopscotch
Properties of Categorization

Characteristic Features
– Perceived degree of category membership has
to do with which features define the category.
– Members usually do not have ALL the necessary
features, but have some subset.
– Those members that have more of the central
features are seen as more central members.
– People have conceptions of typical members.
Three Psychologically Primary Levels
SUPERORDINATE
BASIC LEVEL
SUBORDINATE



animal furniture
dog
chair
terrier
rocker
Children take longer to learn superordinate
Superordinate not associated with mental
images or motor actions
How related to
– Hyponymy
– Hyperonymy
Characteristics of Basic-level
Categories
Language
–
–
–
–
–
People name things more readily at basic level.
Name learned earliest in childhood.
Languages have simpler names at basic level.
Sounds like the “real name”.
Name used more frequently.
» Strange to call a dime a coin, a metal object
– Names used in neutral context.
» There’s a dog on the porch.
» There’s a terrier on the porch.
Characteristics of Basic-level
Categories
Concepts
– Things perceived more holistically at the basic level
(rather than by parts).
– People interact with basic and more specific levels
similarly.
– Things are remembered more readily at basic level.
– Folk biological categories correspond accurately to
scientific biological categories only at the basic
level.
Metadata
Metadata Topics



What is metadata?
Controlled vocabularies / indexing languages
Metadata standards
– Dublin Core
– XML
– etc


Thesaurus creation and use
Classification structure
– Descriptors vs subject headings
– Hierarchies vs facets
Metadata

Metadata is:
– “data about data” (term usage database
systems)
– Information about Information
– Structures and Languages for the Description of
Information Resources and their elements
(components or features)
– “Metadata is information on the organization of
the data, the various data domains, and the
relationship between them” (Baeza-Yates p. 142)
Type of Metadata systems and
standards







Naming and ID systems – URLs, ISBNs
Bibliographic description – MARC, Dublin
Core, TEI, etc.
Music -- SMDL
Images and objects – CIMI, VRA Core
Categories
Numeric Data – DDI, SDSM
Geospatial Data – FGDC
Collections – EAD
Types of Indexing Languages
Uncontrolled Keyword Indexing
 Indexing Languages

– Controlled, but not structured

Thesauri
– Controlled and Structured

Classification Systems
– Controlled, Structured, and Coded

Faceted Classification Systems
Controlled Vocabularies

Vocabulary control is the attempt to
provide a standardized and consistent
set of terms (such as subject
headings, names, classifications, etc.)
with the intent of aiding the searcher
in finding information.
What is a “Controlled
Vocabulary”



“The greatest problem of today is how to teach
people to ignore the irrelevant, how to refuse to
know things, before they are suffocated. For too
many facts are as bad as none at all.” (W.H.
Auden)
Similarly, there are too many ways of expressing
or explaining the topic of a document.
Controlled vocabularies are sets of Rules for topic
identification and indexing, and a THESAURUS,
which consists of “lead-in vocabulary” and an
limited and selective “Indexing Language”
sometimes with special coding or structures.
Uses of Controlled
Vocabularies




Library Subject Headings, Classification
and Authority Files.
Commercial Journal Indexing Services and
databases
Yahoo, and other Web classification
schemes
Online and Manual Systems within
organizations
– SunSolve
– MacArthur
Indexing Languages


An index is a systematic guide designed to
indicate topics or features of documents in
order to facilitate retrieval of documents
or parts of documents.
An Indexing language is the set of terms
used in an index to represent topics or
features of documents, and the rules for
combining or using those terms.
The Indexing Process
Concept identification
 term selection (via thesaurus)
 term assignment

Application: The Indexing
Process (Manual)
Start
Examine Document
and Identify
Significant
Concepts
Does
Thesaurus
contain term
for
Concept
YES
Consider
First
Concept
Can Concept
NO
be expressed
combining
terms?
YES
End
NO
Is
There
Another
Concept
NO
Select
Preferred
Term
NO
Consider
Preferred
Term
YES
Consider Each of
These Terms
NO
YES
Establish Term
Denoting
Concept
Preferred
Term?
YES
Assign Terms
to
Document
Prefer
Alternative
Term(s)
Select Alternative
term to represent
Concept
Would
Concept be
better represented
by one of
these
terms
Consider any
associated terms in
Thesaurus (NT,BT)
Adapted from ISO 5963, p.5
YES
Admit New Term
Into Thesaurus
NO
Is
Term
suitable
Metadata Standards
The problem

Proliferation of the forms of names
– Different names for the same person
– Different people with the same names
Bibliographic Description
MARC (Machine Readable Cataloging)
 DUBLIN CORE

– Warwick Framework for Dublin Core
Metadata
GILS (Government Information
Locator Service)
 RFC 1807 (Format for Bibliographic
Records)
 RDF (Resource Description Format)

Images and Objects
Categories for the Description of
Works of Art (Getty Art Institute)
 Consortium for the Computer
Interchange of Museum Information
(CIMI)
 RLG REACH Element Set (for Shared
Description of Museum Objects)
 VRA Core Categories (Visual
Resources Association)

Collection Level Descriptors
EAD (Encoded Archival Description)
 Z39.50 Profile for Access to Digital
Collections
 RSLP Collection Description
(Research Support Libraries
Programme)

Dublin Core
Simple metadata for describing
internet resources.
 For “Document-Like Objects”
 15 Elements.

Dublin Core Elements








Title
Creator
Subject
Description
Publisher
Other Contributors
Date
Resource Type







Format
Resource
Identifier
Source
Language
Relation
Coverage
Rights Management
Source
Label: SOURCE
 The work, either print or electronic,
from which this resource is derived,
if applicable. For example, an html
encoding of a Shakespearean sonnet
might identify the paper version of
the sonnet from which the electronic
version was transcribed.

The Same Item in Different
Metadata Systems
ISBD
 Dublin Core
 RFC 1807
 TEI Header
 MARC Record

ISBD Punctuation

Title Proper (GMD) = Parallel title : other
title info / First statement of
responsibility ; others. -- Edition
information. -- Material. -- Place of
Publication : Publisher Name, Date. -Material designation and extent ;
Dimensions of item. -- (Title of Series /
Statement of responsibility). -- Notes. -Standard numbers: terms of availability
(qualifications).
Bibliographic Record

Introduction to cataloging and
classification / Bohdan S. Wynar. -8th ed. / Arlene G. Taylor. -Englewood, Colo. : Libraries Unlimited,
1992. -- (Library science text series).
MARC Record (display)























ID:DCLC9124851-B
RTYP:c
ST:p FRN: MS:c EL: AD:06-20-91
CC:9110 BLT:am
DCF:a CSC:
MOD:
SNR:
ATC: UD:04-11-92
CP:cou
L:eng
INT:
GPC:
BIO:
FIC:0
CON:b
PC:s
PD:1992/
REP:
CPI:0 FSI:0
ILC:a
II:1
MMD:
OR:
POL:
DM:
RR:
COL:
EML:
GEN: BSE:
010
9124851
020
0872878112 (cloth)
020
0872879674 (paper)
040
DLC$cDLC$dDLC
050 00 Z693$b.W94 1991
082 00 025.3$220
100 1 Wynar, Bohdan S.
245 10 Introduction to cataloging and classification /$cBohdan S. Wynar.
250
8th ed. /$bArlene G. Taylor.
260
Englewood, Colo. :$bLibraries Unlimited,$c1992.
300
xvii, 633 p. :$bill. ;$c24 cm.
440 0 Library science text series
504
Includes bibliographical references (p. 591-599) and index.
650 0 Cataloging.
650 0 Subject cataloging.
650 0 Classification$xBooks.
630 00 Anglo-American cataloguing rules.
700 10 Taylor, Arlene G.,$d1941-
Conditions of Authorship?


Single person or single corporate entity
Unknown or anonymous authors
– Fictitiously ascribed works




Shared responsibility
Collections or editorially assembled works
Works of mixed responsibility (e.g.
translations)
Related Works
Name Authority Files
ID:NAFL8057230 ST:p EL:n STH:a MS:c UIP:a TD:19910821174242
KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:05-14-80
RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD:
VST:d 08-21-91
Other Versions: earlier
040 DLC$cDLC$dDLC$dOCoLC
053 PR6005.R517
100 10 Creasey, John
400 10 Cooke, M. E.
400 10 Cooke, Margaret,$d1908-1973
400 10 Cooper, Henry St. John,$d1908-1973
400 00 Credo,$d1908-1973
400 10 Fecamps, Elise
400 10 Gill, Patrick,$d1908-1973
400 10 Hope, Brian,$d1908-1973
400 10 Hughes, Colin,$d1908-1973
400 10 Marsden, James
400 10 Matheson, Rodney
400 10 Ranger, Ken
400 20 St. John, Henry,$d1908-1973
400 10 Wilde, Jimmy
500 10 $wnnnc$aAshe, Gordon,$d1908-1973
Different names for the
same person
Name Authority Files
ID:NAFO9114111 ST:p EL:n STH:a MS:n UIP:a TD:19910817053048
KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:06-03-91
RFE:a CSC:c SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD:
VST:d 08-19-91
040 OCoLC$cOCoLC
100 10 Marric, J. J.,$d1908-1973
500 10 $wnnnc$aCreasey, John
663 Works by this author are entered under the name used in the item. For
a listing of other names used by this author, search also under$bCrease
y, John
670 OCLC 13441825: His Gideon's day, 1955$b(hdg.: Creasey, John; usage: J
.J. Marric)
670 LC data base, 6/10/91$b(hdg.: Creasey, John; usage: J.J. Marric)
670 Pseuds. and nicknames dict., c1987$b(Creasey, John, 1908-1973; Britis
h author; pseud.: Marric, J. J.)
Name authority files
ID:NAFL8166762 ST:p EL:n STH:a MS:c UIP:a TD:19910604053124
KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:08-20-81
RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD:
VST:d 06-06-91
Other Versions: earlier
040 DLC$cDLC$dDLC$dOCoLC
100 10 Butler, William Vivian,$d1927400 10 Butler, W. V.$q(William Vivian),$d1927400 10 Marric, J. J.,$d1927670 His The durable desperadoes, 1973.
670 His The young detective's handbook, c1981:$bt.p. (W.V. Butler)
670 His Gideon's way, 1986:$bCIP t.p. (William Vivian Butler writing as J
.J. Marric)
Different people writing with the same name
Other Types of Controlled
Vocabularies
Gazetteers (Geographic Names)
 Code lists (e.g. LC Language Codes)
 Subject Heading Lists
 Classification Schemes
 Thesauri

What is SGML/XML?

A. SGML stands for Standard
Generalized Markup Language
– XML stands for eXtended Markup
Language

B. What it is NOT:
– Not a visual document description
– Not an application specific markup
– Not proprietary
What is SGML/XML?

What it is:
– An international standard (SGML- ISO
8879:1986)
– A generic language for describing the structure
of documents, and markup that can be used for
those documents
– Intended for generating markup for content
rather than form elements

XML is a simplified subset of SGML (being
established by W3C)
XML


Extensible Markup Language
– a simplification of SGML, the Standard
Generalized Markup Language
– instead of a fixed set of format-oriented tags
like HTML, XML allows you to create the
schema -- whatever set of tags are needed -for your information type or application
– this makes any XML instance “self-describing”
and easily understood by computers and
people
Version 1.0 ratified by W3C in 2/98; backed by
Microsoft, Sun, Netscape, many others
Source Dr. Robert J Glushko
HTML Airline Schedule Seen
“By Computer”
<Title>Airline Schedule</Title>
<Body>
<H2>Flight Information</H2>
<H3>United Airlines #200</H3>
<UL><LI>San Francisco
<LI>9:30 AM
<LI>Honolulu
<LI>12:30 PM
<LI>$368.50
</UL></Body>
Source Dr. Robert J Glushko
Airline Schedule in XML
<TransportSchedule Type=“Airline”>
<Segment Id=“United Airlines #200”>
<Origin>San Francisco</Origin>
<DepartTime>9:30 AM</DepartTime>
<Destination>Honolulu</Destination>
<ArriveTime>12:30 PM</ArriveTime>
<Price Currency=“USD”>368.50</Price>
</Segment>
</TransportSchedule>
Source Dr. Robert J Glushko
SGML/XML Structure

An SGML document consists of three
parts:
– The SGML Declaration
– The Document Type Definition (DTD)
– The Document Instance

An XML document requires only the
document instance, but for effective
processing a DTD is important.
Document Type Definitions

The DTD describes the structural
elements and "shorthand" markup for a
particular document type. It defines:
–
–
–
–
–
–
–
–
Names of "legal" elements
How many times elements can appear
The order of elements in a document
Whether markup can be omitted (SGML only)
Contents of elements (i.e., nested structures)
Attributes associated with elements
Names of "entities"
short-hand conventions for element tags.
(SGML only)
DTD Components
 The
major components of a
DTD are:
– Entity Declarations
– Element Declarations
– Attribute Declarations
Thesauri

A Thesaurus is a collection of
selected vocabulary (preferred terms
or descriptors) with links among
Synonymous, Equivalent, Broader,
Narrower and other Related Terms
Thesauri (cont.)

Examples:
– The ERIC Thesaurus of Descriptors
– The Art and Architecture Thesaurus
– The Medical Subject Headings (MESH)
of the National Library of Medicine
Why develop a thesaurus?

To provide a conceptual structure or
“space” for a body of information
– To make it possible to adequately
describe the topical contents of
informational objects at an appropriate
level of generality or specificity
– To provide enhanced search capabilities
and to improve the effectiveness of
searching (I.e., to retrieve most of the
relevant material without too much
irrelevant material).
Why develop a thesaurus?

To provide vocabulary (or
terminological) control.
– When there are several possible terms
designating a single concept, the
thesaurus should lead the indexer or
searcher to the appropriate concept,
regardless of the terms they start with.
Preliminary considerations




What is used now?
– Continue using an existing thesaurus?
– Ad hoc modification of existing thesaurus?
– Develop a new well-structured thesaurus?
What is the scope and complexity of the
subject field?
What kind of retrieval objects or data will
be dealt with?
How exhaustive and specific is the desired
description of objects?
Preliminary Considerations

The scope and complexity of the field will
provide some indication of the scope and
complexity of the thesaurus.
– It is better to plan for a larger and more
comprehensive system than a smaller system
that rapidly will become inadequate as the
database grows.

Development of a good thesaurus requires
a major intellectual effort as well as
clerical operations like data entry and
production of sorted lists.
Development of a Thesaurus
Term Selection.
 Merging and Development of Concept
Classes.
 Definition of Broad Subject Fields
and Subfields.
 Development of Classificatory
structure
 Review, Testing, Application, Revision.

Flow of Work in Thesaurus
Construction
Select Sources
Define Broad Subject
Fields
Improve Class Structure
Assign codes
Sort Terms into Broad
Subject Fields
Print Classified Index
and review
Select Terms
Define Subfields within
one Subject Field
Discuss with Experts and
Users
Record Selected Terms
Work out detailed structure
of the Subject Field
Select descriptors and
checklist items
Sort Terms
Select Preferred Terms
Many
Modifications?
Yes
No
Merge identical Terms
All Subfields of Broad
Subject finished?
No
Assign Notation
Yes
Merge Terms in Same
Concept class
Based on Soergel, pp 327-333
All Broad
Subjects finished?
Yes
No
Produce Full Thesaurus
and Check references
Review and Test
Revise as
needed
2. Merging and Development
of Concept Classes


Sort Term DB into
alphabetical order.
First Round: Merge
information for
Identical terms -possibly pulling info
from additional
sources.

Second Round:
Merge synonyms or
terms in the same
concept class.
3. Definition of Broad Subject
Fields and Subfields


Define Broad
Subject fields and
sort terms into
these broad fields
Define subfields
within each broad
field and sort
terms into these
subfields.

Work out the
detailed structure
– Select Preferred Terms
– Merge information for
terms in the same concept
class

Repeat these steps
– for each subfield within a
broad field
– and for each broad field
– Until all terms have been
consolidated and
preferred terms selected
4. Development of
Classificatory Structure


Produce preliminary
version of
classified index and
update the working
database.
Improve
classificatory
structure

Reality check:
produce and
distribute a version
of the classified
index. Distribute
to users/experts.
5. Final Stages
 Review
 Testing
 Application
 Revision
Thesaurus Revision and
Updates

There will always be new concepts,
products, or expressions that need to
be added to the thesaurus.
– Set a regular schedule of reviews and
revisions.
– Collect complaints, problems, etc. and
fold into revision of the thesaurus
Hierarchical vs. Faceted
(Subject Heading vs. Descriptor)
Category Systems
Assigning
Headings vs. Descriptors

Subject headings
– assign one (or a
few) complex
heading(s) to the
document

Descriptors
– Mix and match
How would we describe recipes using
each technique?
Subject Heading
WILSONLINE
vs.
– Athletes
– Athletes-Heath&Hygiene
– Athletes--Nutrition
– Athletes--Physical
Exams
– …
– Athletics
– Athletics -Administration
– Athletics -- Equipment - Catalogs
– …
– Sports -- Accidents and
injuries
– Sports -- Accidents and
injuries -- prevention
Descriptor
ERIC
–
–
–
–
–
–
–
–
Athletes
Athletic Coaches
Athletic Equipment
Athletic Fields
Athletics
…
Sports psychology
Sportsmanship
Subject Headings vs. Descriptors


Describe the
contents of an
entire document
Designed to be
looked up in an
alphabetical index
– Look up document
under its heading

Few (1-5) headings
per document


Describe one concept
within a document
Designed to be used in
Boolean searching
– Combine to describe
the desired document

Many (5-25)
descriptors per
document
Hierarchical Classification
– Each category is successively broken
down into smaller and smaller
subdivisions
– No item occurs in more than one
subdivision
– Each level divided out by a “character of
division”. Also known as a feature.
» Example: distinguish Literature based on:



Language
Genre
Time Period
Hierarchical Classification
Literature
English
French
Spanish
...
... Prose Poetry Drama ... Prose Poetry Drama ...
...
16th 17th 18th 19th
16th 17th 18th 19th
Labeled Categories for
Hierarchical Classification

LITERATURE
– 100 English Literature
» 110 English Prose




English Prose 16th Century
English Prose 17th Century
English Prose 18th Century
...
» 111 English Poetry



121 English Poetry 16th Century
122 English Poetry 17th Century
...
» 112 English Drama


130 English Drama 16th Century
…
– 200 French Literature
Faceted Classification
Create a separate, free-standing list
for each characteristic of division
(feature).
 Combine features to create a
classification.

Faceted Classification along
with Labeled Categories

A Language
– a English
– b French
– c Spanish



B Genre
– a Prose
– b Poetry
– c Drama


C Period
–
–
–
–
a 16th Century
b 17th Century
c 18th Century
d 19th Century


Aa English Literature
AaBa English Prose
AaBaCa English Prose
16th Century
AbBbCd French Poetry
19th Century
BbCd Drama 19th
Century
Important Question:
How to use both types of
classification structures?
How to look through them?
 How to use them in search?

Design of Information
Architecture
Web Site Design Issues
Iteration is the Key to UI Design
Design
Evaluate
Prototype
Iteration earlier in the design
process is more cost-effective
Design Process: Discovery
Discovery
Conceptualization
Preliminary Design
Design
Implementation
Assess needs
– understand client’s
expectations
– determine scope of
project
– characteristics of
users
Design Process: Conceptualization
Begin defining site
Discovery
Conceptualization
Preliminary Design
– Take results from
discovery and
visualize solutions
– Early information
design
Design
Implementation
Slide by Mark Newman
Design Process: Preliminary Design
Generate multiple (35) designs
Discovery
Conceptualization
Preliminary Design
– one will be selected
for development
– navigation design
– early graphic design
Design
Implementation
Slide by Mark Newman
Design Process: Preliminary Design

Activities
– Sketching designs
– Creating mock-ups
– Quick and rough

Deliverables
–
–
–
–
Schematics (a.k.a. templates)
Site maps
Mock-ups
Presentations
Slide by Mark Newman
Design Process: Design
Iteration
Discovery
Conceptualization
Design
Evaluate
Preliminary Design
Design
Implementation
Prototype
• iteration at the level of
development process
• And within design stage
Slide by Mark Newman
Design Process: Implementation

Discovery
Conceptualization
Preliminary Design
Design
Prepare design for
handoff
– Create final
deliverable
– Specifications and
prototypes
– As much detail as
possible
Implementation
Slide by Mark Newman
Why Do We Prototype?

Get feedback on our design faster
– saves money
Experiment with alternative designs
 Fix problems before code is written
 Keep the design centered on the user

Fidelity in Prototyping
Fidelity refers to the level of detail
 High fidelity

?
– prototypes look like the final product

Low fidelity
?
– artists renditions with many details
missing
Slide by James Landay
Low-fidelity Sketches
Slide by James Landay
Low-fidelity Sketches
Slide by James Landay
Database Systems
Terms and Concepts

Database:
– A collection of similar records with
relationships between the records.
(Rowley)
– A Database is a collection of stored
operational data used by the application
systems of some particular enterprise.
(C.J. Date)
DBMS Benefits
Minimal Data Redundancy
Consistency of Data
Integration of Data
Sharing of Data
Ease of Application Development
Uniform Security, Privacy, and
Integrity Controls
 Data Accessibility and
Responsiveness
 Data Independence
 Reduced Program Maintenance






Database Components
DBMS
===============
Design tools
Database
Database contains:
User’s Data
Metadata
Indexes
Application Metadata
Table Creation
Form Creation
Query Creation
Report Creation
Procedural
language
compiler (4GL)
=============
Run time
Form processor
Query processor
Report Writer
Language Run time
Application
Programs
User
Interface
Applications
Kroenke, Database
Processing
Terms and Concepts

Records
– The set of values for all attributes of a
particular entity
– AKA “tuples” or “rows” in relational
DBMS

File
– Collection of records
– Usually a physical file on OS
– May also be a “logical file” like a
“Relation” or “Table” in relational DBMS
Terms and Concepts

Key
– an attribute or set of attributes used to
identify or locate records in a file

Primary Key
– an attribute or set of attributes that
uniquely identifies each record in a file
Terms and Concepts

Data Independence
– Physical representation and location of data and
the use of that data are separated
» The application doesn’t need to know how or where
the database has stored the data, but just how to
ask for it.
» Moving a database from one DBMS to another
should not have a material effect on application
program
» Recoding, adding fields, etc. in the database
should not affect applications
Terms and Concepts

Metadata
– Data about data
» In DBMS means all of the characteristics
describing the attributes of an entity, E.G.:




name of attribute
data type of attribute
size of the attribute
format or special characteristics
– Characteristics of files or relations
» name, content, notes, etc.
Design
Determination of the needs of the
organization
 Development of the Conceptual Model
of the database

– Typically using Entity-Relationship
diagramming techniques
Construction of a Data Dictionary
 Development of the Logical Model

Entity

An Entity is an object in the real
world (or even imaginary worlds)
about which we want or need to
maintain information
– Persons (e.g.: customers in a business,
employees, authors)
– Things (e.g.: purchase orders, meetings,
parts, companies)
Employee
Attributes

Attributes are the significant
properties or characteristics of an
entity that help identify it and
provide the information needed to
interact with it or use it. (This is the
Birthdate
Metadata for the entities.)
First
Middle
Last
Age
Name
Employee
SSN
Projects
Relationships

Relationships are the associations
between entities. They can involve
one or more entities and belong to
particular relationship types
Relationships
Student
Attends
Class
Project
Supplier
Supplies
project
parts
Part
Mapping to a Relational Model




Each entity in the ER Diagram becomes a
relation.
A properly normalized ER diagram will
indicate where intersection relations for
many-to-many mappings are needed.
Relationships are indicated by common
columns (or domains) in tables that are
related.
We will examine the tables for the Acme
Widget Company derived from the ER
diagram
Normalization


Normalization theory is based on the
observation that relations with certain
properties are more effective in inserting,
updating and deleting data than other sets
of relations containing the same data
Normalization is a multi-step process
beginning with an “unnormalized” relation
– Hospital example from Atre, S. Data Base: Structured Techniques for
Design, Performance, and Management.
Normalization
No transitive
dependency
between
nonkey
attributes
All
determinants
are candidate
keys - Single
multivalued
dependency
BoyceCodd and
Higher
Functional
dependencyof
nonkey
attributes on
the primary
key - Atomic
values only
Full
Functional
dependencyof
nonkey
attributes on
the primary
key
Relational Algebra Operations
Select
 Project
 Product
 Union
 Intersect
 Difference
 Join
 Divide

Effectiveness and Efficiency
Issues for DBMS
Focus on the relational model
 Any column in a relational database
can be searched for values.
 To improve efficiency indexes using
storage structures such as BTrees
and Hashing are used
 But many useful functions are not
indexable and require complete scans
of the the database

Advantages of RDBMS
Possible to design complex data
storage and retrieval systems with
ease (and without conventional
programming).
 Support for ACID transactions

–
–
–
–
Atomic
Consistent
Independent
Durable
Advantages of RDBMS
Support for very large databases
 Automatic optimization of searching
(when possible)
 RDBMS have a simple view of the
database that conforms to much of
the data used in businesses.
 Standard query language (SQL)

Disadvantages of RDBMS



Until recently, no support for complex
objects such as documents, video, images,
spatial or time-series data. (ORDBMS are
adding support these).
Often poor support for storage of complex
objects. (Disassembling the car to park it
in the garage)
Still no efficient and effective integrated
support for things like text searching
within fields.
Study hard, and good luck!
Thank you for all the great work!