LIS 397.1 Introduction to Research in Library and

Download Report

Transcript LIS 397.1 Introduction to Research in Library and

LIS 386.13
Information Technologies
and the
Information Professions
Introduction to Databases
R. E. Wyllys
Copyright © 2000 by R. E. Wyllys
Last revised 2003 Jun 25
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Lesson Objectives
• You will
– Understand the distinctions among flat-file
databases (DBs), network DBs, hierarchical DBs,
relational DBs, and text-oriented DBs
– Become aware that relational DBs have a
characteristic known as the level of normalization
and that the secret of attaining a satisfactory level
of normalization in a DB lies in making its tables
simple.
– Become aware that modern relational DBs use a
standardized language for commands, known as
Structured Query Language (SQL)
– Understand that multiuser databases face
problems of security and privacy
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
The world is
full of files
of data and
information,
which
databases
can help
you find
and
manage.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Examples of Databases
• What does the word "database" mean?
– Nowadays we usually think it means a computer-stored set
of information
– However, databases can exist in many forms. Examples:
•
•
•
•
•
•
•
•
•
•
Electronic data: text, visual images, audio images, numbers
Sheets of paper in folders in a vertical file
A book (think of it as a collection of sentences and illustrations)
Books in a collection (e.g., a library)
Sets of 3"x5" cards containing notes
Blueprints
Maps and other forms of geographic information systems
Core samples from oil wells
Blood samples in a medical laboratory
DNA samples in a forensic laboratory
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Examples of Databases
• What do these examples have in common?
– Sets of data and information composed of, and/or
represented by: bits; or alphanumeric symbols; or
lines and shapes in drawings, pictures, and maps;
or audio recordings; or video recordings; or realia
(i.e., actual substances)
– At least one means by which the sets of data and
information are organized in order to facilitate
access to individual desired sets
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Examples of Databases
• Consider the provisions for access to individual
pieces of information in the following examples:
– Phone book. Contains a collection of several
independent (discrete) databases, each consisting of
names together with corresponding phone numbers:
• White-pages personal listings, arranged alphabetically by
surname and within surname by first names
• White-pages corporate listings, arranged alphabetically
• Blue-pages governmental listings: primary arrangement
alphabetical by type of government (city, county, state, federal),
secondary arrangement alphabetical by agency within type of
government, tertiary arrangement alphabetical by office within
agency
• Yellow-pages listings: primary arrangement by type of
business, secondary arrangement alphabetically by company
within type of business, plus various special groupings (e.g.,
restaurants by ethnic type)
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Examples of Databases
• Provisions for access to individual pieces of
information, cont'd:
– Organizational membership directory, usually
consisting of names together with corresponding postal
addresses, telephone numbers, and email addresses
• Typically contains listings by surname and first names, plus
groupings by regions (e.g., states, countries) and by
membership in special interest groups (SIGs), arranged
alphabetically by name within regional groups and SIGs
– Dictionary, consisting of words with corresponding
definitions, and in some cases, lists of synonyms
and/or antonyms
• Primary collection is individual words arranged alphabetically
• May contain separate sections (e.g., geographical names,
biographical names, abbreviations, proofreaders' marks)
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Examples of Databases
• Provisions for access to individual pieces of
information, cont'd:
– Thesaurus
• Primary arrangement by broad concepts (themes), with
subgroupings of sets of closely related words (often arranged by
type of speech, e.g., nouns, verbs), each set sharing a
subconcept of the primary concept; sometimes includes
antonyms of the primary concept and/or selected subconcepts
– Book (non-fiction)
• Table of contents
– Provides access to chapters (and sometimes to subchapters)
dealing with broad topics that are aspects of the overall subject(s) of
the book
• Index
– Organizes narrow concepts by names, terms, subterms, etc.
– Provides pointers from terms to relevant locations in text of
book
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Examples of Databases
• The foregoing examples illustrate some ways
of organizing information in DBs, whether
computerized or non-computerized: viz.,
– An intrinsic index provides information organized by
and with the entry or record; or
– A separate index can point to the location of the
information; or
– Records (i.e., basic packages of information) can
contain retrieval tags (access tags, labels, etc.) that
identify them and that can be searched for; or
– (Worst Case) Records can be sought via exhaustive
search (by humans or computer programs)
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Computerized Databases
• Consist of
– Bits, organized into bytes, which in turn are
organized into sequences or strings of bytes
– Fields: sets of bytes that represent information
– Records: sets of fields that are associated by
sharing relevance to some entity
– Files: sets of records sharing relevance to a
particular type of entity
• Databases typically consist of one or more
sets of related files
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Computerized Databases
• Of special interest are Relational
Databases (RDBs) and programs that
manage them, known as Relational
Database Management Systems
(RDBMSs).
– Note: The word "relational" is often omitted
nowadays, since almost all well known DBMSs
(e.g., IBM DB2, MS Access, MS SQL Server,
Oracle, Sybex) are RDBMSs.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Introduction to RDBs
• In discussing relational databases, we
use synonymously the words
– File, table, relation
– Record and row
– Field, column, attribute
• Note: Discussions of RDB theory tend to prefer the words
italicized above
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Introduction to RDBs
• Definition:
– A relational database is a set of one or more
tables that together embody information about a
set of related concepts and entities.
– If, as is usually the case, a relational database has
more than one table, the tables are connected
(related) in the following way:
• It is possible to move from any one table in the RDB to
any other table in the RDB via a chain of columns (i.e.,
fields, attributes) shared in pairwise fashion by
successive tables.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Introduction to RDBs
• The picture below shows 3 tables, with a total of
12 attributes (i.e., 12 distinct columns). The top
and middle tables share Attribute 3; the middle and
bottom tables share Attribute 7.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Database Applications
• The word "application" has a special technical
meaning in the field of database management.
• Definition:
– A database application is a combination of
• A relational database-management system (RDBMS)
• A relational database (RDB)
• Associated menus, data-entry forms, and report
forms
• Documentation (e.g., manuals) for the users.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Database Applications
• A database application is a package designed
to facilitate a particular real-world function (or
a set of related functions): e.g., looking up
books in a library catalog, or handling a sales
transaction in a store.
– Note: An application may include more than one
RDB, and/or it may include a "stray" table or two,
so long as such additions serve the basic function
and make the whole package more convenient for
humans to use.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Other Types of DBs
• Types of Databases (other than RDBs)
– Flat-file (spreadsheet)
– Hierarchical
– Network
– Text-oriented
– Geographic information systems
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Flat File DBs
• Flat-file DBs are like the DBs you can
construct in a spreadsheet, i.e., all the
information in the DB is in one file
consisting of one array of rows and
columns. For example:
SSN
123-45-6789
987-65-4321
567-89-0123
Surname
Doe
Fulano
Roe
First Name(s)
Jane Q.
Juan
Richard Rodney
Telephone Number
512-555-1234
210-543-9876
512-987-6431
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Flat File DBs
• Flat-file databases (spreadsheet style)
– Advantages
• Simple
• Suitable for small numbers of records with few
attributes
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Flat File DBs
• Flat-file databases (spreadsheet style)
– Disadvantages
• Likely to include repetitions of data
• Multi-valued attributes (e.g., multiple authors, multiple phone
numbers) require repetitions of accompanying data (see the
"Fulano" entries in the example below)
• Changes in data are difficult to implement
• Deletion and insertion anomalies are common
• Often lead to too much information in one table
SSN
123-45-6789
987-65-4321
987-65-4321
567-89-0123
Surname
Doe
Fulano
Fulano
Roe
First Name(s)
Jane Q.
Juan
Juan
Richard Rodney
Telephone Number
512-555-1234
210-543-9876
512-234-5678
512-987-6431
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Hierarchical DBs
• Hierarchical databases
– Are based on a classification scheme (a
taxonomy). Any element (node) in the database is
linked only to the elements directly above it and
directly below it.
– Typically require custom programming
– The original computer-based databases were
designed for banking. Hierarchical databases
were appropriate for such purposes: e.g.,
individual accounts can be grouped by family or
business; sets of accounts, grouped by branch;
accounts in different branches, grouped by city;
accounts in different cities, grouped by state.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Hierarchical DBs
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Network DBs
• Network databases
– Permit links among all components,
without restrictions to hierarchical
levels; i.e., elements can be linked to
other elements anywhere in the
database, not just those directly
above and below
– Can be extremely difficult to manage
– The World-Wide Web is a very large
example of a network database.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Network DBs
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Text-Oriented DBs
• Text-oriented DBs are, as their name suggests,
DBs that have special features for handling text:
e.g., abilities
– To search for specified strings of characters
• With or without matching the cases of the characters
• While using wildcards, i.e., symbols that will match any one
character or any sequence of characters
– To search on pairs, triples, etc., of words and phrases,
using
• Boolean logic
• Proximity logic (e.g. both words must be in same sentence, or
in same paragraph, or in same section, or within n words of
each other)
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Text-Oriented DBs
• Abilities of text-oriented DBs, cont'd
– To rank search results by weights assigned to the terms
used in the search
– To maintain thesauri of near synonyms and to allow
searches by near synonyms of original query terms
– To maintain, for selected words or phrases, indexes of
their locations in files
• Although commercial text-oriented DBs exist (e.g.,
Lexis-Nexis and Dialog), the only text-oriented
DBMSs for microcomputers that I know of are
askSam, DB/Textworks, and Isys. (From personal
use, I can give askSam high marks for power and
ease of use.)
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Geographic Information Systems
• Geographic information systems (GISs) manage data
based on geographic coordinates, usually obtained
through the use of the Geographic Positioning System
(GPS).
• GISs typically integrate a variety of data relevant to
such concerns as: atmosphere, ecology,
environment, geology, natural resources, pollution,
transportation, utilities, and characteristics and
behaviors of populations (e.g., socioeconomic status,
crimes, diseases, vehicular traffic).
• The integrated data are displayed through the use of
maps, which may include many overlapping, partly
transparent layers.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Advantages of RDBs
• Advantages of relational databases
–
–
–
–
Cut down on needless repetition of information
Ensure more accuracy
Facilitate updating and deletion of information.
Design avoids problems that occur with flat
files, e.g., insertion and deletion anomalies
• "Anomalies" are errors that occur when information
is added to or deleted from existing records, or
when whole records are added to or removed from
a DB.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Relational Databases
• In a RDB, the information content of a
table does not depend on either
– The order of the rows; or
– The order of the columns
• In other words, the rows and columns of a
table can be rearranged at will without
affecting the table's information content
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Primary Keys
• In a RDB, each table
– Must have a primary key (unique identifier)
– Must have no duplicate rows
• A primary key is
– A data attribute (column), or a combination of
attributes, that uniquely identifies each record
in the table.
– A simple key consists of a single attribute
– A composite key consists of two or more
attributes
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Primary Keys
• Primary Key
– Provides unique way to identify each record
– Can be obvious from the structure of the table.
If there is no easy natural choice, you can add
a column containing a unique identifier.
– May consist of the entire record (especially
with two-column tables, which occur often in
the development of RDBs)
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Other Types of Keys
• Secondary Key
– A column that is used to aid in the retrieval of
information from a table. A secondary key is not
required to have unique values in each of its rows,
• Foreign Key
– A column that is used to aid in the retrieval of
information from one table (i.e., is a secondary
key) and that is also the primary key in another
table. Foreign keys are a major tool in RDBs.
– All foreign keys are secondary keys, but not all
secondary keys are foreign keys.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Types of Fields
•
•
•
•
•
•
Character
Numeric
Logical
Image
Object
LBO (Large Binary Object)
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Types of Fields
• Character: Alphanumeric and punctuation.
Character fields can include numerals, but
calculations cannot be performed on the
contents of the field even though they look
like numbers.
– Examples: Names, product codes, SSNs,
ZIPCodes
– Sometimes further differentiated between memo
fields (short) and text fields (long)
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Types of Fields
• Numeric: Arithmetic operations can be
performed on the contents of the field
– Numbers (sometimes further differentiated
among integers [short, long], fixed-decimal
point numbers, floating-point numbers)
– Dates (addition and subtraction only)
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Types of Fields
• General Rule for Numeric Fields: Never
assign a field a numeric data type unless
you intend to perform mathematical
calculations on its contents.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Types of Fields
• Logical: True-False (T, F), Boolean (0, 1),
Yes-No (Y, N)
• Image: Used for relatively small graphics files
(e.g., under 100KB)
• Object: An object is a package of data
together with program code that performs
certain functions on the data
• Large Binary Object: Used for very large
items, such as multi-megabyte graphic,
audio, and video files. The RDB typically
merely stores pointers to the locations of
these large files.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
What Is SQL?
• Structured Query Language (SQL)
– Is pronounced either "S, Q, L" or "sequel"
– Is a widely used standard set of commands and
syntax for doing things with RDBMSs
– Is used especially for query and retrieval
– Includes commands for defining RDBs, conducting
transactions, storing data, etc.
– Is implemented in all major RDBMSs
• SQL does not handle all the practical details
involved in using a RDB, so every RDBMS
has additional features unique to it
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
History of SQL
• A prototype DB query and retrieval language
was developed by IBM in the early 1970s as
"Sequel"
• Other RDB developers and users recognized
the advantage of a standard language for
manipulating RDBs
• This led to ANSI adoption of SQL as an
enhanced RDB language based on Sequel
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Multiuser DBs
• You are probably accustomed to using large
Internet-based DBs as well as a DBMS on an
individual microcomputer. In the latter case you
are the only user of the DBMS and the DBs it
handles, and you are also the manager of the
DBMS.
• In using a DBMS on an individual computer, you
are unlike managers of multiuser DBMSs and
DBs, in that you do not have to worry about
questions of privacy and security—unless, of
course, you are connected to the Internet or
allow other people to use your computer.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Multiuser DBs
• You need to become aware that privacy
and security are major concerns of
multiuser DBMSs and DBs. With such
DBMSs and DBs, much time and effort
must be devoted to providing protection
against violations of privacy and
confidentiality, and against outright theft of
data.
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
Computerized Databases—
They can
help to
save you
from this
kind of
work!
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
References
• Hernandez, Michael J. Database Design for Mere Mortals: A
Hands-On Guide to Relational Database Design. Reading,
MA: Addison-Wesley; 1997. 440 p. ISBN:0-201-69471-9. [A
thorough, middle-level, readily understandable discussion of
RDBs.]
• Kawasaki, Guy. Database 101: A Database Primer for the Rest
of Us. Berkeley, CA: Peachpit Press; 1991. 154 p. ISBN:0938151-52-5. [An exceptionally readable, elementary-level
introduction to computerized databases. Unfortunately, it is now
out-of-print, but some libraries may have a copy.]
• Kroenke, David M. Database Processing: Fundamentals,
Design and Implementation. Englewood Cliffs, NJ: PrenticeHall; 2000. 601 p. ISBN: 0-13-084816-6. [An advanced-level
presentation of RDB theory and practice, but quite readable for
that level.]
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
References, cont'd
• Meadow, Charles T.; Boyce, Bert R.; Kraft, Donald H. Text
Information Retrieval Systems. 2nd ed. San Diego, CA:
Academic Press; 2000. ISBN:0-12-487405-3. [Without
question, the authoritative source on text-oriented
databases and text information retrieval.]
• Monmonier, Mark. Spying with Maps: Surveillance
Technologies and the Future of Privacy. Chicago, IL:
University of Chicago Press; 2002. ISBN0-226-53427-8.
[An excellent introduction to geographic information
systems, as well as to a variety of surveillance methods
and their impacts on privacy.]
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions
References, cont'd
• Taylor, Allen G. Database Development for Dummies.
Foster City, CA: IDG Books Worldwide; 2000. ISBN:07645-0752-4. [Like other "for Dummies" books, this
covers the basics well and is quite readable, though some
may find its attempts at humor to be excessive.]
• Taylor, Allen G. SQL for Dummies. Foster City, CA: IDG
Books Worldwide; 2001. ISBN:0-7645-0737-0. [For
comments, see the previous reference.]
School of Information - The University of Texas at Austin
LIS 386.13, Information Technologies & the Information Professions