Lecture 7 - Pravin Shetty > Resume

Download Report

Transcript Lecture 7 - Pravin Shetty > Resume

Lecture No. 7
Database - Structured and Unstructured
Database. Lecture 7 / 1
Objectives of this Lecture
• This lecture will be focussed mostly on database as a
means of storing and retrieving data.
• The term ‘database’ is always associated with the term
‘database management system’ (DBMS) which indicates that
not only is data stored but it is controlled and managed by
software.
• You will most likely be using Microsoft Access for your
assignment. This DBMS is aimed at the lower end of data
volumes and processing.
• You will also use Oracle in your laboratory work. This is a
medium to large client-server DBMS
Database. Lecture 7 / 2
Database. Lecture 7 / 3
Some Aspects of Data
A major benefit of Computing is Storage and Retrieval of Data
We need to have answers to these questions WHAT are we going to produce as ‘output’ ?
WHAT data is to be stored ?
WHAT is the level of detail (Name, Address, Height …?)
HOW long is to be retained - and is it ‘volatile’?
WHAT form is it to be stored - on line, off line ?
WHO is going to access it ?
Database. Lecture 7 / 4
Some Aspects of Data
HOW often will it be accessed ?
WHAT security of access is to be applied ?
WHAT are the starting volumes - what is the growth or
decay factor ?
WHAT response time is expected ?
HOW accurate is the data content ?
HOW current is it ? (e.g. on line sales retailing)
Database. Lecture 7 / 5
Information / Data
A General Definition:
DATA - raw (unprocessed or partly processed) facts which
represent the state of entities (things) which have
occurred
INFORMATION - data which has been processed into a form
USEFUL TO THE USER
What is Information to one user may be Data to another user.
Database. Lecture 7 / 6
Audit Trail
General Definition:
‘The presence of data processing media and procedures
which allow any and / or all transaction(s) to be traced
through ALL STAGES of processing’
This infers that the following devices / techniques are in place:
1. A logging device which ‘traps’ all transactions
2. Some way of tagging each transaction so that it can be
identified
3. Some way of retrieving the required transaction(s)
4. Some way of archiving - what is the required period ?
5. Control procedures and processes to ensure integrity
Database. Lecture 7 / 7
Data Base
A Database is a shared collection of Inter-Related data
designed to meet the needs of multiple types of users and
applications.
This implies that multiple user VIEWS can be defined
Data stored is independent of the programs which use it
Data is structured to provide a basis for future applications
DATABASE = Stored Collection of Related Data
May be physically distributed
Database. Lecture 7 / 8
Database Management System
A DBMS is SOFTWARE which provides access to the
database in an integrated and controlled manner
A DBMS must contain :
1. Data Definition and Structure capabilities
2. Data Manipulation capabilities
Database. Lecture 7 / 9
Data Definition and Manipulation
Data Definition Language (DDL)
used to describe data at the database level
Schema level - complete database description
Sub-Schema level - user views (restricted)
Data Manipulation Language (DML)
Provides for
Create
Update
Delete
Modify
Report
Insert
Retrieve (extract)
Drop
Calculation
capabilities
Database. Lecture 7 / 10
Three Level Architecture
1. External schema
- User Views
2. Conceptual schema - Total database
description
3. Internal schema
- Physical database
Database. Lecture 7 / 11
The Many Faces of Database
Databases can be:
1. Transaction Intensive
2. Decision Support
3. Mixed-Load
4. Small databases
5. Very Large Database
(VLDB)
6. Non Traditional
ATM’s Checkouts
Browsing for trends
Combination of both
Few thousand records
Many millions or trillions
of records (Banks)
- Weather bureau, flight plans
Computer Aided Design data
-
Database. Lecture 7 / 12
The Many Faces of Database
• They can be:
Data Warehouses
Data Marts
• How is a database size measured ?
There are a number of ‘measurements’
Raw data size
Total database size
Total usable disk space size (which includes media
protection such as mirroring)
Database. Lecture 7 / 13
The Many Faces of Database
Hardware
Database
Raw Data
Total Disk
HP9000
Oracle
100GB
643GB
Digital 8400
Oracle
100GB
361GB
IBM SP2
DB2/6000
100GB
377GB
NCR5100
Teradata
100GB
880GB
NCR5100
Teradata
1,000GB
3,280GB
Database. Lecture 7 / 14
DBMS Requirements
Querying Capabilities
Data Displays (Presentation)
Data entry
Data Validation
Data Deletion
Committing Procedures (of changes)
AND Data Integrity, Security, Consistency and Concurrency
Capabilities
Database. Lecture 7 / 15
Important Database Features
•
•
•
•
•
Data Integrity
Data Independence
Referential Integrity - Relational Database Model
Concurrency Control - Multiple Users
Consistency
- multi users
- distributed database
- replicated database
- partitioned database
- mobile database
• Recovery from failure (Transaction and Media)
• Security
Database. Lecture 7 / 16
File Processing (non database)
Purchasing
Program
Billing Program
Customers
Accounts
Receivable
Accounts Payable
Vendors
Invoice
Buyers
Inventory
Sales Order
Processing
Customers Inventory
Vendors
Payroll
Employee
Database. Lecture 7 / 17
File Processing With Database Technology
Orders Department
Program
A
Program
B
Program
C
Order Filling
system
Customers Inventory
Master
Accounting Department
Program
A
Program
B
Invoicing
system
Back
Orders
Inventory
Pricing
Customer
Master
Database. Lecture 7 / 18
InterRelated File Outline
Salesperson
Buyers
Sales
Stats
Inventory
Purchase
Order
Customer
Accounts
Receivable
Vendor
Accounts
Payable
General
Ledger
Database. Lecture 7 / 19
Conceptual Data Model
customer
order
invoice
product
raw material
work order
vendor
Database. Lecture 7 / 20
User Views
customer
order
customer
customer
order
product
invoice
vendor
raw material
Database. Lecture 7 / 21
Database Models - Hierarchical
owner/parent
owner
child /parent
member
child
child/parent
Database. Lecture 7 / 22
Data Base Models - Network
‘set’ of data
owner
member
owner
member
‘set’ of data
Note: Only linked sets can be accessed
Database. Lecture 7 / 23
Data Base Models - Relational
table
A
table
B
table
C
table
D
table
E
Any table(s) can be joined to any other table(s), provided
there is a means of effecting the join
Primary key / Foreign key concept. Data redundancy
No fixed linkages
Database. Lecture 7 / 24
2 Relations
EMPNUM
3
7
11
15
18
NAME
JONES
SMITH
ADAMS
NGUYEN
PHAN
Date of Birth
16-05-1956
23-09-1965
11-08-1972
23-10-1964
16-11-1976
DEPTNUM
605
432
201
314
201
Relation (Table) Name : EMP
Relation Schema: EMP(empnum,name,date of birth,deptnum)
DEPTNUM
201
314
432
605
DEPTNAME
Production
Finance
Information Systems
Administration
Relation (Table) Name : DEPT
Relation Schema: DEPT(deptnum, deptname)
Database. Lecture 7 / 25
Definition of a
Relational Database
• A relational database is a collection of relations or
two-dimensional tables.
Database
Table Name: EMP
EMPNO
ENAME
JOB
7839
7698
7782
7566
KING
BLAKE
CLARK
JONES
EXECUTIVE
MANAGER
MANAGER
MANAGER
DEPTNO
10
30
10
20
Table Name: DEPT
DEPTNO
10
20
30
40
DNAME
ACCOUNTING
RESEARCH
SALES
OPERATIONS
LOC
CLAYTON
CAULFIELD
PENINSULA
GIPPSLAND
Database. Lecture 7 / 26
Data Models
Model of
system
in client’s
mind
Entity model of
client’s model
Table model
of entity model
Server
Tables on disk
Database. Lecture 7 / 27
Communicating with a RDBMS
Using SQL
SQL statement
is entered
SQL> SELECT loc
FROM dept;
Statement is sent to
database
Database
Data is displayed
LOC
------------CLAYTON
CAULFIELD
PENINSULA
GIPPSLAND
Database. Lecture 7 / 28
Relational Database Management
System
Server
User tables
Data
dictionary
Database. Lecture 7 / 29
Relational DataBase
Data is represented in ROW and COLUMN form (matrix)
(attribute)
Collections of related data ---> TABLES (relations)
1 or more tables
----> DATA BASE
ATTRIBUTES are generally static
ROWS are DYNAMIC and Time-Varying
The number of Attributes = DEGREE of a table
The number of Rows
= CARDINALITY of a table
Database. Lecture 7 / 30
Some RDB Considerations
•
•
•
•
•
•
•
Data is held in tables
No order of data in tables - row or attribute
Concept of Foreign Key - Primary Key relationship
Data Typing - including nulls
Query Access - insert, update, delete, retrieval
Indexing on candidate (and Primary) keys
Integrity Constraints
Attribute value ranges
Referential Integrity
Entity Integrity
User Defined Integrity
• Set retention constraints
Database. Lecture 7 / 31
Some RDB Considerations
• Domain constraints
• User defined ‘Rules’ e.g. no booking of rooms for lectures
after 4.00pm Fridays (also known as Business Rules)
• Recovery procedures. Used to restore a database after a
failure
• No explicit linkages between tables
• Linking or embedding database operations in a procedural
language (Cobol, C ..)
• Databases may be distributed across similar or different
DBMS’s
• Security features
Database. Lecture 7 / 32
Data Description Language
Used to describe data at the Database level
Terms Used
Structure and Attributes
Schema : Complete description of the database using DDL
SubSchema : Describes data in the database as it is
‘known’ to individual programs(processes) or users
A segment or part of logical data record(s) required
is commonly known as a VIEW
Database. Lecture 7 / 34
Data Manipulation Language
Language (commands and syntax) used to cause transfers of
data from the Database and the Operating Environment and
vice versa
Variety of Models - Access, DB2, dBASEV, MySQL, SQLServer
VisualDataBase, DataBoss, Ingres,
Oracle, Informix ....
Windows versions provide Icons and Menu options which are
translated by the DBMS software to Database manipulation
commands
Typical commands: get, put, replace, seek, update,delete,
insert, drop, find, modify
Database. Lecture 7 / 35
Application Systems
users(ATM’s)
C programs
users
users
users
Cobol
D
B
M
S
Access
Database. Lecture 7 / 36
DBMS Components
users
Database
Management
System
Retrieval
Update
Program
Language
Interface
Application
Programs
Database
Utility
Programs
Database. Lecture 7 / 37
A Practical Development Scenario
In 198n , a Company decided to develop and introduce a
Payroll system using database technology. It looked this
this :
Payroll System
Payroll
Data
Database. Lecture 7 / 38
A Practical Development Scenario
The Company grew in numbers and expertise and decided
to introduce a Personnel System. The ‘new’ design was this
Payroll System
System
Payroll
Payroll
Data
Personnel System
Personnel
Data
Database. Lecture 7 / 39
A Practical Development Scenario
In the next few years, these components were added
Payroll System
System
Payroll
Payroll
Data
Job History
Employee
Tracing Data
Personnel System
Personnel
Data
Labor Distribution
Labor
Analysis Data
Database. Lecture 7 / 40
Advantages of DataBase
• Reduced Data Redundancy
• Data Integrity
• Data Independence
• Data Security
• Data Consistency
• Easier use of Data via DBMS Tools (Query languages,
4GL's)
Database. Lecture 7 / 41
Disadvantages of Database
• Complexity
• Expense
• Vulnerability
• Size of - disk storage, processor memory
• Training Costs
• Compatibility
• Technology Lock In
Database. Lecture 7 / 42
Database. Lecture 7 / 43
Data Types
• Used to augment an attribute description and to provide a
means of Integrity
’Normal’ data types are :
Character (or text)
Numeric - Integer, Decimal, Money, Float
(in Access Long and Short Integer,
Decimal with options of a number of ‘decimal’
places)
Date - Standard date format - Access has various forms
Logical - Yes/No True/False Exists/Does Not Exist
Database. Lecture 7 / 44
Attribute Size
In many cases this is set by the Data Definition facilities
e.g.. Date, Short Integer, Long Integer, Logical,Number
Others are set by the Designer:
Number of Characters, Size of a ‘Decimal Number’
Access allows for
a Default value
Duplicates/ No Duplicates allowed
Primary Key nomination
Indexing
‘Required’ Status of an Attribute
Database. Lecture 7 / 45
Integrity Examples
Primary Key - Must have a value - not null
- Must not be duplicated in the same table
Attribute values must exist: (or not)
student record:
student identity number - must exist, not duplicated, must be a
‘valid’ number
student name
- must exist, may be duplicated
student course code
- must exist, must be a valid course
subject code
- if enrolled, must exist, code must be
a valid code
subject result
- dependent on time. May be null. If it
exists must be a valid grade and
mark
Database. Lecture 7 / 46
Integrity Examples
Questions:
1. Do the same ‘constraints’ exist in Excel, Word ?
2. Should they ?
3. What are alternatives for embedding the same or similar
controls
4. Are spreadsheets less ‘reliable’ than databases ?
Database. Lecture 7 / 47
Queries
A Query acts of the base table or tables of a database and
returns a subset of this data.
A Query normally returns a ‘restricted set’ of attributes (and
their current values’ - this is the ‘User View’ of the database
A Query normally has some criteria
e.g. salaries > $50,000
outstandings > 30 days
date (of some event =, > or < some designated date)
Criteria can be linked : event A OR event B,
event A AND event B
Database. Lecture 7 / 48
Reports
A more formal output of data from base tables and in many
cases produces high volume.
The design is formatted ;
Page numbering, Headings, Footers,
Page breaks, Page or Item totals and sub-totals
Reports can be ‘criteria based’ and include calculations
(derived data)
Report content can be imported from another source, or
exported to another target.
Reports can contain Exception Full Graphic Hypertext
information and may be hard copy, electronic copy,
Database. Lecture 7 / 49
A Puzzle
How can these dots be joined by 4 straight lines without
lifting a pencil (or pen) from the surface ?
Database. Lecture 7 / 50
A Puzzle
1
2
3
No, that’s 5 lines
4
5
Database. Lecture 7 / 51
A Puzzle
1
5
4
2
That’s worse - it’s
6 lines
6
3
Database. Lecture 7 / 52
A Puzzle
1
How about this ?
Database. Lecture 7 / 53
A Sunburnt Country
This could be a commentary of the Summer weather
I love a sunburnt country
A land of sweeping plains,
Of ragged mountain ranges,
Of droughts and flooding rains
I love her far horizons,
I love her jewel-sea,
Her beauty and her terror The wide brown land for me !
Database. Lecture 7 / 54
Database. Lecture 7 / 55
Non Structured Data
• Up to date, we have studied some of the methods of file
organisation associated with efficient data retrieval as
embodied in the relational data base model.
• However, although industry has invested large amounts of
money in relational database applications, the greater
percentage of data access requirements are made of nonstructured data sources (a classic one being the World Wide
Web).
• This lecture will uncover some of the techniques which
locate and release data of this non-structured nature.
Database. Lecture 7 / 56
Information Retrieval
Non - Structured
i.e. Occurrence of 'Attribute Values'
neither regular nor regulated
Processing Requirements
- Mainly to reveal/release textual information
- Information searches on 'key terms'
- May be further processed by computer systems
(e.g. Spreadsheets, Mathematical Models,
Simulations)
Database. Lecture 7 / 57
Information Retrieval
User Expectations - All 'relevant' references will be found and
released
(Compare to structured DBMS closure theory)
Note: Approximately 70 - 80% of 'Management Information' is from
Non-structured Databases
Database. Lecture 7 / 58
Information Retrieval
Consider the following :
When I do count the clock that tells the time,
And see the brave day sunk in hideous night,
When I behold the violet past prime,
And sable curls all silvered o'er with white:
When lofty trees I see barren of leaves,
Which erst from heat did canopy the herd
And summer's green all girded up in sheaves
Database. Lecture 7 / 59
Information Retrieval
Borne on the bier with white and bristly beard:
Then of thy beauty do I question make
That thou amongst the wastes of time must go,
Since sweets and beauties do themselves forsake,
And die as fast as they see others grow,
And nothing against time's scythe can make defence
Save breed to brave him when he takes thee hence
Shakespeare, Sonnet No.12
Database. Lecture 7 / 60
Information Retrieval
Or , consider this:
Now is the winter of our discontent
Made glorious summer by this sun of York
And all the clouds that loured upon our house
In the deep bosom of the ocean buried
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments
Our stern alarums changed to merry greetings
Our dreadful marches to delightful measures
Shakespeare: Soliloquy ? ? ?
Database. Lecture 7 / 61
Information Retrieval
Finally, another extract:
The peasants who survived the plague found themselves in
many cases afflicted by fresh burdens, for with fewer people
to work the land, overlords demanded a standstill in wages
and a return to feudal duties in full.
But, with the shortage of labour, workers naturally expected
to be valued more highly and to be given better pay and
more freedom.
Europe 14/15th Century
A history of the world - Rjunstead ( perhaps Enterprise Bargaining ?)
Database. Lecture 7 / 62
Information Retrieval
Definition: An Information Retrieval System stores items of
Information which need to be:
Processed
Searched
Retrieved
Analysed, Condensed, Explained and Sent to User
Populations
• Some Concerns:
Data Base Management
Decision Support (as with other Information Systems)
Database. Lecture 7 / 63
Information Retrieval
Requirements:
• Must have efficient storage organisations
• Rapid search procedures
• Effective dissemination and user interaction
Database. Lecture 7 / 64
Information Retrieval
Information Retrieval Systems are used to handle data in
various forms such as
• Bibliograpic
• Textual
• Audio
• Video
Database. Lecture 7 / 65
Information Retrieval
• Allied Topics
Information Theory
Probability Theory
Computational Semantics
Programming Theory
Algebra
Database. Lecture 7 / 66
Information Retrieval
Measurement of Usefulness:
1. Currency
2. Completeness
Some queries:
1. Excess Information
2. Obsolete (?) Information
Database. Lecture 7 / 67
Information Retrieval
Resource Management
Growth rates
Is it Important ?
Up to 1800, the rate of publication doubling
every 50 years
1800 to 1996 > 100,000 scientific publications
1996
> 400,000 scientific publications
2000
> 680,000 scientific publications
Impact of World Wide Web
plus the translation factor
Database. Lecture 7 / 68
Information Retrieval
Some Information Organisation Problems
Volume expansion is not evenly distributed for all topics
Location of related items
Relationship methods (trace, link, chain)
Database. Lecture 7 / 69
Information Retrieval
Key Functions of Information Retrieval Systems
Input
Requests
Process
Indexing
Language
Output
Information
Items
(mapping)
Organised for
Search Formulation
Request Representation
Indexing
Processes
Information Representation
Database. Lecture 7 / 70
Information Retrieval
Item No
Author
Title
Topic
1
Ash
2
Brown
3
Jones
4
Reynolds
5
Smith
Aspects of
Computerised
Information
Retrieval
A Survey
A History
The State of
of users of of Computer the Art of
Information Systems
Retrieval
Retrieval
Systems
Users of
New
Retrieval
Systems
Computer
Information
Retrieval
Systems
Information
Retrieval
Users
Retrieval
System
Users
Computer
Systems
Information
Retrieval
Systems
Database. Lecture 7 / 75
Information Retrieval
Inverted File Organisation Sample
Related Information Items
Topic
Computer
Information
1
1
Retrieval
Systems
Users
1
1
3
2
4
2
4
4
3
2
5
5
5
Inverted
Index
Additions and Deletions require Index modifications
Database. Lecture 7 / 76
Information Retrieval
Inverted Files in a Commercially Available I.R.S.
Boolean logic : associative terms queries
Operators: AND OR NOT
Functions used: Set Intersection
Set Union
Set Difference
Database. Lecture 7 / 77
Information Retrieval
Database. Lecture 7 / 78
Information Retrieval
Item No
Author
Title
Topic
1
Ash
2
Brown
3
Jones
4
Reynolds
5
Smith
Aspects of
Computerised
Information
Retrieval
A Survey
A History
The State of
of users of of Computer the Art of
Information Systems
Retrieval
Retrieval
Systems
Users of
New
Retrieval
Systems
Computer
Information
Retrieval
Systems
Information
Retrieval
Users
Retrieval
System
Users
Computer
Systems
Information
Retrieval
Systems
Database. Lecture 7 / 79
Information AND Retrieval
Based on the data given:-
1. Use Inverted Index to Locate the reference Numbers for the
term “Information”
Set 1
2. Ditto for the term “Retrieval”
3. Intersect of Set 1 and Set 2
Set 2
Set 3
4. Use set 3 to retrieve documents identified in this set.
Database. Lecture 7 / 80
Information OR Retrieval
Create Set 1 as before on the occurrences of the term
“Information”
Create Set 2 as before on the occurrences of the term
“Retrieval”
Combine Set 1 and Set 2
Set 3
This is known as Set Union
Database. Lecture 7 / 81
Information NOT Retrieval
Perform the formation of Set 1 and Set 2 based on the
occurrence of the term “Information” and also the term
“Retrieval”
Remove any reference from Set 1 which is included in Set 2
to produce a final set, Set 3
This is known as Set Difference
And this is more Information. It’s
informing me it’s time to stop.
Database. Lecture 7 / 82
Additional Features
Document Reference No.
Paragraph No.
Sentence No.
Word No
Document Hierarchy
Retrieval (345 1 2 5)
Could also be a Word Count from the beginning of text
(distance indicator)
Retrieval (345 13) interpreted as Document 345, 13 words
from Beginning of Text (B.O.T)
Database. Lecture 7 / 85
Some Commercial Inverted File
Systems
1. Dialog
2. Stairs
Uses Select and Combine
Utilities to create and maintain database
OnLine retrieval system AQUARIUS - includes
a dictionary
3. Bibliographic Retrieval Services System (BRS)
Based on Stairs
4. Medlars system (National Library of Medicine)
Uses Index File Posting File Data File
5. Orbit system
6. Lexis system
Database. Lecture 7 / 86
B+ Tree Example
Document Clustering
Network
Catalog Hardware Morpheu
Review Synonym
Number Publication
Apparatus
Biomedicine
Statistics
Encyclopedia
Stem
Structure
Grammar
File
Search for Query = ‘File’
Database. Lecture 7 / 88
String Matching - Boyer and Moore
Based on 1. Analysis of query pattern
2. Auxiliary table
3. Character mapping
1 2 3 4 5 6 7 8 9 10 11
LECTUREROOM
L occurs in position 1
E occurs in positions 2 and 7
C occurs in position 3
T occurs in position 4
U occurs in position 5
R occurs in positions 6 and 8
O occurs in positions 9 and 10 M occurs in position 11
Matching starts <Right> character and progressively shifts
left on match
Database. Lecture 7 / 89
Information Management
Data retrieval - an application package
TITAN: Museums, Births Deaths and Marriages Registries
Art Galleries, Market Research, News Services,
Hansard, Reference Libraries
Platforms: Unix, AIX, SunO/S, PCs to Mainframes
Method - Signature Files - A bit string for each record in the
data file. Uses Indexed terms
Database. Lecture 7 / 90
Information Management
2 stages necessary:
1. Create descriptor for each Indexed term
2. Superimpose each term descriptor (exclusive OR-ing)
Professional ---> 100100 .......... (32 bits)
term
Computing ---> 010100 ..........
descriptors
Magazine
---> 000101 ..........
Record descriptor 110101 (inclusive OR)
Record descriptors ---> record descriptor file + Pointers
Database. Lecture 7 / 91