File Structures

Transcript File Structures

C HAPT E R 3
Chapter
9
DataBase & DBMS
J. Glenn
Glenn Brookshear
Brookshear
J.
蔡文能
1
Copyright © 2009 Pearson Education, Inc.
Slide 8-1
Chapter 9: Database Systems
•
•
•
•
•
9.5 Traditional File Structures
9.1 Database Fundamentals
9.2 The Relational Model
9.3 Object-Oriented Databases
9.4 Maintaining Database Integrity
• 9.6 Data Mining
• 9.7 Social Impact of Database Technology
2
Copyright © 2009 Pearson Education, Inc.
Slide 8-2
Data Structure vs. File Structure
• Data Structure : How to arrange data in memory
• File Structure : How to arrange data in Disk and/or
any other secondary storage
• DataBase and DataBase Management System
– Users do NOT have to care about how to store data in a file.
DBMS will handle the detail.
– Users can use SQL (Structured Query Language) to access
the DataBase in an interactive command and/or through a
program (embeded SQL)
RAM Disk
3
Copyright © 2009 Pearson Education, Inc.
Slide 8-3
File Types
• Sequential file: one accessed in a serial manner
from beginning to end. E.g. audio, video, text,
programs.
• Text file: sequential file in which each logical
record is a single character.
ASCII: 1 byte/char
Unicode: 2 bytes/char
4
Copyright © 2009 Pearson Education, Inc.
Slide 8-4
Sequential Files
• Sequential file: A file whose contents can only
be read in order
– Reader must be able to detect end-of-file (EOF)
– Data can be stored in logical records, sorted by a
key field
• Greatly increases the speed of batch updates
5
Copyright © 2009 Pearson Education, Inc.
Slide 8-5
Text Files
• Simple file structure.
• Extendable to more complex file structures
using markup languages (XHTML, HTML).
• XHTML, HTML control the display of the file
on a monitor.
• XML is a standard for markup languages.
6
Copyright © 2009 Pearson Education, Inc.
Slide 8-6
Converting data from two’s
complement notation into ASCII for
storage in a text file
7
Copyright © 2009 Pearson Education, Inc.
Slide 8-7
Figure 9.15 A procedure for merging two
sequential files
8
Copyright © 2009 Pearson Education, Inc.
Slide 8-8
Figure 9.16
Applying the merge
algorithm (Letters
are used to represent
entire records.
The particular letter
indicates the value
of the record’s
key field.)
9
Copyright © 2009 Pearson Education, Inc.
Slide 8-9
XML
• Example
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
http://www.w3schools.com/xml/xml_whatis.asp
10
Copyright © 2009 Pearson Education, Inc.
Slide 8-10
Indexed File
key
pointer
Brown
P1
Johnson
P2
Jones
P3
Smith
P4
Watson
P5
Index file
Data item
for Brown
Data file
Index: A list of key values and the location
of their associated records
11
Copyright © 2009 Pearson Education, Inc.
Slide 8-11
An inverted file
12
Copyright © 2009 Pearson Education, Inc.
Slide 8-12
A file with a partial index
13
Copyright © 2009 Pearson Education, Inc.
Slide 8-13
Figure 9.17 Opening an
indexed file
14
Copyright © 2009 Pearson Education, Inc.
Slide 8-14
Hashing
• In hashing the index file is replaced by a hash
function.
– The storage space is divided into buckets.
– Each record has a key field. Each record is stored in the
bucket corresponding to the hash of its key.
– A hash function computes a bucket number for each key
value.
• Advantage: no index table needed.
• Disadvantages:
i) hash function needs careful
design;
ii) unpredictable performance
Copyright © 2009 Pearson Education, Inc.
15
Slide 8-15
Terminology
• Bucket: section of the data storage area.
• Key: identifier for a block of information.
• Hash function: takes as input a key and outputs
a bucket number.
• Collision: two keys yield the same bucket
number.
16
Copyright © 2009 Pearson Education, Inc.
Slide 8-16
Hash Functions
• Hash Function Requirements
– Easily and quickly computed.
– Values evenly spread over the bucket numbers.
– What can go wrong: bucket number computed from 1st and
3rd characters of a name:
Brown, Brook, Broom, Broadhead, Biot, Bloom, …
Examples of Hash Functions
 Mid square: compute (key x key) and
set bucket number = middle digits.
 Extraction: select digits from certain positions within the key.
 Divide key by number of buckets and use the remainder.
17
Copyright © 2009 Pearson Education, Inc.
Slide 8-17
Figure 9.18 Hashing the key field value 25X3Z
to one of 41 buckets
18
Copyright © 2009 Pearson Education, Inc.
Slide 8-18
Figure 9.19 The rudiments of a hashing system, in
which each bucket holds those records
that hash to that bucket number
19
Copyright © 2009 Pearson Education, Inc.
Slide 8-19
Collisions in Hashing
• Collision: The case of two keys hashing to the same
bucket
– Clustering problem: Poorly designed hashing function can
have uneven distribution of keys into buckets
– Collision also becomes a problem when there aren’t enough
buckets (probability greatly increases as load factor (% of
buckets filled) approaches 75%)
– Solution: somewhere between 50% and 75% load factor,
increase number of buckets and rehash all data
20
Copyright © 2009 Pearson Education, Inc.
Slide 8-20
Handling bucket overflow
21
Copyright © 2009 Pearson Education, Inc.
Slide 8-21
A large file partitioned into buckets
to be accessed by hashing
22
Copyright © 2009 Pearson Education, Inc.
Slide 8-22
The role of an operating
system when accessing a file
System calls ?
23
Copyright © 2009 Pearson Education, Inc.
Slide 8-23
Maintaining a file’s order by means of
a File Allocation Table (FAT)
24
Copyright © 2009 Pearson Education, Inc.
Slide 8-24
Information Required on a Hard Drive to Load an OS
• Startup BIOS (POST , Load MBR)
• Master Boot Record (MBR)
– Master Boot Program
– Partition Table (16 bytes * 4)
• OS Boot Record (Boot Sector)
– Loads the first program file of the OS
• Boot Loader Program
– Begins process of loading OS into memory
25
Copyright © 2009 Pearson Education, Inc.
Slide 8-25
How Data Is Logically Stored on a Floppy Disk
• All floppy and hard disk drives are divided into
tracks and sectors
• Tracks are concentric circles on a disk
• Sector
– Always 512 bytes
– Physical organization of a disk
– BIOS manages disk as sectors
• Cluster (file allocation unit)
– Group of sectors
– Logical organization of a disk
– OS views disk as a list of clusters
26
Copyright © 2009 Pearson Education, Inc.
Slide 8-26
The Boot Record
• Track 0, sector 1 of a floppy disk
• Contains basic information about how the
disk is organized
• Includes bootstrap program, which can be
used to boot from the disk
• Uniform layout and content of boot record
allows any version of DOS or Windows to
read any DOS or Windows disk
27
Copyright © 2009 Pearson Education, Inc.
Slide 8-27
The File Allocation Table (FAT)
• Lists the location of files on disk in a one-column
table
• Floppy disk FAT is 12 bits wide, called FAT12
• Each entry describes how a cluster on the disk is used
• A bad cluster on the disk will be marked in the FAT
28
Copyright © 2009 Pearson Education, Inc.
Slide 8-28
The Root Directory
• Lists all the files assigned to this table
• Contains a fixed number of entries
• Some items included are:
– Filename and extension
– Time and date of creation or last update
– File attributes
– First cluster number
29
Copyright © 2009 Pearson Education, Inc.
Slide 8-29
How a Hard Drive is Logically Organized to Hold Data
• Low-level format
– Creates tracks and sectors, done at factory
• Partition the hard drive (FDISK.EXE)
– Creates partition table at the beginning of drive
• High-level format
– Done by OS for each logical drive
• Master Boot Record (MBR) is the first 512 bytes of a
hard drive
– Master boot program (446 bytes) calls boot program to load
OS
– Partition table
• Description, Location, Size
30
Copyright © 2009 Pearson Education, Inc.
Slide 8-30
Partitions and Logical Drives
31
Copyright © 2009 Pearson Education, Inc.
Slide 8-31
FAT16
• Supported by DOS and all versions of
Windows
• Uses 16 bits for each cluster entry
• As the size of the logical drive increases,
FAT16 cluster size increases dramatically
32
Copyright © 2009 Pearson Education, Inc.
Slide 8-32
FAT32
• Became available with Windows 95 OSR2
• Used 32 bits per FAT entry, although only 28
bits were used to hold cluster numbers
• More efficient than FAT16 in terms of cluster
size
33
Copyright © 2009 Pearson Education, Inc.
Slide 8-33
NTFS
• Supported by Windows NT/2000/XP
• Provides greater security
• Used a database called the master file table
(MFT) to locate files and directories
• Supports large hard drives
34
Copyright © 2009 Pearson Education, Inc.
Slide 8-34
Comparing FAT16, FAT32, NTFS
35
Copyright © 2009 Pearson Education, Inc.
Slide 8-35
Figure 9.1: A file versus a database organization
(1/2)
36
Copyright © 2009 Pearson Education, Inc.
Slide 8-36
Figure 9.1: A file versus a database organization
(2/2)
37
Copyright © 2009 Pearson Education, Inc.
Slide 8-37
Figure 9.2: The conceptual layers of a database
38
Copyright © 2009 Pearson Education, Inc.
Slide 8-38
Schemas
• Schema: A description of the structure of an
entire database, used by database software to
maintain the database
• Subschema: A description of only that portion
of the database pertinent to a particular user’s
needs, used to prevent sensitive data from
being accessed by unauthorized personnel
39
Copyright © 2009 Pearson Education, Inc.
Slide 8-39
Database Management Systems
• Database Management System (DBMS): A software
layer that manipulates a database in response to
requests from applications
• Distributed Database: A database stored on multiple
machines
– DBMS will mask this organizational detail from its users
• Data independence: The ability to change the
organization of a database without changing the
application software that uses it
40
Copyright © 2009 Pearson Education, Inc.
Slide 8-40
Database Models
• Database model: A conceptual view of a
database
– Relational database model
– Object-oriented database model
41
Copyright © 2009 Pearson Education, Inc.
Slide 8-41
Relational Database Model
• Relation: A rectangular table
– Attribute: A column in the table
– Tuple: A row in the table
• Relational Design
– Avoid multiple concepts within one relation
• Can lead to redundant data
• Deleting a tuple could also delete necessary but
unrelated information
42
Copyright © 2009 Pearson Education, Inc.
Slide 8-42
Improving a Relational Design
• Decomposition: Dividing the columns of a
relation into two or more relations, duplicating
those columns necessary to maintain
relationships
– Lossless or nonloss decomposition: A “correct”
decomposition that does not lose any information
43
Copyright © 2009 Pearson Education, Inc.
Slide 8-43
Figure 9.3: A relation containing employee
information
44
Copyright © 2009 Pearson Education, Inc.
Slide 8-44
Figure 9.4: A relation containing redundancy
45
Copyright © 2009 Pearson Education, Inc.
Slide 8-45
Figure 9.5 An employee database
consisting of three relations
46
Copyright © 2009 Pearson Education, Inc.
Slide 8-46
Figure 9.6 Finding the departments in which
employee 23Y34 has worked
47
Copyright © 2009 Pearson Education, Inc.
Slide 8-47
Figure 9.7: A relation and a proposed
decomposition
48
Copyright © 2009 Pearson Education, Inc.
Slide 8-48
Relational Operations
• Select: Choose rows
• Project: Choose columns
• Join: Assemble information from two or more
relations
49
Copyright © 2009 Pearson Education, Inc.
Slide 8-49
Figure 9.8: The SELECT operation
50
Copyright © 2009 Pearson Education, Inc.
Slide 8-50
Figure 9.9: The PROJECT operation
51
Copyright © 2009 Pearson Education, Inc.
Slide 8-51
Figure 9.10: The JOIN operation
52
Copyright © 2009 Pearson Education, Inc.
Slide 8-52
Figure 9.11: Another example of the JOIN
operation
53
Copyright © 2009 Pearson Education, Inc.
Slide 8-53
Figure 9.12 An application of the JOIN
operation
54
Copyright © 2009 Pearson Education, Inc.
Slide 8-54
Figure 9.13: The associations between objects in
an object-oriented database
60
Copyright © 2009 Pearson Education, Inc.
Slide 8-60
Maintaining Database Integrity (1/2)
• Transaction: A sequence of operations that must all
happen together
– Example: transferring money between bank accounts
• Transaction log: A non-volatile record of each
transaction’s activities, built before the transaction is
allowed to execute
– Commit point: The point at which a transaction has been
recorded in the log
– Roll-back: The process of undoing a transaction
61
Copyright © 2009 Pearson Education, Inc.
Slide 8-61
Maintaining database integrity (2/2)
• Simultaneous access problems
– Incorrect summary problem
– Lost update problem
• Locking = preventing others from accessing
data being used by a transaction
– Shared lock: used when reading data
– Exclusive lock: used when altering data
62
Copyright © 2009 Pearson Education, Inc.
Slide 8-62
表格正規化簡介 (1/2)
• 第一階正規化 (1NF):
– 關聯式資料庫(relational database)要求各個資料表皆須符合第一正規化
(first normal form, 簡稱1NF)，如下：
– 第一正規化：每一個欄位只准有一個值。
– 關聯式資料庫要求每個資料表都要有一個主鍵(primary key)，用來識別每
一個tuple。
– 主鍵可以是資料表中的某一個欄位，也可以由幾個欄位組成。
– 關聯式資料庫對主鍵欄位另外有個要求，即實體完整性(entity integrity)：
組成主鍵的任何欄位值，都不可以是Null。
• 第二階正規化 (2NF)：
– 資料表要滿足第一正規化，而且所有欄位都完全功能相依於主鍵。
• 功能相依(functionally dependence)：
– 由attribute X的值可以決定一個唯一的attribute Y的值，簡寫成X→Y。
• 完全功能相依(full functional dependence)：
– 如果attribute Y功能相依於attribute X，但是並不功能相依於attribute X的任
何子集，則稱attribute Y完全功能相依於attribute X。
63
Copyright © 2009 Pearson Education, Inc.
Slide 8-63
表格正規化簡介 (2/2)
X→Y，Y→Z的關係稱做遞移功能相依(transitive functional
dependence)，即Z遞移功能相依於X。
• 第三階正規化(third normal form，簡稱3NF)
– 資料表要滿足第二正規化，而且所有欄位都不可遞移功能相依於主鍵。
• 第三階正規化解決的問題
– 解決了新增資料的問題
– 解決了刪除資料的問題
– 解決了修改資料的問題
64
Copyright © 2009 Pearson Education, Inc.
Slide 8-64
Data Mining (資料探勘)
• Data Mining: The area of computer science that deals
with discovering patterns in collections of data
• Data warehouse: A static data collection to be mined
– Data cube: Data presented from many perspectives to
enable mining
• Data Mining Strategies
–
–
–
–
–
–
Class description
Class discrimination
Cluster analysis
Association analysis
Outlier analysis
Sequential pattern analysis
65
Copyright © 2009 Pearson Education, Inc.
Slide 8-65
Data mining(資料探勘)主要功能與技術
功能
技術
適用領域
關聯性 (Association)
案例庫推理/集合理論/統計
菜籃分析
時間序列 (Sequence)
類神經網路/統計
利率預測
分類 (Classification)
基因演算/類神經網路/統計/
模糊邏輯案例推理/決策樹
客戶評鑑分類
公式 (Modeling)
基因規劃/基因演算/迴歸
銷售預測
群組 (Clustering)
類神經網路/模糊邏輯/
基因演算/統計
市場區隔
66
Copyright © 2009 Pearson Education, Inc.
Slide 8-66
Social Impact of Database Technology
• Problems
– Massive amounts of personal data are being
collected
• Often without knowledge or meaningful consent of
affected people
– Data merging produces new, more invasive
information
– Errors are widely disseminated and hard to correct
• Remedies
– Existing legal remedies often difficult to apply
– Negative publicity may be more effective
67
Copyright © 2009 Pearson Education, Inc.
Slide 8-67
Thank You!
DataBase
Management System
謝謝捧場
[email protected]
蔡文能
68
Copyright © 2009 Pearson Education, Inc.
Slide 8-68