Transcript ppt



Indexing Structures for Files

1



Basic Concepts

Indexing mechanisms used to speed up access to
desired data without having to scan entire table
based on a search key

Search Key
 an attribute used to look up records in a file.
2




An index file consists of records (called index entries) of
the form
search-key


Index Structure
pointer

Index entries
 Search key value and a pointer to a row having that
value
 The values in the index are ordered.

Index files are typically much smaller than the original file

When a file is modified, every index on the file must be
updated
 Updating indices imposes overhead on database
modification.
3


Index Evaluation Metrics


Indexing techniques evaluated on basis of:
 Access types (queries) supported efficiently.
records with a specified value in the attribute
or records with an attribute value falling in a specified
range of values.
 Access/search
time
 Insertion
time
 Deletion time
 Space overhead

4



Index Classification


primary index:
 is specified on the ordering key field of an ordered file,
where every record has a unique value for that field.
 The index has the same ordering as the one of the file.

clustering index:
 is specified on the ordering field of an ordered file.
 The index has the same ordering as the one of the file.
 An ordered file can have at most one primary index or one
clustering index, but not both.

secondary index:
 is specified on any nonordering field of the file.
 The index has different ordering than the one of the file.
 A file can have several secondary indices in addition to its
primary/clustering index.
5



Primary Indices

Primary index is specified on the ordering key
field of an ordered file.

There is one index entry (or index record) in the
index file for each block in the data file.
 Each index entry has the value of the primary
key field for the first record in a block.

The total number of entries in the index file is the
same as the number of disk blocks in the data file.

The index file for a primary index needs fewer
blocks than does the data file.
6



Primary Indices


7



Primary Indices


Finding a record is efficient – do a binary search

Records insertion and deletion is a major problem.
We can avoid the problem by:
 Using an unordered overflow file, or
 Using a linked list of overflow records.
8


Primary Indices

Index (sequential)
10
continuous 20
30
33
40
50
60
39
31
35
36
32
38
34
free space
70
80
90

overflow area
(not sequential)
9



Sparse Vs. Dense Indices

dense index
 has index entry for every record in the file.

sparse (nondense) index
 has index entries for only some of the searchkey values.

A primary index is sparse (nondense) index.
10



Sparse Vs. Dense Indices
Id
Name

Dept
Sparse primary
index sorted
on Id
Ordered file sorted on Id

11
Dense secondary
index sorted
on Name


Sparse Vs. Dense Indices

Ashby, 25, 3000
22
Basu, 33, 4003
25
Bristow, 30, 2007
30
Ashby
33
Cass
Cass, 50, 5004
Smith
Daniels, 22, 6003
40
Jones, 40, 6003
44
Sparse primary
index on Name
44
Smith, 44, 3000
50
Tracy, 44, 5004
Dense secondary
Ordered file on Name index on Age

12



Dense Indices


Pro:
 Very efficient in locating a record given a key,
if fits in the memory
 Can tell if any record exists without accessing
file

Con:
 if too big and doesn’t fit into the memory, will
be expense when used to find a record given its
key
13



Sparse Indices


Sparse index contains index records for only some
search-key values.
 Some keys in the data file will not have an entry
in the index file
 Applicable when records are sequentially ordered
on search-key (ordered files)
 Normally keeps only one key per data block

To locate a record with search-key value K we:
 Find index record with largest search-key value 
K
 Search file sequentially starting at the record to
which the index record points
14


Sparse Indices
Ordered File
Sparse/Primary Index
10
20
10
30
50
70
30
40
90
110
130
150
50
60
70
80
90
100
170
190
210
230


15



Sparse Indices

Less space (can keep more of index in memory)
 Support multi-level indexing structure

Less maintenance overhead for insertions and
deletions.
16



Index Update: Deletion

If deleted record was the only record in the file with
its particular search-key value, the search-key is
deleted from the index also.
 Single-level index deletion:
 Dense indices

deletion of search-key is similar to file record deletion.
 Sparse
indices
If an entry for the search key exists in the index, it is
deleted by replacing the entry in the index with the next
search-key value in the file (in search-key order).
If the next search-key value already has an index entry,
the entry is deleted instead of being replaced.

17


Dense Index: Deletion
10
20
10
20
30
40
30
40
50
60
50
60
70
80


70
80
18


Dense Index: Deletion

delete record 30
10
20
10
20
40 30
40
30 40
40
50
60
50
60
70
80

70
80
19


Sparse Index: Deletion
10
20
10
30
50
70
30
40
50
60
90
110
130
150


70
80
20


Sparse Index: Deletion

delete record 40
10
20
10
30
50
70
30
40
50
60
90
110
130
150

70
80
21


Sparse Index: Deletion

delete record 30
10
20
10
40 30
50
70
30 40
40
50
60
90
110
130
150

70
80
22


Sparse Index: Deletion

delete records 30 & 40
10
20
10
50 30
70 50
70
30
40
50
60
90
110
130
150

70
80
23


Index Update: Insertion


Single-level index insertion:
 Perform a lookup using the search-key value
appearing in the record to be inserted.
 Dense indices
if the search-key value does not appear in the index,
insert it.
 Sparse

indices
if index stores an entry for each block of the file, no
change needs to be made to the index unless a new
block is created.
In this case, the first search-key value appearing in the
new block is inserted into the index.
24


Sparse Index: Insertion

10
20
10
30
40
60
30
40
50
60

25


Sparse Index: Insertion

insert record 34
10
20
10
30
40
60
30
34
40
50
60

26


Sparse Index: Insertion

insert record 15
10
20 15
10
20 30
40
60
30 20
30
40
50
• Illustrated: Immediate reorganization 60
• Variation:
– insert new block (chained file)
– update index

27


Sparse Index: Insertion

insert record 25
10
30
40
60
10
20
25
30
overflow blocks
(reorganize later...)
40
50
60

28


Dense Index: Insertion




Similar
Often more expensive . . .
29


Duplicate keys

10
10
10
20
20
30
30
30
40
45

30


Duplicate keys

Dense index
10
10
10
10
10
20
10
20
20
30
20
30
30
30


30
30
40
45
31


Duplicate keys
Sparse index, one way?
careful if looking
for 20 or 30!



10
10
10
10
20
30
10
20
20
30
40
30
30
40
45
32


Duplicate keys


Sparse index, another way? (clustering index)
– place first new key from block
should
this be
40?
10
20
30
30
10
10
10
20
20
30
30
30
40
45

33


Clustering Indices




A clustering index can be used when the field (the
clustering field) is non-key, and the data file is sorted by
the clustering field.
A file can have at most one primary index or one clustering
index, but not both.

A clustering file is also an ordered file with two fields:
 Clustering field
 pointer to the first block that has a record with that
value for its clustering field.

There is one entry in the clustering index for each distinct
value of the clustering field (rather than for every record).
 Sparse index (nondense)
34


Clustering Indices



A clustering index on the
DEPNo ordering nonkey
field of an EMPLOYEE
file.
35


Clustering Indices


Record insertion and deletion still cause problems
 a solution; cluster of contiguous blocks

Good for range searches
 Use location mechanism to locate index entry at
start of range
 This locates first data record.


Subsequent data records are contiguous if index
is clustered (not so if unclustered)
36


Clustering Indices



Clustering index with
a separate block cluster
for each group of
records that share the
same value for the
clustering field.
37


Secondary Indices


Secondary index:
 is specified on any nonordering field of the file.
 Non-ordering field can be a key (unique) or a non-key
(duplicates)


The index has different ordering than the one of the
file.

A file can have several secondary indices in addition to
its primary index.

there is one index entry for each data record

index record points either to the block in which the
record is stored, or to the record itself

Hence, such an index is dense
38



Secondary Indices


A secondary index usually needs more storage
space and longer search time than does a primary
index.
 It has larger number of entries.

Sequential scan using primary index is efficient,
but a sequential scan using a secondary index is
expensive
 each record access may fetch a new block from
disk
39


Secondary Indices



A dense secondary index (with block
pointers) on a nonordering KEY field.
40


Secondary Indices



A dense secondary index (with
record pointers) on a nonordering non-key field.
41


Index Types and Indexing Fields

Also, review Table 14.2.
Data file ordered
by indexing field
Indexing field is key
Indexing field is nonkey


Primary Index
Clustering Index
42
Data file not
ordered by
indexing field
Secondary index
(Key)
Secondary index
(NonKey)


Multilevel Indices


To search the index faster we can create an index for
the index.

A multilevel index considers the index file as an
ordered file and creates a primary index for the first
level
 outer index – a sparse index of primary index
 inner index – the primary index file
The above process can be repeated for a higher level
if the previous level needs more than one block of
disk storage.
 Read EXAMPLE 3


43


Multilevel Indices


44


B+-Tree Index



A B+-tree, of order f (fan-out --- maximum node capacity),
is a rooted tree satisfying the following:
 All paths from root to leaf are of the same length
(balanced tree)

Each non-leaf node (except the root) has between f/2
and up to f tree pointers (f-1 key values).

A leaf node has between f/2 and f-1 data pointers
(plus a pointer for sibling node).

If the root is not a leaf, it has at least 2 children.

If the root is a leaf (that is, there are no other nodes in
the tree), it can have between 0 and f-1 values.
45



B+-Tree Non-leaf Node Structure

Ki are the search-key values, K1  K2  K3  …  Kf-1
 all keys in the subtree to which P1 points are  K1.
 all keys in the subtree to which Pf points are  Kf-1.
 for 2  i  f-1, all keys in the subtree to which Pi points
have values  Ki-1 and  Ki.

Pi are pointers to children nodes (tree nodes).
46



B+-Tree Leaf Node Structure






for i = 1, 2, …, f-1, pointer Pri is a data pointer, that either
points to a
 file record with search-key value Ki, or
 block of record pointers that point to records having
search-key value Ki. (if search-key is not a key)
Pnext points to next leaf node in search-key order.
Within each leaf node, K1  K2  K3  …  Kf-1
If Li, Lj are leaf nodes and i  j, then
 Li’s search-key values  Lj’s search-key values
47


57
81
95
To record
with key 57
To record
with key 81
To record
with key 95

Sample Leaf Node
48

From non-leaf node
to next leaf
in sequence


95
81
57
to keys
 57


Sample Non-Leaf Node
to keys
57  k  81
to keys
81  k  95
49
to keys
 95


50
110
130
179
11
35
Root
180
200
150
156
179
120
130
100
101
110
30
35
3
5
11

Example of a B+-Tree

f=4


Number of pointers/keys for B+-Tree 

Non-leaf
30
Leaf
30
35
min. node
120
150
180
Full node
3
5
11
f=4
51



Observations about B+-Trees


In a B+-tree, data pointers are stored only at the
leaf nodes of the tree
 hence, the structure of leaf nodes differs from
the structure of internal nodes.

The leaf nodes have an entry for every value of the
search field, along with a data pointer to the
record.

Some search field values from the leaf nodes are
repeated in the internal nodes.
52


B+-Trees: Search






Let a be a search key value and T the pointer to the
root of the tree that has f pointer.
Search(a, T)
If T is non-leaf node:
 for the first i that satisfy a  Ki, 1  i  f-1

call Search(a, Pi),
 else call Search(a, Pf).
Else
//T is a leaf node
 if no value in T equals a, report not found.
 else if Ki in T equals a, follow pointer Pri to
read the record/block.
53


B+-Trees: Search






In processing a query, a path is traversed in the tree
from the root to some leaf node.
If there are n search-key values in the file,
 the path is no longer than log f/2(n) (worst
case).
With 1 million search key values and f = 100, at
most log50(1000000) = 4 nodes are accessed in a
lookup.
Contrast this with a balanced binary tree with 1
million search key values -- around 20 nodes are
accessed in a lookup.
54


B+-Trees: Insertion





Find the leaf node in which the search-key value
would appear
If the search-key value is found in the leaf node,
 add the record to main file and if necessary
 add to the block a pointer to the record
If the search-key value is not there,
 add the record to the main file and then:
 If there is room in the leaf node, insert (keyvalue, pointer) pair in the leaf node
 Otherwise, split the node along with the new
(key-value, pointer) entry
55


B+-Trees: Insertion

Splitting a node:
 take the f (search-key value, pointer) pairs
(including the one being inserted) in sorted order.
 place the first (f+1)/2 in the original node x, and
the rest in a new node y.
 let k be the largest key value in x.
 insert (k, y) in the parent node in their correct
sequence.
 If the parent is full
 the entries in the parent node up to Pj, where j =
(f+1)/2 are kept, while the jth search value is
moved to the parent, no replicated.
 A new internal node will hold the entries from
Pj+1 to the end of the entries in the node.


56



B+-Trees: Insertion

The splitting of nodes proceeds upwards till a node
that is not full is found.

In the worst case the root node may be split
increasing the height of the tree by 1.
57




Insertion – Example 3
Insert key = 31
f=4

30
31
32
3
5
11
11
32

58


Insert key = 7
f=4
11
30
31
3
57
11
3
5
5
31



Insertion – Example 3
59


Insert key = 160
f=4
60
180
200
160
179
150
156
179
179
120
140
179
156
100



Insertion – Example 3


New root, insert 45
25
30
32
40
20
25
10
12
1
2
3
3
12
25
61
40
45
new root
f=4
32



Insertion – Example 3




62



B+-Trees: Deletion


Find the record to be deleted, and remove it from
the main file and from the bucket (if present).

Remove (search-key value, pointer) from the leaf
node.

If the node has too few entries due to the removal,
and the entries in the node and a sibling fit into a
single node, then
 Insert all the search-key values in the two nodes
into a single node (the one on the left), and
delete the other node.
63


B+-Trees: Deletion

 Delete
the pair (Ki-1, Pi), where Pi is the pointer to
the deleted node, from its parent, recursively
using the above procedure.


Otherwise, if the node has too few entries due to the
removal, and the entries in the node and a sibling
DO NOT fit into a single node, then
 Redistribute the pointers between the node and a
sibling such that both have more than the
minimum number of entries.
 Update the corresponding search-key value in the
parent of the node.
64



B+-Trees: Deletion


The node deletions may cascade upwards till a node
which has f/2 or more pointers is found.

If the root node has only one pointer after deletion, it
is deleted and the sole child becomes the root.
65



Merge with Sibling

Delete 45

45
50
40
50
20
10
40
50
f=4
66



Redistribute Keys

Delete 40

35
40
50
30
35
15
10
35 30
50
f=4
67



Non-leaf Merging
Delete 37
f=4
22


68
40
45
30
37
25
26
30
20
22
10
14
1
3
3
14
22
30
26 30
37
new root




69


Extra Reading



Read Examples 1 to 7.
70
