Transcript + 1

CSE2132
Database Systems
Week 11 Lecture
Indexes
Indexes 11. 1
Indexes
Consider
SELECT *FROM EMPLOYEE
WHERE EMP_ID = 'E9'
Assuming EMP_ID is unique we expect to retrieve 1 row.
How many records did we have to access in order to retrieve that 1 row?
P1
E1 - Jones
E2 - Smith
E3 - Wong
P2
E4 - White
E5 - Bloggs
P3
E7 - Chen
E9 - Green
Indexes 11. 2
Indexes an Overview
The minimum amount of data transfer between Secondary Storage and
Main Memory is 1 page.
Therefore the cost of accessing Emp_Id = 'E9' is measured using the
number of pages we had to access to retrieve the record we required.
In the case of an unordered file the number of accesses using a linear
search will average n/2 - where n = the number of pages.
n=3
BA (Blocks Accessed) = n/2 = 3/2 = 2 accesses (1.5)
The minimum possible number of data base accesses will be equal to the
number of rows retrieved (or less if the rows are in the same
block/page).
i.e 1 row = 1 page (block) accessed
Indexes 11. 3
Indexes an Overview
The use of index may aid in this BUT indexes have their own overheads
Indexes may use up to 50% of the allocated file space of the data base
How can we reduce the cost of index use and the space used by
indexes ?
1. Use efficient file access methods.
e.g. If we have an ordered file we can use a binary search rather than
just a linear search.
2. Be wise when choosing to create an index.
Indexes 11. 4
An Example
r=30,000 records
blocksize=1,024 bytes (or page size)
rec length=100 bytes
BF = Blocking Factor = 1,024/100 = 10
Number of Blocks = r/BF = 30,000/10 = 3,000 blocks (or pages)
Using a Linear Search
BA = n/2 = 3,000/2 = 1,500 (on average)
Using a Binary Search (if the file is ordered)
BA=log2 n
BA=log2 3,000=12
1 record still need 12 accesses (This is a maximum)
Indexes 11. 5
Hashing - The Number of Accesses
BA = 1 - depending on the blocking factor and the fill factor
1 record retrieved = 1 access - optimal
10 records retrieved = 10 accesses
But what if they are 10 consecutive rows?
BF = 10
BA = 1 or 2 if the records are stored consecutively rather than 10
required for a hashed file organization
Indexes 11. 6
The Situation When Using Indexes
Indexes rely on a key value for access.
If we do not have a key we must sequentially search the file.
An index is an auxiliary file that makes it more efficient to search for a
record in the data file .
An Index is called an ACCESS PATH on a field.
An example of a simple index structure is
<key(field-value),pointer-to-page>
ordered by the field value.
Indexes 11. 7
Index Operation
The pages can be moved around and the index address will still be
correct providing the address contained in the directory is updated
The pointers provide direct access to the data records
Optimally there should be very few accesses to traverse the index file
as all of the accesses are overheads
BA = BA (index) + 1
To retrieve the
data record
Indexes 11. 8
One Way Indexes Reduce Accesses
Example 1 from Elmasri & Navathe (p108)
Binary Search
Data file 3000 blocks
BA=log23000=12
Using an Index to retrieve a data record
the number of accesses is :BAtotal = BAi+BAd = (log 2 n)+1
Size of index record - assume key = 9 bytes eg. SSN V+P
V - length of key
P - length of pointer to data value
= 9 + 6 = 15 bytes
Blocking Factor
BFi=Blocksize / Indexrecordsize
= 1024/15 = 68 records per block
Number of Index Blocks = 30,000/68 = 442
Indexes 11. 9
Total Accesses are Then
BA
i+d
= (log
2
442) + 1
= 9 + 1 = 10 accesses
Which is less than the 12 accesses using the ordered
file alone.
Nb: A non-ordered data file is assumed
and a dense index.
Indexes 11. 10
Another Way Indexes Reduce Accesses
The data file may be ordered and therefore our index may be sparsely
populated, i.e. only 1 index entry per block of data records - therefore
less index records
P1
P2
P3
E1 Jones
E4 White
E7 Chen
E2
Smith
E5 Bloggs
E8
Brown
E3
Wong
E6 West
E9
Green
Our index would only need to contain 3 entries instead of 9 for the
dense index - the number of index entries depends on the blocking
factor of the data file. An unordered data file requires 30,000 index
entries with one index entry per data record. The ordered file requires
only one index entry per block i.e. 3000 index entries.
Indexes 11. 11
Total Accesses are Then
3000 index records/68 = 45 index blocks : ordered data file
BA = (log 2 45) + 1
BA = 5 + 1 = 6 accesses for an indexed ordered file
NOTE: Index files may be independent of the data file
The index can be created and dropped independently of the data file
The index file can be of any file organization(in Oracle it is a B+Tree)
There can be any number of index files for a data file
Indexes 11. 12
Index Terminology
The attributes on which the index is/are built is/are called the
INDEXED FIELD/S
- if these attributes are built on the PRIMARY KEY
then this index is called the PRIMARY INDEX
- if an index is built on any other attributes it is called a
SECONDARY INDEX
and the attribute values may be non- unique
A SIMPLE CLUSTERING INDEX (This is not an option in Oracle)
- the index is built on a non-unique attribute and includes one index
entry for each distinct value of the attribute. The index entry points to
the first data block that contains records with that attribute value. The
underlying file must be ordered on the chosen non-unique attribute.
Indexes 11. 13
A Primary Index on an Ordered File
(ISAM)
ENO
INDEX FILE
PRIMARY
KEY VALUE
1
13
25
39
53
BLOCK
POINTER
.
.
.
.
.
This assumes ENO
is unique and is being
used as a primary key.
It is a non dense index as
the data file is ordered on
the index key.
DATA FILE
ENAME SALARY BIRTDATE
1
3
13
14
25
26
39
41
Indexes 11. 14
A Dense Secondary Index on Non Data File Ordering
Column
INDEX FILE
ENAME
Allen
INDEX
FIELD VALUE
Aaron
Abbot
Adams
Akers
Allen
BLOCK
POINTER
.
.
.
.
.
This assumes ENAME
has been nominated as
an alternative key so an
index entry is required
for every record.The
data file has been
ordered on empno
Anderson
Alexander
Adams
Alfred
Abbot
Akers
Aaron
DATA FILE
DNO EMPNO SALARY
1
3
7
9
13
14
16
17
25
26
29
30
33
35
39
41
Indexes 11. 15
An Clustering Index on a Non_Key Column
INDEX FILE
CLUSTERING
FIELD VALUE
1
2
3
4
5
BLOCK
POINTER
.
.
.
.
.
It has been decided to order the
data file on the department
number.
This is sometimes termed a
clustering index but it is
different to a cluster in Oracle
DNO
1
1
1
2
2
3
3
3
DATA FILE
NAME
EMPNO SALARY
4
8
27
15
11
3
3
4
4
5
5
5
5
Indexes 11. 16
Consider Which Type of Index ?
SELECT * FROM EMP WHERE E# = ‘E1’
SELECT * FROM EMP WHERE DEPT = ‘D1’
SELECT * FROM EMP WHERE ENAME = ‘ABLE’
SELECT * FROM EMP WHERE ENAME LIKE ‘A%’
SELECT * FROM EMP WHERE DEPT = ‘D2’
AND ENAME > ‘CAIN’
E#
ENAME
DEPT
E1
E2
E3
E4
ABLE
ADAMS
BLAKE
BROCK
D1
D1
D1
D2
Indexes 11. 17
Multi Level Indexes
- because a single level index is an ordered file we can create a
non-dense index to an index
i.e A Second Level Index
We can repeat this process of leveling until the highest level of our
index can fit into main memory
- probably 1 page in size - this will reduce the number of I/O's
in traversing the index by 1
This concept of a number of index levels decreasing in breadth is
a tree structure
Indexes 11. 18
Indexes as Tree structures
Tree structures have certain properties we can take advantage of :1. The number of pointers at one level will determine the number of
values stored at the next level
We can predict the number of levels and therefore the number of
accesses required for our data file.
If a node has p data fields it has p + 1 pointers to p + 1 nodes each of
which has p values i.e A node which can fit 2 data values per index
node has 3 pointers to three nodes each with two data values and 3
pointers which can point to 18 data values .
Indexes 11. 19
Indexes as Tree structures
<
=
<
=
>
>
>
<
=
>
>
>
<
=
>
>
A 3 level index with 2 data values per node could point to 18 data
values.
Indexes 11. 20
Indexes as Tree structures
2. If a tree structure is balanced and self-maintaining then the
number of accesses to the data file is constant
The number of accesses using a multi-level index is :-
BA = (log
b
n) + 1
- where b is the branching factor and n is the number of pages
NOTE : The number of accesses will decrease as n becomes smaller
(i.e use a non-dense index) and as the branching factor b increases
Indexes 11. 21
Indexes as Tree structures
(Example 3. Elmasri and Navathe Ch 5)
V = 9 bytes + a 6 byte pointer thus
Index entry Ri = 15 bytes
The blocking factor = 1024/15 = 68
This is known as the fanout for a multi-level index
The number of first level blocks = 30,000 / 68 = 442 blocks
The number of second level blocks = 442 / 68 = 7 blocks
The number of third level blocks = 7 / 68 = 1 block
BA = T(number of levels) + 1 = 3 + 1 = 4
or
BA = (log68 3000) + 1 = 3 + 1 = 4
Index size = b1 + b2 + b3
= 442 + 7 + 1 = 450 blocks
Indexes 11. 22
Index Usage
An index is used to optimize data retrieval
Key
E1
E20
E40
P1
E1
E2
E10
P2
E20
E30
P3
E40
E45
E56
Page #
1
2
3
SELECT *
FROM EMP
WHERE E# = 'E4'
Total Number of Accesses = Index Accesses+1 Data Access = 1 + 1 = 2
Average Number of Serial Accesses for a Table with 3 pages is n/2 = 2
Indexes 11. 23
The Query Optimizer
p. 210, 242-247 Hoffer et.al. 6th Edn.
Ch 18 Date
The Query Optimizer may decide against using an index
In some cases the optimizer will use only the index to answer a query
and will not access any data pages
SELECT COUNT(*)
FROM EMP
Some Relational Products force the use of an index even though it
may be inefficient to do so
1. To support a PK (in Oracle if a column is made a primary key an
index is created)
2. To implement clustering
Indexes 11. 24
Good and Bad Candidates for Indexes
Create indexes on columns used in predicates :. Read only and frequently accessed tables if > 3 pages
. The columns of a predicate in frequently executed transactions
. High update tables can also use indexes if > 6 pages
. columns used in joins are candidates for indexes
. columns in which aggregates are frequently calculated
. use indexes on FK if using RI - will work out integrity violations or
cascade quickly
Avoid creating indexes on :. attributes with a small number of unique values i.e. Gender M,F
although the Oracle BITMAP index is suitable in this situation
. keep indexes down to a reasonable number on high update
attributes(tables) 2 or 3 if possible
Indexes 11. 25
Other Issues for Indexes
The above criteria are not always mutually exclusive - therefore must
decide on index usage based on the most important requirements. There
are tricks which can be used to enhance the use of indexes :. place index or index level in memory
. place indexes on a fast device
. place indexes on a separate device from the data they reference
. use multiple column indexes where possible
i.e INDEX C1, C2, C4
SELECT *
FROM EMP
SELECT *
WHERE C1 = 'A' - will use both FROM EMP
AND C2 = 'B'
treats this as WHERE C1 = 'A' -will use only one
AND C5 = 'E'
a substring
AND C4 = 'D'
AND C5 = 'E'
Indexes 11. 26
An Example
Oracle block maybe 2048 bytes (they vary with operating system)
Number of EMPS = 30,000
48 bytes overhead per index block
Blocking Factor of the index or keys per index page
= 2000/(key length + 6)= 2000/(9+6)= 133
- in theory the branching factor will be less than 133 as we will want
to include free space
b1 = number of records / 133 = 30,000/ 133 = 225
b2 = b1 / 133 = 2
b3 = b2 / 133 < 1
t = 3 levels of index
data
or t = log 133 30000 = 2.1 = 3
(= log 10 30000 / log 10 133)
Maximum Number of Records with a three level index with
Blocking Factor as indicated = 133 x 133 x 133 = 2.5 million
Indexes 11. 27