Transcript lesson21

Performing Indexing and
Full-Text Searching
Lesson 21
Skills Matrix
Indexes
• Indexes can potentially speed resultset
creation during a query, but only when a
WHERE clause specifies an effective search
argument. Indexes always slow data entry.
• You must balance these competing goals to
keep your database performing optimally.
Heaps
• Tables with no clustered index in place are called
heaps.
• SQL Server stores tables on disk by allocating one
extent (eight contiguous 8 KB pages) at a time in the
database file.
• When one extent fills with data, another is allotted.
• These extents, however, aren’t physically next to each
other in the database file; they’re scattered about.
• That is part of what makes data access on a heap so
slow: SQL Server needs to access an index allocation
map (IAM) to find various extents for the table it’s
searching.
Sysindexes Table
• This system table is used to store index
information: every table in your database has
an entry in the sysindexes table whether or not
the particular table has an index in place.
• The First IAM column tells SQL Server exactly
where to find the first index allocation map
(IAM) page in the database.
• This process of scanning the IAM page, and
then scanning each extent of the table for the
record needed, is called a table scan.
Clustered Indexes
• Clustered indexes physically rearrange the data
that users insert in your tables.
• The arrangement of a clustered index on disk can
be compared to that in a dictionary, because they
both use the same storage paradigm.
• It might help to visualize an index in SQL Server as
an upside-down tree.
– In fact, the index structure is called a B-tree
(balanced-tree) structure.
Clustered Index
• The beginning point of an index traversal is the
root page of the clustered index.
• Each intermediate level of the clustered index
grows initially from the root, to two pages, to
four pages, to eight pages, and so on.
• Because an index is organized as a balanced
inverted tree, each move from one index page
eliminates half of the data to be scanned.
Clustered Index
• When you perform a query on a column that is part of a
clustered index (by using a SELECT statement), SQL
Server must refer to the sysindexes table, where every
table has a record.
• Tables with a clustered index have a value of 1 in the
indid column (unlike heaps, which have a value of 0).
• Once the record has been located, SQL Server looks at
the root column, which contains the location of the root
page of the clustered index.
• When SQL Server locates the root page of the index, it
begins to search for your data.
Clustered Index
• Each page in the clustered index has pointers, or links,
to the index page just before it and the index page just
after it.
• Having these links built into the index pages eliminates
the need for the IAM pages that heaps require.
• SQL Server then looks through each intermediate-level
page, where it may be redirected to another
intermediate-level page or, finally, to the leaf level.
• The leaf level in a clustered index is the end
destination—the data you requested in your SELECT
query.
Selectivity
• Selectivity is the number of duplicate values
in a column; low selectivity means a column
has many duplicate values.
Modifying Data with a Clustered Index
• Modifying data with a clustered index is the
same—you use standard INSERT, UPDATE ,
and DELETE statements.
• What makes this process intriguing is the
way SQL Server has to store your data: it
must be physically rearranged to conform to
the clustered index parameters.
Modifying Data with a Clustered Index
• On a heap, the data are inserted at the end
of the table, which is the bottom of the last
data page.
• If there is no room on any of the data pages,
SQL Server allocates a new extent and starts
filling it with data.
Modifying Data with a Clustered Index
• Because you’ve told SQL Server to physically
rearrange your data by creating a clustered
index, SQL Server no longer has the freedom to
stuff data wherever room exists.
• The data must be placed in order physically.
• To help SQL Server accomplish this task
efficiently, you need to leave a little room at the
end of each data page on a clustered index.
This blank space is referred to as the fill factor.
Nonclustered Index
• Like its clustered cousin, the nonclustered index is
a B-tree structure having a root page, intermediate
levels, and a leaf level.
• However, two major differences separate the index
types.
– The first is that the leaf level of the nonclustered
index doesn’t contain the actual data; it contains
pointers to the data that is stored in data pages.
– The second big difference is that the
nonclustered index doesn’t physically rearrange
the data.
Nonclustered Index
• When you search for data on a table with a
nonclustered index, SQL Server first queries the
sysindexes table, looking for a record that contains
your table name and a value in the indid column
from 2 to 251 (0 denotes a heap, and 1 is for a
clustered index).
• Once SQL Server finds this record, it looks at the
root column to find the root page of the index (just
like it did with a clustered index).
• Once SQL Server has the location of the root page,
it can begin searching for your data.
Clustered and Nonclustered Indexes
Covered Index and Included Columns
• Covered index:
– The index that contains all output fields required
by the operation performed on that index.
– The base table need not be accessed.
• Included columns
– You can extend the functionality of nonclustered
indexes by adding nonkey columns to the leaf
level of the nonclustered index.
– By including nonkey columns, you can create
nonclustered indexes that cover more queries.
Nonkey Columns
• When using nonkey columns, you need to keep a
few guidelines in mind:
– Nonkey columns can be included only in
nonclustered indexes.
– Columns can’t be defined in both the key
column and the INCLUDE list.
– Column names can’t be repeated in the
INCLUDE list.
– Nonkey columns can’t be dropped from a table
unless the index is dropped first.
Nonkey Columns
– You must have at least one key column defined
with a maximum of 16 key columns.
– You can have only up to a maximum of 1,023
included columns.
– The only changes allowed to nonkey columns
are to nullability (from NULL to NOT NULL, and
vice versa), and increasing the length of
varbinary, varchar, or nvarchar columns.
Setting Relational Options
• PAD_INDEX
– When the PAD_INDEX option is set to ON, the
percentage of free space that is specified by
the fill factor is applied to the intermediatelevel pages of the index.
– When this is OFF, the intermediate-level
pages are filled to the point at which there is
room for one new record.
Setting Relational Options
• FILLFACTOR
– Specifies how full the database engine should make
each page during index creation or rebuild.
– Valid values are from 0 to 100.
– Values of 0 and 100 are the same in that they both
tell the database engine to fill the page to capacity,
leaving room for only one new record.
– Any other value specifies the amount of space to
use for data; for instance, a fill factor of 70 tells the
database engine to fill the page to 70 percent full
with 30 percent free space.
Collations
• Collations are used within databases to
display and store an international character
set, based on business requirements.
• When returning data, you have the ability to
retrieve the data in a collation type different
from how it was stored.
• When working with these multiple collations,
you can invoke the COLLATE keyword and
then specify the collation type you prefer to
use.
Collations
• You can use the COLLATE keyword in various
ways and at several levels:
– COLLATE on database creation
– COLLATE on table creation
– COLLATE by casting or expression
– Collations supported by SQL Server
Filtered Index
• SQL Server 2008 includes a new method of
indexing known as Filtered Indexes.
• These indexes work by providing a WHERE
clause like other SQL commands to only
select some rows of data.
• The query optimizer will try to use a filtered
index whenever the WHERE clause of a
SELECT statement matches the WHERE
clause of the index.
Filtered Statistics
• A related new feature in SQL Server 2008 is
Filtered Statistics.
• Query execution plans can now use statistics
filtered by a WHERE clause.
• Filtered statistics are created along with
filtered indexes although you can manually
create filtered statistics using the CREATE
STATISTICS command.
Partitioned Index
• It’s usually best to partition a table and then create an
index on the table so that SQL Server can partition the
index for you based on the partition function and schema
of the table.
• However, you can partition indexes separately.
• This is useful in the following cases:
– The base table isn’t partitioned.
– Your index key is unique but doesn’t contain the
partition column of the table.
– You want the base table to participate in collocated
JOINs with other tables using different JOIN columns.
Partitioned Index
• If you decide you need to partition your index
separately, then you need to keep the
following in mind:
– The arguments of the partition function for
the table and index must have the same data
type.
– Your table and index must define the same
number of partitions.
– The table and index must have the same
partition boundaries.
Spatial Data
• SQL Server 2008 and later versions support spatial data.
This includes support for a planar spatial data type,
geometry, which supports geometric data—points, lines,
and polygons—within a Euclidean coordinate system.
• The geography data type represents shapes on the
Earth's surface, such as an area of land.
• A spatial index on a geography column maps the
geographic data to a two-dimensional, non-Euclidean
space.
• A spatial index is defined on a table column that contains
spatial data (a spatial column).
• Each spatial index refers to a finite space.
Primary Index
• A primary key ensures that each of the records in
your table is unique in some way.
• It does this by creating a special type of index
called a unique index.
• An index is ordinarily used to speed up access to
data by reading all of the values in a column and
keeping an organized list of where the record that
contains that value is located in the table.
Primary Index
• A unique index not only generates that list, but it
also prohibits duplicate values from being stored in
the index.
• If a user tries to enter a duplicate value in the
indexed field, the unique index will return an error,
and the data modification will fail.
• When a column can be used as a unique identifier
for a row (such as an identity column), it is referred
to as a surrogate or candidate key.
Full-text Indexes
• You perform full-text searching through a
completely separate program that runs as a
service, called the SQL Server FullText Search
service, or msftesq, and that can be used to index
all sorts of information from most of the BackOffice
(or even non-Microsoft) products.
• Full-text indexes are created using SQL Server
tools, such as Management Studio, but they are
maintained by the FullText Search service and
stored on disk as files separate from the database.
Noise Words
• You need to consider reducing meaningless
successful hits by itemizing those common
words or phrases in a file for use by Full-Text
Search.
• In SQL Server 2005 this is a text file located,
by default, in the ftdata folder. Use Notepad
or any similar text editor to add or delete
noise words.
Summary
• In this lesson, you first learned how SQL Server
accesses and stores data when no index is in place.
• Without a clustered index, the table is called a heap,
and the data are stored on a first-come, first-served
basis.
• When accessing this data, SQL Server must perform a
table scan, which means SQL Server must read every
record in the table to find the data you’re seeking.
• This can make data access slow on larger tables; but
on smaller tables that are about one extent in size,
table scans can be faster than indexing.
Summary
• Next you learned how to accelerate data access by
using indexes.
• The first index you looked at was the clustered
index.
• This type of index physically rearranges the data in
the database file.
• This property makes the clustered index ideal for
columns that are constantly being searched for
ranges of data and that have low selectivity,
meaning several duplicate values.
Summary
• Next came nonclustered indexes.
• These indexes don’t physically rearrange the data
in the database but rather create pointers to the
actual data.
• This type of index is best suited to high-selectivity
tables (few duplicate values) where single records
are desired rather than ranges.
• Then you learned how to create indexes using SQL
Server Management Studio.
Summary
• Finally, you found that full-text searching
could greatly enhance SELECT queries by
allowing you to find words or phrases within
your text fields.
• With this newfound knowledge about
indexing, you’ll be able to speed up data
access for your users.
Summary for Certification Examination
• Know the difference between clustered and
nonclustered indexes.
• Clustered indexes physically rearrange the
data in a table to match the definition of the
index.
• Nonclustered indexes are separate objects
in the database that refer to the original
table, without rearranging the table in any
way.
Summary for Certification Examination
• Understand full-text indexing. Understand what fulltext indexing is for and how to manage it.
• Full-text indexing runs as a separate service and is
used to search columns of text for phrases, instead
of just single words.
• You have to repopulate the index occasionally to
keep it up to date with the underlying table.
• SQL Server can do this for you automatically if you
want.
Summary for Certification Examination
• Know the relational options for creating
indexes.
• In the “Setting Relational Options” section,
you learned about the different relational
options and what they do.
• Familiarize yourself with these options for
the exam.