Transcript item-name
SQL:1999 Advanced Querying
Decision-Support Systems
Data Warehousing
Data Analysis and OLAP
This also apply to SQL: 2003 that is just a minor revision of
SQL:1999
1
Database System Concepts 4th Edition
22.1
©Silberschatz, Korth and Sudarshan
Decision Support Systems
Decision-Support systems are used to make business
decisions often based on data collected by on-line transactionprocessing systems.
Examples of business decisions:
what items to stock?
What insurance premium to change?
Who to send advertisements to?
Examples of data used for making decisions
Retail sales transaction details
Customer profiles (income, age, sex, etc.)
2
Database System Concepts 4th Edition
22.2
©Silberschatz, Korth and Sudarshan
Decision-Support Systems: Overview
A data warehouse archives information gathered from multiple
sources, and stores it under a unified schema, at a single site.
Important for large businesses which generate data from multiple
divisions, possibly at multiple sites
Data may also be purchased externally
Data analysis tasks are simplified by specialized tools and SQL
extensions
Example tasks
For each product category and each region, what were the total sales
in the last quarter and how do they compare with the same quarter
last year
As above, for each product category and each customer category
Statistical analysis packages (e.g., : S++) can be interfaced with
databases
Partial support through OLAP Functions
Data mining seeks to discover knowledge automatically in the form
of statistical rules and patterns from Large databases.
3
Database System Concepts 4th Edition
22.3
©Silberschatz, Korth and Sudarshan
Data Warehousing
A data warehouse is a repository of information gathered
from multiple sources.
4
Database System Concepts 4th Edition
22.4
©Silberschatz, Korth and Sudarshan
Data Warehousing (Cont.)
Provides a single consolidated interface to data
Data stored for an extended period, providing access to
historical data
Data/updates are periodically downloaded form online
transaction processing (OLTP) systems.
Typically, download happens each night.
Data may not be completely up-to-date, but is recent enough for
analysis.
Running large queries at the warehouse ensures that OLTP
systems are not affected by the decision-support workload.
5
Database System Concepts 4th Edition
22.5
©Silberschatz, Korth and Sudarshan
Issues in Building a Warehouse
When and how to gather data.
Source driven: data source initiates data transfer
Destination driven: warehouse initiates data transfer
What schema to use.
Schema integration
Cleaning and conversion of incoming data
What data to summarize.
Raw data may be too large to store on-line
Aggregate values (totals/subtotals) often suffice
Queries on raw data can often be transformed by query optimizer to use
aggregate values
How to propagate updates.
Date at warehouse is a view on source data
Efficient view maintenance techniques required
6
Database System Concepts 4th Edition
22.6
©Silberschatz, Korth and Sudarshan
Data-Warehouse Architecture
7
Database System Concepts 4th Edition
22.7
©Silberschatz, Korth and Sudarshan
Star Schema For A Data Warehouse
8
Database System Concepts 4th Edition
22.8
©Silberschatz, Korth and Sudarshan
Online Analytical Processing
The operation of changing the dimensions used in a cross-tab is
called pivoting.
An OLAP system provides other functionality as well. For
instance, the analyst may wish to see a cross-tab on item-name
and color for a fixed value of size, for example, large, instead of
the sum across all sizes. Such an operation is referred to as
slicing. The operation is sometimes called dicing, particularly
when values for multiple dimensions are fixed.
The operation of moving from finer-granularity data to a coarser
granularity is called a rollup.
The opposite operation - that of moving from coarser-granularity
data to finer-granularity data – is called a drill down.
9
Database System Concepts 4th Edition
22.9
©Silberschatz, Korth and Sudarshan
Relational Representation of the Data in
Figure 22.1
10
Database System Concepts 4th Edition
22.10
©Silberschatz, Korth and Sudarshan
Cross Tabulation of sales by item-name
and color
The table above is an example of a cross-tabulation(or cross-
tab) also referred to as a pivot-table. In general, a cross-table is
a table where values for one attribute form the row headers,
values for another attribute form the column headers, and the
values in an individual cell are derived as follows.
A cross tab with summary rows/columns can be represented by
introducing a special value all to represent subtotals.
Database System Concepts 4th Edition
22.11
11
©Silberschatz, Korth and Sudarshan
Hierarchies on Dimensions
12
Database System Concepts 4th Edition
22.12
©Silberschatz, Korth and Sudarshan
Cross Tabulation of sales With Hierarchy
on item-name
13
Database System Concepts 4th Edition
22.13
©Silberschatz, Korth and Sudarshan
Three-Dimensional Data Cube
14
Database System Concepts 4th Edition
22.14
©Silberschatz, Korth and Sudarshan
OLAP Implementation
The earliest OLAP systems used multidimensional arrays in
memory to store data cubes, and are referred to as
mutidimensional OLAP (MOLAP) systems.
Hybrid systems, which store some summaries in memory and
store the base data and other summaries in a relational
database, are called hybrid OLAP (HOLAP) systems.
15
Database System Concepts 4th Edition
22.15
©Silberschatz, Korth and Sudarshan
Data Analysis (Cont.)
Rollup: Moving from finer-granularity data to a coarser
granularity by means of aggregation.
Drill down: Moving from coarser-granularity data finer-
granularity data.
Proposed extensions to SQL, such as the cube operation
help to support generation of summary data
The following query generates the previous table.
select color, size, sum (number)
from sales
groupby color, size with cube
16
Database System Concepts 4th Edition
22.16
©Silberschatz, Korth and Sudarshan
Data Analysis (Cont.)
Figure shows the combinations of dimensions size, color, price
In general computing cube operation with n groupby columns gives
2nd different groupby combinations.
17
Database System Concepts 4th Edition
22.17
©Silberschatz, Korth and Sudarshan
SQL:1999 Extended GroupBy
also supports generalizations of the group by constructs, using the cube and
rollup constructs. A representative use of the cube construct is;
select item-name, color, size, sum(number)
from sales
group by cube(item-name, color, size)
This query computes the union of eight different groupings of the
sales relation:
{ (item-name, color, size), (item-name, color), (item-name, size),
(color, size), (item-name), (color), (size), ( ) }
Where ( ) denotes an empty group by list.
For each grouping, the result contains the null value for attributes not
present in the grouping. For instance, with occurrences of all replaced by
null, can be computed by the query
select item-name, color, sum(number)
from sales
group by cube(item-name, color)
18
Database System Concepts 4th Edition
22.18
©Silberschatz, Korth and Sudarshan
Extended Group By (Cont.)
A representative rollup construct is
select item-name, color, size, sum(number)
from sales
group by rollup(item-name, color, size)
Here only four grouping are generated:
{ (item-name, color, size), (item-name, color), (item-name), ( ) }
Rollup can be used to generate aggregates at multiple levels of a
hierarchy on a column. For instance, we have a table itemcategory(itemname, category) giving the category of each item. Then the query
select category, item-name, sum(number)
from sales, category
where sales.item-name = itemcategory.item-name
group by rollup(category, item-name)
would give a hierarchical summary by item-name and by category.
19
Database System Concepts 4th Edition
22.19
©Silberschatz, Korth and Sudarshan
Extended Group By (Cont.)
Multiple rollups and cubes can be used in a single group by clause.For
instances, the following query
select item-name, color, size, sum(number)
from sales
group by rollup(item-name), rollup(color, size)
generates the groupings
{ (item-name, color, size), (item-name, color), (item-name), (color, size),
(color), ( ) }
The function grouping can be applied on an attribute; it returns 1 if the value
is a null value representing all, and returns 0 in all other cases. Consider the
following query:
select item-name, color, size, sum(number),
grouping(item-name) as item-name-flag,
grouping(color) as color-flag,
grouping(size) as size-flag,
from sales
group by cube(item-name, color, size)
20
Database System Concepts 4th Edition
22.20
©Silberschatz, Korth and Sudarshan
Extended Aggregation: OLAP Functions
Rank, row_number and other functions based on explicit order
Many new statistical functions: see, e.g., DB2 UDB's High-Function
Business Intelligence in e-business---Red Book
http://www.redbooks.ibm.com/redbooks/SG246546.html
STDDEV, etc.
CORRELATION,
COVARIANCE.
Regression functions: Y = a X + b
REGR_SLOPE -> a
REGR_INTERCEPT -> b
Windows on aggregates:
Physical windows: based on the number of rows in the window
Logical windows: based on the value span of the window.
21
Database System Concepts 4th Edition
22.21
©Silberschatz, Korth and Sudarshan
Ranking
Ranking is done in conjunction with an order by specification. Suppose we
are given a relation student-marks(student-id, marks) which stores the
marks obtained by each student. The following query gives the rank of each
student.
select student-id,
rank( ) over (order by (marks) desc) as s-rank
from student-marks
An extra order by clause is needed to get them in sorted order, as shown
below.
select student-id,
rank ( ) (order by (marks) desc) as s-rank
from student-marks
order by s-rank
22
Database System Concepts 4th Edition
22.22
©Silberschatz, Korth and Sudarshan
Partition By
Ranking can be done within partition of the data. The following query then
gives the rank of students within each section:
student-marks(student-id, marks);
student- section(student-id, section)
select student-id, section,
rank( ) over (partition by section order by marks desc)
as sec-rank
from student-marks, student-section
where student-marks.student-id = student-section.student-id
order by section, sec-rank
dense_rank(): no holes after ties
row_number(): provide row numbering given a specific partitioning and
ordering of rows.
23
Database System Concepts 4th Edition
22.23
©Silberschatz, Korth and Sudarshan
Ranking (Cont.)
For a given constant n, the ranking the function ntile(n) takes the tuples
in each partition in the specified order, and divides them into n buckets
with qual numbers of tuples. For instance, we an sort employees by
salary, and use ntile(3) to find which range (bottom third, middle third, or
top third) each employee is in, and compute the total salary earned by
employees in each range:
select threetile, sum(salary)
from (
select salary, ntile(3)
over (order by (salary) as threetile
from employee) as s
group by threetile
SQL:1999 permits the user to specify where they should occur by using
nulls first or nulls last, for instance
select student-id,
rank ( ) over (order by marks desc nulls last) as s-rank
from student-marks
24
Database System Concepts 4th Edition
22.24
©Silberschatz, Korth and Sudarshan
Aggregates on Windows
An example of window query is that, given sales values for each date, calculates
for each date the average of the sales on that day, the previous day, and the next
day; such moving average queries are used to smooth out random variations.
In contrast to group by, the same tuple can exist in multiple windows. Suppose
we are given a relation transaction(account-number, date-time, value),
select account-number, date-time,
sum(value) over (partition by account-number
order by date-time
range unbounded preceding)
This query returns a new sum for each new tuple—cumulative sum!
An actual window can also be specified: e.g.
range 10 rows preceding
range 30 minutes preceding
25
Database System Concepts 4th Edition
22.25
©Silberschatz, Korth and Sudarshan